A unifying framework for joint trait analysis under a non-infinitesimal model

Ruth Johnson; Huwenbo Shi; Bogdan Pasaniuc; Sriram Sankararaman

doi:10.1093/bioinformatics/bty254

. 2018 Jun 27;34(13):i195–i201. doi: 10.1093/bioinformatics/bty254

A unifying framework for joint trait analysis under a non-infinitesimal model

Ruth Johnson ^1,^✉, Huwenbo Shi ², Bogdan Pasaniuc ^2,^3,^4,^#, Sriram Sankararaman ^1,^2,^3,^#

PMCID: PMC6022541 PMID: 29949958

Abstract

Motivation

A large proportion of risk regions identified by genome-wide association studies (GWAS) are shared across multiple diseases and traits. Understanding whether this clustering is due to sharing of causal variants or chance colocalization can provide insights into shared etiology of complex traits and diseases.

Results

In this work, we propose a flexible, unifying framework to quantify the overlap between a pair of traits called UNITY (Unifying Non-Infinitesimal Trait analYsis). We formulate a Bayesian generative model that relates the overlap between pairs of traits to GWAS summary statistic data under a non-infinitesimal genetic architecture underlying each trait. We propose a Metropolis–Hastings sampler to compute the posterior density of the genetic overlap parameters in this model. We validate our method through comprehensive simulations and analyze summary statistics from height and body mass index GWAS to show that it produces estimates consistent with the known genetic makeup of both traits.

Availability and implementation

The UNITY software is made freely available to the research community at: https://github.com/bogdanlab/UNITY.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Genome-wide association studies (GWAS) have identified thousands of regions in the genome that contain variants that contribute to risk for many diseases. Many of these risk regions have been implicated in multiple phenotypes such as autism and schizophrenia (Autism Spectrum Disorders Working Group of The Psychiatric Genomics Consortium et al., 2017), multiple autoimmune diseases (Cotsapas et al., 2011; Ramos et al., 2011; Richard-Miceli and Criswell, 2012), Crohn’s disease and psoriasis (Ellinghaus et al., 2012), and many others. Understanding which causal variants are shared among diseases can provide novel etiological insight as well as provide evidence of potential shared causal mechanisms between complex traits. In addition, identifying which variants contribute to multiple traits can help decipher which molecular traits (e.g. gene expression) contribute to disease risk (Giambartolomei et al., 2014; Hormozdiari et al., 2016); genetic variants that causally alter gene expression as well as disease risk can link a particular gene to a given disease.

Genetic overlap has been analyzed both at the genome-wide level and local level, where the latter refers to analysis done within a given genomic region. Genetic correlation, a measure that quantifies the similarity in the genetic effects on pairs of traits, is commonly used for assessing the relationship between two traits and can be applied either genome-wide or to local data (Bulik-Sullivan et al., 2015; Shi et al., 2017). Many of the models for estimating genome-wide genetic correlation assume an infinitesimal genetic architecture where all SNPs, or single nucleotide polymorphisms, are assumed to have a very small effect on the trait. In contrast to genetic correlation, colocalization methods aim to estimate whether the GWAS association signals for two traits at the same region are due to the same causal variant across the traits or chance (Giambartolomei et al., 2014; Hormozdiari et al., 2016). The methods that relax the infinitesimal assumption either assume a single causal variant per region or limit the number of potential causal variants a priori, often due to computational considerations (Giambartolomei et al., 2014; Hormozdiari et al., 2016). Although, both genetic correlation and colocalization aim to describe the genetic sharing between traits, these methods have been utilized largely independently of each other.

In this work, we present a unifying statistical model that ties together genetic correlation and colocalization. To accomplish this, we present a fully generative Bayesian statistical model that models the shared as well distinct genetic variants underlying a pair of traits. The model allows for sparse genetic architectures (where only a small fraction of variants are causally impacting the traits). The model is richly parametrized: allowing us to jointly model global parameters such as the proportion of variants that are causal for both as well for either trait, the trait heritability, the correlation of the effect sizes at the causal SNPs and local parameters such as the effect of a single SNP on each of the traits.

A challenge of a non-infinitesimal genetic architecture is that it presents a computationally challenging inference problem. Performing inference under this model often involves explicitly enumerating all causal configurations of the SNPs. This exponential search space of $2^{2 M}$ , where M is the number of SNPs analyzed, proves intractable given the large genetic datasets now available. We propose Unifying Non-Infinitesimal Trait analYsis (UNITY) that relies on Markov Chain Monte Carlo (MCMC) to approximate the posterior probabilities of the model parameters. In this work, we focus on estimating the proportion of shared and trait-specific causal variants since parameters such as heritability and genetic correlation can be estimated using previous methods (Bulik-Sullivan et al., 2015). Additionally, a key advantage of the method is that it only requires summary level association statistic data, which bypasses many of the privacy concerns associated with individual level data. With the widespread availability of GWAS summary statistics (Pasaniuc and Price, 2017), we expect that a method operating only on summary statistics would prove most useful for the research community. Through comprehensive simulations and an analysis of height and body mass index (BMI), we show that our method can accurately estimate the proportion of shared causal SNPs between two complex traits.

2 Materials and methods

2.1 Generative model

Here, we introduce a Bayesian framework for estimating the proportion of causal variants shared between a pair of complex traits. The input to our method is the vector of signed effect sizes at each SNP for each trait (we only analyze SNPs for which effect size estimates are available for both traits). We model the genetic as well as non-genetic variances in each trait, the genetic correlation among the traits, and the proportion of causal SNPs that are shared across traits as well as are unique to each. The proportion of causal SNPs shared between the traits is denoted by p₁₁, the proportion of causal SNPs specific to trait 1 and trait 2 as p₁₀ and p₀₁, respectively, and the proportion of non-causal SNPs is denoted by p₀₀, where $p_{00} + p_{10} + p_{01} + p_{11} = 1$ . For each trait $p \in {1, 2}$ , we denote the genetic variance $σ_{p}^{2}$ (which is the same as its heritability as $h_{p}^{2}$ if the trait is standardized), the environmental noise as $σ_{e_{p}}^{2} = \frac{1 - h_{p}^{2}}{N_{p}}$ , where N_p denotes the sample size for trait p, and the genetic correlation between the two traits as ρ. Altogether, our model has the following parameters: $(σ_{1}^{2}, σ_{2}^{2}, ρ, p_{00}, p_{10}, p_{01}, p_{11})$ .

We assume that trait p ( $p \in {1, 2}$ ) measured in individual i, $y_{p, i}$ is a linear function of standardized genotypes $x_{=} (x_{i, 1}, \dots, x_{i, M})$ measured across M SNPs with SNP effect sizes $β_{p} = (β_{p, 1}, \dots, β_{p, M})$ and independent additive noise term $ϵ_{p, i}$ . Further, we assume that there are no sample overlaps across the two studies.

y_{1, i} = \sum_{m = 1}^{M} β_{1, m} x_{i, m} + ϵ_{1, i} y_{2, i} = \sum_{m = 1}^{M} β_{2, m} x_{i, m} + ϵ_{2, i}

i \in {1, \dots, N_{1}} i \in {1, \dots, N_{2}}

ϵ_{p, i} \overset{iid}{\sim} N (0, σ_{e_{p}}^{2})

A SNP m is causal for trait p if its true effect $β_{p, m} \neq 0$ and it is not causal otherwise. We denote the probability of a SNP being causal for every combination of the two traits as: $p = (p_{00}, p_{10}, p_{01}, p_{11})$ .

Denoting the causal effect sizes for trait $p, p \in {1, 2}$ across all SNPs $γ_{p} = (γ_{p, 1}, \dots, γ_{p, M})$ , we assume that the causal effect sizes for each SNP are independent, allowing us to model the effect sizes at SNP m for each of the two traits $(γ_{1, m}, γ_{2, m})$ as a random vector drawn from a bi-variate normal distribution centered at zero with the following covariance matrix:

(\begin{array}{l} γ_{1, m} \\ γ_{2, m} \end{array}) | (σ_{1}^{2}, σ_{2}^{2}, ρ, p) \sim N ((\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} \frac{σ_{1}^{2}}{M (p_{11} + p_{10})} & \frac{σ_{1} σ_{2} ρ}{M (p_{11})} \\ \frac{σ_{1} σ_{2} ρ}{M (p_{11})} & \frac{σ_{2}^{2}}{M (p_{11} + p_{01})} \end{array}))

$C_{p} = (C_{p, 1}, \dots, C_{p, M})$ denotes the causal indicator vector for trait p, where $C_{p, m} = 1$ if SNP m is causal for trait p and 0 otherwise. $(C_{1, m}, C_{2, m})$ is a random vector drawn from a discrete distribution with parameters given by $p$ :

P ((\begin{array}{l} C_{1, m} \\ C_{2, m} \end{array}) = (\begin{array}{l} a \\ b \end{array}) | p) = p_{a b}, a, b \in {0, 1}

The true effect sizes for each trait p at SNP m, $β_{p, m}$ , conditioned on the causal status at a SNP is the element-wise product of the causal indicator vector and the true causal effect sizes.

(\begin{array}{l} β_{1, m} \\ β_{2, m} \end{array}) | (\begin{array}{l} C_{1, m} \\ C_{2, m} \end{array}), (\begin{array}{l} γ_{1, m} \\ γ_{2, m} \end{array}) = (\begin{array}{l} γ_{1, m} C_{1, m} \\ γ_{2, m} C_{2, m} \end{array})

We can model the conditional distribution of the GWAS summary statistics given the true effect sizes, where ${\hat{β}}_{p, m}$ is the estimated marginal effect size of the mth SNP for trait p (Shi et al., 2017):

\begin{array}{l} (\begin{array}{l} {\hat{β}}_{1, 1 : M} \\ {\hat{β}}_{2, 1 : M} \end{array}) | ((\begin{array}{l} β_{1, 1 : M} \\ β_{2, 1 : M} \end{array}), σ_{e_{1}}^{2}, σ_{e_{2}}^{2}) \sim N ((\begin{array}{l} V β_{1, 1 : M} \\ V β_{2, 1 : M} \end{array}), Σ_{e}) \\ Σ_{e} = (\begin{array}{l} σ_{e_{1}}^{2} V & 0 \\ 0 & σ_{e_{2}}^{2} V \end{array}) \end{array}

$V$ is the matrix of correlations among the SNPs, i.e. the linkage disequilibrium (LD) matrix. $V$ can be estimated from a reference panel of genotypes collected from a population that is genetically similar to the populations for which summary statistics are available. Alternately, when performing inference at the genome-wide level, we can prune the list of SNPs such that they come from independent LD blocks. LD-pruning creates an approximately independent subset of SNPs in which case $V$ can be approximated by the identity matrix, $I$ . In this work, we restrict our attention to the case where $V \approx I$ .

We impose a Dirichlet prior on $p$ :

ρ | λ \sim Dir (λ)

Here $λ = (λ_{1}, λ_{2}, λ_{3}, λ_{4})$ . In practice, we set $λ_{1} = λ_{2} = λ_{3} = λ_{4} = λ = 0.20$ :

In principle, we can also impose priors on the remaining parameters, i.e. the trait heritability $(σ_{1}^{2}, σ_{2}^{2})$ and their genetic correlation ρ and estimate all of these parameters jointly with $p$ in a fully Bayesian model. These parameters can be estimated using other methods (Bulik-Sullivan et al., 2015) and, in this work, we fix the values of these parameters to their estimates and focus on estimating $p$ .

Given the parameters ( $σ_{1}^{2}, σ_{2}^{2}$ ,ρ, λ), the joint distribution of the probability of causal configurations $p$ , the causal indicator vectors $C_{1}, C_{2}$ , the causal effect sizes $γ_{1}, γ_{2}$ , and the estimated effect sizes ${\hat{β}}_{1}, {\hat{β}}_{2}$ is given by:

\begin{array}{l} P ({\hat{β}}_{1}, {\hat{β}}_{2}, C_{1}, C_{2}, γ_{1}, γ_{2}, p | σ_{1}^{2}, σ_{2}^{2}, ρ, λ) \\ = P (p | λ) \\ \times \prod_{m = 1}^{M} {P ((\begin{array}{l} C_{1, m} \\ C_{2, m} \end{array}) | p) P ((\begin{array}{l} γ_{1, m} \\ γ_{2, m} \end{array}) | (σ_{1}^{2}, σ_{2}^{2}, ρ, p)) \\ \times P ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}) | ((\begin{array}{l} γ_{1, m} \\ γ_{2, m} \end{array}), (\begin{array}{l} C_{1, m} \\ C_{2, m} \end{array}), σ_{e_{1}}^{2}, σ_{e_{2}}^{2}))} \end{array}

Integrating over the hidden variables $C_{1}, C_{2}, γ_{1}, γ_{2}$ , we obtain:

\begin{array}{l} P ({\hat{β}}_{1}, {\hat{β}}_{2}, p | σ_{1}^{2}, σ_{2}^{2}, ρ, λ)) \\ = P (p | λ) \\ \times \prod_{m = 1}^{M} [\begin{array}{l} \int \sum_{\begin{matrix} C_{1, m} \\ C_{2, m} \end{matrix}} {P ((\begin{array}{l} C_{1, m} \\ C_{2, m} \end{array}) | p) P ((\begin{array}{l} γ_{1, m} \\ γ_{2, m} \end{array}) | (σ_{1}^{2}, σ_{2}^{2}, ρ, p)) \\ \times P ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}) | ((\begin{array}{l} γ_{1, m} \\ γ_{2, m} \end{array}), (\begin{array}{l} C_{1, m} \\ C_{2, m} \end{array}), σ_{e_{1}}^{2}, σ_{e_{2}}^{2}))} d γ_{1, m} d γ_{2, m} \end{array}] \\ = Dir (p; λ) \\ \times \prod_{m = 1}^{M} [p_{00} N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} σ_{e_{1}}^{2} & 0 \\ 0 & σ_{e_{2}}^{2} \end{array})) \\ \begin{array}{l} + p_{10} N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} \frac{σ_{1}^{2}}{M (p_{11} + p_{10})} + σ_{e_{1}}^{2} & 0 \\ 0 & σ_{e_{2}}^{2} \end{array})) \\ + p_{01} N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} σ_{e_{1}}^{2} & 0 \\ 0 & \frac{σ_{2}^{2}}{M (p_{11} + p_{01})} + σ_{e_{2}}^{2} \end{array})) \\ + p_{11} N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} \frac{σ_{1}^{2}}{M (p_{11} + p_{10})} + σ_{e_{1}}^{2} & \frac{σ_{1} σ_{2}}{M (p_{11})} ρ \\ \frac{σ_{1} σ_{2}}{M (p_{11})} ρ & \frac{σ_{2}^{2}}{M (p_{11} + p_{01})} + σ_{e_{2}}^{2} \end{array}))] \end{array} \end{array}

2.2 Parameter inference in our model

Given the generative model described in the previous section, the inference problem lies in computing the posterior distribution of $p$ given the estimated summary statistics

P (p | {\hat{β}}_{1}, {\hat{β}}_{2}, σ_{1}^{2}, σ_{2}^{2}, ρ, λ)

= \frac{P (p, {\hat{β}}_{1}, {\hat{β}}_{2} | σ_{1}^{2}, σ_{2}^{2}, ρ, λ)}{P ({\hat{β}}_{1}, {\hat{β}}_{2} | σ_{1}^{2}, σ_{2}^{2}, ρ, λ)}

The true joint posterior distribution is intractable. Thus, we use MCMC (Brooks et al., 2011) to approximate the posterior distribution. MCMC approximates the target posterior distribution

$P (p | {\hat{β}}_{1}, {\hat{β}}_{2}, σ_{1}^{2}, σ_{2}^{2}, ρ)$ by a sequence of random samples ${(p^{(t)})}_{t = 1}^{T}$ drawn from a Markov chain constructed so that the stationary distribution of the chain is the target posterior.

P (p | {\hat{β}}_{1}, {\hat{β}}_{2}, σ_{1}^{2}, σ_{2}^{2}, ρ, λ) \approx \frac{1}{T} \sum_{t = 1}^{T} δ_{p^{(t)}} (p)

In our setting, we use a random-walk Metropolis–Hastings algorithm (Metropolis et al., 1953) that generates a sample $p^{(t + 1)}$ at iteration t + 1 given the sample $p^{(t)}$ at the previous iteration using the following proposal distribution that generates a proposed sample $p^{*}$ which is then accepted or rejected depending on the Metropolis–Hastings ratio (which depends on the ratio of the posterior probability density at the proposed parameter to the previous parameter):

p^{*} \sim Dir (d)

d = λ + B p^{(t)}

Here, B is a constant that controls the variance of the proposal distribution. In practice, we found that B = 10 yields effective mixing.

The final step in specifying the MCMC algorithm lies in computing the ratio of the posterior probability density at the proposed parameter to the original parameter. Computation of the ratio requires the evaluation of the posterior probability only up to a normalization constant:

\begin{array}{l} P (p | {\hat{β}}_{1}, {\hat{β}}_{2}, σ_{1}^{2}, σ_{2}^{2}, ρ, λ) \\ \propto P ({\hat{β}}_{1}, {\hat{β}}_{2}, p | σ_{1}^{2}, σ_{2}^{2}, ρ, λ) \\ = Dir (p; λ) \\ \times [\prod_{m = 1}^{M} N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} σ_{e_{1}}^{2} & 0 \\ 0 & σ_{e_{2}}^{2} \end{array})) \cdot (p_{00}) \\ + N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} \frac{σ_{1}^{2}}{M (p_{11} + p_{10})} + σ_{e_{1}}^{2} & 0 \\ 0 & σ_{e_{2}}^{2} \end{array})) \cdot (p_{10}) \\ + N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} σ_{e_{1}}^{2} & 0 \\ 0 & \frac{σ_{2}^{2}}{M (p_{11} + p_{01})} + σ_{e_{2}}^{2} \end{array})) \cdot (p_{01}) \\ + N ((\begin{array}{l} {\hat{β}}_{1, m} \\ {\hat{β}}_{2, m} \end{array}); (\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} \frac{σ_{1}^{2}}{M (p_{11} + p_{10})} + σ_{e_{1}}^{2} & \frac{σ_{1} σ_{2}}{M (p_{11})} ρ \\ \frac{σ_{1} σ_{2}}{M (p_{11})} ρ & \frac{σ_{2}^{2}}{M (p_{11} + p_{01})} + σ_{e_{2}}^{2} \end{array})) \\ \cdot (p_{11})] \end{array}

2.3 Efficient mixing of MCMC chains

In any practical application of MCMC, the number of iterations, burn-in period, and initialization point are critical to ensuring convergence and accurate estimates. Slow mixing of the MCMC chains can occur if the starting point is at a region of low posterior density. As opposed to selecting a random starting point, we carefully select the initialization of each chain by choosing the set of parameters that yields the highest posterior density. We use the limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm (Byrd et al., 1994) to determine the maximum a posteriori estimates for $p_{00}, p_{10}, p_{01}, p_{11}$ . We repeat this 10 times, initializing the optimization algorithm with random starting values drawn from the prior. We compute the posterior density of all 10 candidate starting values and select the set that yields the highest density. This set of parameters is then used as the starting point for our MCMC chain. In addition, to diagnose convergence, we use 100 Markov chains all initialized using the scheme described above. Our final estimate is the mean of all samples drawn from the 100 chains.

2.4 Note on runtime

We assessed the performance based on the number of seconds per iteration of the MCMC sampler. The main computation is calculating the likelihood at each iteration, which is directly dependent on the number of SNPs per trait. The complexity of the algorithm is $O (m)$ , where m is the number of SNPs. We empirically demonstrate that our method is linear in the number of SNPs through simulation (Supplementary Fig. S1). In addition, the runtime is invariably connected to the number of iterations required for the MCMC to converge. We find that using the maximum a posteriori probability (MAP) estimate as an initialization value leads to fast convergence, requiring only 500 iterations in practice.

3 Results

UNITY provides a novel generalized framework to jointly model GWAS summary statistics data of two complex traits, incorporating fundamental genetic parameters, such as heritability and genetic correlation, and makes minimal assumptions in inference procedures. Since UNITY assumes a non-infinitesimal model, it allows for very sparse genetic architectures, i.e. by setting $p_{00} \approx 1$ . However, this non-infinitesimal model can also be generalized to the infinitesimal model by setting $p_{00} \approx 0, p_{10} \approx 0, p_{01} \approx 0, p_{11} \approx 1$ .

3.1 UNITY generalizes colocalization and genetic correlation

We discuss a comparison of the parameters of UNITY with those obtained by other methods that perform cross-trait analysis and the underlying assumptions of each method. We first analyze the cross-trait LD score regression model (Bulik-Sullivan et al., 2015), which estimates genome-wide genetic correlation based on the random-effect model, making the implicit assumption that every SNP has a non-zero effect. In contrast to cross-trait LD score regression, UNITY assumes a generalized non-infinitesimal model, explicitly modeling a sparse genetic architecture. We also compare UNITY with methods that do not make the infinitesimal model assumption. While models such as PleioPred explicitly model the proportion of trait-specific and shared causal variants $p_{00}, p_{10}, p_{10}, p_{11}$ , the main goal of this method is to perform genetic risk prediction (Hu et al., 2017a) rather than estimating these proportions.

We compare UNITY with COLOC (Giambartolomei et al., 2014) and eCAVIAR (Hormozdiari et al., 2016), Bayesian methods to assess the evidence of colocalization, i.e. whether GWAS signals of two traits are driven the same underlying causal variants. Both methods explicitly model $p = (p_{00}, p_{10}, p_{10}, p_{11})$ (Giambartolomei et al., 2014; Hormozdiari et al., 2016). However, COLOC makes the simplifying assumption that there is at most one-causal variant at a region (Giambartolomei et al., 2014), allowing it to not explicitly model LD. And although eCAVIAR allows for multiple causal variants and explicitly models LD, it restricts the maximum number of causal variants at six per region for computational efficiency (Hormozdiari et al., 2016). In comparison with these methods, UNITY allows for any number of causal variants while making the assumption that there is no LD between the SNPs. We outline a summary of the relationship between UNITY and all methods described in Table 1.

Table 1.

Displayed is a summary of current methods that perform joint trait analysis and the relationship to the parameters in UNITY

Method	h²	ρ	$p$	Misc.
UNITY	*	*	*
Cross-trait LD Score regression (Bulik-Sullivan et al., 2015)	*	*	$p_{11} \approx 1$
PleioPred (Hu et al., 2017a)	*	*	*	infers $p$ to estimate effect sizes
COLOC (Giambartolomei et al., 2014)	–	–	*	max 1 causal
eCAVIAR (Hormozdiari et al., 2016)	–	–	*	max 6 causals

Open in a new tab

Boxes with an (*) denote the values that a method models. Note that this summary is not exhaustive.

To empirically demonstrate the benefit of the relaxed assumptions of UNITY as compared to current methods, we conduct a modest comparison against COLOC (Giambartolomei et al., 2014). We simulated 100 regions of 500 SNPs with multiple causal variants. We perform colocalization analysis over all of the regions using COLOC. When there are causal variants independently associated with each trait and shared variants, COLOC estimates that the association within the region is driven only by two independent variants, where one is specific to trait 1 and the other is specific to trait 2. Because COLOC assumes at most one-causal variant per region, the method is unable to distinguish between a variant that independently drives only one trait versus a variant that is colocalized when both cases are present. For completeness, we also included a simulation that follows the assumption underlying COLOC of the one-causal setting. The full table listing these results in outlined in Supplementary Table S2. However, we are unable to directly compare estimates with COLOC because there is not a clear mapping between the estimates of COLOC and the estimated parameters of UNITY, thus any direct comparison would be an unfair comparison due to the mismatch in the models.

3.2 Simulations

We generated summary statistics for 500 SNPs from two synthetic GWAS. The causal effect sizes for each SNP, $γ_{p, m}$ , were drawn jointly from a multivariate normal distribution where $h_{1}^{2}, h_{2}^{2}, ρ$ denote the heritability of each trait and the genetic correlation. We denote the number of SNPs as M and the proportion of causal variants for each trait as p₁₀, p₀₁ and the proportion of shared casuals as p₁₁:

(\begin{array}{l} γ_{1, m} \\ γ_{2, m} \end{array}) \sim (\begin{array}{l} \frac{h_{1}^{2}}{M (p_{11} + p_{10})} & \frac{h_{1} h_{2} ρ}{M (p_{11})} \\ \frac{h_{1} h_{2} ρ}{M (p_{11})} & \frac{h_{2}^{2}}{M (p_{11} + p_{01})} \end{array})

To simulate causal SNPs, we drew an $M \times 4$ matrix from a multinomial distribution parametrized by $p$ where the mth row of values denotes whether a SNP is causal for neither trait, only trait 1, only trait 2, or neither trait. Using this, we constructed two $M \times 1$ causal indicator vectors, C₁, C₂, where $C_{1, m}, C_{2, m} = 1$ if the mth SNP was causal for both traits, $C_{1, m} = 1, C_{2, m} = 0$ if the SNP was only causal for trait 1, $C_{1, m} = 0, C_{2, m} = 1$ if it was only causal for trait 2, and $C_{1, m}, C_{2, m} = 0$ if the SNP was non-causal. To get the true effect sizes, we multiplied element-wise $β_{1} = C_{1} ° γ_{1}$ and $β_{2} = C_{2} ° γ_{2}$ where we are essentially zeroing out any entry from the causal effect vector where a SNP is non-causal.

To compute the estimated GWAS effect sizes, ${\hat{β}}_{p}$ , we assumed $cov (ϵ_{1}, ϵ_{2}) = 0$ , so random noise terms $ϵ_{1}, ϵ_{2}$ were drawn from two normal distributions $N (0, \frac{1 - h_{1}^{2}}{N_{1}})$ and $N (0, \frac{1 - h_{2}^{2}}{N_{2}})$ respectively. We assume that the SNPs being used at the genome-wide level will be LD-pruned such that there is very little or no correlation structure. Thus, we set the LD matrix V = I_M, where I_M is an M × M identify matrix. We then draw the estimated effect sizes from a conditional distribution of the GWAS summary statistics, as described in Section 2.

First, we confirm that our method accurately predicts the proportion of causal variants under varying sample sizes and heritability estimates. We tested a variety of simulation frameworks where we fixed the genetic correlation and heritabilities of the two traits. We ran each simulation for 500 iterations and used the first quarter of the iterations as burn-in. We vary the proportion of causal variants contributing to only trait 1 (p₁₀), proportion of causal variants for only trait 2 (p₀₁), and the proportion of casual variants contributing to both traits (p₁₁). As shown in Figure 1, we can see that UNITY performs robustly across each scenario.

Fig. 1. — We estimate the proportion of causal variants under four simulation frameworks where we vary the sample size (N₁, N₂), heritability ( $h_{1}^{2} = h_{2}^{2}$ ), and proportion of causal variants. First, we first simulated values where the total proportion of causal variants is low: $p_{00} = 0.89, p_{10} = 0.05, p_{01} = 0.05, p_{11} = 0.01$ , along with a low sample size and high heritability: $h_{1}^{2} = 0.05, h_{2}^{2} = 0.05, ρ = 0, N_{1} = 1000, N_{2} = 1000$ , as shown in (a). Second, we tested the model with the same proportion of causal variants, but with a larger sample size and smaller heritability: $h_{1}^{2} = 0.001, h_{2}^{2} = 0.001, ρ = 0, N_{1} = 100 000, N_{2} = 100 000$ , shown in (b). Third, we simulated data with a higher proportion of causal variants, $p_{00} = 0.50, p_{10} = 0.20, p_{01} = 0.20, p_{11} = 0.10$ . Using the same sets of heritabilities and sample sizes from the first two simulations, we tested the prediction accuracy of our model. (c) denotes the simulation with low sample size and high heritability, and (d) denotes the simulation with high sample size and low heritability. The dotted red lines denote the true proportion of causal SNPs in each simulation

Next, to assess how UNITY performs with varying levels of heritability, we continued to fix ρ = 0, but varied the values of the heritability. Note that we used low heritability values due to the low number of simulated SNPs (M = 500). From Figure 2, we can see that the estimates reflect the prior distribution of $(p_{00}, p_{10}, p_{01}, p_{11})$ when the heritability is very low. We also show in Figure 3 that our estimates are invariant to the correlation between phenotypes.

Fig. 2. — We simulate the following proportion of causal variants $p_{00} = 0.97, p_{10} = 0.01, p_{01} = 0.01, p_{11} = 0.01$ and vary the heritability ( $h_{1}^{2} = h_{2}^{2}$ ) while fixing $ρ, N_{1}, N_{2}, M$ . We vary the heritability from.01 to 5e−7 and plot the estimated proportion of non-causal variants (a), proportion of causal variants for trait 1 (b), proportion of causal variants for trait 2 (c) and proportion of shared causal variants (d). We note that as the heritability goes down, the data become less informative and the estimates reflect the prior

Fig. 3. — We simulate the following proportion of causal variants $p_{00} = 0.97, p_{10} = 0.01, p_{01} = 0.01, p_{11} = 0.01$ and vary the genetic correlation from 0 to 0.50 while fixing $h_{1}^{2}, h_{2}^{2}, N_{1}, N_{2}, M$ . We only show the estimate of p₁₁, since this would be the only estimate directly affected by the presence of genetic correlation

To assess the role of sample size in our inference, we performed simulations where we varied the number of individuals from 1000 to 250 000. We find that the recommended sample size should be at least 50 000 individuals to yield precise results (Supplementary Fig. S2). Additionally, to further assess the performance of the method, we also performed simulations where $h_{1}^{2} \neq h_{2}^{2}$ and when $p_{10} \neq p_{01}$ . Through simulation, we demonstrate that our method is robust to these scenarios, with detailed results provided in Supplementary Figures S3 and S4.

Finally, through simulations, we empirically demonstrate that our method is well calibrated under the null hypothesis, defined as: (i) $p_{10} = 0$ , (ii) $p_{01} = 0$ and (iii) $p_{11} = 0$ . To demonstrate this, we simulated 100 000 SNPs with 100 000 individuals where $h_{1}^{2} = 0.25, h_{2}^{2} = 0.25, ρ = 0$ . For each hypothesis, we set the parameter of interest exactly to 0 and then then simulated 2% causal variants between the remaining parameters. For example, for null hypothesis (1), the corresponding set of simulation parameters would be: $p_{10} = 0, p_{01} = 0.01, p_{11} = 0.01$ . Using UNITY, we estimated the null parameter and report the posterior mean and standard deviation in Table 2. Note that UNITY estimates the null parameter very close to zero, but not exactly zero. This is because there is a non-zero prior on the set of parameters, making it not possible to be exactly zero, but can instead be asymptotically close.

Table 2.

We present the posterior means and standard deviations estimated when the proportion of causal variants is set exactly to zero for trait 1 and trait 2, and when the shared proportion is exactly zero

Hypothesis	Null parameter	Mean	SD
1	p₁₀	0.0006	0.0023
2	p₀₁	0.0004	0.0005
3	p₁₁	0.0002	0.0003

Open in a new tab

3.2.1 LD-pruning to identify approximately independent SNPs

To rigorously assess the role of LD in our model, we demonstrate a sufficient LD-pruning scheme through simulations. To model a realistic LD structure, we used SNPs from 1000 Genomes (Consortium et al., 2012a) to compute the LD for each of the approximately independent LD blocks identified in Berisa and Pickrell (2016). We filtered rare SNPs by minor allele frequency, MAF ≤ 0.05, and used 1 million SNPs sampled across the LD blocks. We chose only a subset of 1 million SNPs because this closely reflects the number of SNPs genotyped on SNP arrays. We simulated the GWAS effect sizes as outlined in Section 3.1, where the heritabilities for each of the each traits was set to $h_{1}^{2} = 0.50$ and $h_{2}^{2} = 0.50$ (which is similar to the estimated SNP heritability for height), and genetic correlation ρ = 0.

To assess the role of LD-pruning, we divided the genome into K kilobase non-overlapping windows and selected a SNP from each window. We varied K to assess the minimal window size necessary to create a subset of approximately independent SNPs. In addition, we used cross-trait LD Score regression to estimate the heritabilities for both traits and the genetic correlation after pruning, which were subsequently used in the inference. Through simulations, we determined that a 5 KB window provides precise estimates (Supplementary Table S1).

3.3 Empirical analysis of BMI and height

We downloaded GWAS summary data for both height and BMI from the GIANT consortium (Allen et al., 2010; Speliotes et al., 2010) where each study has >170 000 individuals. First, we overlapped each GWAS by rsid to get SNPs present in both studies. Then for each trait, we filtered out SNPs with a minor allele frequency $\leq 0.05$ . Additionally, we performed LD-pruning by taking a SNP from every 5 KB window.

We used cross-trait LD Score to estimate the heritability and genetic correlation parameters: $h_{H}^{2} = 0.2390, h_{B}^{2} = 0.1566, ρ = - 0.0845$ . Denoting height as the first trait and BMI as the second, we estimated the proportion of causal variants for each trait as, $p_{00} = 0.9519, p_{10} = 0.0062, p_{01} = 0.01579, p_{11} = 0.0262$ . We summarize the distribution of estimated causal SNPs in Figure 4.

Fig. 4. — We show the distribution of estimated non-causal and causal SNPs from the height and BMI analysis

Our results are consistent with the known genetic makeup of BMI and height. Since BMI is a function of an individual’s height and weight, we expect all of the contributing variants for height to also contribute to BMI. UNITY predicts more BMI-only specific variants than height-only variants. We hypothesize that the BMI specific variants are those that contribute to weight, whereas the variants that contribute to height in the BMI dataset were already captured in the p₁₁ estimate. In principal, we would expect p₁₀ to be zero since SNPs contributing to height also contribute to BMI. We expect this could be due to the non-zero prior on p₁₀. Because of this, the estimate can never truly be zero but can be asymptotically close.

4 Discussion

In this work, we introduce a statistical framework for quantifying the relationship between two complex traits. The key advantage of our method is that it makes very few assumptions about the data and few restrictions during inference. Rather than relying on assumptions about a trait’s genetic architecture, we let the data describe the underlying genetics. By using a Metropolis–Hastings sampling framework, we can calculate a variety of likelihoods without relying on any conjugate prior pairings. For example, although we choose to model the causal effect sizes through a multivariate normal, one could choose another distribution, and the sampling procedure would still hold even if the new distribution did not have a conjugate prior. Finally, by operating exclusively on GWAS summary statistic data, we aim to encourage future large-scale meta analyses, since obtaining individual level data are not always readily available.

We conclude with several limitations and potential future directions of our framework. First, as the size of genetic datasets grow, subsampling methods such as MCMC may prove computationally intractable. Alternatives include using adaptive MCMC to accelerate mixing and convergence or variational methods that do not require subsampling. Additionally, we have yet to rigorously quantify the effects of LD in our model in practice for local inference. We leave rigorous comparison between UNITY and other relevant methods as future work.

Additionally, recent integrative methods have shown that the incorporation of a variants functional genomic context can improve both power and accuracy in identifying potential causal variants (Hu et al., 2017b; Kichaev et al., 2014; Li and Kellis, 2016; Pickrell, 2014). Large-scale initiatives such as the ENCODE (Consortium et al., 2012b) and ROADMAP (Kundaje et al., 2015) projects have provided comprehensive databases of tissue-specific functional genomic annotations. Combining this rich atlas of functional data and the genetic information from GWAS will likely uncover novel insights into disease biology. We leave the incorporation of functional elements as a potential direction for future work.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(127.8KB, pdf)}

Acknowledgements

We are grateful to Kathryn Burch, Claudia Giambartolomei and Gleb Kichaev for helpful and insightful discussions.

Funding

This work was funded in part by National Institutes of Health (NIH), under awards R01HG009120, R01HG006399, U01CA194393, R00GM111744, R35GM125055, National Science Foundation Grant III-1705121, an Alfred P. Sloan Research Fellowship, and a gift from the Okawa Foundation.

Conflict of Interest: none declared.

References

Allen H.L. et al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467, 832.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Autism Spectrum Disorders Working Group of the Psychiatric Genomics Consortium Anney R.J. et al. (2017) Meta-analysis of gwas of over 16, 000 individuals with autism spectrum disorder highlights a novel locus at 10q24. 32 and a significant overlap with schizophrenia. Mol. Autism, 8, 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berisa T., Pickrell J.K. (2016) Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32, 283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brooks S. et al. (2011). Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton, FL. [Google Scholar]
Bulik-Sullivan B. et al. (2015) An atlas of genetic correlations across human diseases and traits. Nat. Genet., 47, 1236.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Byrd R.H. et al. (1994). Representations of quasi-Newton matrices and their use in limited memory methods. Math. Program., 63, 129–156.
1000 Genomes Project Consortium Abecasis G.R. et al. (2012) An integrated map of genetic variation from 1, 092 human genomes. Nature, 491, 56.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cotsapas C. et al. (2011) Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet., 7, e1002254.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ellinghaus D. et al. (2012) Combined analysis of genome-wide association studies for crohn disease and psoriasis identifies seven shared susceptibility loci. Am. J. Hum. Genet., 90, 636–647. [DOI] [PMC free article] [PubMed] [Google Scholar]
ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Giambartolomei C. et al. (2014) Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet., 10, e1004383.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hormozdiari F. et al. (2016) Colocalization of gwas and eqtl signals detects target genes. Am. J. Hum. Genet., 99, 1245–1260. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu Y. et al. (2017) Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction. PLoS Genet., 13, e1006836.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu Y. et al. (2017) Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput. Biol., 13, e1005589.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kundaje A. et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518, 317.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kichaev G. et al. (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 10, e1004722.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y., Kellis M. (2016) Joint bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases. Nucleic Acids Res., 44, e144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Metropolis N. et al. (1953) Equation of state calculations by fast computing machines. J. Chem. Phys., 21, 1087–1092. [Google Scholar]
Pasaniuc B., Price A.L. (2017) Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet., 18, 117.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pickrell J.K. (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet., 94, 559–573. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ramos P.S. et al. (2011) A comprehensive analysis of shared loci between systemic lupus erythematosus (sle) and sixteen autoimmune diseases reveals limited genetic overlap. PLoS Genet., 7, e1002406.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Richard-Miceli C., Criswell L.A. (2012) Emerging patterns of genetic overlap across autoimmune disorders. Genome Med., 4, 6.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shi H. et al. (2017) Local genetic correlation gives insights into the shared genetic architecture of complex traits. Am. J. Hum. Genet., 101, 737–751. [DOI] [PMC free article] [PubMed] [Google Scholar]
Speliotes E.K. et al. (2010) Association analyses of 249, 796 individuals reveal 18 new loci associated with body mass index. Nat. Genet., 42, 937.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(127.8KB, pdf)}

[bty254-B1] Allen H.L. et al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467, 832.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B2] Autism Spectrum Disorders Working Group of the Psychiatric Genomics Consortium Anney R.J. et al. (2017) Meta-analysis of gwas of over 16, 000 individuals with autism spectrum disorder highlights a novel locus at 10q24. 32 and a significant overlap with schizophrenia. Mol. Autism, 8, 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B3] Berisa T., Pickrell J.K. (2016) Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32, 283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B4] Brooks S. et al. (2011). Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton, FL. [Google Scholar]

[bty254-B5] Bulik-Sullivan B. et al. (2015) An atlas of genetic correlations across human diseases and traits. Nat. Genet., 47, 1236.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B6] Byrd R.H. et al. (1994). Representations of quasi-Newton matrices and their use in limited memory methods. Math. Program., 63, 129–156.

[bty254-B7] 1000 Genomes Project Consortium Abecasis G.R. et al. (2012) An integrated map of genetic variation from 1, 092 human genomes. Nature, 491, 56.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B8] Cotsapas C. et al. (2011) Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet., 7, e1002254.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B9] Ellinghaus D. et al. (2012) Combined analysis of genome-wide association studies for crohn disease and psoriasis identifies seven shared susceptibility loci. Am. J. Hum. Genet., 90, 636–647. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B10] ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B11] Giambartolomei C. et al. (2014) Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet., 10, e1004383.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B12] Hormozdiari F. et al. (2016) Colocalization of gwas and eqtl signals detects target genes. Am. J. Hum. Genet., 99, 1245–1260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B13] Hu Y. et al. (2017) Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction. PLoS Genet., 13, e1006836.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B14] Hu Y. et al. (2017) Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput. Biol., 13, e1005589.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B15] Kundaje A. et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518, 317.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B16] Kichaev G. et al. (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 10, e1004722.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B17] Li Y., Kellis M. (2016) Joint bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases. Nucleic Acids Res., 44, e144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B18] Metropolis N. et al. (1953) Equation of state calculations by fast computing machines. J. Chem. Phys., 21, 1087–1092. [Google Scholar]

[bty254-B19] Pasaniuc B., Price A.L. (2017) Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet., 18, 117.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B20] Pickrell J.K. (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet., 94, 559–573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B21] Ramos P.S. et al. (2011) A comprehensive analysis of shared loci between systemic lupus erythematosus (sle) and sixteen autoimmune diseases reveals limited genetic overlap. PLoS Genet., 7, e1002406.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B22] Richard-Miceli C., Criswell L.A. (2012) Emerging patterns of genetic overlap across autoimmune disorders. Genome Med., 4, 6.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B23] Shi H. et al. (2017) Local genetic correlation gives insights into the shared genetic architecture of complex traits. Am. J. Hum. Genet., 101, 737–751. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty254-B24] Speliotes E.K. et al. (2010) Association analyses of 249, 796 individuals reveal 18 new loci associated with body mass index. Nat. Genet., 42, 937.. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A unifying framework for joint trait analysis under a non-infinitesimal model

Ruth Johnson

Huwenbo Shi

Bogdan Pasaniuc

Sriram Sankararaman