Abstract
More and more large cohort studies have conducted or are conducting genome-wide association studies (GWAS) to reveal the genetic components of many complex human diseases. These large cohort studies often collected a broad array of correlated phenotypes that reflect common physiological processes. By jointly analyzing these correlated traits, we can gain more power by aggregating multiple weak effects and shed light on the mechanisms underlying complex human diseases. The majority of existing multi-trait association test methods are based on jointly modeling the multivariate traits conditional on the genotype as covariate, and can readily accommodate the imputed SNPs by using their imputed dosage as a covariate. An alternative class of multi-trait association tests is based on the inverted regression, which models the distribution of genotypes conditional on the covariate and multivariate traits, and has been shown to have competitive performance. To our knowledge, all existing inverted regression approaches have implicitly used the “best-guess” genotypes, which is not efficient and known to lead to dramatic power loss, and there have not been any proposed methods of incorporating imputation uncertainty into inverted regressions. In this work, we propose a general and efficient framework that can account for the imputation uncertainty to further improve the association test power of inverted regression models for imputed SNPs. We demonstrate through extensive numerical studies that the proposed method has competitive performance. We further illustrate its usefulness by application to association test of diabetes-related glycemic traits in the Atherosclerosis Risk in Communities (ARIC) Study.
Keywords: GEE, GWAS, Imputation, SNP
1 Introduction
Genetic studies often collect multiple phenotypes, which could be analyzed jointly to increase power by aggregating multiple weak effects and provide additional insights into the etiology of complex human diseases (Solovieff et al., 2013). Existing multi-trait association test methods (see, e.g., Ferreira and Purcell, 2009; Liu et al., 2009; Yang et al., 2010; Rasmussen-Torvik et al., 2010; O’Reilly et al., 2012; Tang and Ferreira, 2012; van der Sluis et al., 2013; He et al., 2013; Schifano et al., 2013; Stephens, 2013; Seoane et al., 2014) can be broadly classified into two categories. The first one is based on jointly modeling the multiple correlated outcomes with some multivariate regression models. Another novel approach is based on the inverted regression model, where the genotypes are regressed on the covariates and multivariate outcomes to estimate and test the multi-trait associations and typically some ordinal multinomial regression model is used. For example, O’Reilly et al. (2012) adopted the proportional odds model (POM), and Wu and Pankow (2015) proposed the adjacent category logit (ACL) model. For the multivariate regression based approach, it is straightforward to accommodate the imputed SNPs by using their imputation dosages as the covariate. While for the inverted regression approach, to our knowledge, all existing methods have implicitly used the “best-guess” genotypes, which is not efficient and known to lead to dramatic power loss, and there have not been any proposed methods in the literature that can incorporate the imputation uncertainty into inverted regressions. We propose a general and efficient GEE modeling approach to extending the inverted regression model to multi-trait association test of imputed SNPs.
2 Materials and Methods
2.1 Genotype based multinomial regression model
Consider a collection of continuous traits Y = (y1, … , ym)T, a p-vector of covariates X to be adjusted (which could contain both ancestry and non-ancestry covariates, e.g., ancestry principal components, age and gender), and a genotype score G (number of minor alleles). Assume the multivariate normal trait model, (Y |G,X) ~ N(γ0 + γXX + γG, Σ), where γ0 is a m-vector, γX is a m × p matrix, γ is a m-vector, and Σ is a m × m covariance matrix. The null hypothesis of multi-trait association is H0 : γ = 0. When modeling the population genotype distribution Pr(G|X) with a logistic regression model (it holds when, e.g., the genotypes follow the Hardy-Weinberg equilibrium within each ancestry population), we can derive an adjacent-category logit model (ACL) (Wu and Pankow, 2015)
(1) |
where ϕg = Pr(G = g|X, Y ) is the conditional genotype distribution probability, βX is a p-vector, and β is a m-vector (specifically β = Σ−1γ). The multi-trait association amounts to testing H0 : β = 0. A closely related approach is the MultiPhen method (O’Reilly et al., 2012), which assumed the proportional odds model (POM) for analyzing the three genotypes
(2) |
The multi-trait association amounts to testing H0 : β̃ = 0. In general the POM provides a good approximation to the ACL, and two approaches have similar performance for directly genotyped/observed SNPs (Wu and Pankow, 2015). We want to remark that the inverted regression approach has assumed that the genotypes are directly observed, and all existing methods have implicitly used the “best-guess” genotypes for imputed SNPs. For both the inverted regression and multivariate regression models, the main parameters of interest are a vector of length m. The inverted regression model has smaller number of nuisance parameters, p+2, compared to the multivariate regression model, m+mp+m(m+1)/2.
2.2 Genotype imputation
To facilitate SNP association studies and across studies meta-analysis, many ungenotyped SNPs are typically imputed based on outside reference panel of existing samples, e.g., the HapMap and 1000 genome project (Browning and Browning, 2009; Howie et al., 2009; Li et al., 2010). These imputation approaches rely on the intuition that individuals can share short stretches of haplotypes inherited from distant common ancestors. Once these stretches are identified using those genotyped SNPs, alleles for intervening SNPs that are not genotyped in the individuals can then be imputed based on those individuals with measured SNPs (i.e., reference panel samples) (Li et al., 2009, 2010). The typical imputation takes as input those haplotypes for polymorphic markers in the reference panel (e.g., the phased HapMap or 1000 genome chromosomes), and those directly genotyped markers in the individuals to be imputed. The sequence of markers are modeled as a mosaic of the set of reference haplotypes based on a Hidden Markov Model (HMM) (Li and Stephens, 2003; Stephens and Scheet, 2005). In the HMM, the reference haplotypes are treated as the hidden states, and the genotyped markers are treated as the observed signals. The HMM parameters are estimated iteratively and missing genotypes are sampled at each iteration based on the current HMM estimates. The sampled genotype counts over all iterations are aggregated together to give an indication of the relative probability of observing each possible genotypes (Li et al., 2010). The relative fractions of three genotypes comprise the imputation scores for an imputed SNP.
In the following, we develop two modeling approaches to incorporating the imputation scores into the inverted regression. The first approach is rooted in the weighted multinomial regression approach with robust GEE covariance estimates (Lipsitz et al., 1994; Preisser et al., 2002). The second approach is based on the fractional multinomial regression modeling (Murteira and Ramalho, 2014), which is very suited to model the imputed genotype proportions. We will further show that these two modeling approaches are equivalent.
2.3 Association test of imputed SNPs: weighted multinomial regression
We develop a computationally fast weighted regression approach, where the imputation scores are treated as weights. Since the same sample will be used three times (for the three genotype scores), we need to take into account their dependence in the estimation of parameter covariance. The model-based covariance estimate from the independent weighted regression will under-estimate the variation. We propose to use the robust GEE sandwich covariance (Liang and Zeger, 1986). Specifically here we adopt the approach of Lipsitz et al. (1994) for modeling the multinomial outcomes, and the modeling framework of Preisser et al. (2002) for incorporating weights in the GEE.
For a collection of n unrelated individuals, denote Xi as the covariate, and Yi as the m-vector of outcomes for sample i = 1, … , n. Consider testing association of an imputed SNP. For the i-th sample, denote (pi0, pi1, pi2) as the imputation scores (posterior probabilities) of genotype 0, 1, 2. Denote Pr(Gi = k|Xi, Yi) = ϕik, i = 1, … , n,k = 0, 1, 2. We convert the genotype score into a bivariate indicator of being the first two genotypes: the genotype scores 0/1/2 are coded as (1,0), (0,1), (0,0) respectively. For the i-th sample, the three imputed genotypes (0,1,2) are represented by the working vector Gi = (1, 0, 0, 1, 0, 0)T. We define a probability vector μi = (ϕi0, ϕi1, ϕi0, ϕi1, ϕi0, ϕi1)T. Denote the imputation score matrix Wi = diag(pi0, pi0, pi1, pi1, pi2, pi2). Assume a block-diagonal working covariance matrix Vi with the 2×2 diagonal blocks equal to diag(ϕi0, ϕi1)–(ϕi0, ϕi1)T (ϕi0, ϕi1), which is the multinomial covariance matrix. Denote θ as the collection of all model parameters. We use the following estimating equations for model estimation and inference
(3) |
The robust sandwich covariance of θ̂ can then be computed as , where and we plugin the estimated θ̂.
Let θ̂2 denote the m-vector of estimated regression parameters of main interest for the multivariate traits Y (i.e., β in the ACL and β̃ in the POM). Denote the corresponding covariance of θ̂2 as V. The statistic , which asymptotically has a null m-DF chi-square distribution, can be used to test the multi-trait association. When genetic effects are similar across traits, we can further improve the multi-trait association test power using a 1-DF statistic to test linear combinations of θ2 following the line of O’Brien (1984). To test the similar or similar scaled effects across different traits, we propose the test statistics: , where 1m is a column vector of m ones, S = [diag(Σ̂0)]½, and Σ̂0 is computed as the sample covariance matrix of residual vector of regressing Y on X (see Appendix for details). Their significance p-values can be computed based on the standard normal distribution. This generic GEE modeling approach can be readily generalized to analyze imputed SNPs using any inverted regression methods.
In the following, we show that the proposed GEE modeling approach is equivalent to a fractional multinomial regression model, which provides more intuitive justifications to model the imputed genotype scores.
2.4 Genotype based fractional multinomial regression model
For the ith individual, note that its imputation scores (pi0, pi1, pi2) tell the relative fractions of three genotypes under ideal repeated sampling: for N individuals with the same characteristics (including covariate values) as the ith individual, N(pi0, pi1, pi2) will be the observed counts of three genotypes, and naturally we can model them with a three-category multinomial distribution. Thus we can model the imputation scores with a multinomial distribution based quasi-likelihood, , and study the following quasi-likelihood for parameter estimation, . This model is also known as the fractional multinomial regression model (Murteira and Ramalho, 2014). We maximize L to obtain the quasi-maximum likelihood estimates (QMLE) for parameters, θ̃ = arg maxθ L, and compute its asymptotic covariance matrix based on the GEE as follows. Denote . The estimator θ̃ is obtained by solving estimating equations , and its robust sandwich covariance matrix can then be computed as Ṽ = B̃−1ΩB̃−1, where , and and we plugin the estimated θ̃. We can show that this QMLE will lead to the same estimates as the previous weighted GEE approach. Specifically we can show that Ũi = Ui (see appendix for technical derivations). This QMLE can be cast into a weighted multinomial regression model and can be readily and quickly solved using existing software.
Previous derivations have assumed the additive genetic model, and they can be easily extended to recessive and dominant genetic models (see Supplementary materials).
In the following we conduct simulation studies to investigate the performance of the proposed methods for testing the multi-trait association of imputed SNPs.
3 Simulation study
We simulate a standard normal covariate X1 and an ancestry Bernoulli covariate X2 with probability of 0.5 (population indicator). The SNP genotype G is simulated from a Binomial distribution, Binom(2,f0), where the minor allele frequency (MAF) f0 = p0 + p1X2. We conducted simulations for testing m related traits of 1,000 unrelated individuals. Each time we simulate the m traits from a multivariate normal distribution with a compound symmetry correlation matrix with correlation ρ. The first trait has a variance of 2 and all the other traits have unit variance, . We set E(Yk) = 1 + 0.5X1 + 0.5X2 + γkG for odd index k, and E(Yk) = 1+X1 + X2 + γkG for even index k. For a given SNP G, we simulate its imputation probabilities from the Dirichlet distribution with parameters (α0, α1, α2), where αG = τ and αg = (1 − τ )/2 for g ≠ G, with larger τ reflecting higher imputation accuracy. We used 106 experiments to evaluate the type I error, and 104 experiments to evaluate the power under various combinations of (γ1, … , γm). We conducted simulations for various parameter settings. Here we reported the results for m = 4, p0 = 0.3, p1 = 0.1, ρ = 0.2, 0.5, and τ = 0.8, 0.95. The conclusions remain the same for other settings.
We studied the two inverted regression methods, the ACL and POM based GEE tests. For comparison we included the multiple linear regression model (MLM) based efficient GEE score tests (Avery et al., 2011; He et al., 2013), which have been shown to appropriately control the type I errors and have the overall best performance compared to the other methods (e.g., TATES of van der Sluis et al., 2013 and other univariate test based methods) in extensive numerical studies. All methods reported three p-values based on the m-DF omnibus test and two 1-DF tests assuming common or common scaled effects. Denote the respective three tests as (Qa, Ta, ) for ACL GEE test, (Qo, To, ) for POM GEE test, and (Qs, Ts, ) for the MLM GEE test. We use the imputed dosage as a covariate in the MLM GEE tests. In the appendix, we technically show that the MLM GEE tests are essentially based on a joint model of the multivariate traits with the imputation dosage as a covariate. As a by-product, we derive very fast numerical algorithms for genome-wide association test. For illustration, we also include the naive approach of modeling the “best-guess” genotypes for the two inverted regression methods, denoted as (Q̃a, T̃a, ) for the ACL, and (Q̃o, T̃o, ) for the POM respectively.
Table 1 summarizes the estimated type I errors. Overall we can see that for the two inverted regression (ACL and POM) based tests, using the “best-guess” genotypes leads to slightly conservative type I errors compared to their corresponding GEE tests that properly account for the imputation uncertainty. The ACL “best-guess” tests are generally more conservative compared to the corresponding POM “best-guess” tests, which control the type I error rate reasonably well. All ACL based tests appropriately control the type I errors. The POM GEE tests have slightly inflated type I errors at small significance level. The MLM GEE tests have well-controlled type I errors.
Table 1.
ρ = 0.2, τ = 0.8 | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||
α | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
10−4 | 1.13 | 1.08 | 1.07 | 0.76 | 0.82 | 0.80 | 1.32 | 1.31 | 1.29 | 0.86 | 0.96 | 0.98 | 0.89 | 1.01 | 1.05 | |||||
10−3 | 1.09 | 1.07 | 1.06 | 0.84 | 0.89 | 0.89 | 1.15 | 1.13 | 1.17 | 0.94 | 1.01 | 1.02 | 0.96 | 1.00 | 1.00 | |||||
10−2 | 1.05 | 1.03 | 1.04 | 0.92 | 0.95 | 0.95 | 1.08 | 1.05 | 1.05 | 0.99 | 0.99 | 0.99 | 0.98 | 0.99 | 1.01 | |||||
| ||||||||||||||||||||
ρ = 0.5, τ = 0.8 | ||||||||||||||||||||
| ||||||||||||||||||||
α | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
| ||||||||||||||||||||
10−4 | 1.03 | 0.94 | 1.09 | 0.65 | 0.69 | 0.74 | 1.26 | 1.18 | 1.22 | 0.76 | 0.79 | 0.89 | 0.81 | 0.84 | 0.96 | |||||
10−3 | 1.10 | 1.03 | 1.02 | 0.80 | 0.88 | 0.88 | 1.15 | 1.11 | 1.11 | 0.94 | 0.96 | 0.96 | 0.92 | 0.95 | 0.96 | |||||
10−2 | 1.06 | 1.02 | 1.03 | 0.92 | 0.94 | 0.94 | 1.09 | 1.06 | 1.04 | 0.99 | 0.98 | 0.99 | 0.98 | 0.98 | 0.99 | |||||
| ||||||||||||||||||||
ρ = 0.2, τ = 0.95 | ||||||||||||||||||||
| ||||||||||||||||||||
α | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
| ||||||||||||||||||||
10−4 | 0.95 | 1.00 | 0.99 | 0.72 | 0.83 | 0.81 | 1.48 | 1.25 | 1.20 | 0.92 | 0.94 | 0.96 | 0.69 | 0.92 | 0.93 | |||||
10−3 | 1.05 | 1.03 | 1.05 | 0.82 | 0.84 | 0.90 | 1.17 | 1.09 | 1.10 | 0.96 | 0.92 | 0.97 | 0.97 | 0.98 | 1.02 | |||||
10−2 | 1.03 | 1.02 | 1.01 | 0.94 | 0.95 | 0.95 | 1.06 | 1.05 | 1.04 | 0.99 | 1.00 | 1.00 | 0.98 | 1.01 | 1.00 | |||||
| ||||||||||||||||||||
ρ = 0.5, τ = 0.95 | ||||||||||||||||||||
| ||||||||||||||||||||
α | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
| ||||||||||||||||||||
10−4 | 1.04 | 0.97 | 0.96 | 0.76 | 0.73 | 0.71 | 1.51 | 1.27 | 1.25 | 0.96 | 0.87 | 0.84 | 0.94 | 0.94 | 0.88 | |||||
10−3 | 1.02 | 0.99 | 1.00 | 0.82 | 0.88 | 0.88 | 1.16 | 1.10 | 1.04 | 0.96 | 0.95 | 0.96 | 0.93 | 0.95 | 0.96 | |||||
10−2 | 0.99 | 1.01 | 1.01 | 0.90 | 0.94 | 0.95 | 1.05 | 1.02 | 1.04 | 0.97 | 0.99 | 0.99 | 0.96 | 0.99 | 1.00 |
Table 2 and 3 summarize the power under τ = 0.8 and τ = 0.95 respectively. The 1-DF tests are the most powerful when either γj or γj/σj are close to each other. Not surprisingly using the “best-guess” genotypes leads to power loss for the two inverted regression (ACL and POM) based tests especially under lower imputation accuracy. The ACL GEE tests have comparable performance as the MLM GEE tests under relatively high imputation accuracy (τ = 0.95). For imputed SNPs with less accuracy (τ = 0.8), the ACL GEE tests have improved power compared to the MLM GEE tests. Overall the POM GEE tests have the largest power among all methods, which need to be interpreted with caution since POM GEE tests have slightly inflated type I errors as we have shown in Table 1. Under the same imputation uncertainty, when multiple traits have similar genetic effects, all tests have larger power under ρ = 0.2 compared to ρ = 0.5; while when genetic effects are different across traits, all tests have larger power under ρ = 0.5 compared to ρ = 0.2. Here joint multi-trait association test works well when combining highly correlated traits with heterogeneous genetic effects or lowly correlated traits with similar genetic effects.
Table 2.
ρ = 0.2, τ = 0.8 | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||
(γ1, γ2, γ3, γ4) | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
(0.3,0,0,0) | 0.066 | 0 | 0.001 | 0.031 | 0 | 0 | 0.078 | 0 | 0.002 | 0.041 | 0 | 0.001 | 0.057 | 0 | 0.001 | |||||
(0.3,0.2,0.1,0) | 0.274 | 0.051 | 0.111 | 0.152 | 0.026 | 0.059 | 0.307 | 0.063 | 0.130 | 0.183 | 0.034 | 0.074 | 0.248 | 0.047 | 0.102 | |||||
(.25,.18,.18,.18) | 0.293 | 0.496 | 0.528 | 0.168 | 0.335 | 0.367 | 0.325 | 0.531 | 0.563 | 0.202 | 0.380 | 0.413 | 0.268 | 0.480 | 0.509 | |||||
(0.2,0.2,0.2,0.2) | 0.379 | 0.623 | 0.591 | 0.226 | 0.452 | 0.422 | 0.415 | 0.659 | 0.626 | 0.267 | 0.500 | 0.472 | 0.351 | 0.606 | 0.573 | |||||
| ||||||||||||||||||||
ρ = 0.5, τ = 0.8 | ||||||||||||||||||||
| ||||||||||||||||||||
(γ1, γ2, γ3, γ4) | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
| ||||||||||||||||||||
(0.3,0,0,0) | 0.216 | 0 | 0.001 | 0.116 | 0 | 0 | 0.246 | 0 | 0.001 | 0.143 | 0 | 0.001 | 0.195 | 0 | 0.001 | |||||
(0.3,0.2,0.1,0) | 0.341 | 0.004 | 0.028 | 0.200 | 0.002 | 0.013 | 0.375 | 0.005 | 0.034 | 0.236 | 0.002 | 0.018 | 0.313 | 0.004 | 0.026 | |||||
(.25,.18,.18,.18) | 0.085 | 0.176 | 0.218 | 0.040 | 0.101 | 0.129 | 0.102 | 0.196 | 0.242 | 0.052 | 0.119 | 0.150 | 0.073 | 0.166 | 0.206 | |||||
(0.2,0.2,0.2,0.2) | 0.137 | 0.309 | 0.250 | 0.070 | 0.190 | 0.149 | 0.157 | 0.336 | 0.274 | 0.088 | 0.220 | 0.172 | 0.121 | 0.293 | 0.236 |
Table 3.
ρ = 0.2, τ = 0.95 | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||
(γ1, γ2, γ3, γ4) | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
(0.3,0,0,0) | 0.242 | 0 | 0.003 | 0.206 | 0 | 0.002 | 0.256 | 0 | 0.004 | 0.222 | 0 | 0.002 | 0.229 | 0 | 0.003 | |||||
(0.3,0.2,0.1,0) | 0.658 | 0.140 | 0.286 | 0.601 | 0.108 | 0.245 | 0.674 | 0.157 | 0.309 | 0.622 | 0.131 | 0.272 | 0.643 | 0.148 | 0.291 | |||||
(.25,.18,.18,.18) | 0.684 | 0.850 | 0.872 | 0.634 | 0.819 | 0.843 | 0.699 | 0.859 | 0.879 | 0.657 | 0.832 | 0.852 | 0.672 | 0.848 | 0.870 | |||||
(0.2,0.2,0.2,0.2) | 0.781 | 0.924 | 0.906 | 0.732 | 0.904 | 0.879 | 0.793 | 0.928 | 0.913 | 0.752 | 0.911 | 0.890 | 0.768 | 0.922 | 0.903 | |||||
| ||||||||||||||||||||
ρ = 0.5, τ = 0.95 | ||||||||||||||||||||
| ||||||||||||||||||||
(γ1, γ2, γ3, γ4) | Qa | Ta |
|
Q̃a | T̃a |
|
Qo | To |
|
Q̃o | T̃o |
|
Qs | Ts |
|
|||||
| ||||||||||||||||||||
(0.3,0,0,0) | 0.571 | 0 | 0.001 | 0.515 | 0 | 0 | 0.588 | 0 | 0.001 | 0.543 | 0 | 0.001 | 0.557 | 0 | 0.001 | |||||
(0.3,0.2,0.1,0) | 0.742 | 0.010 | 0.084 | 0.693 | 0.006 | 0.063 | 0.756 | 0.012 | 0.102 | 0.717 | 0.009 | 0.079 | 0.731 | 0.013 | 0.091 | |||||
(.25,.18,.18,.18) | 0.278 | 0.442 | 0.518 | 0.239 | 0.396 | 0.470 | 0.292 | 0.455 | 0.526 | 0.255 | 0.418 | 0.486 | 0.268 | 0.436 | 0.514 | |||||
(0.2,0.2,0.2,0.2) | 0.417 | 0.666 | 0.580 | 0.367 | 0.617 | 0.526 | 0.432 | 0.677 | 0.589 | 0.390 | 0.633 | 0.547 | 0.403 | 0.662 | 0.577 |
We also performed simulation studies for less frequent and rare MAF (0.1, 0.05, and 0.01). The complete results are available at the supplementary materials. The overall conclusions remain the same.
4 ARIC GWAS of diabetes-related glycemic traits
The Atherosclerosis Risk in Communities (ARIC) study (The ARIC Investigators, 1989) is a multi-center prospective investigation of atherosclerotic disease in men and women aged 45–64 years at baseline. They were recruited from four U.S. communities: Forsyth County, North Carolina; Jackson, Mississippi; suburban areas of Minneapolis, Minnesota; and Washington County, Maryland. A total of 15,792 individuals participated in the baseline examination in 1987–1989. The vast majority of ARIC participants are of European (73%) or African ancestry (26%). Among 15,792 ARIC participants, we jointly analyzed the four fasting glucose levels of 5947 genotyped ARIC white participants who were non-diabetic at four visits measured approximately three years apart. Excluded from the analysis are a total of 9845 participants due to the following reasons: (1) 4314 participants are non-white; (2) 2751 participants do not complete all four visits; (3) 1556 participants have diabetes diagnosis or unknown diabetes status at any of the four visits; (4) 373 participants have no fasting glucose measurements for at least one of the four visits; (5) 851 participants do not have GWAS data. All ARIC participants have complete information on age, gender, and study center. The ARIC Study design, plasma glucose measurement, genotyping and other covariates have been described previously (Rasmussen-Torvik et al., 2010). The glucose levels had an average correlation of 0.55 between visits. We applied an additive genetic model and adjusted for age, gender and study center (population indicators).
For illustration, we analyze those typed and imputed SNPs in chromosome 1 and 2. We test those common SNPs with MAF ≥ 0.05 and imputation R2 ≥ 0.3, which leads to 163,048 and 189,023 SNPs in chromosome 1 and 2 respectively. There were no identified genome-wide significant SNPs (p-value ≤ 5 × 10−8) for chromosome 1, and multiple significant SNPs for chromosome 2. Specifically for the three tests: the m-DF omnibus test, and two 1-DF tests assuming common or common scaled effects, the ACL GEE tests (Qa, Ta, ) identified 56, 60, and 60 significant SNPs, the POM GEE tests (Qo, To, ) identified 56, 56, 59 SNPs, and the MLM GEE tests (Qs, Ts, ) identified 56, 59, 60 SNPs. All the identified SNPs are genome-wide significant in a meta-analyses of 21 fasting glucose GWAS with around 46,186 non-diabetic participants conducted by the MAGIC Consortium (Dupuis et al., 2010). Compared to the MLM test , the ACL test identified one additional genome-wide significant SNP, rs1260326, with p-value of 3.3 × 10−8. The p-value reported by the MAGIC meta-analysis of fasting glucose was 4.3 × 10−13. Compared to the POM test To, the ACL test Ta identified four additional genome-wide significant SNPs, rs1260326, rs574981, rs549410 and rs550151, with p-values of 3.3×10−8, 9.1×10−9, 9.1×10−9, and 7.5×10−9 respectively. Their respective p-values reported by the MAGIC meta-analysis of fasting glucose were 4.3 × 10−13, 8.6 × 10−14, 1.7 × 10−13, and 1.2 × 10−13.
All identified significant SNPs in chromosome 2 are imputed with imputation R2 in the range of 0.90 to 0.9998. To our knowledge, all previous inverted regression approaches have implicitly used the “best-guess” genotypes. When using the “best-guess” genotypes, missed one significant SNP, rs1260326, missed three significant SNPs, rs574981, rs549410 and rs550151, and missed one SNP, rs1260326, at the genome-wide significance level, compared to their corresponding GEE tests using the imputation scores.
The QQ-plots of p-values for chromosome 1 and 2 SNPs are shown in figure 1 and 2 respectively. We also compute the genomic control (GC) parameters, which are the mean of the 1-DF chi-square test statistics, and the mean of the 4-DF chi-square test statistics scaled by four. The three methods have similar GC values: 1.01–1.02 for chromosome 1 and 1.04–1.10 for chromosome 2.
5 Discussion
Most existing GWAS have primarily focused on testing single trait associations, which have led to discovery of many genome-wide significant variants for many human diseases and traits. However for most complex human diseases and traits, the explained heritability or trait variance by these identified variants still remain very small, which indicates significant “missing heritability” and yet more variants with small or moderate effects to be discovered. Recently there have been many efforts of conducting joint association test of correlated traits that reflect common physiological processes to identify more interesting genetic variants and provide additional insights into the disease etiology. Testing multiple correlated traits can aggregate weak variant effects to improve the genetic association test power. Among the existing multi-trait association test methods, the inverted regression approach models the conditional distribution of genotypes on covariates and multivariate traits, and provides a convenient and powerful approach with competitive performance. However it is not straightforward to analyze imputed SNPs for the inverted regression models in contrast to the trait based regression modeling approach, which can readily use the imputed dosage as covariate. In this paper, we proposed a general GEE based inverted regression modeling method to appropriately and efficiently test the multi-trait association of imputed SNPs. We show that the naive approach of analyzing “best-guess” genotypes could lead to dramatic power loss, while the proposed GEE based modeling approach offers much improved power and has comparable or larger power compared to the dosage based trait regression modeling approach.
For genome-wide association analyses, speed and robustness are both key issues. The proposed GEE modeling approach is robust and computationally fast. It is worthwhile to explore the likelihood based approach (e.g., mixed effects modeling approach or more generally likelihood ratio test based approach), which could bring more power under correct model assumptions than the typically Wald test based GEE modeling approach.
In this paper, we have focused on the multiple continuous traits association test of single variants. It is worthwhile to extend the inverted regression methods to association test at the gene level (Guo et al., 2013; van der Sluis et al., 2015), and generally to joint association test of mixed outcomes.
Supplementary Material
Acknowledgments
This research was supported in part by NIH grant GM083345 and CA134848. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We want to thank the associate editor and reviewers for their constructive comments which have greatly improved the presentation of the paper.
The ARIC Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN-268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN-268201100011C, and HHSN268201100012C), R01HL087641, R01HL59367 and R01HL086694; National Human Genome Research Institute contract U01HG004402; and National Institutes of Health contract HHSN268200625226C. The authors thank the staff and participants of the ARIC study for their important contributions. Infrastructure was partly supported by Grant Number UL1RR025005, a component of the National Institutes of Health and NIH Roadmap for Medical Research.
Appendix
Equivalence of QMLE and weighted GEE estimates
In the weighted GEE approach, the three imputed genotypes (0,1,2) are represented by the working vector Gi = (1, 0, 0, 1, 0, 0)T for the i-th sample. Denote a probability vector μi = (ϕi0, ϕi1, ϕi0, ϕi1, ϕi0, ϕi1)T. Denote the imputation score matrix Wi = diag(pi0, pi0, pi1, pi1, pi2, pi2). Assume a block-diagonal working covariance matrix Vi with the 2 × 2 diagonal blocks equal to Ai = diag(ϕi0, ϕi1) − (ϕi0, ϕi1)T (ϕi0, ϕi1), which is the multinomial covariance matrix. Denote θ as the collection of all model parameters. The weighted GEE for the i-th sample is defined as , where . First note that
and we can check that , and . Therefore we have
Note that ϕi2 = 1 − ϕi0 − ϕi1, and hence we have
GEE score test of multiple continuous traits
Here we show that for multiple continuous traits, the MLM GEE test of He et al. (2013) is essentially based on a joint model of the multivariate traits with imputation dosage as a covariate. Given the observations, denote the n × p covariate matrix as X (intercept included), the genotype dosage vector as G, and the kth outcome vector as Yk, k = 1, … , K. Consider the following joint multivariate linear regression model, Yk = Xαk + Gβk + Ek, where Ek = (εk1, … , εkn)T. We model the error vector with a zero-mean multivariate normal distribution with and Corr(εki, εli) = ρkl. Denote the n × n projection matrix H = X(XTX)−1XT. The scaled score statistic for testing βk is uk = (Yk − HYk)TG/σ̂k. Denote the score vector U = (u1, … , uK)T. Note that we can equivalently write . Hence we have asymptotically Var(uk) = ||G−HG||2, and Cov(uk, ul) = ||G−HG||2ρkl. The MLM GEE test is based on U and its estimated covariance, , which asymptotically follows a K-DF chi-square distribution under null. He et al. (2013) consistently estimated based on the efficient score vectors (Lin, 2005a,b). We can easily verify that the efficient score vectors are Zk = (Yk − HYk) ∘ (G − HG)/σ̂k, where ∘ is the Hadamard product (matrix element-wise product). Note that the residual vectors Yk – HYk and H can be pre-computed, and we just need to compute G – HG to test the genome-wide multi-trait associations.
1-DF multi-trait association test
Consider U = aT θ̂2. U asymptotically follows a normal distribution, U ~ N(aT η, aTV a), where η is the true value of θ2. For the ACL (1), we have θ2 = Σ−1γ, where Σ is the covariance matrix of Y and γ is the corresponding marginal genetic effects. For the POM (2), we assume θ2 ≈ Σ−1γ since the POM approximates the ACL. Assuming a common genetic effect for all traits, we have η = νΣ−11m. The effect size of U is then proportional to ν(aTΣ−11m)/(aTV a)½ = νbTV −½Σ−11m, where b = V ½a/(aTV a)½ (note bT b = 1). Taking b ∝ V −½Σ−11m will maximize the effect size. Therefore we use the following statistic . With a common scaled genotype effect for all traits, we have η = νΣ−1S, where S = [diag(Σ)]½. Similarly we can derive T′ = STΣ−1V−1θ̂2/(STΣ−1V−1Σ−1S)½. In practice we estimate Σ by Σ̂0, the sample covariance matrix of Ỹ, the residual vector of regressing Y on X.
References
- Avery CL, He Q, North KE, Ambite JL, Boerwinkle E, Fornage M, Hindorff LA, Kooperberg C, Meigs JB, Pankow JS, Pendergrass SA, Psaty BM, Ritchie MD, Rotter JI, Taylor KD, Wilkens LR, Heiss G, Lin DY. A phenomics-based strategy identifies loci on APOC1, BRAP, and PLCG1 associated with metabolic syndrome phenotype domains. PLoS Genet. 2011;7(10):e1002322. doi: 10.1371/journal.pgen.1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. American Journal of Human Genetics. 2009;84(2):210–223. doi: 10.1016/j.ajhg.2009.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dupuis J, Langenberg C, Prokopenko I, Saxena R, Soranzo N, Jackson AU, et al. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nature Genetics. 2010;42(2):105–116. doi: 10.1038/ng.520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferreira MAR, Purcell SM. A multivariate test of association. Bioinformatics. 2009;25(1):132–133. doi: 10.1093/bioinformatics/btn563. [DOI] [PubMed] [Google Scholar]
- Guo X, Liu Z, Wang X, Zhang H. Genetic association test for multiple traits at gene level. Genetic epidemiology. 2013;37(1):122–129. doi: 10.1002/gepi.21688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He Q, Avery CL, Lin DY. A general framework for association tests with multivariate traits in large-scale genomics studies. Genetic Epidemiology. 2013;37(8):759–767. doi: 10.1002/gepi.21759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5(6):e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klei L, Luca D, Devlin B, Roeder K. Pleiotropy and principal components of heritability combine to increase power for association analysis. Genetic Epidemiology. 2008;32(1):9–19. doi: 10.1002/gepi.20257. [DOI] [PubMed] [Google Scholar]
- Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003;165(4):2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Willer C, Sanna S, Abecasis G. Genotype imputation. Annual Review of Genomics and Human Genetics. 2009;10:387–406. doi: 10.1146/annurev.genom.9.081307.164242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010;34(8):816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. [Google Scholar]
- Lin DY. On Rapid Simulation of P Values in Association Studies. American Journal of Human Genetics. 2005a;77(3):513–514. doi: 10.1086/432817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY. An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics. 2005b;21(6):781–787. doi: 10.1093/bioinformatics/bti053. [DOI] [PubMed] [Google Scholar]
- Lipsitz SR, Kim K, Zhao L. Analysis of repeated categorical data using generalized estimating equations. Statistics in Medicine. 1994;13(11):1149–1163. doi: 10.1002/sim.4780131106. [DOI] [PubMed] [Google Scholar]
- Liu J, Pei Y, Papasian CJ, Deng HW. Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations. Genetic Epidemiology. 2009;33(3):217–227. doi: 10.1002/gepi.20372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murteira JMR, Ramalho JJS. Regression analysis of multivariate fractional data. Econometric Reviews. 2014 doi: 10.1080/07474938.2013.806849. in press. [DOI]
- O’Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40(4):1079–1087. [PubMed] [Google Scholar]
- O’Reilly PF, Hoggart CJ, Pomyen Y, Calboli FCF, Elliott P, Jarvelin MR, Coin LJM. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE. 2012;7(5):e34861. doi: 10.1371/journal.pone.0034861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Preisser JS, Lohman KK, Rathouz PJ. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Statistics in Medicine. 2002;21(20):3035–3054. doi: 10.1002/sim.1241. [DOI] [PubMed] [Google Scholar]
- Rasmussen-Torvik LJ, Alonso A, Li M, Kao W, Kattgen A, Yan Y, Couper D, Boerwinkle E, Bielinski SJ, Pankow JS. Impact of repeated measures and sample selection on genome-wide association studies of fasting glucose. Genetic Epidemiology. 2010;34(7):665–673. doi: 10.1002/gepi.20525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schifano E, Li L, Christiani D, Lin X. Genome-wide association analysis for multiple continuous secondary phenotypes. The American Journal of Human Genetics. 2013;92(5):744–759. doi: 10.1016/j.ajhg.2013.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seoane JA, Campbell C, Day INM, Casas JP, Gaunt TR. Canonical correlation analysis for gene-based pleiotropy discovery. PLoS Comput Biol. 2014;10(10):e1003876. doi: 10.1371/journal.pcbi.1003876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics. 2013;14(7):483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephens M. A unified framework for association analysis with multiple related phenotypes. PLoS ONE. 2013;8(7):e65245. doi: 10.1371/journal.pone.0065245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephens M, Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. American Journal of Human Genetics. 2005;76(3):449–462. doi: 10.1086/428594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang CS, Ferreira MAR. A gene-based test of association using canonical correlation analysis. Bioinformatics. 2012;28(6):845–850. doi: 10.1093/bioinformatics/bts051. [DOI] [PubMed] [Google Scholar]
- The ARIC Investigators. The atherosclerosis risk in communities (ARIC) study: design and objectives. American Journal of Epidemiology. 1989;129(4):687–702. [PubMed] [Google Scholar]
- van der Sluis S, Dolan CV, Li J, Song Y, Sham P, Posthuma D, Li MX. MGAS: a powerful tool for multivariate gene-based genome-wide association analysis. Bioinformatics. 2015;31(7):1007–1015. doi: 10.1093/bioinformatics/btu783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Sluis S, Posthuma D, Dolan CV. TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet. 2013;9(1):e1003235. doi: 10.1371/journal.pgen.1003235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu B, Pankow JS. Statistical methods for association tests of multiple continuous traits in genome-wide association studies. Annals of human genetics. 2015;79(4):282–293. doi: 10.1111/ahg.12110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Q, Wu H, Guo CY, Fox CS. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genetic Epidemiology. 2010;34(5):444–454. doi: 10.1002/gepi.20497. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.