Abstract
Multiple correlated traits are often collected in genetic studies. By jointly analyzing multiple traits, we can increase power by aggregating multiple weak effects and reveal additional insights into the genetic architecture of complex human diseases. In this article, we propose a multivariate linear regression-based method to test the joint association of multiple quantitative traits. It is flexible to accommodate any covariates, has very accurate control of type I errors, and offers very competitive performance. We also discuss fast and accurate significance p value computation especially for genome-wide association studies with small-to-medium sample sizes. We demonstrate through extensive numerical studies that the proposed method has competitive performance. Its usefulness is further illustrated with application to genome-wide association analysis of diabetes-related traits in the Atherosclerosis Risk in Communities (ARIC) study. We found some very interesting associations with diabetes traits which have not been reported before. We implemented the proposed methods in a publicly available R package.
1. Introduction
Over the past ten years, many epidemiologic studies have used genome-wide association studies (GWAS) to identify genetic components of many complex human diseases. These large cohort studies often collected a broad array of correlated traits that often reflect common physiological processes. By jointly analyzing these correlated traits, we can often gain more power by aggregating multiple weak effects and shed light on the mechanisms underlying complex human diseases [1].
There have been many methods proposed recently to detect SNP association with multiple correlated traits (see, e.g., [2–13]). A direct approach is based on the minimum trait p value [6], which typically requires permutations to compute significance p value. A related approach is the trait-based association test using an extended Simes procedure (TATES; [10]) that combines the univariate trait p values while correcting for the correlations among the multivariate traits. Various dimension reduction methods that summarize the multivariate traits into a univariate outcome are also proposed, which then apply the traditional univariate association test. Examples include the principal component analysis (PCA) [2], principal components of heritability (PCH) [3], and averaging longitudinally observed traits [7, 14]. PCA is an unsupervised dimension reduction and the top PC may not necessarily reflect the association signal. Sample splitting is typically used in PCH for significance calculations and may lead to loss of power.
Multivariate trait testing methods generally perform better than univariate analysis-based approach [15]. Among the multivariate testing methods, a popular approach is the canonical correlation analysis (CCA) [4, 16, 17], which is fast to compute but not flexible and is unable to accommodate covariates. Liu et al. [5] proposed the GEE model [18] to jointly analyze one continuous and one binary trait. In Avery et al. [19] and He et al. [11], GEE-based marginal generalized linear modeling of multivariate traits is adopted for efficient multitrait association testing. Schifano et al. [20] proposed a closely related GEE-based scaled marginal association test of multiple secondary continuous traits. Sitlani et al. [13] explored the GEE modeling of longitudinally measured traits for association test. These GEE-based methods typically explicitly avoided modeling the trait correlations. Another set of multivariate approaches is based on the inverted regression of genotypes to test the overall trait effects. For example, the proportional odds regression modeling of genotypes was proposed as a convenient approach to testing multitrait associations [8, 21, 22]. A related adjacent category logistic regression of genotypes was proposed by Wu and Pankow [12]. Inverted regression approach does not easily accommodate imputed SNPs and has generally used the “best-guess” genotypes, which is known to be leading to a loss of power. In contrast, the multivariate trait regression approach can easily test imputed SNPs by using the imputation dosage as covariate.
In this article, we explore an alternative multivariate regression framework to explicitly model the trait correlation and adjust for covariates to test multitrait associations. We compute the analytical p values for the proposed tests based on the F-distributions that offer very accurate type I error control with good finite sample performance. We also exploit the parallel nature of genome-wide association test to develop very efficient numerical algorithms that are extremely scalable to genome-wide association tests of millions of SNPs. We demonstrate through extensive numerical studies that the proposed methods have very competitive performance compared to existing methods. We further illustrate the usefulness of the proposed methods through an application to genome-wide association study of multiple diabetes-related glycemic traits.
2. Methods
We first discuss a multivariate linear regression-based framework for modeling the multiple quantitative traits and then derive the Wald type statistics for testing multitrait associations.
2.1. Multivariate Linear Regression Model
Consider m continuous traits Y = (y1,…, ym)T, a covariate vector X = (x1,…, xp)T of length p (which could contain an ancestry indicator or principal components), and a genotype score G coding the number of minor alleles. Consider the multivariate normal trait model:
(1) |
where β0 is a vector of length m, βX is an m × p matrix, β1 is a vector of length m, and the random error ϵ is of length m and is assumed to follow a zero mean multivariate normal distribution with covariance Σ, ϵ ~ N(0, Σ). Multivariate trait association amounts to testing H0 : β1 = 0. Here we have assumed the same covariates for all traits, which is the case for our ARIC study GWAS example (see Application to ARIC GWAS of Glycemic Traits) and many typical GWAS. In the supplementary materials (available here), we discuss the possible scenario with different covariates for each trait. The trait model (1) is a multivariate linear model (MLM; see, e.g., [23, chapter 8] and [24, chapter 9]).
Given observations for n unrelated individuals, for individual i, denote Yi as the outcome, Xi as the covariate, and Gi as the genotype score. Denote Y = (Y1,…, Yn)T, X = (X1,…, Xn)T, G = (G1,…, Gn)T, and design matrix Z = (1n, X, G) of dimension n × (p + 2), where 1n = (1,…, 1)T is a column vector of n ones.
Denote the m × (p + 2) parameter matrix β = (β0, βX, β1). We can check that the maximum likelihood estimators (MLEs) are (see, e.g., [23, p. 294])
(2) |
2.2. Conducting Multivariate Association Tests
Denote the vector operator vec(), which stacks the columns of a matrix into a vector. Denote A = ZTZ. For the MLEs (2) of the MLM model (1), we can check that (see, e.g., [23, p. 296])
(3) |
where ⊗ denotes the Kronecker product and independently follows a Wishart distribution, Wm(Σ, n − p − 2), with n − p − 2 degrees of freedom (DFs) and scale matrix Σ.
Define the n × (p + 1) design matrix Z0 = (1n, X) and the corresponding n × n hat matrix H = Z0(Z0TZ0)−1Z0T. Let P = I − H and Ge = PG. Here I is an n × n identity matrix. We can check that
(4) |
We test the multitrait association with the following Wald statistic:
(5) |
Note that and are independent. Under the null hypothesis, ((n − p − 1 − m)/mn)Q follows the F-distribution with (m, n − p − 1 − m) DFs (see, e.g., [25, p. 541]).
In the supplementary materials, we analytically show that the CCA test approach [4] is equivalent to a Score test statistic under the MLM model (1) when there are no covariates other than the genotype. Therefore, the proposed MLM-based Wald test can be treated as a natural and flexible generalization of the CCA: (I) it can accommodate any covariates; (II) it is based on the more powerful Wald test instead of the Score test for an association test of quantitative traits; (III) it has an exact F-distribution for the multivariate normally distributed traits and hence has very accurate control of type I errors for any sample sizes without the need of asymptotic approximation; and (IV) it is very fast to compute (see next section for details) and extremely scalable to genome-wide association tests of millions of SNPs.
When genetic effects are similar across traits, we can further improve the multivariate association test power using a test statistic with 1-DF following the lines of O'Brien [26], which performed a Wald test of linear combinations of β1. We can derive similar Wald tests under the MLM (1) (see supplementary materials for technical details). When the genotype effects are the same across different traits, we study the following test statistic:
(6) |
where 1m is an m × 1 column vector of ones. When the scaled genotype effects are the same across different traits, we study the following test statistic:
(7) |
where S is a column vector of estimated standard errors: .
Under the null hypothesis, both T and T′ follow the asymptotic standard normal distribution. To improve the finite sample performance, we can compare ((n − p − 1 − m)/n)T and ((n − p − 1 − m)/n)T′ to a t-distribution with (n − p − 1 − m)-DF.
2.3. Efficient Computation of GWAS Wald Test Statistics
For a typical GWAS with millions of SNPs, rather than fitting a MLM for each SNP, we developed very efficient algorithm to estimate the MLMs for all SNPs using matrix decomposition tricks following the line of Voorman et al. [27] as follows. For Z0, denote its singular value decomposition (SVD) as Z0 = UDVT, where U is an n × (p + 1) matrix with orthogonal columns, D is a (p + 1)×(p + 1) diagonal matrix, and V is a (p + 1)×(p + 1) orthogonal matrix. The null MLM hat matrix can then be computed as H = UUT, and Ge = G − U(UTG). Denote the null MLM residual matrix as E = Y − U(UTY), and let V0 = ETE. In (4), we have shown that the genotype effect can be efficiently computed as . We can then compute the covariance matrix MLE as . Here both V0 and U just need to be precomputed once and can be stored for use with all SNPs. Operationally we can also apply the popular PLINK tool [28] to test multitrait association. We first obtain the residuals of multivariate traits and genotypes adjusting for all covariates. We then input the residuals into the PLINK CCA test approach [4]. Technically, we need to adjust the PLINK output p value using an F-distribution with different DFs (see supplementary materials for technical details).
3. Results
3.1. Simulation Studies
We consider three forms of Wald statistics: Q is the omnibus test, and T and T′ are the 1-DF test assuming common or common scaled effects. The GEE-based approaches of He et al. [11] are computationally very efficient, have been shown to appropriately control the type I errors, and have the overall best detection power compared to the other methods (e.g., TATES of [10] and other univariate test-based methods) in extensive numerical studies. Here we compared the proposed methods to their GEE score tests, denoted as (Qs, Ts, Ts′), which are the m-DF omnibus test and 1-DF tests assuming a common effect or common scaled effect.
We consider a standard normal covariate X1 and a Bernoulli covariate X2 with probability of 0.5. The SNP genotype score G is simulated from a Binomial distribution, Binom(2, f0), where the minor allele frequency (MAF) f0 = p0 + p1X2. Here X2 is essentially a population indicator and we have simulated SNPs under population stratification.
We conducted simulations for testing m = 2,4, 8 related traits of 1,000 unrelated individuals, respectively. Each time, we simulate the m traits from a multivariate normal distribution with a compound symmetry correlation matrix with correlation ρ. The first trait has a variance of 2 and all the other traits have unit variance. We set E(Yi) = 1 + 0.5X1 + 0.5X2 + γiG for i = 1,3,…, m − 1, and E(Yk) = 1 + X1 + X2 + γkG for k = 2,4,…, m.
We used 10 million experiments to evaluate the type I error and 105 experiments to evaluate the power under various combinations of (γ1,…, γm). We conducted simulations for p0 = (0.1,0.3), p1 = 0.1, and ρ = 0,0.2,0.5,0.8. Here we report the results for m = 2,8, ρ = 0,0.5, and p0 = 0.1. The conclusions remain the same for other settings (data not shown).
Tables 1 and 2 summarize the estimated type I errors. Overall, the type I errors are well controlled for the proposed methods, while the GEE score tests are conservative, especially for large number of traits (m = 8). In general, the proposed Wald tests follow the exact F-distribution under the null hypothesis and hence the type I errors are well controlled under all settings. The GEE tests rely on the large-sample asymptotic distribution and therefore generally we need large sample size to have better control of type I errors, especially for a larger number of traits (containing more model parameters).
Table 1.
α | ρ = 0 | ρ = 0.5 | ||||
---|---|---|---|---|---|---|
10−5 | 10−4 | 10−3 | 10−5 | 10−4 | 10−3 | |
Qs | 0.69 | 0.79 | 0.89 | 0.67 | 0.79 | 0.89 |
Ts | 0.74 | 0.85 | 0.93 | 0.71 | 0.83 | 0.92 |
Ts′ | 0.74 | 0.85 | 0.89 | 0.71 | 0.83 | 0.92 |
Q | 1.04 | 1.00 | 1.00 | 1.03 | 1.01 | 1.00 |
T | 0.98 | 0.99 | 1.01 | 0.97 | 0.99 | 1.00 |
T′ | 0.96 | 0.98 | 1.00 | 0.96 | 0.98 | 0.99 |
Table 2.
α | ρ = 0 | ρ = 0.5 | ||||
---|---|---|---|---|---|---|
10−5 | 10−4 | 10−3 | 10−5 | 10−4 | 10−3 | |
Qs | 0.43 | 0.62 | 0.75 | 0.44 | 0.60 | 0.75 |
Ts | 0.74 | 0.84 | 0.93 | 0.77 | 0.85 | 0.93 |
Ts′ | 0.74 | 0.84 | 0.93 | 0.78 | 0.85 | 0.93 |
Q | 0.94 | 0.99 | 1.00 | 0.94 | 1.00 | 1.00 |
T | 1.03 | 1.03 | 1.02 | 1.05 | 1.04 | 1.03 |
T′ | 1.03 | 1.03 | 1.03 | 1.03 | 0.99 | 0.99 |
Tables 3 and 4 summarize the power for m = 2 and m = 8, respectively. T is the most powerful when γj are close to each other, and T′ is the most powerful when γj/σj are close to each other. In general, the proposed MLM-based Wald tests perform better than the corresponding GEE-based score tests, especially when testing a large number of traits. This agrees with the general principle that the Wald test is typically more powerful than the GEE-based test.
Table 3.
(γ1, γ2) | Q | T | T′ | Q s | T s | T s′ | |
---|---|---|---|---|---|---|---|
ρ = 0.5 | |||||||
| |||||||
(0.3,0) | (0.21,0) | 0.375 | 0.001 | 0.024 | 0.334 | 0.001 | 0.019 |
(0.3,0.1) | (0.21,0.1) | 0.206 | 0.047 | 0.146 | 0.177 | 0.039 | 0.126 |
(0.25,0.18) | (0.18,0.18) | 0.180 | 0.221 | 0.258 | 0.154 | 0.194 | 0.233 |
(0.3,0.25) | (0.21,0.25) | 0.523 | 0.617 | 0.619 | 0.476 | 0.573 | 0.582 |
(0.2,0.2) | (0.14,0.2) | 0.179 | 0.257 | 0.215 | 0.154 | 0.23 | 0.193 |
(0.2,0.25) | (0.14,0.25) | 0.410 | 0.501 | 0.369 | 0.367 | 0.466 | 0.333 |
(0.25,0.25) | (0.18,0.25) | 0.449 | 0.560 | 0.492 | 0.403 | 0.521 | 0.455 |
(0,0.25) | (0,0.25) | 0.638 | 0.278 | 0.052 | 0.59 | 0.247 | 0.040 |
(0,0.3) | (0,0.3) | 0.893 | 0.525 | 0.121 | 0.865 | 0.477 | 0.093 |
(0.1,0.25) | (0.07,0.25) | 0.465 | 0.485 | 0.372 | 0.418 | 0.448 | 0.330 |
(0.1,0.3) | (0.07,0.3) | 0.744 | 0.726 | 0.590 | 0.700 | 0.688 | 0.534 |
(0.2,0.3) | (0.14,0.3) | 0.845 | 0.891 | 0.842 | 0.810 | 0.870 | 0.810 |
| |||||||
ρ = 0 | |||||||
| |||||||
(0.3,0) | (0.21,0) | 0.206 | 0.026 | 0.063 | 0.178 | 0.020 | 0.051 |
(0.3,0.1) | (0.21,0.1) | 0.316 | 0.249 | 0.337 | 0.278 | 0.215 | 0.304 |
(0.25,0.18) | (0.18,0.18) | 0.419 | 0.510 | 0.530 | 0.376 | 0.471 | 0.494 |
(0.3,0.25) | (0.21,0.25) | 0.830 | 0.891 | 0.892 | 0.796 | 0.868 | 0.870 |
(0.2,0.2) | (0.14,0.2) | 0.375 | 0.486 | 0.462 | 0.333 | 0.449 | 0.427 |
(0.2,0.25) | (0.14,0.25) | 0.631 | 0.727 | 0.677 | 0.584 | 0.692 | 0.636 |
(0.25,0.25) | (0.18,0.25) | 0.734 | 0.820 | 0.801 | 0.690 | 0.792 | 0.771 |
(0,0.25) | (0,0.25) | 0.405 | 0.249 | 0.134 | 0.36 | 0.217 | 0.107 |
(0,0.3) | (0,0.3) | 0.701 | 0.485 | 0.29 | 0.657 | 0.437 | 0.235 |
(0.1,0.25) | (0.07,0.25) | 0.451 | 0.385 | 0.165 | 0.406 | 0.356 | 0.140 |
(0.1,0.3) | (0.07,0.3) | 0.769 | 0.639 | 0.301 | 0.728 | 0.605 | 0.257 |
(0.2,0.3) | (0.14,0.3) | 0.700 | 0.743 | 0.545 | 0.655 | 0.713 | 0.500 |
Table 4.
(γ1,…, γ8) | Q | T | T′ | Q s | T s | T s′ |
---|---|---|---|---|---|---|
ρ = 0.5 | ||||||
| ||||||
γ 1 = 0.3, γi>1 = 0 | 0.303 | 0.001 | 0 | 0.229 | 0 | 0 |
(.3, .2, .1, .05,0,…, 0) | 0.696 | 0 | 0.008 | 0.599 | 0 | 0.005 |
γ 1 = 0.2, γi>1 = 0.15 | 0.045 | 0.201 | 0.220 | 0.030 | 0.169 | 0.195 |
γ i = 0.15 | 0.048 | 0.237 | 0.193 | 0.032 | 0.204 | 0.170 |
| ||||||
ρ = 0 | ||||||
| ||||||
γ 1 = 0.3, γi>1 = 0 | 0.063 | 0.001 | 0.004 | 0.043 | 0.001 | 0.002 |
(.3, .2, .1, .05,0,…, 0) | 0.467 | 0.156 | 0.224 | 0.372 | 0.102 | 0.152 |
γ 1 = 0.2, γi>1 = 0.15 | 0.934 | 0.996 | 0.997 | 0.887 | 0.992 | 0.993 |
γ i = 0.15 | 0.912 | 0.995 | 0.994 | 0.855 | 0.989 | 0.988 |
The chi-square statistic ((n − p − 1)/n)Q is commonly used in practice and referred to an m-DF chi-square distribution to compute multitrait association test's p values, which can lead to significantly inflated type I errors at stringent genome-wide significance levels. Figure 1 shows the ratio of actual significance level of Wald test's p values computed using the chi-square distribution and F-distribution, respectively. We can see that the type I error based on the chi-square distribution is inflated: more so for larger number of traits, smaller significance level, and smaller sample size. For example, when testing m = 8 traits with p = 2 covariates and n = 500 samples, under genome-wide significance level 5 × 10−8, the actual significance level of chi-square distribution p value is 3.42 × 5 × 10−8 = 1.7 × 10−7. Using the chi-square distribution to compute p values will lead to very small inflation only when the sample size is large, such as in the meta-analysis of multiple GWAS studies. For typical GWAS with small-to-medium sample sizes, we recommend using the appropriate F-distribution to compute significance p values to reduce false positive findings.
3.2. Application to ARIC GWAS of Glycemic Traits
The Atherosclerosis Risk in Communities (ARIC) study [29] is a population-based, multicenter prospective investigation of cardiovascular disease. Men and women aged 45–64 years at baseline were recruited from four US communities: Forsyth County, North Carolina; Jackson, Mississippi; suburban areas of Minneapolis, Minnesota; and Washington County, Maryland. A total of 15,792 individuals participated in the baseline examination during the period of 1987–1989. The vast majority of ARIC participants are of European (73%) or African (26%) ancestry. We conducted two analyses of diabetes-related glycemic traits in ARIC GWAS data, which has been imputed to around 2.5 million HapMap SNPs using MaCH [30]. We included in the analysis those common SNPs with MAF ≥0.05 and imputation score R2 ≥ 0.3.
As a proof of concept, we first analyzed four fasting glucose levels in 5947 nondiabetic ARIC white participants measured at four visits (visits 1–4) conducted approximately three years apart. The average correlation of glucose levels is 0.55. We applied an additive genetic model with imputed dosage as a covariate and adjusted for age, gender, and study center in all tests. By analyzing four fasting glucose measures jointly, T′ identified 104 significant SNPs, T identified 103, Ts′ identified 102, Ts identified 101, and Q and Qs identified the same set of 95 SNPs at the genome-wide significance level 5 × 10−8. Analyzing each glucose measure separately identified 34, 84, 37, and 64 genome-wide significant SNPs at visits 1, 2, 3, and 4, respectively. All the identified SNPs by different methods are genome-wide significant in the MAGIC Consortium, a meta-analysis of 21 fasting glucose GWAS which together included 46,186 nondiabetic participants [31].
Compared to Ts′, the two additional SNPs identified by T′, rs780093 and rs780094, had p values of 4.8 × 10−8 and 4.8 × 10−8 using T′. Their respective MAGIC meta-analysis' p values were 2.9 × 10−13 and 2.5 × 10−12. Compared to Ts, the two additional SNPs identified by T, rs1260326 and rs11688384, had p values of 4.7 × 10−8 and 4.0 × 10−8 using T. Their respective MAGIC meta-analysis' p values were 4.3 × 10−13 and 4.1 × 10−10.
Second, we jointly analyzed three distinct diabetes-related glycemic traits measured at visit 4 in 5068 nondiabetic white participants measured at visit 4 in ARIC: fasting glucose, fasting insulin, and glucose level 2 hours after an oral glucose challenge. We applied an additive genetic model with imputed dosage as a covariate and adjusted for age, gender, and study center. To account for the skewed distribution of fasting insulin, we adopted the Box-Cox transformation with an estimated power of 0.35 [32]. The three traits had an average pairwise correlation of 0.31. When analyzing fasting insulin or 2-hour glucose levels individually, we did not identify any significant SNPs at the genome-wide significance level (5 × 10−8). For joint testing of all three traits, Ts, Ts′, T, T′ identified none, Qs identified 139, and Q identified 140 genome-wide significant SNPs, among which 61 and 61 SNPs were reported as genome-wide significant in the MAGIC meta-analyses of fasting glucose, fasting insulin, or 2-hour glucose levels [31, 33].
Compared to Qs, Q identified two additional genome-wide significant SNPs, rs4665987 and rs853780, with p values of 4.9 × 10−8 and 4.9 × 10−8, respectively. MAGIC meta-analysis of fasting glucose reported a p value of 2.1 × 10−38 for rs853780. Its MAGIC meta-analyses of fasting insulin and 2-hour glucose p values are 0.054 and 0.477, respectively. For rs4665987 (near GCKR on chromosome 2:27755825), MAGIC meta-analysis' p values for the fasting glucose, fasting insulin, and 2-hour glucose levels are 4.6 × 10−6, 0.04, and 9.3 × 10−5, respectively. This SNP was genome-wide significantly associated with human serum metabolite levels in a GWAS of 8330 Finnish individuals [34] and several other GWAS [35–38]. Compared to Q, Qs reported one additional genome-wide significant SNP, rs17540154, with p value of 4.3 × 10−8. The MAGIC meta-analysis of fasting glucose reported a p value of 8.7 × 10−38 for rs17540154. Its MAGIC meta-analyses of fasting insulin and 2-hour glucose p values are 0.101 and 0.720, respectively.
Among the identified significant SNPs by joint testing, there were 79 novel genome-wide significant SNPs that have not been reported as significantly associated with diabetes-related fasting glucose and insulin levels before. Among them, one SNP, rs4665987, is located on chromosome 2:27755825 and 78 other SNPs are clustered on chromosomes 15:62132921 to 15:62396389. Interestingly, six of them (listed in Table 5) were genome-wide significant in the MAGIC meta-analysis of proinsulin level [39]. The list of all identified SNPs with detailed analysis' results is available in the supplementary materials.
Table 5.
SNP | Chr | bp | ARIC joint test's p value | MAGIC meta-analysis' p value | ||||
---|---|---|---|---|---|---|---|---|
Wald | GEE | FG | FI | 2hFG | FP | |||
rs4502156 | 15 | 62383155 | 5.4E − 09 | 7.9E − 09 | 8.4E − 08 | 6.7E − 01 | 8.2E − 05 | 3.8E − 11 |
rs7163757 | 15 | 62391608 | 1.4E − 08 | 1.8E − 08 | 4.2E − 07 | 5.7E − 01 | 1.9E − 05 | 3.9E − 11 |
rs8037894 | 15 | 62394264 | 1.2E − 08 | 1.6E − 08 | 4.1E − 07 | 4.8E − 01 | 3.5E − 05 | 8.7E − 11 |
rs6494307 | 15 | 62394690 | 1.7E − 08 | 2.1E − 08 | 3.3E − 07 | 4.9E − 01 | 2.7E − 05 | 4.1E − 11 |
rs7167878 | 15 | 62396189 | 1.7E − 08 | 2.1E − 08 | 4.6E − 07 | 4.5E − 01 | 2.4E − 05 | 4.1E − 11 |
rs7172432 | 15 | 62396389 | 1.7E − 08 | 2.2E − 08 | 6.5E − 07 | 3.3E − 01 | 1.9E − 05 | 4.3E − 11 |
4. Discussion
So far typical effect sizes of most identified genetic variants for many diseases or traits are very small and they have only explained a very small proportion of the overall disease heritability or trait variation. It is commonly accepted that there are many more common variants with relatively small-to-medium effect sizes or rare variants with larger effect sizes yet to be discovered. To identify these additional variants, very large sample sizes will be needed. One approach is to form a consortium to facilitate meta-analysis of many studies, but development of these genetics consortia is generally time-consuming and logistically challenging. Meanwhile the recently studied joint association test of multiple correlated traits offers an alternative approach to boost power in that it can often dramatically improve the association test power by “enlarging the sample size” through the incorporation of many correlated traits that are typically collected in most large genetic studies and may share genetic determinants. Another strategy to further improve the detection power is to use a variant-set association test, which has been proven to be very useful (see, e.g., [16, 17, 40–42]). It is worthwhile to generalize the proposed Wald tests to develop more accurate and powerful association tests of variant sets across multiple traits.
Here we have focused on testing a relatively small number of correlated quantitative traits, which have enabled us to develop accurate and powerful association tests without any asymptotic approximations as adopted in the more general though conservative GEE approach, which can be applied to any mix of quantitative and discrete traits. It will be interesting to extend the proposed methods to the phenome-wide association studies (PheWAS) with a large collection of phenotypes [43–45] and develop more powerful joint association test of quantitative and discrete traits.
In the previous discussions, we have assumed the same set of covariates across all traits. With differing covariates, we provide technical details regarding model estimation and extensive simulation studies to confirm that the proposed methods accurately control type I errors and perform favorably compared to existing methods (see the supplementary materials for complete results). In summary, we recommend the proposed multivariate linear regression-based test as a complementary approach to enhancing the power of analyzing multiple quantitative traits in unrelated individuals. Our numerical studies have suggested that the omnibus Wald test generally has robust and good performance. The 1-DF Wald tests can perform well due to reduced DFs, but they could be sensitive to the underlying assumptions. It will be worthwhile to develop adaptive and powerful tests. We have implemented the proposed methods in an R package available at http://www.github.com/baolinwu/MTAR. We provide some sample R codes to install and use the package in the supplementary materials. The developed algorithms are very efficient and extremely scalable to genome-wide association test.
Acknowledgments
This research was supported in part by NIH Grants GM083345 and CA134848. The authors are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. The ARIC study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute Contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C), R01HL087641, R01HL59367, and R01HL086694; National Human Genome Research Institute Contract U01HG004402; and National Institutes of Health Contract HHSN268200625226C. The authors thank the staff and participants of the ARIC study for their important contributions. Infrastructure was partly supported by Grant no. UL1RR025005, a component of the National Institutes of Health and NIH Roadmap for Medical Research.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Supplementary Materials
References
- 1.Solovieff N., Cotsapas C., Lee P. H., Purcell S. M., Smoller J. W. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics. 2013;14(7):483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wang K., Abbott D. A principal components regression approach to multilocus genetic association studies. Genetic Epidemiology. 2008;32(2):108–118. doi: 10.1002/gepi.20266. [DOI] [PubMed] [Google Scholar]
- 3.Klei L., Luca D., Devlin B., Roeder K. Pleiotropy and principal components of heritability combine to increase power for association analysis. Genetic Epidemiology. 2008;32(1):9–19. doi: 10.1002/gepi.20257. [DOI] [PubMed] [Google Scholar]
- 4.Ferreira M. A. R., Purcell S. M. A multivariate test of association. Bioinformatics. 2009;25(1):132–133. doi: 10.1093/bioinformatics/btn563. [DOI] [PubMed] [Google Scholar]
- 5.Liu J., Pei Y., Papasian C. J., Deng H.-W. Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations. Genetic Epidemiology. 2009;33(3):217–227. doi: 10.1002/gepi.20372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yang Q., Wu H., Guo C.-Y., Fox C. S. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genetic Epidemiology. 2010;34(5):444–454. doi: 10.1002/gepi.20497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Rasmussen-Torvik L. J., Alonso A., Li M., et al. Impact of repeated measures and sample selection on genome-wide association studies of fasting glucose. Genetic Epidemiology. 2010;34(7):665–673. doi: 10.1002/gepi.20525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.O'Reilly P. F., Hoggart C. J., Pomyen Y., et al. MultiPhen: Joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE. 2012;7(5) doi: 10.1371/journal.pone.0034861.e34861 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Stephens M. A Unified Framework for Association Analysis with Multiple Related Phenotypes. PLoS ONE. 2013;8(7) doi: 10.1371/journal.pone.0065245.e65245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.van der Sluis S., Posthuma D., Dolan C. V. TATES: Efficient Multivariate Genotype-Phenotype Analysis for Genome-Wide Association Studies. PLoS Genetics. 2013;9(1) doi: 10.1371/journal.pgen.1003235.e1003235 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.He Q., Avery C. L., Lin D.-Y. A general framework for association tests with multivariate traits in large-scale genomics studies. Genetic Epidemiology. 2013;37(8):759–767. doi: 10.1002/gepi.21759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wu B., Pankow J. S. Statistical Methods for Association Tests of Multiple Continuous Traits in Genome-Wide Association Studies. Annals of Human Genetics. 2015;79(4):282–293. doi: 10.1111/ahg.12110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sitlani C. M., Rice K. M., Lumley T., et al. Generalized estimating equations for genome-wide association studies using longitudinal phenotype data. Statistics in Medicine. 2015;34(1):118–130. doi: 10.1002/sim.6323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Ganesh S. K., Chasman D. I., Larson M. G. Effects of long-term averaging of quantitative blood pressure traits on the detection of genetic associations. The American Journal of Human Genetics. 2014;95(1):49–65. doi: 10.1016/j.ajhg.2014.06.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Galesloot T. E., Van Steen K., Kiemeney L. A. L. M., Janss L. L., Vermeulen S. H. A comparison of multivariate genome-wide association methods. PLoS ONE. 2014;9(4) doi: 10.1371/journal.pone.0095923.e95923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tang C. S., Ferreira M. A. R. A gene-based test of association using canonical correlation analysis. Bioinformatics. 2012;28(6):845–850. doi: 10.1093/bioinformatics/bts051.bts051 [DOI] [PubMed] [Google Scholar]
- 17.Seoane J. A., Campbell C., Day I. N. M., Casas J. P., Gaunt T. R. Canonical Correlation Analysis for Gene-Based Pleiotropy Discovery. PLoS Computational Biology. 2014;10(10) doi: 10.1371/journal.pcbi.1003876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Liang K. Y., Zeger S. L. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. doi: 10.1093/biomet/73.1.13. [DOI] [Google Scholar]
- 19.Avery C. L., He Q., North K. E., et al. A phenomics-based strategy identifies loci on APOC1, BRAP, and PLCG1 associated with metabolic syndrome phenotype domains. PLoS Genetics. 2011;7(10) doi: 10.1371/journal.pgen.1002322.e1002322 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Schifano E. D., Li L., Christiani D. C., Lin X. Genome-wide association analysis for multiple continuous secondary phenotypes. American Journal of Human Genetics. 2013;92(5):744–759. doi: 10.1016/j.ajhg.2013.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wang K. Testing genetic association by regressing genotype over multiple phenotypes. PLoS ONE. 2014;9(9) doi: 10.1371/journal.pone.0106918.e106918 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Guo X., Li Y., Ding X., He M., Wang X., Zhang H. Association tests of multiple phenotypes: ATeMP. PLoS ONE. 2015;10(10) doi: 10.1371/journal.pone.0140348.e0140348 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Anderson T. W. An Introduction to Multivariate Statistical Analysis. 3rd. New York, NY, USA: Wiley; 2003. [Google Scholar]
- 24.Fox J. Applied Regression Analysis and Generalized Linear Models. 2nd. Los Angeles, CA, USA: SAGE Publications, Inc.; 2008. [Google Scholar]
- 25.Rao C. R. Linear Methods of Statistical Induction and their Applications. 2nd. New York, NY, USA: Wiley; 1973. [Google Scholar]
- 26.O'Brien P. C. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40(4):1079–1087. doi: 10.2307/2531158. [DOI] [PubMed] [Google Scholar]
- 27.Voorman A., Rice K., Lumley T. Fast computation for genome-wide association studies using boosted one-step statistics. Bioinformatics. 2012;28(14):1818–1822. doi: 10.1093/bioinformatics/bts291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Purcell S., Neale B., Todd-Brown K., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.The ARIC Investigators. The atherosclerosis risk in communities (aric) study: design and objectives. American Journal of Epidemiology. 1989;129(4):687–702. doi: 10.1093/oxfordjournals.aje.a115184. [DOI] [PubMed] [Google Scholar]
- 30.Li Y., Willer C. J., Ding J., Scheet P., Abecasis G. R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010;34(8):816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dupuis J., Langenberg C., Prokopenko I., et al. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nature Genetics. 2010;42(2):105–116. doi: 10.1038/ng.520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Box G. E. P., Cox D. R. An analysis of transformations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1964;26:211–252. [Google Scholar]
- 33.Saxena R., Hivert M. F., Langenberg C. Genetic variation in GIPR influences the glucose and insulin responses to an oral glucose challenge. Nature Genetics. 2010;42(2):142–148. doi: 10.1038/ng.521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Kettunen J., Tukiainen T., Sarin A.-P., et al. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nature Genetics. 2012;44(3):269–276. doi: 10.1038/ng.1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Tang W., Basu S., Kong X., et al. Genome-wide association study identifies novel loci for plasma levels of protein C: The ARIC study. Blood. 2010;116(23):5032–5036. doi: 10.1182/blood-2010-05-283739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Barber M. J., Mangravite L. M., Hyde C. L., et al. Genome-Wide Association of Lipid-Lowering Response to Statins in Combined Study Populations. PLoS ONE. 2010;5(3) doi: 10.1371/journal.pone.0009763.e9763 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tin A., Woodward O. M., Kao W. H. L., et al. Genome-wide association study for serum urate concentrations and gout among African Americans identifies genomic risk loci and a novel URAT1 loss-of-function allele. Human Molecular Genetics. 2011;20(20):4056–4068. doi: 10.1093/hmg/ddr307.ddr307 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kottgen A., Albrecht E., Teumer A., et al. Genome-wide association analyses identify 18 new loci associated with serum urate concentrations. Nature Genetics. 2013;45(2):145–154. doi: 10.1038/ng.2500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Strawbridge R. J., Dupuis J., Prokopenko I. Genome-wide association identifies nine common variants associated with fasting proinsulin levels and provides new insights into the pathophysiology of type 2 diabetes. Diabetes. 2011;60(10):2624–2634. doi: 10.2337/db11-0415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wu M. C., Kraft P., Epstein M. P., et al. Powerful snp-set analysis for case-control genome-wide association studies. American Journal of Human Genetics. 2010;86(6):929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wu M. C., Lee S., Cai T., Li Y., Boehnke M., Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Lee S., Abecasis G. R., Boehnke M., Lin X. Rare-variant association analysis: study designs and statistical tests. American Journal of Human Genetics. 2014;95(1):5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Pendergrass S. A., Brown-Gentry K., Dudek S. M., et al. The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery. Genetic Epidemiology. 2011;35(5):410–422. doi: 10.1002/gepi.20589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Pendergrass S. A., Brown-Gentry K., Dudek S., et al. Phenome-Wide Association Study (PheWAS) for detection of pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) network. PLoS Genetics. 2013;9(1) doi: 10.1371/journal.pgen.1003087.e1003087 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Cronin R. M., Field J. R., Bradford Y., et al. Phenome-wide association studies demonstrating pleiotropy of genetic variants within FTO with and without adjustment for body mass index. Applied Genetic Epidemiology. 2014;5, article 250 doi: 10.3389/fgene.2014.00250. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.