Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Feb 1.
Published in final edited form as: Genet Epidemiol. 2016 Jan 18;40(2):91–100. doi: 10.1002/gepi.21945

Sequence kernel association test of multiple continuous phenotypes

Baolin Wu 1,*, James S Pankow 2,*
PMCID: PMC4724299  NIHMSID: NIHMS743687  PMID: 26782911

Abstract

Genetic studies often collect multiple correlated traits, which could be analyzed jointly to increase power by aggregating multiple weak effects and provide additional insights into the etiology of complex human diseases. Existing methods for multiple trait association tests have primarily focused on common variants. There is a surprising dearth of published methods for testing the association of rare variants with multiple correlated traits. In this paper, we extend the commonly used sequence kernel association test (SKAT) for single trait analysis to test for the joint association of rare variant sets with multiple traits. We investigate the performance of the proposed method through extensive simulation studies. We further illustrate its usefulness with application to the analysis of diabetes-related traits in the Atherosclerosis Risk in Communities (ARIC) Study. We identified an exome-wide significant rare variant set in the gene YAP1 worthy of further investigations.

Keywords: GWAS, Rare variant, Score statistic, SKAT

1 Introduction

More and more large cohort studies have conducted or are conducting genome-wide association studies (GWAS) to reveal the genetic components of many complex human diseases. These large cohort studies often collected a broad array of correlated phenotypes that reflect common physiological processes. By jointly analyzing these correlated traits, we can gain more power by aggregating multiple weak effects and shed light on the mechanisms underlying complex human diseases (Solovieff et al., 2013). We propose to study multivariate test statistics to detect the joint association of a rare variant set with multiple correlated continuous traits.

Many methods have been developed for testing the association of common variants with multiple traits (see Ferreira and Purcell, 2009; Liu et al., 2009b; Yang et al., 2010; Rasmussen-Torvik et al., 2010; Avery et al., 2011; OReilly et al., 2012; Maity et al., 2012; van der Sluis et al., 2013; He et al., 2013; Schifano et al., 2013; Wu and Pankow, 2015; Hua and Ghosh, 2015, e.g.). In GWAS, identified common variants only explained a small proportion of the phenotypic variance for most complex traits studied to date. Manolio et al. (2009) have suggested that rare variants could contribute substantially to missing heritability. Individual rare variant based tests typically lack power due to low minor allele frequencies. A commonly used rare variant analysis approach is the gene-level association test that tests the joint effects of rare variants within a gene to enrich the association signal. Two commonly used rare variant set analyses are the burden test (see Morgenthaler and Thilly, 2007; Madsen and Browning, 2009; Morris and Zeggini, 2010; Price et al., 2010; Lin and Tang, 2011, e.g.) and variance component score test (see Wu et al., 2010; Neale et al., 2011; Wu et al., 2011; Lee et al., 2012, e.g.). The burden test works well for variants with similar effects and could lose substantial power with both protective and deleterious variants. The sequence kernel association test (SKAT, Wu et al., 2011) is based on the variance component score test and works well under various combinations of protective and deleterious variants, and is the most widely used approach for rare variant set association tests.

Existing methods for multi-trait association test have primarily focused on the common variants. There is a surprising dearth of published methods for analyzing rare variant set association with multiple traits. In this paper, we extend the commonly used SKAT for single trait analysis to test for the joint association of rare variant set with multiple continuous traits. We investigate the performance of the proposed method through extensive simulation studies. We further illustrate its usefulness with application to the analysis of diabetes-related traits in the Atherosclerosis Risk in Communities (ARIC) Study.

2 Methods

Consider n unrelated individuals sequenced in a region with m rare variants and K measured continuous traits. For individual i = 1, …, n, let Yi = (yi1, …, yiK)T denote the K continuous traits, Xi = (xi1, …, xip) the p covariates (including the intercept) to be adjusted, and Gi = (gi1, …, gim) the genotypes (number of minor alleles or imputed dosage) for the m variants. Here for simplicity we have assumed a common set of covariates for all traits. The following methods can be readily adapted to the case of differing covariates.

We analyze yik marginally with a normal distribution with mean μik = Xiαk + Giβk and variance σk2, where αk = (α1k, …, αpk)T and βk = (β1k, …, βmk)T are coefficient vectors of length p and m respectively. Denote the correlation matrix of Yi as Σ = (ρkl), where ρkl = Cor(yik, yil). The joint association of rare variant set is testing H0 : β1 = … = βK = 0 versus Ha : βjk ≠ 0 for some j, k.

2.1 Multi-trait sequence kernel association test

For the j-th variant, we first show that its score statistics for testing βjk across different traits have a correlation structure determined by the trait covariance matrix. Denote X=(X1T,,XnT)T as the n × p design matrix, and the associated hat matrix as H = X(XTX)−1 XT.

Consider testing the association of the j-th variant and k-th trait, j = 1, …, m, k = 1, …, K. We regress the j-th variant and k-th trait on the covariate X, and define the corresponding predicted value as (g^1j,,g^nj)T=H(g1j,,gnj)T, and (μ^1k,,μ^nk)T=H(y1k,,ynk)T. Denote τj=i=1n(gijg^ij)2, j = 1, …, m, and the correlation of genotype residuals for the j-th and l-th variant as

rjl=i=1n(gijg^ij)(gilg^il)τjτl,

which can be interpreted as the (adjusted) linkage disequilibrium (LD, genotype correlation after adjusting for covariates). For rare variants, the LD is typically small. Define a m × m (adjusted) LD matrix R = (τjτlrjl), j = 1, …, m, l = 1, …, m.

We can check that the score statistic for testing βjk is proportional to Ujk=i=1ngij(yikμ^ik) (up to a scale parameter, the estimated variance σk2). We can further show that (see Appendix for details)

Var(Ujk)=τj2σk2,Cor(Ujk,Ujl)=ρkl,Cor(Ujk,Uik)=rji,Cor(Ujk,Uil)=rjiρkl.

Therefore for the same trait, score statistics of different variants are dependent with correlations reflecting their adjusted LD. For the same variant, their score statistics across different traits reflect the trait correlations. As for different variants and different traits, the correlation of their score statistics is the product of the trait correlation and adjusted variant LD. Therefore for rare variant set with small LD, the correlation of score statistics involving any two different variants is typically small.

In the same spirit as the SKAT for association test of single trait, we propose the following multi-trait sequence kernel association test for the joint effects of variant set (denoted as MSKAT)

Q=j=1mwj2SjTΣ^1Sj, (1)

where wj is a pre-specified weight (typically determined by the variant MAF; see the simulation section), and Sj = (sj1, …, sjK)T with sjk=Ujkσ^k1, k = 1, …, K. Here we have standardized the score statistics based on the trait variation, and σ^k2 and Σ^kl=ρ^kl are all estimated under the null model

σ^k2=1npi=1n(yikμ^ik)2,ρ^kl=i=1n(yikμ^ik)(yilμ^il)(np)σ^kσ^l.

Define a K × m matrix S = (S1, …, Sm). The MSKAT statistic can be written as Q=vec(S)T(W2Σ^1)vec(S), where W = diag(w1, …, wm), ⊗ means the matrix Kronecker product, and vec() is the vector operator, which stacks the columns of a matrix into a vector.

Alternatively we can consider summing over the SKAT statistics for individual trait

Q=k=1Kj=1mwj2sjk2=vec(S)T(W2IK)vec(S),

where IK denotes the K × K identity matrix.

A closely related approach is the multivariate kernel machine regression (MKMR) proposed by Maity et al. (2012) and further studied at Hua and Ghosh (2015). MKMR has been mainly studied for testing common variant set association. By incorporating variant weight, MKMR can be readily generalized to test rare variant set association. Both MKMR and the proposed Q/Q′ are quadratic functions of the multivariate outcomes, and hence their significance values can all be analytically and quickly computed based on the mixture of chi-square distributions. For Q/Q′, we have standardized the outcomes and worked on the individual trait-variant score statistics (hence implicitly using the linear kernel). While the MKMR has chosen to model the unscaled outcomes directly, construct the quadratic test statistic assuming a different working covariance matrix, and can use general kernel matrix for association test.

2.2 Significance p-value calculation

In the Appendix we show that the asymptotic null covariance of vec(S) is RΣ^, and both Q and Q′ can be derived as a variance component score test under some working linear mixed effects model. Both Q and Q′ are constructed as quadratic functions of vec(S), and under null both Q and Q′ asymptotically follow a weighted sum of independent 1 degree of freedom (1-DF) chi-square distributions. For Q, the weights equal to λkj, where λkj = λj, k = 1, …, K, j = 1, …, m, and (λ1, …, λm) are the eigen values of matrix W RW. For Q′, the weights equal to λjγk, j = 1, …, m, k = 1, …, K, where (γ1, …, γK) are the eigen values of Σ^.

We can quickly compute the significance p-values of Q and Q′ without the need of intensive permutations using a mix of Davies' method (Davies, 1980), and a modified scaled chi-square distribution approximation method of Liu et al. (2009a) following Wu et al. (2011); Lee et al. (2012), or the saddlepoint approximation method (Kuonen, 1999) following Chen et al. (2014); Wu et al. (2015).

3 Results

3.1 Simulation studies

We conduct simulation studies to investigate the performance of the proposed MSKAT statistics. For comparison, we included the Bonferroni corrected minimum p-value based on the individual trait SKAT significance p-values (denoted as Pmin), and the MKMR using the variant weighted linear kernel.

We simulated 1000 individuals and considered two covariates: a standard normal covariate X1, and a binary ancestry indicator X2 with Pr(X2 = 1) = 0.5. We consider testing K = 4 related traits with a compound-symmetry correlation matrix: Y1 = 1 + 0.5X1 + 0.5X2 + η1 + ε1, Y2 = 1 + X1 + X2 + η2 + ε2, Y3 = 1 + 0.5X1 + 0.5X2 + η3 + ε3, and Y4 = 1 + X1 + X2 + η4 + ε4, where (ε1, ε2, ε3, ε4) are zero-mean normal with variances (σ12=2, σ22=1, σ32=1, σ42=1) and correlation ρ, and (η1, η2, η3, η4) are contributions from the set of rare variants, which are simulated as follows.

Using a calibrated coalescent model (Schaffner et al., 2005), we first generated 10,000 European-like haplotypes of length 1000 kb. Each time we randomly pair the haplotypes to simulate 1000 individuals. We study those rare variants with MAF ≤ 0.01 in a randomly selected gene region of length 10 kb, denoted as (G1, …, Gm). We model the rare variant contribution to disease risk as ηk=j=1mβkjGj, k = 1, …, K. For all methods, we assign the variant weights following Wu et al. (2011), which are the computed beta distribution density function with parameters 1 and 25 at the rare variant MAF.

We used 10 million experiments under the null to evaluate the type I error, and 10,000 experiments under various combinations of βkj to evaluate the power. We conducted simulations for ρ = (0.2, 0.5, 0.8). We evaluated the type I error at the nominal significance level α = 10−5, 10−4, and 10−3, by setting all βkj = 0. We evaluated the power under various combinations of βkj. For each trait, we separately set their βkj as follows. Each time we randomly selected θ proportion of rare variants and set their |βkj| = −d log10(pj), where pj is the rare variant MAF. The other null rare variants have zero coefficients. We have assumed that rarer variants have larger effect sizes. We conducted simulations for (1) θ = 0.25, d = 0.25, (2) θ = 0.5, d = 0.2, (3) θ = 0.75, d = 0.15. They correspond to regression coefficients of 0.5, 0.4 and 0.3 for MAF=0.01 respectively. The signs of βkj were randomly set as 1 or −1. We also conducted simulations when all βkj's have the same signs (see supplementary materials for complete details). We conducted simulations for four scenarios: each time only the first L traits were associated with the rare variant set, L = 1, 2, 3, 4. Intuitively in the first scenario (L = 1), where only the first trait is associated with the rare variant set, we expect that the minimum p-value based approach or testing the first trait alone will have good performance. But we will show that by simultaneously testing correlated null traits, the proposed MSKAT could actually improve the detection power compared to testing the first trait alone. When there are multiple correlated traits that are associated with the rare variant set, the proposed MSKAT could offer much improved detection power than the minimum p-value based approach.

Figure 1 to 3 show the QQ plots of null significance p-values for all methods under different trait correlation ρ. Table 1 summarizes the estimated type I errors under different significance levels. Overall we can see that the Type I errors are well controlled under all considered scenarios for all methods. The Pmin is based on the Bonferroni corrected significance level and tends to be very conservative under strong trait dependence (ρ = 0.8), while the MKMR, Q and Q′ have generally more consistent performance with different trait correlations.

Figure 1.

Figure 1

QQ plots of null significance p-values for all methods: ρ = 0.2

Figure 3.

Figure 3

QQ plots of null significance p-values for all methods: ρ = 0.8

Table 1.

Type I errors (divided by the nominal significance level α) of multivariate tests for four continuous traits with pairwise correlation ρ: Q is the proposed MSKAT statistic incorporating the trait correlation, Q′ is the sum of individual trait SKAT statistics, Pmin is the Bonfferoni corrected minimum p-value based on the individual trait SKAT significance p-values, and MKMR is the multivariate kernel machine regression approach.

ρ 0.2 0.5 0.8
α 10−5 10−4 10−3 10−5 10−4 10−3 10−5 10−4 10−3
Q 0.80 0.82 0.89 0.80 0.82 0.89 0.80 0.82 0.89
Q 0.86 0.84 0.90 0.94 0.86 0.91 0.94 0.87 0.91
Pmin 0.70 0.81 0.89 0.77 0.82 0.88 0.77 0.70 0.73
MKMR 0.88 0.89 0.93 0.91 0.89 0.93 0.88 0.90 0.93

Table 2, 3, 4, and 5 summarize the power under significance level θ = 10−4 for L = 1, 2, 3, 4 respectively. When only the first trait is associated with the rare variant set (Table 2), Pmin performs better than MKMR, Q and Q′ under weak trait correlation (ρ = 0.2). The MKMR and MSKAT statistic Q could benefit from increased trait correlations, and offer much improved power by incorporating strongly correlated null traits.

Table 2.

Power of multivariate tests for four continuous traits with pairwise correlation ρ: Q is the proposed MSKAT statistic incorporating the trait correlation, Q′ is the sum of individual trait SKAT statistics, Pmin is the Bonferroni corrected minimum p-value based on the individual trait SKAT significance p-values, and MKMR is the multivariate kernel machine regression approach. Only the first trait is associated with the rare variant set (L = 1). The causal rare variant proportion is θ and their regression coefficient is set as −d log10(MAF). The highest powered tests in each column are bold-faced.

(d, θ) (0.25,0.25) (0.2,0.5) (0.15,0.75)
ρ 0.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.8
Q 0.016 0.038 0.204 0.024 0.054 0.280 0.017 0.036 0.205
Q 0.008 0.003 0.001 0.012 0.004 0.002 0.008 0.002 0.001
Pmin 0.024 0.024 0.024 0.035 0.035 0.035 0.022 0.022 0.023
MKMR 0.003 0.011 0.091 0.004 0.016 0.128 0.002 0.010 0.089

Table 3.

Power of multivariate tests for four continuous traits with pairwise correlation ρ: Q is the proposed MSKAT statistic incorporating the trait correlation, Q′ is the sum of individual trait SKAT statistics, Pmin is the Bonferroni corrected minimum p-value based on the individual trait SKAT significance p-values, and MKMR is the multivariate kernel machine regression approach. Only the first L = 2 traits are associated with the rare variant set. The causal rare variant proportion is θ and their regression coefficient is set as −d log10(MAF). The highest powered tests in each column are bold-faced.

(d, θ) (0.25,0.25) (0.2,0.5) (0.15,0.75)
ρ 0.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.8
Q 0.147 0.294 0.749 0.223 0.427 0.896 0.156 0.326 0.854
Q 0.085 0.029 0.012 0.136 0.048 0.019 0.088 0.030 0.012
Pmin 0.123 0.123 0.123 0.175 0.175 0.174 0.119 0.118 0.118
MKMR 0.117 0.262 0.721 0.177 0.381 0.879 0.120 0.285 0.837

Table 4.

Power of multivariate tests for four continuous traits with pairwise correlation ρ: Q is the proposed MSKAT statistic incorporating the trait correlation, Q′ is the sum of individual trait SKAT statistics, Pmin is the Bonferroni corrected minimum p-value based on the individual trait SKAT significance p-values, and MKMR is the multivariate kernel machine regression approach. Only the first L = 3 traits are associated with the rare variant set. The causal rare variant proportion is θ and their regression coefficient is set as −d log10(MAF). The highest powered tests in each column are bold-faced.

(d, θ) (0.25,0.25) (0.2,0.5) (0.15,0.75)
ρ 0.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.8
Q 0.372 0.604 0.945 0.534 0.774 0.989 0.421 0.686 0.982
Q 0.250 0.100 0.041 0.385 0.171 0.073 0.278 0.108 0.043
Pmin 0.206 0.206 0.204 0.283 0.281 0.276 0.199 0.196 0.193
MKMR 0.348 0.596 0.944 0.503 0.765 0.988 0.394 0.680 0.981

Table 5.

Power of multivariate tests for four continuous traits with pairwise correlation ρ: Q is the proposed MSKAT statistic incorporating the trait correlation, Q′ is the sum of individual trait SKAT statistics, Pmin is the Bonferroni corrected minimum p-value based on the individual trait SKAT significance p-values, and MKMR is the multivariate kernel machine regression approach. All L = 4 traits are associated with the rare variant set. The causal rare variant proportion is θ and their regression coefficient is set as −d log10(MAF). The highest powered tests in each column are bold-faced.

(d, θ) (0.25,0.25) (0.2,0.5) (0.15,0.75)
ρ 0.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.8
Q 0.597 0.807 0.988 0.779 0.928 0.999 0.681 0.882 0.997
Q 0.453 0.225 0.099 0.646 0.376 0.183 0.523 0.257 0.110
Pmin 0.277 0.275 0.271 0.376 0.372 0.366 0.266 0.262 0.257
MKMR 0.586 0.801 0.987 0.767 0.926 0.998 0.672 0.880 0.996

The statistic Q′ ignored the trait dependence by directly summing over individual trait SKAT statistics. Overall we can see that it suffered power loss with increasing trait correlations. The minimum p-value based approach Pmin had nearly constant power across different trait correlations.

When there are multiple correlated traits that are associated with the rare variant set (Table 3, 4, and 5), the MSKAT statistic Q had the overall best performance. Overall we can see that Q′ had reduced power with increasing trait correlations, and the Pmin had nearly constant power across different trait correlations. Both MKMR and the MSKAT statistic Q accounted for the trait dependence, and had improved power with increasing trait correlations.

Overall we can see that the proposed MSKAT statistic Q is an attractive approach with good power across a wide range of alternatives.

An interesting scenario is one in which only the first trait Y1 is associated with the rare variant set and all the others are null traits (L = 1). Stephens (2013) and Wu and Pankow (2015) have reported that joint testing by incorporating a correlated null trait could improve the power for testing association of common variant. Table 6 compared the SKAT based rare variant set testing of Y1 alone versus the joint multivariate testing under previous simulation settings. We can see that jointly testing highly correlated traits could have greater power over testing Y1 alone. In general both MKMR and the proposed MSKAT statistic Q could benefit from the trait correlations to largely improve the detection power. The minimum p-value based approach is largely unaffected by the trait correlations.

Table 6.

Detection power incorporating correlated null traits: only the first trait Y1 is associated with the rare variant set. The causal rare variant proportion is θ and their regression coefficient is set as −d log10(MAF). We compared the multivariate trait based test approach, Q, Q′, Pmin and MKMR to the SKAT applied to testing Y1 only, denoted as SKAT(Y1). The highest powered tests in each row are bold-faced.

α = 10−4, d = −0.25, θ = 0.25

ρ SKAT(Y1) Q Q Pmin MKMR
0.2 0.038 0.016 0.008 0.024 0.003
0.5 0.038 0.038 0.003 0.024 0.011
0.8 0.038 0.204 0.001 0.024 0.091
α = 10−4, d = −0.2, θ = 0.25

ρ SKAT(Y1) Q Q Pmin MKMR
0.2 0.054 0.024 0.012 0.035 0.004
0.5 0.054 0.054 0.004 0.035 0.016
0.8 0.054 0.280 0.002 0.035 0.128
α = 10−4, d = −0.15, θ = 0.75

ρ SKAT(Y1) Q Q Pmin MKMR
0.2 0.036 0.017 0.008 0.022 0.002
0.5 0.036 0.036 0.002 0.022 0.010
0.8 0.036 0.205 0.001 0.023 0.089

3.2 ARIC GWAS study

The Atherosclerosis Risk in Communities (ARIC) study (The ARIC Investigators, 1989) is a multi-center prospective investigation of atherosclerotic disease. Men and women aged 45–64 years at baseline were recruited from four U.S. communities: Forsyth County, North Carolina; Jackson, Mississippi; suburban areas of Minneapolis, Minnesota; and Washington County, Maryland. A total of 15,792 individuals participated in the baseline examination in 1987–1989. The vast majority of ARIC participants are of European (73%) or African ancestry (26%). We applied the proposed MSKAT and other competing methods in ARIC to test for association between diabetes-related traits and rare variants in each gene. Genotypes were obtained from the Illumina HumanExome BeadChip (Grove et al., 2013), which has information on 247,870 variants. We jointly analyzed fasting (≥ 8 hr) glucose levels measured at four visits each approximately three years apart in 5866 non-diabetic white participants. The ARIC Study design and methods for measurement of plasma glucose and other covariates have been described previously (Rasmussen-Torvik et al., 2010). The glucose levels had an average correlation of 0.55 between visits. We applied an additive genetic model and adjusted for age, gender and study center (population indicators).

We included rare variants with the minor allele counts larger than 1 and the MAF less than 0.01. In total we analyzed 12,439 rare variant sets in the exome-wide scan for glucose levels. We set the exome-wide significance level as 4.0 × 10−6, which is the Bonferonni corrected significance level for the total number of tested rare variant sets. When analyzing glucose at each visit separately, we did not identify any significant rare variant set at an exome-wide significance level. When the 4 visits were analyzed jointly, the MSKAT Q identified a significant set with two rare variants in the gene YAP1 (p-value = 2.4 × 10−6) that passed exome-wide significance. By contrast, the Q′, Pmin and MKMR tests did not identify any significant rare variant set. For the gene YAP1, Q′ reported a p-value of 8.8 × 10−3, MKMR reported a p-value of 1.5 × 10−4, and the SKAT p-values for the four measures were 0.007, 0.006, 0.216, and 0.102 respectively. The two rare YAP1 variants (rs61746398 and rs112417656) have 6 and 28 alleles detected in the samples respectively, and their estimated regression coefficients (expressed in mg/dL per copy of the minor allele) and associated p-values (listed in parentheses) at four visits are −5.0 (0.123), −11.0 (0.001), −2.7 (0.473), −11.6 (0.001) for rs61746398, and 4.0 (0.008), −3.8 (0.020), −2.3 (0.190), −1.2 (0.487) for rs112417656. Overall the rare variant effects exhibit some heterogeneity across four visits.

YAP1 is a potent oncogene and known to play a role in the development and progression of multiple cancers (Wu et al., 2010; Chen et al., 2013; Dixit et al., 2014). Common variants in YAP1 are associated with polycystic ovary syndrome (PCOS) (Louwers et al., 2013), an obesity-related endocrine disorder sharing similar epidemiological and patho-physiological features with type 2 diabetes. Among Han Chinese PCOS patients, the risk genotype for one of these common YAP1 variants (rs11225161) was also associated with higher blood glucose 30 min and 60 min after an oral glucose tolerance test (Li et al., 2012). Further research is needed on the possible role of identified rare variants in YAP1 in glucose metabolism.

4 Discussion

In the application, we have analyzed the multiple measures of fasting glucose levels over time as an illustration example. As suggested in our extensive simulation studies, the proposed methods could be useful for different correlated traits (for example, BMI and waist circumference). In our simulations, we have assumed positive correlations for traits. Since the sequence kernel association test and the proposed methods are all quadratic forms of the score statistics, they are invariant to any location and scale transformations of the outcomes. Therefore we expect that the same conclusions will hold for traits with negative correlations (for example, HDL and LDL).

In this paper we have focused on the joint analysis of multiple continuous traits. Our results have shown that joint association test can improve the variant detection power even when jointly testing additional strongly correlated null traits. It will be interesting to generalize the methods to study the joint test of mixed outcomes, which analytically and computationally are more challenging.

We have implemented the proposed methods in R programs posted at http://www.umn.edu/~baolin/research/mskat_Rcode.html. The proposed methods are very easy to program, and the implementations are extremely efficient: the significance p-values are analytically and quickly computed without the need of intensive permutation or random sampling. Figure 4 compares the MKMR and the proposed methods for their average CPU sec used to compute significance p-values for rare variant sets on a single Linux workstation with 2.5 GHz CPU and 24 GB memory. We follow the previous simulation setup with two covariates and four correlated traits (ρ = 0.5). We consider two sample sizes: n = 1000 and n = 5000 samples; and three variant set sizes: 25, 50 and 100 variants in each variant set. The MKMR does not scale well with the sample size and is using significantly more time than the proposed methods. Here we report the average time for one variant set using MKMR, and the average time for 1000 variant sets using the proposed methods. Overall we can see that both methods roughly scale linearly with the variant set size, and the proposed methods are very efficient and many orders of magnitude faster than MKMR.

Figure 4.

Figure 4

Average CPU sec used to analyze variant sets: the x-axis shows the variant set size, the y-axis shows the average CPU sec used (over 10 simulations); the first plot is for computing both p-values of Q and Q′ for 1000 variant sets, the other two plots are for computing p-values of MKMR for one variant set. To save computation time and make comparison fair, we have used the same analytical approach as the proposed methods to compute significance p-values for MKMR instead of the empirical random sampling approach as in Maity et al. (2012).

In summary, we recommend the proposed MSKAT statistic Q as a complementary approach to enhancing the detection power of rare variant set by jointly testing multiple correlated traits, which were often collected in most genetic studies.

Supplementary Material

Supp Info

Figure 2.

Figure 2

QQ plots of null significance p-values for all methods: ρ = 0.5

Acknowledgements

This research was supported in part by NIH grant GM083345 and CA134848. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations, and the ARIC publications committee for helpful comments. We want to thank the editor and reviewer for their constructive comments which have greatly improved the presentation of the paper.

The ARIC Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute (NHLBI) contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN2682011000010C, HHSN2682011000011C, and HHSN2682011000012C). The authors thank the staff and participants of the ARIC study for their important contributions. Support for exome chip genotyping in the ARIC Study was provided by the National Institutes of Health (NIH) American Recovery and Reinvestment Act of 2009 (ARRA) (5RC2HL102419).

APPENDIX

Covariance of score statistics

Consider the score statistic Ujk=i=1ngij(yiky^ik). Denote P = InH, where In is a n × n identify matrix. Let Gj=(g1j,,gnj)T, Yk=(y1k,,ynk)T. Using matrix notation, the score statistics can be written as Ujk=GjTPYk. Therefore Var(Ujk)=σk2GjTP2Gj=σk2τj2, and

Cov(Ujk,Ujl)=Cov(GjTPYk,GjTPYl)=σkσlρklτj2,
Cov(Ujk,Uik)=Cov(GjTPYk,GiTPYk)=σk2τjτirji,
Cov(Ujk,Uil)=Cov(GjTPYk,GiTPYl)=σkσlρklτjτirji,

and hence

Cor(Ujk,Ujl)=ρkl,Cor(Ujk,Uik)=rji,Cor(Ujk,Uil)=rjiρkl.

We estimate σ^k2=i=1n(yikμ^ik)2(np), and ρ^kl=i=1n(yikμ^ik)(yilμ^il)(np)(σ^kσ^l).

Variance component score test for multi-trait association

Denote the response matrix Y = (Y1, …, Yn) of dimension K × n. We model Yi = (Xiα)T + (Giβ)T + ϵi, where α = (α1, …, αK) and β = (β1, …, βK) are coefficient matrix of dimension p × K and m × K respectively. For simplicity of notation, we assume that responses have been standardized, i.e., ϵi ~ N(0, Σ).

Denote the vector operator as vec(), which stacks the columns of a matrix into a vector. Assume that vec(βT)~N(0,σ02B), where B is a pre-defined matrix of dimension (mK)×(mK). Then testing the joint effects of rare variant reduces to testing H0:σ02=0.

Denote G=(G1T,,GnT) and E = (ϵ1, …, ϵn). We have Cov(vec(E)) = InΣ, and Y = (Xα)T + βTG + E. Note that we can write vec(βTG) = (GTIK)vec(βT), hence Cov[vec(βTG)]=σ02(GTIK)B(GIK). Denote A = Cov[vec(Y)]. We can check that

A=InΣ+σ02(GTIK)B(GIk),

and E[vec(Y)] = vec((Xα)T). Therefore the log likelihood is proportional to

=12logA12[vec(Y)vec((Xα)T)]TA1[vec(Y)vec((Xα)T)].

Note that σ02A1=A1[(GTIK)B(GIK)]A1. Therefore the score statistic for testing σ02=0 is proportional to

Qv=vec(Y(Xα^)T)TA01[(GTIK)B(GIK)]A01vec(Y(Xα^)T),

where A0=InΣ^, and α^ and Σ^ are obtained under the null model by solving

maxα,Σ12logInΣ12[vec(Y)vec((Xα)T)]T(InΣ1)[vec(Y)vec((Xα)T)],

which can be solved by performing K separate linear regression models and then estimating Σ by the residual covariance matrix. Denote the projection matrix P = InX(XTX)−1XT. We can check that Y(Xα^)T=YP.

Note that A01=InΣ^1, and A01vec(Y(Xα)T)=vec(Σ^1YP). Therefore

Qv=vec(Σ^1YP)T[(GTIK)B(GIK)]vec(Σ^1YP).

Denote S = Y PGT. Note that (GIK)vec(Σ^1YP)=vec(Σ^1S)=(ImΣ^1)vec(S). And hence

Qv=vec(S)T(ImΣ^1)B(ImΣ^1)vec(S).

We can choose B=W2Σ^, where W = diag(w1, …, wm) is a diagonal matrix of pre-defined weights assigned to each variant. Here we assume a priori that the regression coefficients for a given variant have the same dependence structure as the covariance of the traits, and regression coefficients across different variants are independent. Then we can derive the proposed MSKAT statistic

Q=vec(S)T(W2Σ^1)vec(S).

If we choose B=W2Σ^2, we have

Q=vec(S)T(W2IK)vec(S).

Asymptotic p-value computation

Note that

vec(S)=vec(YPGT)=(GPIK)vec(Y).

Therefore vec(S) follows a multivariate normal distribution with zero null mean and asymptotical null variance

Cov(vec(S))=(GPIK)(InΣ^)(PGTIK)=RΣ^,R=GPGT.

We can further check that Cov[(WIK)vec(S)]=(WRW)Σ^, and Cov[(WΣ^12)vec(S)]=(WRW)IK. Therefore for Q and Q′, their null distributions are weighted averages of 1-DF chi-square distributions, with weights being the eigen values of (W RW) ⊗ IK and (WRW)Σ^ respectively. For Kronecker product matrix, their eigen values can be computed from the crossproduct of individual matrix eigen values.

References

  1. Avery CL, He Q, North KE, Ambite JL, Boerwinkle E, Fornage M, et al. A phenomics-based strategy identifies loci on APOC1, BRAP, and PLCG1 associated with metabolic syndrome phenotype domains. PLoS Genet. 2011;7(10):e1002322. doi: 10.1371/journal.pgen.1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Chen H, Lumley T, Brody J, Heard-Costa NL, Fox CS, Cupples LA, Dupuis J. Sequence kernel association test for survival traits. Genetic Epidemiology. 2014;38(3):191–197. doi: 10.1002/gepi.21791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen W, Wang W, Zhu B, Guo H, Sun Y, Ming J, Shen N, Li Z, Wang Z, Liu L, Cai B, Duan J, Li J, Liu C, Zhong R, Hu W, Huang T, Miao X. A functional variant rs1820453 in YAP1 and breast cancer risk in chinese population. PLoS ONE. 2013;8(11):e79056. doi: 10.1371/journal.pone.0079056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Davies RB. Algorithm AS 155: the distribution of a linear combination of χ2 random variables. Applied Statistics. 1980;29(3):323–333. [Google Scholar]
  5. Dixit D, Ghildiyal R, Anto NP, Sen E. Chaetocin-induced ROS-mediated apoptosis involves ATM–YAP1 axis and JNK–dependent inhibition of glucose metabolism. Cell Death & Disease. 2014;5(5):e1212. doi: 10.1038/cddis.2014.179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Ferreira MAR, Purcell SM. A multivariate test of association. Bioinformatics. 2009;25(1):132–133. doi: 10.1093/bioinformatics/btn563. [DOI] [PubMed] [Google Scholar]
  7. Grove ML, Yu B, Cochran BJ, Haritunians T, Bis JC, Taylor KD, et al. Best practices and joint calling of the HumanExome BeadChip: the CHARGE consortium. PloS One. 2013;8(7):e68095. doi: 10.1371/journal.pone.0068095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. He Q, Avery CL, Lin DY. A general framework for association tests with multivariate traits in large-scale genomics studies. Genetic Epidemiology. 2013;37(8):759–767. doi: 10.1002/gepi.21759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Hua WY, Ghosh D. Equivalence of kernel machine regression and kernel distance covariance for multidimensional phenotype association studies. Biometrics. 2015 doi: 10.1111/biom.12314. doi:10.1111/biom.12314. [DOI] [PubMed] [Google Scholar]
  10. Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika. 1999;86(4):929–935. [Google Scholar]
  11. Lee S, Wu MC, Lin X. Optimal tests for rare variant e ects in sequencing association studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Li T, Zhao H, Zhao X, Zhang B, Cui L, Shi Y, Li G, Wang P, Chen ZJ. Identification of YAP1 as a novel susceptibility gene for polycystic ovary syndrome. Journal of Medical Genetics. 2012;49(4):254–257. doi: 10.1136/jmedgenet-2011-100727. [DOI] [PubMed] [Google Scholar]
  13. Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. The American Journal of Human Genetics. 2011;89(3):354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computational Statistics & Data Analysis. 2009a;53(4):853–856. [Google Scholar]
  15. Liu J, Pei Y, Papasian CJ, Deng HW. Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations. Genetic Epidemiology. 2009b;33(3):217–227. doi: 10.1002/gepi.20372. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Louwers YV, Stolk L, Uitterlinden AG, Laven JSE. Cross-ethnic meta-analysis of genetic variants for polycystic ovary syndrome. The Journal of Clinical Endocrinology and Metabolism. 2013;98(12):E2006–2012. doi: 10.1210/jc.2013-2495. [DOI] [PubMed] [Google Scholar]
  17. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLOS Genetics. 2009;5(2):e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Maity A, Sullivan PF, Tzeng JY. Multivariate Phenotype Association Analysis by Marker-Set Kernel Machine Regression. Genetic Epidemiology. 2012;36(7):686–695. doi: 10.1002/gepi.21663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindor LA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutation research. 2007;615(1–2):28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
  21. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology. 2010;34(2):188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(3):e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. OReilly PF, Hoggart CJ, Pomyen Y, Calboli FCF, Elliott P, Jarvelin MR, Coin LJM. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE. 2012;7(5):e34861. doi: 10.1371/journal.pone.0034861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86(6):832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Rasmussen-Torvik LJ, Alonso A, Li M, Kao W, Kattgen A, Yan Y, Couper D, Boer-winkle E, Bielinski SJ, Pankow JS. Impact of repeated measures and sample selection on genome-wide association studies of fasting glucose. Genetic Epidemiology. 2010;34(7):665–673. doi: 10.1002/gepi.20525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Schifano E, Li L, Christiani D, Lin X. Genome-wide association analysis for multiple continuous secondary phenotypes. The American Journal of Human Genetics. 2013;92(5):744–759. doi: 10.1016/j.ajhg.2013.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics. 2013;14(7):483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stephens M. A unified framework for association analysis with multiple related phenotypes. PLoS ONE. 2013;8(7):e65245. doi: 10.1371/journal.pone.0065245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. The ARIC Investigators The atherosclerosis risk in communities (aric) study: design and objectives. American Journal of Epidemiology. 1989;129(4):687–702. [PubMed] [Google Scholar]
  31. van der Sluis S, Posthuma D, Dolan CV. TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet. 2013;9(1):e1003235. doi: 10.1371/journal.pgen.1003235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wu C, Xu B, Yuan P, Miao X, Liu Y, Guan Y, Yu D, Xu J, Zhang T, Shen H, Wu T, Lin D. Genome-wide interrogation identifies YAP1 variants associated with survival of small-cell lung cancer patients. Cancer Research. 2010;70(23):9721–9729. doi: 10.1158/0008-5472.CAN-10-1493. [DOI] [PubMed] [Google Scholar]
  33. Wu B, Pankow JS. Statistical methods for association tests of multiple continuous traits in genome-wide association studies. Annals of human genetics. 2015;79(4):282–293. doi: 10.1111/ahg.12110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wu B, Pankow JS, Guan W. Sequence kernel association analysis of rare variant set based on the marginal regression model for binary traits. Genetic Epidemiology. 2015;39(6):399–405. doi: 10.1002/gepi.21913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-Set analysis for case-control genome-wide association studies. The American Journal of Human Genetics. 2010;86(6):929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Yang Q, Wu H, Guo CY, Fox CS. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genetic Epidemiology. 2010;34(5):444–454. doi: 10.1002/gepi.20497. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Info

RESOURCES