Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jul 12.
Published in final edited form as: Hum Hered. 2016 Apr 29;80(3):126–138. doi: 10.1159/000445057

Rare-Variant Kernel Machine Test for Longitudinal Data from Population and Family Samples

Qi Yan 1, Daniel E Weeks 2, Hemant K Tiwari 3, Nengjun Yi 3, Kui Zhang 4, Guimin Gao 5, Wan-Yu Lin 6, Xiang-Yang Lou 7, Wei Chen 1,2,*, Nianjun Liu 3,*
PMCID: PMC4940283  NIHMSID: NIHMS768751  PMID: 27161037

Abstract

Objective

The kernel machine (KM) test reportedly performs well in the set-based association test of rare variants. Many studies have been conducted to measure phenotypes at multiple time points, but the standard KM methodology has only been available for phenotypes at a single time point. In addition, family-based designs have been widely used in genetic association studies; therefore, the data analysis method used must appropriately handle familial relatedness. A rare variant test does not currently exist for longitudinal data from family samples. Therefore, in this paper, we aim to introduce an association test for rare variants, which includes multiple longitudinal phenotype measurements for either population or family samples.

Methods

This approach uses KM regression based on the linear mixed model framework and is applicable to longitudinal data from either population (L-KM) or family samples (LF-KM).

Results

In our population-based simulation studies, L-KM has good control of Type I error rate and increased power in all the scenarios we considered, compared with other competing methods. Conversely, in the family-based simulation studies, we found an inflated Type I error rate when L-KM was applied directly to the family samples, whereas LF-KM retained the desired Type I error rate and had the best power performance overall. Finally, we illustrate the utility of our proposed LF-KM approach by analyzing data from an association study between rare variants and blood pressure from the Genetic Analysis Workshop 18 (GAW18).

Conclusion

We propose a method for rare-variant association testing in population and family samples, using phenotypes measured at multiple time points for each subject. The proposed method has the best power performance compared to competing approaches in our simulation study.

Keywords: rare variant, longitudinal data, linear mixed model, linear kernel function, family samples

Introduction

Genome-wide association studies (GWASs) have been widely used to identify common genetic variants that are associated with complex human diseases [15]. For a typical GWAS, hundreds or thousands of subjects are recruited and genotyped at hundreds of thousands of genetic variants (e.g., single nucleotide polymorphisms [SNPs]). An association between the phenotype and each of the SNPs is usually tested sequentially via a single-marker test. However, the traditional single-marker association test is not powerful enough to detect rare variants that may play key roles in influencing complex diseases [6,7]. To increase the statistical power, many set-based methods [717] have been developed to evaluate the joint effects of a group of genetic variants in a predefined genomic region on the phenotypes of interest. Among the existing methods, one powerful, flexible, and computationally efficient test is the sequence kernel-machine-based associations test (SKAT) [18,19]. In this kernel machine (KM) approach, weights are assigned to each marker to improve the power. In addition, KM can easily include other covariates in the model. Both linear and nonlinear kernels may be used for the genotype-phenotype relationship. Furthermore, the KM test statistic follows a mixture of chi-square distributions. Thus, p-values can be computed analytically and quickly without employing resampling.

In many genetic studies, phenotypes are measured at multiple time points for each subject [2023]. Quite a few methods [2427] have been developed for longitudinal genetic data analysis, but very few of these methods are for rare variants [24]. We believe that a method in which all time points are accounted for jointly in an association test would likely improve the power. In addition, family-based designs have been widely used to study the association between diseases and genetic factors [2833]. In GWASs that employ population samples, the association between quantitative phenotypes and genetic markers is usually investigated by applying a general linear model; however, statistics based on general linear models result in inflated Type I error rates when applied to family data directly [3436] because they ignore familial correlations. Instead of a general linear model, a linear mixed model including a random polygenic effect can be used to handle the familial correlation. The covariance of random polygenic effects in all subjects can be expressed by a kinship matrix. This linear mixed model with a kinship matrix has typically been used when dealing with family samples in GWASs [37]. SKAT was recently extended to family samples by including the kinship matrix [3842].

In this study, we propose a method for rare-variant association testing in population and family samples, using phenotypes measured at multiple time points for each subject. In our population-based simulation studies, we found that when the longitudinal kernel machine test (L-KM) considered all time points, it was more powerful than approaches that only considered some of the time points, although all the methods we tested had correct Type I error rates. Conversely, in our family-based simulation studies, we found an inflated Type I error rate when L-KM was directly applied to the family sample, whereas the longitudinal family kernel machine test (LF-KM) retained the desired Type I error rate and had the best power overall compared with the methods that only considered some of the time points. Another study reported similar findings [24]. Finally, we assessed the association between rare genetic variants and blood pressure from the Genetic Analysis Workshop 18 (GAW18) data to demonstrate our approach. These data present challenges for analysis, including large pedigrees, repeated measures, and whole-genome sequence data [4345].

Methods

KM Regression in a Linear Mixed Model Framework

We set the stage for our subsequent derivations by defining, for a trait measured at a single time point, a linear mixed model setup similar to that of Chen et al. [38]. Let there be n subjects with q genetic variants. The n × 1 vector of the quantitative trait y follows a linear mixed model:

y=Xβ+Gγ+u+ε,

where X is an n × p covariate matrix, β is a p × 1 vector containing parameters for the fixed effects (an intercept and p − 1 covariates), G is an n × q genotype matrix for the q genetic variants of interest where an additive genetic model is assumed (i.e., coded as 0, 1, or 2 representing the copies of minor alleles) for illustration, γ is a q × 1 vector for the random effects of the q genetic variants, u is an n × 1 vector for the random effects due to covariates (e.g., time for longitudinal data or relatedness in families), and ε is an n × 1 vector for the random error. The random effect γj for variant j is assumed to be normally distributed with mean zero and variance τwj; thus, the null hypothesis H0: γ = 0 is equivalent to H0: τ = 0, which can be tested with a variance component score test [19] in the mixed model. The random variables ε and u are also assumed to be normally distributed, and are uncorrelated with each other and also with γ:

γ~N(0,τW)
u~N(0,K)
ε~ N(0,σE2I),

where W is a predefined q × q diagonal weight matrix for each variant and may use wi=Beta(MAFi,1,25) as in SKAT, K is an n × n covariance matrix, and σE2 is the error variance.

Following the same rationale as in the derivation of the SKAT score statistic [4648] (refer to the Appendix for a detailed derivation), the test statistic is:

Q=(yXβ^)Σ^1GWGΣ^1(yXβ^),

where β̂ is the vector of estimated fixed effects of covariates under H0, and Σ^=K^+σ^E2I is the estimated variance-covariance matrix under H0. The statistic Q is a quadratic form of (yXβ̂), and therefore follows a mixture of chi-square distributions [49], although some of the parameters are estimated [50]. Thus, the p-values can be calculated by numerical algorithms, such as moment-matching methods [47,51], Davies’ method [52], and Kuonen’s saddlepoint method [53]. In this study, we used the Davies’ method.

L-KM Regression for Quantitative Traits in Population Data

We now extend the above model by incorporating intercept and time as both fixed and random effects. Different covariance structures for longitudinal data, such as compound symmetry, autoregressive, and Toeplitz, can be easily implemented under this framework.

Under the null hypothesis, H0: τ = 0, for the i-th subject at time point j, the random intercept and time model is:

yij=β0+tijβ1+b0i+tijb1i+εij,

where tij indicates time. β0 and β1 are the fixed effects of intercept and time, while b0i and b1i are the random effects of intercept and time for the i-th subject. For one subject, the model can be rewritten as

yi=Xiβ+Zibi+εi.

We assume that there are m time points. Thus, yi = (yi1,yi2, …,yim)′ is an m × 1 vector, Xi is an m × 2 matrix for intercept and time, β = (β0  β1) and bi = (b0i  b1i). For simplicity, we did not include other covariates (which can be easily included) in the model; therefore, Zi is the same as xi, and

Var(bi)=(σint2σcovσcovσtime2)
Var(yi)=ZiVar(bi)Zi+σE2Im×m.

For the whole data set, the variance term is:

Var(y)=IZiVar(bi)Zi+σE2I=Σ,

where y is an n·m × 1 vector, and ⊗ is the kronecker product to produce a diagonal block matrix. The variance terms σint2,σtime2, σcov, and σE2 can be estimated from the data (e.g., using the R package nlme [54]), and then the L-KM test statistic Q can be constructed in the same way as in the above section.

LF-KM Regression for Quantitative Traits of Family Data

For pedigree data, familial correlation can be added to the model as an additional random variable. Under the null hypothesis, H0: τ = 0, for the i-th subject in the k-th family at time point j, the random intercept and time model becomes:

yijk=β0+tijkβ1+b0ik+tijkb1ik+δik+εijk,

where β0 and β1 are the fixed effects of intercept and time, while b0ik and b1ik are the random effects of intercept and time. δik is the random effect for familial correlation. For one subject with m time point observations, the model can be rewritten in vector form as:

yik=Xikβ+Zikbik+δik+εik.

Again, we assume m time points and no other covariates; thus, yik is an m × 1 vector, Xik and Zik are the same m×2 matrix for intercept and time. For illustration, we consider the model for a trio family:

yk=Xkβ+Zkbk+δk+εk
Var(Zkbk)=I3×3ZikVar(bik)Zik=I3×3Zik(σint2σcovσcovσtime2)Zik
Var(δk)=σG2·JkΦkJk=σG2·[1m×10m×10m×10m×11m×10m×10m×10m×11m×1]Φk[1m×10m×10m×10m×11m×10m×10m×10m×11m×1]
Φk=[100.5010.50.50.51],

where yk is a 3m × 1 vector, and Φk is twice the kinship matrix for a trio family:

Var(yk)=Var(Zkbk)+Var(δk)+σE2I3m×3m.

For the whole data set with multiple families, we assume n individuals from the families. The variance term is:

Var(y)=IZikVar(bik)Zik+σG2·JΦJ+σE2I=Σ
J=[1m×10m×10m×11m×1]nm×n,

where σint2,σtime2, σcov, and σE2 represent the same variance/covariance terms as in the population-based model. σG2 represents the variance term for the random effects of familial correlation. Φ is twice the n × n kinship matrix obtained from the data. All the variance terms can be estimated (e.g., using the R package pedigreemm [55]), and then the LF-KM test statistic Q can be constructed as above.

Population-based Simulation Study

Simulation of sample genotypes

We simulated 1,000 unrelated samples based on a matrix of 10,000 haplotypes over a 200-kb region generated by the calibrated coalescent model [56], mimicking the European ancestry linkage disequilibrium (LD) structure. Only rare variants (minor allele frequency [MAF] < 0.05) were kept, and 2,000 haplotypes were randomly selected to form the unrelated subjects’ haplotypes. Then, 30 neighboring SNPs with at least one copy of the minor allele (i.e., excluding non-polymorphic variants) were used in the analysis. We simulated 100 genotype datasets.

Type I error rate

For each of the 100-genotype data sets generated, we simulated 1,000 sets of quantitative longitudinal phenotypes based on five time points. The phenotypes for subject i were generated via the model:

yi=0.05·X1i+0.5·X2i+0.1·ti+vi,

where X1i is a continuous covariate generated from a normal distribution with a mean of 50 and a standard deviation of 5 (this single value was repeated five times to mimic the five time points); X2i is a dichotomous covariate generated from a Bernoulli distribution with a probability of 0.5, which was also repeated five times; ti is the time point, assuming values of 0, 1, 2, 3, or 4; vi is random error that follows a multivariate normal distribution with a mean of 0, and covariance matrix Var(yi)

Var(yi)=Zi(σint2σcovσcovσtime2)Zi+σE2I5×5=(1011121314)(10.50.51)(1111101234)+I5×5,

where the variance of random effects of intercept and time as well as random error were set to 1, the covariance between the intercept and time was set to −0.5, which assumes the rate of change is slow when the baseline value is large (e.g., it is less likely that a patient with high blood pressure at baseline will quickly reach an even higher blood pressure). The phenotypes of all of the individuals were generated in the same manner, and the 1,000 sets of simulated phenotypes for each of the 100-genotype data sets were used to evaluate the Type I error rate.

Using these unrelated population samples, we compared L-KM using all five time points (abbreviated as L-KM-m5) with four other approaches: (1) L-KM using a subset of three time points (L-KM-m3), (2) L-KM with 20% of the observations randomly assigned to missing (L-KM-missing), (3) KM on the averaged phenotype over five time points (avg-KM), and (4) KM on the phenotype at the last time point (last-KM).

Power evaluation

We generated the same genotypes as described above. The quantitative phenotypes for one subject were generated via the following model:

yi=0.05·X1i+0.5·X2i+0.1·ti+j=1qβjGj+vi,

where X1i, X2i, ti, and vi were set up the same way as described above. G1, G2, …, Gq are the genotypes of causal SNPs, and β1, β2, …, βq are coefficients of the causal SNPs. We assumed that 30% of all variants are disease susceptibility variants. Furthermore, each βj was set as c|log10MAFj| in order to assign large weights to rare variants, where c = 0.4 was chosen, such that when MAF = 0.0001, β = 1.6, following the literature [19]. Because causal variants might not influence the phenotype in a consistent direction, we also assumed that one-third of the causal variants were protective (i.e., 20% risk variants & 10% protective variants), with βj = −c|log10MAFj|. The phenotypes for all of the individuals were generated in the same manner, and the 1,000 sets of simulated phenotypes for each of the 100-genotype data sets were used to evaluate power. We compared L-KM-m5 with L-KM-m3, L-KM-missing, avg-KM, and last-KM.

Family-based Simulation Study

Simulation of sample genotypes

To simulate family data, we used the aforementioned pool of 10,000 haplotypes over a 200-kb region and family structures as shown in Figure 1. First, haplotypes were randomly selected for all founders. The offspring haplotypes were generated by randomly transmitting one of the two haplotypes of each parent to the child. For the scenario of trio families (Figure 1A), we generated 300 families with a father, a mother, and one offspring in each family. For the three-generation scenario, we generated 100 families (Figure 1B). Furthermore, 30 neighboring polymorphic SNPs were used in the analysis, and we simulated 100 genotype datasets for each of the two scenarios.

Figure 1.

Figure 1

Two pedigree structures used in the simulation studies.

Type I error rate

To evaluate the Type I error rates, we simulated 1,000 sets of phenotypes for each of the 100-genotype datasets. The quantitative phenotypes for one trio family were generated via the following model:

yk=0.05·X1k+0.5·X2k+0.1·tk+δk+vk,

where X1k and X2k are the same as described above. Both X1k and X2k repeat five times for five time points for one subject, and each family includes three subjects; time tk is a sequence of 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4; νk follows a multivariate normal distribution with means of 0 and a covariance matrix, Var(yk):

Var(yk)=I3×3Zik(σint2σcovσcovσtime2)Zik+σG2·JkΦkJk+σE2I15×15=I3×3(1011121314)(10.50.51)(1111101234)+[15×105×105×105×115×105×105×105×115×1][100.5010.50.50.51][15×105×105×105×115×105×105×105×115×1]+I15×15,

where the variance of the random effects of intercept, time, genetic variance, and error were set to 1, and the covariance between the intercept and time was set to −0.5. The three-generation family scenario was set up in an analogous way, but with a more complicated kinship matrix. In this work, the kinship coefficients were directly obtained from the known pedigree structures. If genome-wide genotype data are available, it may be more advantageous to use genetic markers to estimate the kinship coefficients [5762]. Actually, with enough genetic markers to estimate kinship coefficients, the proposed method is a unified approach that allows for any relationship in the data, even cryptic relatedness.

We compared the Type I error rates of LF-KM using all five time points (abbreviated as LF-KM-m5) to the Type I error rates of five other approaches: (1) LF-KM using a subset of three time points (LF-KM-m3), (2) LF-KM with 20% of the observations randomly assigned to missing (LF-KM-missing), (3) family KM on the averaged phenotype over five time points (avg-F-KM), (4) family KM on the last time point phenotype (last-F-KM), and (5) L-KM-m5.

Power evaluation

For the causal gene sets, we used the same genotypes as described in the evaluation of the Type I error rates. The quantitative phenotypes for one family were generated via the model:

yk=0.05X1k+0.5X2k+0.1tk+j=1qβjGj+vk,

where X1k, X2k, tk, and vk were set up the same way as for the Type I error evaluation. G1, G2, …, Gq are the genotypes of the causal SNPs. We assumed that 30% of all variants were disease susceptibility variants, βj and each was set to c|log10MAFj|, where c = 0.4 [19]. We also considered a situation in which one-third of the causal variants were protective (i.e., 20% disease variants & 10% protective variants), where we set βj = −c|log10MAFj| for the protective variants. Again, the phenotypes for all families were simulated in the same manner, and these 1,000 sets of phenotypes for each of the 100-genotype data sets were used to evaluate the statistical power. We compared LF-KM-m5 with LF-KM-m3, LF-KM-missing, avg-F-KM, and last-F-KM.

GAW18 Data

To demonstrate the utility of our proposed method for real data, we analyzed data from GAW18, which consisted of whole genome sequence data in a pedigree-based sample with longitudinal phenotypes for blood pressure and hypertension. Although 200 replicates of simulated longitudinal data are available, we only utilized the real phenotypes. The GAW18 data is drawn from T2D-GENES Project 2, and these families were obtained from two studies: the San Antonio Family Heart Study (SAFHS) [63] and the San Antonio Family Diabetes/Gallbladder Study (SAFDGS) [64]. Only odd numbered autosomes are available. The sequence dataset contains dosage (minor allele counts); a Merlin-based procedure was employed to impute missing dosages [65]. The dosage values can assume any decimal number from 0 to 2, instead of 0, 1, and 2 as in the additive model. Only rare variants (MAF < 0.05) were used in the analysis, and weights similar to those employed by Wu et al. [19] were used. We assigned SNPs to a gene if they were located within 5 kb of the gene on both sides. In total, 11,096 genes were used in the analysis.

We evaluated the association of genetic variants with diastolic and systolic blood pressure (DBP and SBP, respectively), which are considered heritable traits [66]. In total, 855 subjects from 20 families were used in the analysis, and each subject had up to 4 exams (sample size at Exam 1: 855, Exam 2: 605, Exam 3: 622, and Exam 4: 233). We applied LF-KM to test the association between DBP and SBP and each of the genes, with adjustment for age and sex. Some subjects had less than four observations, but this linear mixed model-based approach was flexible enough to accommodate missing observations. Under the assumption of missing at random, the incomplete data can be directly analyzed by linear mixed model [67]. Even though subjects may have missing observations at different time points, the covariance matrix for each subject is constructed from the available data at the observed time points. The linear mixed model only needs to be fitted once under the null hypothesis for one set of genome-wide sequencing data. The covariate estimates β̂ and variance-covariance matrix Σ̂ can then be saved and reused for constructing LF-KM statistics with the genotypes of each gene; therefore, the overall computation time is still fast even for genome-wide data.

Results

Simulation of the Type I Error Rate

When applied to population samples, all of the methods used (i.e., L-KM-m5, L-KM-m3, L-KM-missing, avg-KM, and last-KM) had empirical Type I error rates close to the nominal level (Table 1, Figure 2). When the L-KM-m5 statistic was applied to family data, the Type I error rate was inflated (Table 2). In contrast, LF-KM-m5, LF-KM-m3, LF-KM-missing, avg-F-KM, and last-F-KM retained the desired Type I error rates. Similar patterns can be observed in the QQ plots shown in Figure 3, indicating that LF-KM-m5, LF-KM-m3, LF-KM-missing, avg-F-KM, and last-F-KM control Type I error well, whereas L-KM-m5’s Type I error rate is more severely inflated in the data with three-generation families than that with the trio families.

Table 1.

Type I error rates of the longitudinal kernel machine (KM) using five time points (L-KM-m5), longitudinal KM using three time points (L-KM-m3), L-KM with 20% of the observations randomly assigned to missing (L-KM-missing), KM using the averaged phenotype (avg-KM), and KM using the last time point phenotype (last-KM) from the population-based simulation at significance levels of 0.05, 0.01, 0.005, and 0.001.

α = 0.05 α = 0.01 α = 0.005 α = 0.001




L-KM-m5 0.05004 0.00988 0.00466 0.00081




L-KM-m3 0.05040 0.00975 0.00481 0.00091




L-KM-missing 0.05041 0.00954 0.00477 0.00093




avg-KM 0.04935 0.00941 0.00472 0.00090




last-KM 0.04789 0.00976 0.00496 0.00098

Figure 2.

Figure 2

QQ plot of the p-values for population samples, with a 95% pointwise confidence band (gray area) that was computed under the assumption that the p-values were drawn independently from a uniform [0, 1] distribution.

Table 2.

Type I error rates of the longitudinal family KM using five time points (LF-KM-m5), longitudinal family KM using three time points (LF-KM-m3, LF-KM with 20% of the observations randomly assigned to missing (LF-KM-missing), family KM using the averaged phenotype (avg-F-KM), family KM using the last time point phenotype (last-F-KM), and longitudinal KM using five time points (L-KM-m5) from the family-based simulation at significance levels of 0.05, 0.01, 0.005, and 0.001.

α = 0.05 α = 0.01 α = 0.005 α = 0.001





300 trios
LF-KM-m5 0.04736 0.00923 0.00465 0.00071




LF-KM-m3 0.04776 0.00893 0.00455 0.00073




LF-KM-missing 0.04852 0.00904 0.00461 0.00084




avg-F-KM 0.04870 0.00922 0.00439 0.00083




last-F-KM 0.04795 0.00932 0.00437 0.00091




L-KM-m5 0.10023 0.02528 0.01417 0.00329





100 three-generation
families
LF-KM-m5 0.04903 0.00980 0.00484 0.00099




LF-KM-m3 0.04914 0.00971 0.00480 0.00094




LF-KM-missing 0.04918 0.00990 0.00512 0.00100




avg-F-KM 0.04958 0.00978 0.00472 0.00089




last-F-KM 0.04946 0.00963 0.00487 0.00080




L-KM-m5 0.16505 0.04928 0.02885 0.00874

Figure 3.

Figure 3

QQ plots of the p-values for family samples, with a 95% pointwise confidence band (gray area) that was computed under the assumption that the p-values were drawn independently from a uniform [0, 1] distribution. (A) Trio families; (B) Three-generation families.

Statistical Power Comparison

When we compared the power of the statistics on the population samples (Figure 4), the power of L-KM-m5 was consistently higher than that of the other methods. This was expected because L-KM-m5 made full use of the data; in contrast, L-KM-m3, L-KM-missing and last-KM used only a subset of the time points, and avg-KM used the averaged phenotype, thus losing information from the longitudinal observations. Similarly, when evaluated using the family data, LF-KM-m5 outperformed the other methods as expected (Figure 5). Note that L-KM-m5 is not included in Figure 5 because of its inflated Type I error rate with family data.

Figure 4.

Figure 4

Power comparison for the population samples (the α level axis uses a log10 scale). (A) A 30% disease variants scenario; (B) 20% disease variants and 10% protective variants scenario.

Figure 5.

Figure 5

Power comparison for the family samples (the α level axis uses a log10 scale). (A) A 30% disease variants scenario for trio families; (B) 20% disease variants and 10% protective variants scenario for trio families; (C) A 30% disease variants scenario for three-generation families; (D) 20% disease variants and 10% protective variants scenario for three-generation families.

GAW18 Data Analysis Results

We used the LF-KM statistic to analyze the GAW18 data for an association between longitudinal DBP and SBP and 11,096 genes; only rare variants (MAF < 0.05) were used. We found 11 genes that were associated with DBP and 4 genes that were associated with SBP, with p-values < 0.0001 (Figure 6; Table 3). The DNAH9 gene was significantly associated with DBP, with a p-value < 4.5 × 10−6 (equivalent to an α level of 0.05 after Bonferroni correction). DNAH9 contains 69 exons extending over 373 kb, and is a rather large gene [68]. Among the significantly associated genes, INO80 was suggestively associated with both DBP (p-value = 6.44 × 10−5) and SBP (p-value = 6.08 × 10−5). Among the genes that were previously shown to be associated with DBP and SBP in a large-scale GWAS [66], only two of them had p-values < 0.05 in the LF-KM portion of our analysis (Table 4). Note that in our analysis, common variants were not evaluated (and only genes on odd autosomes were evaluated). This could be the reason that we did not replicate many genes that were found to be significant in the previous GWAS [66]. We also examined whether any signals were detectable when only using non-synonymous rare SNPs, and found that a region containing DNAH9, CDRT4 and FAM18B2 on chromosome 17 was associated with diastolic blood pressure either analyzing all rare SNPs or only non-synonymous rare SNPs (Figures 6 and 1S).

Figure 6.

Figure 6

(A) –log10(p-values) of the association between 11,096 genes and diastolic blood pressure (DBP); (B) –log10(p-values) of the association between 11,096 genes and systolic blood pressure (SBP). The blue line is the suggestive significance level, 1 × 10−4, and the red line is the stringent Bonferroni-corrected significance level, 4.5 × 10−6.

Table 3.

Genes potentially associated with diastolic blood pressure (DBP) and systolic blood pressure (SBP) at a significance level of α = 1 × 10−4.

LF-KM

Gene Chr Start Stop N SNPs P-value
Trait: DBP
DNAH9 17 11496748 11878485 1569 1.49 × 10−6
CDRT4 17 15334332 15471875 600 1.15 × 10−5
FAM18B2 17 15336258 15471945 585 1.16 × 10−5
CDRT1 17 15463798 15528018 219 1.51 × 10−5
SCO1 17 10578654 10605885 179 1.79 × 10−5
HS3ST3B1 17 14199400 14257721 294 3.21 × 10−5
TRIM16 17 15526274 15592613 322 3.35 × 10−5
CCL18 17 34386640 34404392 87 3.72 × 10−5
INO80 15 41266078 41413444 518 6.44 × 10−5
C17orf48 17 10595931 10619550 135 7.75 × 10−5
FBLIM1 1 16078102 16118089 175 9.99 × 10−5

Trait: SBP
PRDM10 11 129764607 129877730 416 4.22 × 10−5
INO80 15 41266078 41413444 518 6.08 × 10−5
TRIP10 19 6734691 6756537 75 7.98 × 10−5
SMN1 5 70215768 70357324 63 8.84 × 10−5

Table 4.

Genes significant at α = 0.05 that were also shown to be associated with DBP and SBP in a large-scale genome-wide association study (GWAS) [57].

LF-KM

Gene Chr Start Stop N SNPs P-value
Trait: DBP
MOV10 1 113210763 113248368 124 0.022

Trait: SBP
MOV10 1 113210763 113248368 124 0.013
SLC4A7 3 27409214 27530911 497 0.017

Analysis of the GAW18 data took 32.8 hours on a single computing node with a 3 GHz CPU and 4 GB memory. Using a computer cluster with multiple nodes, we anticipate that genome-wide data analysis should be finished within hours using our proposed method.

Discussion

In this work, we developed two statistics (L-KM and LF-KM) using a linear mixed model framework, which can be employed to analyze longitudinal data with quantitative traits while properly adjusting for any family structure. As set-based analysis methods which test a set of genetic variants jointly, L-KM and LF-KM share the advantages of set-based methods, such as improved power and reduced multiple testing. Another approach [24] is also able to test the association between longitudinal phenotypes and genes. In this algorithm, the test is based on collapsing markers within a gene via an aggregated index; however, this could introduce substantial noise to the summarized value. In addition, permutation is used to evaluate p-values, which is computationally intensive, particularly on a genome-wide scale. On the other hand, our proposed methods do not collapse markers into a single value; thus, they allow each marker to have different directions and magnitudes of effects. Our proposed kernel regression-based methods can analytically compute p-values without resampling, leading to a substantial reduction in computation time. Furthermore, our proposed methods are computationally efficient because they are basically score tests, and thus the null model (which does not include genotypes) only needs to be fitted once for the whole genome. Different methods have higher power in different scenarios. When most of the genetic variants analyzed are causal and the direction of their effects are the same, the optimal sequence kernel association tests (SKAT-O) and burden tests have higher power [69] than SKAT. When most of the variants in a region are not causal and the directions of effects of causal variants are different, SKAT has higher power [69] than SKAT-O and burden tests. The L-KM and LF-KM statistics can be extended to the “optimal” framework by combining with burden test statistic (see Supplementary Materials for details), which is similar to the extension of SKAT to SKAT-O [69]. The optimal methods have empirical Type I error rates close to the nominal level (Figures 3S and 4S) and increased power when casual variants are in the same direction (Figures 5S and 6S).

In the population-based simulation studies, we showed that L-KM preserves the desired Type I error rates. When multiple measurements were available at different time points, we showed that L-KM achieves higher power than the competing approaches by using measurements at all time points. In the family-based simulation studies, we showed that using L-KM on data with related samples results in an inflated Type I error rate, while LF-KM had the correct Type I error rate because it considered the familial structure in the model. Analogously, LF-KM achieves the best power performance when using all observations. Based on our simulation study, L-KM is a good choice for genetic analysis of longitudinal data for population samples, and LF-KM is a good choice for family samples. It should be noted that LF-KM is more general and includes L-KM as a special case, where each individual can be treated as being from one family.

The L-KM and LF-KM computation times depend on the complexity of the model which is influenced by factors such as sample size, the number of genetic variants, and the number of repeated measures. In fact, the computation time required to fit the model under the null hypothesis may not be crucial when performing a genome-wide analysis. Because L-KM and LF-KM are score tests, the estimates of fixed effects coefficients and the covariance matrices under the null hypothesis are not related to the genetic variants. The linear mixed model under the null hypothesis only needs to be fitted once for the whole genome. The covariate estimates β̂ and variance-covariance matrix Σ̂ can then be saved and reused for constructing test statistics for all the genes. On the other hand, plugging the genes one by one into the formula takes the majority of the computation time. If the number of markers in a gene is large, matrix inversion is still computationally intensive. If the total number of observations is large, the runtime may be infeasible using single computing node (Figure 2S). Although parallel and powerful computing facilities are available, it is always advantageous to use fast algorithms, such as EMMA/EMMAX [58,70], TASSEL [70], and several new fast linear mixed model algorithms [7174], to make our approach faster and more efficient.

In the real data analysis, we only included rare variants to illustrate our proposed method; we did not perform a thorough analysis and an attempt to draw biological conclusions. Therefore, we did not intend to arrive at any biological conclusions from this illustrative analysis. In practice, both common and rare variants are usually available and we must analyze both. There are different analysis strategies to handle this. One strategy is to include both common and rare variants in the set-based analysis [75]. Another strategy is to group rare variants into sets [79,19,76], and analyze common variants individually. A third strategy is to group rare variants and common variants into different sets and test them either together or separately [12,7780]. The comparison of these strategies is beyond the scope of this work and also depends on the data being analyzed.

Although the method we propose includes some assumptions, the framework is general and flexible. Covariates can be easily incorporated into the model. The L-KM and LF-KM algorithms were implemented in R (http://www.r-project.org) and the source code is available online (http://www.pitt.edu/~qiy17/Softwares.html).

Supplementary Material

Supplemental Material

Acknowledgments

This work was supported in part by grant EPS1158862 (Q.Y., H.K.T.) and 1462990 (X.L. and N.L.) from the National Science Foundation; grants GM081488, R03DE024198, R01HL092173, P60AR064172, UL1TR001417 (N.L.), 5R01GM069430-08 (N.Y.), 5R01DA025095 (X.L.), GM073766 (G.G.), R01 HG007358 (W.C.), and R01 EY024226 (W.C.) from the National Institutes of Health; grant from Children’s Hospital of Pittsburgh of the UPMC Health System; grants 102-2628-B-002-039-MY3 from the Ministry of Science and Technology of Taiwan, and NTU-CESRP-104R7622-8 from National Taiwan University (W-Y.L.). GAW18 was provided by the T2D-GENES Consortium, which was supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors.

Appendix

Derivation of the Kernel Machine Test (KM) in a Linear Mixed Model Framework

For the linear mixed model, the log likelihood is

l=C12log|Σ|12(yXβ)Σ1(yXβ).

To derive the score test for H0: τ = 0, we take the first derivative with respect to τ

dldτ=12tr(Σ1GWG)+12(yXβ)Σ1GWGΣ1(yXβ),

where dldτ is the score function of τ under H0: τ = 0. If Σ is replaced with its estimate, the first term in the score function is fixed and does not depend on phenotype y. Following the same rationale as used in the derivation of the sequence kernel machine-based associations test (SKAT) score statistic [4648], we used twice the second term as our test statistic.

Under the null hypothesis, the linear mixed model is y = + u + ε, and the estimates are

Σ^=K^+σ^E2I
β^=(XΣ^1X)1XΣ^1y.

Replacing the variance components with their maximum likelihood estimators (MLEs), we have

Q=(yXβ^)Σ^1GWGΣ^1(yXβ^)

as the test statistic. Under the null hypothesis, the variance of the residual is

Var(yXβ^)=Var(yX(XΣ^1X)1XΣ^1y)=Σ^X(XΣ^1X)1X=P0.

The statistic Q is a quadratic form of yXβ̂ and follows a mixture of chi-square distributions under H0. Thus,

Q~i=1qλiχ1,i2,

where λi are the eigenvalues of the matrix W12GΣ^1P0Σ^1GW12 [38], which can be proved by the Theorem of the Distribution of Quadratic Forms and the Theorem of Equal Eigenvalues. The p-values can be calculated by different algorithms, such as Davies’ method [52].

The Theorem of Equal Eigenvalues states that for matrices A and B, the eigenvalues of AB and BA are equal if one of A and B is invertible. In our case, (yXβ̂)~N(0, P0) and K0=Σ^1GWGΣ^1=(W12GΣ^1)(W12GΣ^1)=KK. Therefore, λi in Q~i=1nλiχ1,i2 are the eigenvalues of the matrix K′KP0 according to the theorem in Yuan and Bentler [49], which is equivalent to the nonzero eigenvalues λi in Q~i=1qλiχ1,i2 of the matrix KP0K=W12GΣ^1P0Σ^1GW12 according to the Theorem of Equal Eigenvalues. The reason for using this matrix form is that sample size n is usually larger than the number of markers q in one gene, and thus the size of KP0K′ is smaller than K′KP0. Therefore, less computation is involved when using KP0K′. If sample size n is smaller than the number of markers q, there is no need to use KP0K′.

Footnotes

The authors also declare that they have no conflicts of interest.

References

  • 1.Wellcome Trust Case Control C. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, Wacholder S, Wang Z, Welch R, Hutchinson A, Wang J, Yu K, Chatterjee N, Orr N, Willett WC, Colditz GA, Ziegler RG, Berg CD, Buys SS, McCarty CA, Feigelson HS, Calle EE, Thun MJ, Hayes RB, Tucker M, Gerhard DS, Fraumeni JF, Jr, Hoover RN, Thomas G, Chanock SJ. A genome-wide association study identifies alleles in fgfr2 associated with risk of sporadic postmenopausal breast cancer. Nature genetics. 2007;39:870–874. doi: 10.1038/ng2075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, Minichiello MJ, Fearnhead P, Yu K, Chatterjee N, Wang Z, Welch R, Staats BJ, Calle EE, Feigelson HS, Thun MJ, Rodriguez C, Albanes D, Virtamo J, Weinstein S, Schumacher FR, Giovannucci E, Willett WC, Cancel-Tassin G, Cussenot O, Valeri A, Andriole GL, Gelmann EP, Tucker M, Gerhard DS, Fraumeni JF, Jr, Hoover R, Hunter DJ, Chanock SJ, Thomas G. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature genetics. 2007;39:645–649. doi: 10.1038/ng2022. [DOI] [PubMed] [Google Scholar]
  • 4.Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Manolio TA, Brooks LD, Collins FS. A hapmap harvest of insights into the genetics of common disease. The Journal of clinical investigation. 2008;118:1590–1605. doi: 10.1172/JCI34772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Schork NJ, Murray SS, Frazer KA, Topol EJ. Common vs. Rare allele hypotheses for complex diseases. Current opinion in genetics & development. 2009;19:212–219. doi: 10.1016/j.gde.2009.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. American journal of human genetics. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (cast) Mutation research. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
  • 10.Li B, Leal SM. Discovery of rare variants via sequencing: Implications for the design of complex trait association studies. PLoS genetics. 2009;5:e1000481. doi: 10.1371/journal.pgen.1000481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Human heredity. 2010;70:42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology. 2010;34:188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lin WY, Yi N, Zhi D, Zhang K, Gao G, Tiwari HK, Liu N. Haplotype-based methods for detecting uncommon causal variants with common snps. Genetic epidemiology. 2012;36:572–582. doi: 10.1002/gepi.21650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lin WY, Yi N, Lou XY, Zhi D, Zhang K, Gao G, Tiwari HK, Liu N. Haplotype kernel association test as a powerful method to identify chromosomal regions harboring uncommon causal variants. Genetic epidemiology. 2013;37:560–570. doi: 10.1002/gepi.21740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lin WY, Lou XY, Gao G, Liu N. Rare variant association testing by adaptive combination of p-values. PloS one. 2014;9:e85728. doi: 10.1371/journal.pone.0085728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yan Q, Tiwari HK, Yi N, Lin WY, Gao G, Lou XY, Cui X, Liu N. Kernel-machine testing coupled with a rank-truncation method for genetic pathway analysis. Genetic epidemiology. 2014;38:447–456. doi: 10.1002/gepi.21813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful snp-set analysis for case-control genome-wide association studies. American journal of human genetics. 2010;86:929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. American journal of human genetics. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Aulchenko YS, Ripatti S, Lindqvist I, Boomsma D, Heid IM, Pramstaller PP, Penninx BW, Janssens AC, Wilson JF, Spector T, Martin NG, Pedersen NL, Kyvik KO, Kaprio J, Hofman A, Freimer NB, Jarvelin MR, Gyllensten U, Campbell H, Rudan I, Johansson A, Marroni F, Hayward C, Vitart V, Jonasson I, Pattaro C, Wright A, Hastie N, Pichler I, Hicks AA, Falchi M, Willemsen G, Hottenga JJ, de Geus EJ, Montgomery GW, Whitfield J, Magnusson P, Saharinen J, Perola M, Silander K, Isaacs A, Sijbrands EJ, Uitterlinden AG, Witteman JC, Oostra BA, Elliott P, Ruokonen A, Sabatti C, Gieger C, Meitinger T, Kronenberg F, Doring A, Wichmann HE, Smit JH, McCarthy MI, van Duijn CM, Peltonen L, Consortium E. Loci influencing lipid levels and coronary heart disease risk in 16 european population cohorts. Nature genetics. 2009;41:47–55. doi: 10.1038/ng.269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kamatani Y, Matsuda K, Okada Y, Kubo M, Hosono N, Daigo Y, Nakamura Y, Kamatani N. Genome-wide association study of hematological and biochemical traits in a japanese population. Nature genetics. 2010;42:210–215. doi: 10.1038/ng.531. [DOI] [PubMed] [Google Scholar]
  • 22.Kathiresan S, Manning AK, Demissie S, D'Agostino RB, Surti A, Guiducci C, Gianniny L, Burtt NP, Melander O, Orho-Melander M, Arnett DK, Peloso GM, Ordovas JM, Cupples LA. A genome-wide association study for blood lipid phenotypes in the framingham heart study. BMC medical genetics. 2007;8(Suppl 1):S17. doi: 10.1186/1471-2350-8-S1-S17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sabatti C, Service SK, Hartikainen AL, Pouta A, Ripatti S, Brodsky J, Jones CG, Zaitlen NA, Varilo T, Kaakinen M, Sovio U, Ruokonen A, Laitinen J, Jakkula E, Coin L, Hoggart C, Collins A, Turunen H, Gabriel S, Elliot P, McCarthy MI, Daly MJ, Jarvelin MR, Freimer NB, Peltonen L. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature genetics. 2009;41:35–46. doi: 10.1038/ng.271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wang S, Fang S, Sha Q, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants with longitudinal data. BMC proceedings. 2014;8:S91. doi: 10.1186/1753-6561-8-S1-S91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Furlotte NA, Eskin E, Eyheramendy S. Genome-wide association mapping with longitudinal data. Genetic epidemiology. 2012;36:463–471. doi: 10.1002/gepi.21640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Melton PE, Almasy LA. Bivariate association analysis of longitudinal phenotypes in families. BMC proceedings. 2014;8:S90. doi: 10.1186/1753-6561-8-S1-S90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fan R, Zhang Y, Albert PS, Liu A, Wang Y, Xiong M. Longitudinal association analysis of quantitative traits. Genetic epidemiology. 2012;36:856–869. doi: 10.1002/gepi.21673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Falk CT, Rubinstein P. Haplotype relative risks: An easy reliable way to construct a proper control sample for risk calculations. Annals of human genetics. 1987;51:227–233. doi: 10.1111/j.1469-1809.1987.tb00875.x. [DOI] [PubMed] [Google Scholar]
  • 29.Ott J. Statistical properties of the haplotype relative risk. Genetic epidemiology. 1989;6:127–130. doi: 10.1002/gepi.1370060124. [DOI] [PubMed] [Google Scholar]
  • 30.Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (iddm) American journal of human genetics. 1993;52:506–516. [PMC free article] [PubMed] [Google Scholar]
  • 31.Terwilliger JD, Ott J. A haplotype-based 'haplotype relative risk' approach to detecting allelic associations. Human heredity. 1992;42:337–346. doi: 10.1159/000154096. [DOI] [PubMed] [Google Scholar]
  • 32.Ott J, Kamatani Y, Lathrop M. Family-based designs for genome-wide association studies. Nature reviews Genetics. 2011;12:465–474. doi: 10.1038/nrg2989. [DOI] [PubMed] [Google Scholar]
  • 33.George VT, Elston RC. Testing the association between polymorphic markers and quantitative traits in pedigrees. Genetic epidemiology. 1987;4:193–201. doi: 10.1002/gepi.1370040304. [DOI] [PubMed] [Google Scholar]
  • 34.Chen T, Santawisook P, Wu Z. A multi-level model for analyzing whole genome sequencing family data with longitudinal traits. BMC proceedings. 2014;8:S86. doi: 10.1186/1753-6561-8-S1-S86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wu YY, Briollais L. Mixed-effects models for joint modeling of sequence data in longitudinal studies. BMC proceedings. 2014;8:S92. doi: 10.1186/1753-6561-8-S1-S92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Zhou H, Zhou J, Sobel EM, Lange K. Fast genome-wide pedigree quantitative trait loci analysis using mendel. BMC proceedings. 2014;8:S93. doi: 10.1186/1753-6561-8-S1-S93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Human heredity. 2000;50:211–223. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]
  • 38.Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genetic epidemiology. 2013;37:196–204. doi: 10.1002/gepi.21703. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser PA, Lin X. Snp set association analysis for familial data. Genetic epidemiology. 2012;36:797–810. doi: 10.1002/gepi.21676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, Richards JB, Ciampi A, Greenwood CM. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genetic epidemiology. 2013;37:366–376. doi: 10.1002/gepi.21725. [DOI] [PubMed] [Google Scholar]
  • 41.Yan Q, Tiwari HK, Yi N, Gao G, Zhang K, Lin WY, Lou XY, Cui X, Liu N. A sequence kernel association test for dichotomous traits in family samples under a generalized linear mixed model. Human heredity. 2015;79:60–68. doi: 10.1159/000375409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Yan Q, Weeks DE, Celedon JC, Tiwari HK, Li B, Wang X, Lin WY, Lou XY, Gao G, Chen W, Liu N. Associating multivariate quantitative phenotypes with genetic variants in family samples with a novel kernel machine regression method. Genetics. 2015;201:1329–1339. doi: 10.1534/genetics.115.178590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Paterson AD. Drinking from the holy grail: Analysis of whole-genome sequencing from the genetic analysis workshop 18. Genetic epidemiology. 2014;38(Suppl 1):S1–S4. doi: 10.1002/gepi.21818. [DOI] [PubMed] [Google Scholar]
  • 44.Chen H, Malzahn D, Balliu B, Li C, Bailey JN. Testing genetic association with rare and common variants in family data. Genetic epidemiology. 2014;38(Suppl 1):S37–S43. doi: 10.1002/gepi.21823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cordell HJ. Summary of results and discussions from the gene-based tests group at genetic analysis workshop 18. Genetic epidemiology. 2014;38(Suppl 1):S44–S48. doi: 10.1002/gepi.21824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A powerful and flexible multilocus association test for quantitative traits. American journal of human genetics. 2008;82:386–397. doi: 10.1016/j.ajhg.2007.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics. 2007;63:1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2003;4:57–74. doi: 10.1093/biostatistics/4.1.57. [DOI] [PubMed] [Google Scholar]
  • 49.Yuan KH, Bentler PM. Two simple approximations to the distributions of quadratic forms. The British journal of mathematical and statistical psychology. 2010;63:273–291. doi: 10.1348/000711009X449771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser PA, Lin X. Snp set association analysis for familial data. Genetic epidemiology. 2012;36:797–810. doi: 10.1002/gepi.21676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometrics. 1946;2:110–114. [PubMed] [Google Scholar]
  • 52.Davies R. The distribution of a linear combination of chi-square random variables. J R Stat Soc Ser C Appl Stat. 1980;29:323–333. [Google Scholar]
  • 53.Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika. 1999;86:929–935. [Google Scholar]
  • 54.Pinheiro J, Bates D, DebRoy S, Sarkar D, Team RC. Nlme: Linear and nonlinear mixed effects models. R package version 31–118. 2014 http://CRANR-projectorg/package=nlme. [Google Scholar]
  • 55.Vazquez AI, Bates DM, Rosa GJ, Gianola D, Weigel KA. Technical note: An r package for fitting generalized linear mixed models in animal breeding. Journal of animal science. 2010;88:497–504. doi: 10.2527/jas.2009-1952. [DOI] [PubMed] [Google Scholar]
  • 56.Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome research. 2005;15:1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12. doi: 10.1007/BF01441146. [DOI] [PubMed] [Google Scholar]
  • 58.Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nature genetics. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Lynch M, Ritland K. Estimation of pairwise relatedness with molecular markers. Genetics. 1999;152:1753–1766. doi: 10.1093/genetics/152.4.1753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Ritland K. Multilocus estimation of pairwise relatedness with dominant markers. Molecular ecology. 2005;14:3157–3165. doi: 10.1111/j.1365-294X.2005.02667.x. [DOI] [PubMed] [Google Scholar]
  • 61.Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature genetics. 2006;38:203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]
  • 62.Liu N, Zhao H, Patki A, Limdi NA, Allison DB. Controlling population structure in human genetic association studies with samples of unrelated individuals. Statistics and its interface. 2011;4:317–326. doi: 10.4310/sii.2011.v4.n3.a6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Mitchell BD, Kammerer CM, Blangero J, Mahaney MC, Rainwater DL, Dyke B, Hixson JE, Henkel RD, Sharp RM, Comuzzie AG, VandeBerg JL, Stern MP, MacCluer JW. Genetic and environmental contributions to cardiovascular risk factors in mexican americans. The san antonio family heart study. Circulation. 1996;94:2159–2170. doi: 10.1161/01.cir.94.9.2159. [DOI] [PubMed] [Google Scholar]
  • 64.Hunt KJ, Lehman DM, Arya R, Fowler S, Leach RJ, Goring HH, Almasy L, Blangero J, Dyer TD, Duggirala R, Stern MP. Genome-wide linkage analyses of type 2 diabetes in mexican americans: The san antonio family diabetes/gallbladder study. Diabetes. 2005;54:2655–2662. doi: 10.2337/diabetes.54.9.2655. [DOI] [PubMed] [Google Scholar]
  • 65.Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nature genetics. 2002;30:97–101. doi: 10.1038/ng786. [DOI] [PubMed] [Google Scholar]
  • 66.International Consortium for Blood Pressure Genome-Wide Association S. Ehret GB, Munroe PB, Rice KM, Bochud M, Johnson AD, Chasman DI, Smith AV, Tobin MD, Verwoert GC, Hwang SJ, Pihur V, Vollenweider P, O'Reilly PF, Amin N, Bragg-Gresham JL, Teumer A, Glazer NL, Launer L, Zhao JH, Aulchenko Y, Heath S, Sober S, Parsa A, Luan J, Arora P, Dehghan A, Zhang F, Lucas G, Hicks AA, Jackson AU, Peden JF, Tanaka T, Wild SH, Rudan I, Igl W, Milaneschi Y, Parker AN, Fava C, Chambers JC, Fox ER, Kumari M, Go MJ, van der Harst P, Kao WH, Sjogren M, Vinay DG, Alexander M, Tabara Y, Shaw-Hawkins S, Whincup PH, Liu Y, Shi G, Kuusisto J, Tayo B, Seielstad M, Sim X, Nguyen KD, Lehtimaki T, Matullo G, Wu Y, Gaunt TR, Onland-Moret NC, Cooper MN, Platou CG, Org E, Hardy R, Dahgam S, Palmen J, Vitart V, Braund PS, Kuznetsova T, Uiterwaal CS, Adeyemo A, Palmas W, Campbell H, Ludwig B, Tomaszewski M, Tzoulaki I, Palmer ND, consortium CA, Consortium CK, KidneyGen C, EchoGen c, consortium C-H, Aspelund T, Garcia M, Chang YP, O'Connell JR, Steinle NI, Grobbee DE, Arking DE, Kardia SL, Morrison AC, Hernandez D, Najjar S, McArdle WL, Hadley D, Brown MJ, Connell JM, Hingorani AD, Day IN, Lawlor DA, Beilby JP, Lawrence RW, Clarke R, Hopewell JC, Ongen H, Dreisbach AW, Li Y, Young JH, Bis JC, Kahonen M, Viikari J, Adair LS, Lee NR, Chen MH, Olden M, Pattaro C, Bolton JA, Kottgen A, Bergmann S, Mooser V, Chaturvedi N, Frayling TM, Islam M, Jafar TH, Erdmann J, Kulkarni SR, Bornstein SR, Grassler J, Groop L, Voight BF, Kettunen J, Howard P, Taylor A, Guarrera S, Ricceri F, Emilsson V, Plump A, Barroso I, Khaw KT, Weder AB, Hunt SC, Sun YV, Bergman RN, Collins FS, Bonnycastle LL, Scott LJ, Stringham HM, Peltonen L, Perola M, Vartiainen E, Brand SM, Staessen JA, Wang TJ, Burton PR, Soler Artigas M, Dong Y, Snieder H, Wang X, Zhu H, Lohman KK, Rudock ME, Heckbert SR, Smith NL, Wiggins KL, Doumatey A, Shriner D, Veldre G, Viigimaa M, Kinra S, Prabhakaran D, Tripathy V, Langefeld CD, Rosengren A, Thelle DS, Corsi AM, Singleton A, Forrester T, Hilton G, McKenzie CA, Salako T, Iwai N, Kita Y, Ogihara T, Ohkubo T, Okamura T, Ueshima H, Umemura S, Eyheramendy S, Meitinger T, Wichmann HE, Cho YS, Kim HL, Lee JY, Scott J, Sehmi JS, Zhang W, Hedblad B, Nilsson P, Smith GD, Wong A, Narisu N, Stancakova A, Raffel LJ, Yao J, Kathiresan S, O'Donnell CJ, Schwartz SM, Ikram MA, Longstreth WT, Jr, Mosley TH, Seshadri S, Shrine NR, Wain LV, Morken MA, Swift AJ, Laitinen J, Prokopenko I, Zitting P, Cooper JA, Humphries SE, Danesh J, Rasheed A, Goel A, Hamsten A, Watkins H, Bakker SJ, van Gilst WH, Janipalli CS, Mani KR, Yajnik CS, Hofman A, Mattace-Raso FU, Oostra BA, Demirkan A, Isaacs A, Rivadeneira F, Lakatta EG, Orru M, Scuteri A, Ala-Korpela M, Kangas AJ, Lyytikainen LP, Soininen P, Tukiainen T, Wurtz P, Ong RT, Dorr M, Kroemer HK, Volker U, Volzke H, Galan P, Hercberg S, Lathrop M, Zelenika D, Deloukas P, Mangino M, Spector TD, Zhai G, Meschia JF, Nalls MA, Sharma P, Terzic J, Kumar MV, Denniff M, Zukowska-Szczechowska E, Wagenknecht LE, Fowkes FG, Charchar FJ, Schwarz PE, Hayward C, Guo X, Rotimi C, Bots ML, Brand E, Samani NJ, Polasek O, Talmud PJ, Nyberg F, Kuh D, Laan M, Hveem K, Palmer LJ, van der Schouw YT, Casas JP, Mohlke KL, Vineis P, Raitakari O, Ganesh SK, Wong TY, Tai ES, Cooper RS, Laakso M, Rao DC, Harris TB, Morris RW, Dominiczak AF, Kivimaki M, Marmot MG, Miki T, Saleheen D, Chandak GR, Coresh J, Navis G, Salomaa V, Han BG, Zhu X, Kooner JS, Melander O, Ridker PM, Bandinelli S, Gyllensten UB, Wright AF, Wilson JF, Ferrucci L, Farrall M, Tuomilehto J, Pramstaller PP, Elosua R, Soranzo N, Sijbrands EJ, Altshuler D, Loos RJ, Shuldiner AR, Gieger C, Meneton P, Uitterlinden AG, Wareham NJ, Gudnason V, Rotter JI, Rettig R, Uda M, Strachan DP, Witteman JC, Hartikainen AL, Beckmann JS, Boerwinkle E, Vasan RS, Boehnke M, Larson MG, Jarvelin MR, Psaty BM, Abecasis GR, Chakravarti A, Elliott P, van Duijn CM, Newton-Cheh C, Levy D, Caulfield MJ, Johnson T. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature. 2011;478:103–109. doi: 10.1038/nature10405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Laird NM. Missing data in longitudinal studies. Statistics in medicine. 1988;7:305–315. doi: 10.1002/sim.4780070131. [DOI] [PubMed] [Google Scholar]
  • 68.Bartoloni L, Blouin JL, Maiti AK, Sainsbury A, Rossier C, Gehrig C, She JX, Marron MP, Lander ES, Meeks M, Chung E, Armengot M, Jorissen M, Scott HS, Delozier-Blanchet CD, Gardiner RM, Antonarakis SE. Axonemal beta heavy chain dynein dnah9: Cdna sequence, genomic structure, and investigation of its role in primary ciliary dyskinesia. Genomics. 2001;72:21–33. doi: 10.1006/geno.2000.6462. [DOI] [PubMed] [Google Scholar]
  • 69.Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13:762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, Bradbury PJ, Yu J, Arnett DK, Ordovas JM, Buckler ES. Mixed linear model approach adapted for genome-wide association studies. Nature genetics. 2010;42:355–360. doi: 10.1038/ng.546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature methods. 2014;11:407–409. doi: 10.1038/nmeth.2848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. Fast linear mixed models for genome-wide association studies. Nature methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
  • 73.Svishcheva GR, Axenovich TI, Belonogova NM, van Duijn CM, Aulchenko YS. Rapid variance components-based method for whole-genome association analysis. Nature genetics. 2012;44:1166–1170. doi: 10.1038/ng.2410. [DOI] [PubMed] [Google Scholar]
  • 74.Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nature genetics. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. American journal of human genetics. 2013;92:841–853. doi: 10.1016/j.ajhg.2013.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Team NGESP-ELP. Christiani DC, Wurfel MM, Lin X. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. American journal of human genetics. 2012;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Wang T, Elston RC. Improved power by use of a weighted score test for linkage disequilibrium mapping. American journal of human genetics. 2007;80:353–360. doi: 10.1086/511312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Chapman J, Whittaker J. Analysis of multiple snps in a candidate gene or region. Genetic epidemiology. 2008;32:560–566. doi: 10.1002/gepi.20330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Pan W. Asymptotic tests of association with multiple snps in linkage disequilibrium. Genetic epidemiology. 2009;33:497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Yi N, Liu N, Zhi D, Li J. Hierarchical generalized linear models for multiple groups of rare and common variants: Jointly estimating group and individual-variant effects. PLoS genetics. 2011;7:e1002382. doi: 10.1371/journal.pgen.1002382. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES