Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2019 Jan 10;104(2):260–274. doi: 10.1016/j.ajhg.2018.12.012

Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies

Han Chen 1,2, Jennifer E Huffman 3, Jennifer A Brody 4, Chaolong Wang 5, Seunggeun Lee 6, Zilin Li 7, Stephanie M Gogarten 8, Tamar Sofer 9,10, Lawrence F Bielak 11, Joshua C Bis 4, John Blangero 12, Russell P Bowler 13, Brian E Cade 9,10, Michael H Cho 14,15, Adolfo Correa 16, Joanne E Curran 12, Paul S de Vries 1, David C Glahn 17,18, Xiuqing Guo 19, Andrew D Johnson 20, Sharon Kardia 11, Charles Kooperberg 21, Joshua P Lewis 22, Xiaoming Liu 23, Rasika A Mathias 24, Braxton D Mitchell 22,25, Jeffrey R O’Connell 22, Patricia A Peyser 11, Wendy S Post 26, Alex P Reiner 21, Stephen S Rich 27, Jerome I Rotter 19, Edwin K Silverman 14,15, Jennifer A Smith 11, Ramachandran S Vasan 20,28,29, James G Wilson 30, Lisa R Yanek 24; NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium; TOPMed Hematology and Hemostasis Working Group, Susan Redline 9,10,31, Nicholas L Smith 4,32,33,34, Eric Boerwinkle 1,35, Ingrid B Borecki 8, L Adrienne Cupples 20,36, Cathy C Laurie 8, Alanna C Morrison 1, Kenneth M Rice 8, Xihong Lin 7,37,
PMCID: PMC6372261  PMID: 30639324

Abstract

With advances in whole-genome sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and sequence kernel association test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-set mixed model association tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program. SMMATs share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be fit only once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMATs correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.

Keywords: generalized linear mixed model, variant set association test, whole-genome sequencing, TOPMed, rare variants, population structure, relatedness

Introduction

In recent years, massive DNA sequence data have been generated. Large-scale whole-genome sequencing projects, such as the National Heart, Lung, and Blood Institute’s (NHLBI) Trans-Omics for Precision Medicine (TOPMed) program and the National Human Genome Research Institute’s (NHGRI) Genome Sequencing Project (GSP), have produced whole-genome sequences from more than 120,000 samples. The designs of the studies from which participants are drawn need not be uniform or simple; for example, TOPMed includes population-based cohorts, family studies, and case-control studies, some of which are conducted in recently admixed populations, and some of which involve large pedigrees of closely related participants.

In population-based cohorts and case-control studies, population stratification and cryptic relatedness are major sources of confounding that need to be accounted for in association tests. For common single-variant analysis, linear mixed models that use an estimated genetic relationship matrix (GRM) to account for both population stratification and cryptic relatedness have been widely applied in genome-wide association studies (GWASs) to analyze structured and related samples.1, 2, 3, 4, 5, 6 For binary traits, however, we previously showed that linear mixed models may not be appropriate in the presence of population stratification due to misspecified mean-variance relationships. Therefore, we instead proposed a computationally efficient method GMMAT7 to perform common single-variant tests in GWASs by fitting generalized linear mixed models (GLMMs),8 which simultaneously account for population structure, cryptic relatedness, and shared environmental effects, using multiple variance components and/or random effects.

Hundreds of millions of genetic variants, mostly with a low and extremely rare minor allele frequency (MAF), are being analyzed in large-scale sequencing projects such as TOPMed and GSP. Yet, single-variant tests that have been widely used in GWASs are generally underpowered for analyzing rare genetic variants from sequencing studies. To circumvent this problem, statistical tests such as the burden test,9, 10, 11, 12 sequence kernel association test (SKAT),13 and their various combinations14, 15, 16 have been proposed. These tests analyze multiple genetic variants in sets, grouped by genes, genomic regions, or other bioinformatic aggregation units. Most of these tests were originally developed to analyze samples from unrelated individuals, as well as extensions to analyze family data with known pedigree structures in the parametric mixed model and semiparametric generalized estimating equation frameworks.17, 18, 19, 20, 21, 22, 23

Linear mixed models using a single random effect with the GRM covariance matrix to account for population structure have been developed and implemented in software programs for sequencing data analysis, such as EPACTS and Rvtests.24 Meta-analysis methods for family data have been developed and implemented in seqMeta and RAREMETAL,25, 26 but only for continuous traits in the linear mixed model framework. Moreover, these existing methods do not account for cryptic relatedness and between-subject relatedness from multiple sources and have not been applied to large-scale whole-genome sequencing studies with complex study samples, due to statistical and computational challenges.

One challenge is that among traditional variant set tests such as burden tests and SKAT, no single approach is uniformly most powerful. Another challenge is that existing hybrid tests that combine burden tests and SKAT, such as SKAT-O,14 MiST,15 and aSPU,16 are powerful but are subject to much greater computational loads than either the burden test or SKAT alone in the GLMM framework. Of note, SKAT-O is slower than SKAT because it searches on a grid for the optimal linear combination of the burden test and SKAT statistics. MiST requires adjusting for the genetic burden as a covariate in the SKAT model and hence needs to fit a burden model for each variant set. In large samples of possibly related individuals, extension of MiST is not as practical as in unrelated samples, since fitting a mixed effects model using the burden score for each variant set (or each test unit) is computationally intensive across the genome. Finally, aSPU uses a permutation or Monte Carlo simulation procedure to compute the p values, which can also be challenging in the context of large-scale whole-genome sequencing studies with both population structure and relatedness. Therefore, there is a pressing need to develop powerful and computationally efficient statistical methods for large-scale whole-genome sequencing studies.

To address these statistical and computational challenges, we develop the variant set mixed model association tests (SMMATs), computationally efficient variant set tests for both continuous and binary traits, which are applicable to structured and related samples with potential multiple sources of correlations, from large-scale whole-genome sequencing studies. We include four tests in the SMMAT framework: the burden test (SMMAT-B), SKAT (SMMAT-S), SKAT-O (SMMAT-O), and an efficient hybrid test to combine the burden test and SKAT (SMMAT-E), with power improvements over mixed model-based burden test, SKAT and SKAT-O. All four SMMATs share the same reduced model under the null hypothesis, i.e., the GLMM with only covariates, which needs to be fit only once for all genetic variant sets in an analysis. We show that all of these tests can be constructed using shared single-variant scores and their covariance matrices, thus further improving the computational efficiency in practice compared to performing these tests separately. Moreover, it has been shown that single-variant scores and their covariance matrices can also be used in the meta-analysis of variant set tests,25, 27 and thus SMMAT has been implemented to be directly applicable to combining multi-cohort studies ranging from unstructured independent samples to structured and related samples. Finally, we develop a unified analysis pipeline in our software package GMMAT that implements SMMAT variant set tests in both single study (pooled analysis) and meta-analysis contexts to facilitate research on rare genetic variants from large-scale sequencing studies. We demonstrate the application of our method to the analysis of fibrinogen levels in the TOPMed study.

Material and Methods

Generalized Linear Mixed Models (GLMMs)

We formulate the SMMATs (SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E) from the same GLMM

g(μi)=Xiα+Giβ+bi, (Equation 1)

where g() is a monotonic “link” function that connects the mean of phenotype yi, denoted by μi=E(yi|Xi,Gi,bi), for subject i of n samples, to the covariate row vector Xi, the genotype row vector Gi for q genetic variants in a set, and the random effects bi that accounts for population structure and relatedness. The phenotypes yi follow a distribution in the exponential family. For continuous traits, we usually assume that yi follow a normal distribution and use an identity link function; for binary traits, we assume yi follow a Bernoulli distribution and use a logit link function. In Equation 1, α is a p × 1 vector of fixed covariate effects including an intercept, and the genotype effects β are assumed to be a q × 1 vector whose distribution has meanW1qβ0 and covariance θW2, where W=diag{wj} is a pre-specified q × q matrix assigning weights to each variant, θ is a variance component parameter, and 1q is a column vector of length q with all elements 1. We assume that bN(0,k=1KνkΦk) is an n × 1 vector of random effects with each entry bi, K variance component parameters vk, and known n × n relatedness matrices Φk (1kK). We allow for multiple random effects to account for complex sampling designs such as hierarchical designs, shared environmental effects, and repeated-measures from longitudinal studies.

SMMAT-B, SMMAT-S, and SMMAT-O

In Equation 1, testing the genotype effects of q variants H0:β=0 is equivalent to testing the null hypothesis that H0:β0=0 and θ=0. The reduced GLMM under this null hypothesis specifies that

g(μ0i)=Xiα+bi, (Equation 2)

where μ0i=E(yi|Xi,bi). If we test H0:β0=0 under the assumption that θ=0, a burden score test SMMAT-B can be constructed as

TB=(yμˆ0)TGW1q1qTWGT(yμˆ0)ϕˆ2,

where y=(y1y2yn)T is an n × 1 vector of phenotypes yi, μˆ0 is a vector of fitted mean values under the model in Equation 2, G=(G1TG2TGnT)T is an n × q genotype matrix of the variant set in the test, and ϕˆ is an estimate of the dispersion parameter (or the residual variance) ϕ. Under H0:β0=0, the statistic TB asymptotically follows ξBχ12, where the scalar ξB=1qTWGTPˆGW1q, χ12 is a chi-square distribution with 1 df, and Pˆ=Σˆ1Σˆ1X(XTΣˆ1X)1XTΣˆ1 is the n × n projection matrix of the null GLMM (Equation 2), X=(X1TX2TXnT)T is an n × p covariate matrix, Σˆ=Vˆ+k=1KνˆkΦk with Vˆ=ϕˆIn for continuous traits in linear mixed models, and Vˆ=diag{1/(μˆ0i(1μˆ0i))} for binary traits in logistic mixed models (where the dispersion parameter ϕis known to be 1).

On the other hand, if we test H0:θ=0 under the assumption β0=0, a variance component score-type test SMMAT-S can be constructed as

TS=(yμˆ0)TGWWGT(yμˆ0)ϕˆ2.

Under H0:θ=0, TS asymptotically follows j=1qξSjχ1,j2, where χ1,j2 are independent chi-square distributions with 1 df, and ξSj are the eigenvalues of ΞS=WGTPˆGW.

If one assumes β0has mean 0 and variance γ, β then follows a distribution 0 and covariance τW{(1ρ)Iq+ρ1q1qT}W, whereτ=γ+θ and ρ=γ/(γ+θ), which takes values between 0 and 1. The joint null hypothesis H0:β0=0 and θ=0 is equivalent to H0:τ=0.Given ρ, a variance component score-type test can be constructed as

Tρ=ρTB+(1ρ)TS.

If ρ=1, Tρ becomes the SMMAT-B burden statistic TB, which assumes β are the same for all q variants after weighting. If ρ=0,Tρ becomes the SMMAT-S SKAT statistic TS. If an optimal ρ is obtained by minimizing the p value of Tρ, then SMMAT-O can be constructed, with its p value calculated using a one-dimensional numerical integration, following SKAT-O.14 A key advantage of SMMAT-O is that it maximizes the power by using the optimal linear combination of the mixed model burden test SMMAT-B and the mixed model SKAT SMMAT-S. As it requires a grid search over ρ, it is computationally considerably more expensive than SMMAT-B and SMMAT-S. We propose in the next section a computationally much more efficient method to combine SMMAT-B and SMMAT-S.

SMMAT-E

An alternative joint test to SMMAT-O for H0:β0=0 and θ=0 can be constructed using two asymptotically independent tests: a test for H0:β0=0 versus H1:β00 under the constraint θ=0 and a test for H0:θ=0 versus H1:θ>0 with β0 as a nuisance parameter that is estimated underH0:θ=0. In unrelated samples, this testing strategy is a special case of MiST adjusting for the genotype burden score as a single fixed-effects covariate,15 which requires the burden model to be fit for each SNP set. We note that the first test is SMMAT-B TB in the SMMAT framework, and the second test Tθ can be constructed from the null burden GLMM

g(μBi)=Xiα+GiW1qβ0+bi, (Equation 3)

where μBi=E(yi|Xi,GiW1q,bi) is the mean of yi in the burden GLMM. We can construct a SKAT-type statistic adjusting for the genetic burden

Tθ=(yμ˜B)TGWWGT(yμ˜B)ϕ˜2,

whereμ˜B is a vector of fitted values μ˜Bi using the burden GLMM in Equation 3 for a given variant set. However, fitting this burden GLMM separately for each variant set is computationally expensive in large-scale whole-genome association studies.

Therefore, we propose a different computationally efficient strategy by assuming that the mean of genetic effects β0 is not large, a reasonable assumption for most genomic regions and most complex human diseases. Then we can construct Tθ efficiently without refitting the burden GLMMs in Equation 3 for each variant set across the genome. We show in Appendix A that Tθ can be approximated by

Tθϕˆ2(yμˆ0)TGW{Iq1q(1qTWGTPˆGW1q)11qTWGTPˆGW}{IqWGTPˆGW1q(1qTWGTPˆGW1q)11qT}WGT(yμˆ0).

Therefore, under H0:θ=0, Tθ asymptotically approximately follows j=1qξθjχ1,j2, where χ1,j2 are independent chi-square distributions with 1 df, and ξθj are the eigenvalues of Ξθ=WGTPˆGWWGTPˆGW1q(1qTWGTPˆGW1q)11qTWGTPˆGW. By the central limit theorem, both WGT(yμ˜B)/ϕ˜ and 1qTWGT(yμˆ0)/ϕˆ are asymptotically normal, and their covariance matrix is

Cov(WGT(yμ˜B)ϕ˜,1qTWGT(yμˆ0)ϕˆ)
{IqWGTPˆGW1q(1qTWGTPˆGW1q)11qT}WGTPˆGW1q=0.

Therefore, Tθ and TB are approximately asymptotically independent. Let pθ and pB be the p value of the two tests, respectively, then SMMAT-E p value pE is computed using Fisher’s method with a chi-square distribution with 4 df as pE=P(χ42>2log(pθpB)).

Meta-analysis

SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E can all be conducted in the meta-analysis context. Assuming the single-variant scores S=GT(yμˆ0)/ϕˆ and their covariance matrix Ψ=GTPˆG are computed for each variant set in each study, we can reconstruct TB=STW1q1qTWS with ξB=1qTWΨW1q; TS=STWWS with ΞS=WΨW; Tρ=ρTB+(1ρ)TS and Tθ=STW{Iq1q(1qTWΨW1q)11qTWΨW}{IqWΨW1q(1qTWΨW1q)11qT}WS with Ξθ=WΨWWΨW1q(1qTWΨW1q)11qTWΨW.

For each variant set, let m=1,2,,M be the index of studies and Sm and Ψm be the single-variant scores and covariance matrix from study m. In testing the “weak” null hypothesis28 of summary genetic effects H0:β=0,25, 27 we can compute meta summary statistics S=m=1MSm and Ψ=m=1MΨm and use them in SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E. If a genetic variant is monomorphic in a study, its single-variant score statistic and the corresponding row and column in the covariance matrix will be set to 0 for that study. When combining studies with very different sample characteristics, testing the “strong” null hypothesis28 that genetic effects in all studies are 0 is sometimes desired. In the general case, we may choose to group studies that are similar and test whether the summary genetic effects in all groups are 0, for example, in the meta-analysis of multi-ethnic samples. Let c=1,2,,C be a partition of M studies (CM), where C is the number of ethnicities, Scm and Ψcm be the single-variant scores and covariance matrix from study m in partition c (m=1,2,,Mc in partition c, and c=1CMc=M), such that genetic effects for the same variant are summarized within each partition c but heterogeneous across partitions,27 we can also compute summary statisticsS=(m=1M1S1mTm=1M2S2mTm=1MCSCmT)T and Ψ=diag{m=1McΨcm}. Note that S is now a vector of length Cq and Ψ is a block-diagonal matrix with C blocks of q × q matrices, one for each partition of studies (with total dimension Cq × Cq), so we should replace W, 1q, and Iq by ICW (where denotes the Kronecker product), 1Cq, and ICq, respectively, in the above expressions for TB, Tρ, TS, and Tθ for meta-analysis.

Simulation Studies

Type I Error in Single-Cohort Studies

We performed coalescent simulations to generate sequence data with 100 genetic variants in each set, and 10,000 independent sets for 8,000 individuals from a 20 × 20 grid of spatially continuous populations with migration rate between adjacent cells M = 10 (Figure 1A). Within each cell, we paired 20 individuals into 10 families and simulated 2 children for each family using gene dropping,29 and in total we had 4,000 families and 16,000 individuals. For continuous traits, in each simulation replicate, we simulated the phenotype yij for individual j in family i under the null hypothesis of no genetic association from

yij=α1Zi+bij+ɛij, (Equation 4)

where the “population effect” α1=1 and the population indicator Zi = 1 if family i was from a 10 × 10 grid in the top left of the map (population 1) and Zi = 0 otherwise (population 2). The familial random effects were simulated as

bi=(bi1bi2bi3bi4)N((0000),(0.5000.50.250.250.250.250.250.250.250.250.50.250.250.5)), (Equation 5)

and the random error εijN(0,1) for each individual j in family i. Then we randomly sampled 3,500 individuals from the 10 × 10 grid in the top left and 6,500 individuals from the rest of the map. The family identifier was removed for all individuals in the analysis, so that there were both population structure and cryptic relatedness in the sample. We compared SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E in analyzing 10,000 independent variant sets based on a linear mixed model using our GMMAT package, including random effects with their covariance matrix proportional to the GRM, and adjusted for the first ten principal components (PCs) of ancestry. We repeated this 4,000 times to get p values combined from 40 million independent genetic variant sets for each test.

Figure 1.

Figure 1

Map of Spatially Continuous Populations from Which Genotypes Were Simulated Based on the Coalescent Model

(A) Map for a single-cohort simulation study: the top left 10 × 10 grid formed population 1, and the rest formed population 2.

(B) Map for a meta-analysis simulation study: scenario A studies were unrelated individuals sampled from population 1 only; scenario B studies were related individuals sampled from specific regions in population 1 and population 2; scenario C studies were unrelated individuals sampled from specific regions in population 1 and population 2; and scenario D studies were related individuals sampled from specific regions in population 2 only.

For binary traits, in each simulation replicate, we simulated the phenotype yij for individual j in family i under the null hypothesis of no genetic association from

log(P(yij=1)1P(yij=1))=α0+bij, (Equation 6)

where α0 was chosen such that the disease prevalence was 0.01 in all populations, and the familial random effects bij were simulated in the same way as for continuous traits. Then we randomly sampled 2,500 case subjects and 1,000 control subjects from the 10 × 10 grid in the top left (population 1), and 2,500 case subjects and 4,000 control subjects from the rest of the map (population 2) to form a hypothetical study with balanced case and control subjects in combined populations. Therefore, there was confounding by population structure resulting from unequal sampling, even though the disease prevalence was the same. We removed the family identifier, compared SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E in analyzing 10,000 independent variant sets based on a logistic mixed model using our GMMAT package, similarly as described above, and repeated this 4,000 times to get p values combined from 40 million independent genetic variant sets for each test.

Type I Error in Meta-analysis

We also conducted simulation studies in the meta-analysis context to evaluate the type I error rates. We considered four scenarios: unrelated individuals, without confounding by population structure (scenario A studies); related individuals, with confounding by population structure (scenario B studies); unrelated individuals, with confounding by population structure (scenario C studies); and related individuals, without confounding by population structure (scenario D studies).

For scenario A studies, we simulated 16 unrelated individuals in each cell from the 10 × 10 grid in the top left of the map (Figure 1B). For continuous traits, we simulated the phenotype yij from Equation 4, with α1=0 and bij = 0 and randomly sampled 1,000 individuals. For binary traits, we simulated yij from Equation 6, with bij = 0, and randomly sampled 500 case subjects and 500 control subjects.

For scenario B studies, we simulated eight unrelated individuals, paired them into four families, and simulated two children for each family in each cell from the 10 × 10 grid in the center of the map (Figure 1B). For continuous traits, we simulated the phenotype yij from Equation 4, with α1=1 and the population indicator Zi = 1 if family i was from population 1, and Zi = 0 if from population 2. Familial random effects bij were simulated using Equation 5, and we randomly sampled 350 individuals from population 1 and 650 individuals from population 2. For binary traits, we simulated yij from Equation 6, with bij from Equation 5, and randomly sampled 250 case subjects and 100 control subjects from population 1, and 250 case subjects and 400 control subjects from population 2.

For scenario C studies, we simulated 16 unrelated individuals in each cell from the 20 × 5 grid in the top of the map (Figure 1B). For continuous traits, we simulated the phenotype yij from Equation 4, with α1=1, the population indicator Zi = 1 if family i was from population 1 and Zi = 0 if from population 2, and bij = 0, and we randomly sampled 350 individuals from population 1 and 650 individuals from population 2. For binary traits, we simulated yij from Equation 6, with bij = 0, and randomly sampled 250 case subjects and 100 control subjects from population 1 and 250 case subjects and 400 control subjects from population 2.

For scenario D studies, we simulated 8 unrelated individuals, paired them into 4 families and simulated 2 children for each family in each cell from the 20 × 5 grid in the bottom of the map (Figure 1B). For continuous traits, we simulated the phenotype yij from Equation 4, with α1=0, familial random effects bij simulated using Equation 5, and we randomly sampled 1,000 individuals. For binary traits, we simulated yij from Equation 6, with bij from Equation 5, and randomly sampled 500 case subjects and 500 control subjects.

In each simulation replicate, we simulated 3 studies from each scenario, totaling 12 studies with a combined sample size of 12,000 (6,000 case subjects and 6,000 control subjects for binary traits). We compared SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E using two meta-analysis strategies: all studies in the same group, and scenario A, B, C, and D studies in four separate groups. In the latter case, three studies from the same scenario were grouped in the same partition with shared genetic effects, while studies from different scenarios were allowed to have heterogeneous genetic effects. Variants are included in the meta-analysis as long as they are polymorphic in at least one of the 12 studies. We repeated 4,000 simulation replicates to get p values from 40 million independent genetic variant sets.

Power

We used the same genotype data as in the single-cohort type I error simulations and evaluated the empirical power of SMMAT-B, SMMAT-S, SMMAT-O, SMMAT-E, and the GLMM extension of MiST (GLMM-MiST) that combines the p value of SMMAT-B (Equation 2) and the p value of SMMAT-S (Equation 3) using Fisher’s method. All tests were performed using weights equal to a beta distribution density function with parameters 1 and 25 on the MAF of each variant.13 We considered 9 scenarios, with the proportion of causal variants in a test unit changing from 10% to 20% to 50%, and the proportion of variants with negative effects out of causal variants changing from 100% to 80% to 50%. For continuous traits, we simulated the phenotype yij for individual j in family i from

yij=α1Zi+lGijlβl+bij+εij,

where α1=1, the population indicator Zi = 1 if family i was from population 1 and Zi = 0 if from population 2, gijl was the centered genotype for causal variant l of individual j in family i, the causal effect size was |βl|=c|log10MAFl| for variant l with MAFl, where the constant c was set to 0.2, 0.1, and 0.05 when the proportion of causal variants was 10%, 20%, and 50%, the familial random effects bij were simulated using Equation 5, and the random error ɛijN(0,1). We randomly sampled 35% individuals from population 1 and 65% individuals from population 2.

For binary traits, we simulated the phenotype yij for individual j in family i from

log(P(yij=1)1P(yij=1))=α0+lGijlβl+bij,

where α0 was chosen such that the disease prevalence was 0.01 in all populations, Gijl was the centered genotype for causal variant l of individual j in family i, the causal effect size was |βl|=c|log10MAFl| for variant l with MAFl, where the constant c was set to 0.3, 0.2, and 0.1 when the proportion of causal variants was 10%, 20%, and 50%, the familial random effects bij were simulated using Equation 5. We randomly sampled 35% individuals (with 25% case subjects and 10% control subjects out of the total sample size) from population 1, and 65% individuals (with 25% case subjects and 40% control subjects out of the total sample size) from population 2 to form a hypothetical study with balanced case and control subjects in combined populations.

For both continuous and binary traits, we varied the total sample size from 2,000 to 5,000 to 10,000, repeated 1,000 simulation replicates for each scenario under the alternative hypothesis, and compared the empirical power at the significance level of 2.5 × 10−6.

TOPMed Example Involving Fibrinogen Levels

Samples with both plasma fibrinogen measures and whole-genome sequence data (Freeze 5b) from the following 11 TOPMed studies were included in the analysis: the Old Order Amish Study (Amish), Cleveland Family Study (CFS), Genetic Epidemiology of COPD Study (COPDGene), Framingham Heart Study (FHS), Jackson Heart Study (JHS), San Antonio Family Study (SAFS), the Atherosclerosis Risk in Communities (ARIC) Study, Genetic Studies of Atherosclerosis Risk (GeneSTAR), Genetic Epidemiology Network of Arteriopathy (GENOA), the Multi-Ethnic Study of Atherosclerosis (MESA), and Women’s Health Initiative (WHI). The TOPMed studies were approved by institutional review boards at participating institutions, and informed consent was obtained from all study participants. Amish, CFS, FHS, JHS, and SAFS are family-based studies with differing degrees of relatedness. The total sample size was 23,763. Within each study and each ethnicity, measured fibrinogen levels were adjusted for age, sex, and study-specific covariates, and the residuals were rank normalized and rescaled by multiplying by the original standard deviation, so that the transformed phenotype data have the same variances as on the original scale. The transformed phenotype data were pooled together in the analysis, using a heteroscedastic linear mixed model30 allowing for different residual variances in each study/ethnicity, adjusting for study, ethnicity, sequence center, and top ten ancestry PCs31 as fixed-effects covariates, and including a GRM calculated by mixed model analysis for pedigrees and populations (MMAP) to model the random effects for relatedness. Rare and low-frequency genetic variants on chromosome 4 with MAF less than 5%, including all singletons and extremely rare variants, were included in our rare variant association analysis of fibrinogen levels using the sliding window method32 with 4 kb non-overlapping windows, using SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E with weights equal to a beta distribution density function with parameters 1 and 25 on the MAF of each variant.13 As sensitivity analyses, we also included 1 kb, 10 kb, and 40 kb non-overlapping sliding windows, as well as an analysis using 4 kb windows with no ancestry PC adjustment. The analyses were performed using the GMMAT App (v.0.9.3), which includes the implementation of the SMMAT method, with 32 parallel threads on a single computing node with 240 GB total memory in the Analysis Commons.33 To benchmark the computational speed in running SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E, we also ran re-analyses to perform each test separately, using summary statistics from the sliding window analysis and a single thread on a computing node with 15 GB total memory in the Analysis Commons.

Results

Simulation Studies

Table 1 shows the empirical type I error rates of SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E at significance levels of 0.05, 0.0001, and 2.5 × 10−6 in the variant set analyses of continuous and binary traits in single-cohort simulation studies. All four tests have well-controlled type I error rates at these significance levels, suggesting that GLMMs can be effective in adjusting for population structure and cryptic relatedness in complex study samples. This is also consistent with the quantile-quantile (QQ) plots in Figure 2, which show neither inflation nor deflation in the tail.

Table 1.

Empirical Type I Error Rates of SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E in Single-Cohort Simulation Studies at Significance Levels of 0.05, 0.0001, and 2.5 × 10−6


Continuous Traits
Binary Traits
Level 0.05 0.0001 2.5 × 10−6 0.05 0.0001 2.5 × 10−6
SMMAT-B 0.047 8.7 × 10−5 2.0 × 10−6 0.049 9.6 × 10−5 2.0 × 10−6
SMMAT-S 0.048 8.7 × 10−5 2.0 × 10−6 0.049 9.5 × 10−5 2.3 × 10−6
SMMAT-O 0.050 1.1 × 10−4 3.0 × 10−6 0.052 1.2 × 10−4 3.0 × 10−6
SMMAT-E 0.050 1.0 × 10−4 3.0 × 10−6 0.050 9.9 × 10−5 2.0 × 10−6

The total sample size was 10,000, and results from 4,000 simulation replicates were combined to get 40 million genetic variant sets.

Figure 2.

Figure 2

Quantile-Quantile Plots of SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E in the Analysis of 10,000 Samples in Single-Cohort Studies with Both Population Structure and Cryptic Relatedness, under the Null Hypothesis of No Genetic Association

(A) Continuous traits in linear mixed models.

(B) Binary traits in logistic mixed models.

Table 2 and Figure 3 show simulation results of SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E assuming all studies in the same group (hom) or in four separate groups (het) in meta-analyses for combining four types of studies: with and without confounding by population structure, with and without cryptic relatedness. We note that SMMAT-B statistic TB has the same form in these two meta-analysis strategies,27 so we included seven tests in the simulation studies. In het SMMAT-S, SMMAT-O, and SMMAT-E, studies from the same scenario were grouped together to assume shared genetic effects. Under the null hypothesis of no genetic associations, hom SMMAT-O shows very mild inflation in our simulation settings, but all other six tests in the SMMAT framework control type I error rates well at significance levels of 0.05, 0.0001, and 2.5 × 10−6 and have well-calibrated tail probabilities, for both continuous and binary traits.

Table 2.

Empirical Type I Error Rates of SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E Assuming All Studies in the Same Group (hom) and Scenario A, B, C, and D Studies in Four Separate Groups (het), in Meta-analysis Simulation Studies at Significance Levels of 0.05, 0.0001, and 2.5 × 10−6


Continuous Traits
Binary Traits
Level 0.05 0.0001 2.5 × 10−6 0.05 0.0001 2.5 × 10−6
SMMAT-B 0.051 1.0 × 10−4 2.6 × 10−6 0.051 1.1 × 10−4 2.5 × 10−6
Hom SMMAT-S 0.051 1.0 × 10−4 2.6 × 10−6 0.051 1.1 × 10−4 2.1 × 10−6
Het SMMAT-S 0.051 1.0 × 10−4 2.8 × 10−6 0.052 1.0 × 10−4 2.4 × 10−6
Hom SMMAT-O 0.053 1.3 × 10−4 4.0 × 10−6 0.053 1.4 × 10−4 3.4 × 10−6
Het SMMAT-O 0.052 1.1 × 10−4 2.6 × 10−6 0.052 1.1 × 10−4 2.2 × 10−6
Hom SMMAT-E 0.051 1.0 × 10−4 2.5 × 10−6 0.051 1.1 × 10−4 2.6 × 10−6
Het SMMAT-E 0.051 1.0 × 10−4 2.8 × 10−6 0.052 1.1 × 10−4 3.0 × 10−6

The total sample size was 12,000 from 12 studies, and results from 4,000 simulation replicates were combined to get 40 million genetic variant sets.

Figure 3.

Figure 3

Quantile-Quantile Plots of SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E in the Meta-analysis of 12 Studies with a Total Sample Size of 12,000, under the Null Hypothesis of No Genetic Association

(A) Continuous traits in linear mixed models, all studies in the same group.

(B) Binary traits in logistic mixed models, all studies in the same group.

(C) Continuous traits in linear mixed models, scenario A, B, C, and D studies in four separate groups.

(D) Binary traits in logistic mixed models, scenario A, B, C, and D studies in four separate groups.

Figures 4 and 5 present the empirical power for causal variant sets at the significance level of 2.5 × 10−6 for continuous and binary traits, respectively. The power increases with the sample size. As the proportion of causal variants with effects in the same direction drops from 100% to 80% to 50% in each row, the power drops for all tests, but most substantially for the burden test SMMAT-B. When the sample size is large (i.e., 10,000 samples), SMMAT-E and GLMM-MiST have the highest power, for both continuous and binary traits in all nine simulation scenarios. SMMAT-E and GLMM-MiST have almost the same power in all these settings, while GLMM-MiST requires fitting a separate GLMM for each variant set. When all genetic variants in a test unit are causal with large effects in the same direction (a simulation scenario in favor of SMMAT-B, see Supplemental Material and Methods for details), SMMAT-B has the highest power, followed by SMMAT-O and SMMAT-E or GLMM-MiST (Figures S1A and S1B). On the log scale, SMMAT-E and GLMM-MiST p values are very close (Figures S1C and S1D).

Figure 4.

Figure 4

Empirical Power of Linear Mixed Model-Based SMMAT-B, SMMAT-S, SMMAT-O, SMMAT-E, and GLMM-MiST in Continuous Trait Analysis of 2,000, 5,000, and 10,000 Samples

(A–C) 10% causal variants with 100% (A), 80% (B), or 50% (C) negative effects.

(D–F) 20% causal variants with 100% (D), 80% (E), or 50% (F) negative effects.

(G–I) 50% causal variants with 100% (G), 80% (H), or 50% (I) negative effects.

Effect sizes were simulated using the same parameter in each row, but different across rows.

Figure 5.

Figure 5

Empirical Power of Logistic Mixed Model-Based SMMAT-B, SMMAT-S, SMMAT-O, SMMAT-E, and GLMM-MiST in Binary Trait Analysis of 2,000, 5,000, and 10,000 Samples

(A–C) 10% causal variants with 100% (A), 80% (B), or 50% (C) negative effects.

(D–F) 20% causal variants with 100% (D), 80% (E), or 50% (F) negative effects.

(G–I) 50% causal variants with 100% (G), 80% (H), or 50% (I) negative effects.

Effect sizes were simulated using the same parameter in each row, but different across rows.

In the presence of genetic relatedness from multiple sources (see Supplemental Material and Methods for details), linear and logistic mixed models with single GRM random effects and multiple random effects all control type I errors for continuous and binary traits (Figure S2). The multiple random effects model is more powerful than the single GRM random effects models for continuous traits in our simulation settings, although the single GRM random effects models with and without ancestry PC adjustment almost have the same power (Figure S3). For binary traits, compared to the single GRM random effects model adjusting for ten ancestry PCs as fixed effects, the multiple random effects model is slightly more powerful and the single GRM random effects model with no ancestry PC adjustment is generally slightly less powerful, in our simulation settings (Figure S4).

TOPMed Example Involving Fibrinogen Levels

We compared the results from SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E in an analysis of fibrinogen levels, using chromosome 4 (including the genomic region that encodes the fibrinogen protein, FGB) whole-genome sequence data from 11 TOPMed studies. Previous studies have reported two rare variants within FGB on chromosome 4, rs6054 (hg38 position 154,568,456) and rs201909029 (hg38 position 154,567,636) associated with lower fibrinogen levels, with similar effect sizes in all ancestry groups.34 In the sliding window analysis, we grouped low-frequency and rare genetic variants with MAF less than 5% into 46,859 non-overlapping 4 kb windows containing at least one variant. The number of variants in each window passing the MAF filter ranged from 1 to 1,290, with a median of 351 (25% quartile 326 and 75% quartile 380). The QQ plot (Figure 6A) shows that all four tests have well-calibrated tail probabilities. Table 3 summarizes heteroscedastic linear mixed model-based SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E p values in FGB and flanking regions. SMMAT-S, SMMAT-O, and SMMAT-E give the most significant results in the 4 kb window 154,554–154,558 kb, with p values 1.6 × 10−17, 8.9 × 10−17, and 6.2 × 10−19, respectively, while SMMAT-B p value is much larger (6.9 × 10−5). In the 4 kb window that covers both known association rare variants rs6054 and rs201909029 (window 154,566–154,570 kb), SMMAT-E gives the smallest p value (3.1 × 10−17), followed by SMMAT-S (p value 9.7 × 10−17), SMMAT-O (p value 3.3 × 10−16), and SMMAT-B (p value 1.6 × 10−8).

Figure 6.

Figure 6

TOPMed Fibrinogen Level SMMAT Analysis Results via a Heteroscedastic Linear Mixed Model on Rare Variants with MAF < 5% in Non-overlapping 4 kb Sliding Windows on Chromosome 4 (n = 23,763)

(A) Quantile-quantile plot.

(B) p values on the log scale versus physical positions of the windows on chromosome 4 (build hg38).

Table 3.

TOPMed Fibrinogen-Level SMMAT p Values in Known Association Gene FGB and Flanking Regions on Chromosome 4, using a Heteroscedastic Linear Mixed Model on Rare Variants with MAF < 5% (n = 23,763)

Start (kb) End (kb) No. of Variants SMMAT-B SMMAT-S SMMAT-O SMMAT-E
154,554 154,558 348 6.9 × 10−5 1.6 × 10−17 8.9 × 10−17 6.2 × 10−19
154,558 154,562 370 0.078 3.7 × 10−11 2.4 × 10−10 3.7 × 10−14
154,562 154,566 326 0.76 1.5 × 10−9 3.5 × 10−9 4.2 × 10−10
154,566 154,570 309 1.6 × 10−8 9.7 × 10−17 3.3 × 10−16 3.1 × 10−17
154,570 154,574 332 0.030 1.9 × 10−7 5.2 × 10−7 8.9 × 10−8
154,574 154,578 349 2.1 × 10−7 7.3 × 10−7 2.8 × 10−7 4.1 × 10−13
154,578 154,582 342 1.7 × 10−4 2.7 × 10−5 2.8 × 10−5 2.1 × 10−9

Physical positions of each window are on build hg38.

In this TOPMed data example, linear mixed models with and without adjusting for ten ancestry PCs as fixed-effects covariates gave very close p values (Figure S5). When we changed the window size from 4 kb to 1 kb (Figure S6), 10 kb (Figure S7), and 40 kb (Figure S8), the QQ plots showed that the analyses were well calibrated and the same association was identified. Regardless of the window size, SMMAT-E almost always gave the smallest p values, except for the 1 kb window 154,567–154,568 kb, which covers rs201909029. For this 1 kb window, none of the tests gave significant p values after adjusting for multiple testing, indicating potential lack of power, since rs201909029 has only 33 minor allele counts in our TOPMed samples (Table S1).

Computation Time

Table 4 shows the CPU time for running the sliding window analysis for 23,763 individuals with TOPMed whole-genome sequence data and fibrinogen levels, using summary statistics from 46,859 non-overlapping 4 kb windows on chromosome 4. The GMMAT App (v.0.9.3) in the Analysis Commons cloud computing platform has implemented SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E, with the option of running one or more tests in an analysis. SMMAT-B results are automatically included when running SMMAT-O or SMMAT-E, and SMMAT-S p values will also be output when running SMMAT-O. Of the four tests in Table 4, SMMAT-B takes shortest time as the p value calculation does not involve any eigen-decomposition of covariance matrices. SMMAT-S takes only about 10 min longer than SMMAT-B for the eigen-decomposition of 46,859 covariance matrices. SMMAT-E takes about 12 min longer than SMMAT-S and gives both SMMAT-B and SMMAT-E p values. SMMAT-O takes 175 min longer than SMMAT-S, as more eigen-decompositions are performed in SMMAT-O when it searches for the optimal combination of SMMAT-B and SMMAT-S on a grid of ρ values. We did not include GLMM-MiST in the analysis, because it took 159 min CPU time to fit a GLMM for this TOPMed sample. By extrapolation, it would take more than 14 years CPU time for analyzing 23,763 related individuals with 46,859 windows using GLMM-MiST.

Table 4.

CPU Time in the TOPMed Fibrinogen Level SMMAT using Summary Statistics from a Sliding Window Analysis using Non-overlapping 4 kb Windows on Chromosome 4 (n = 23,763)

Test Time (min)
SMMAT-B 81
SMMAT-S 91
SMMAT-O 266
SMMAT-E 103

Tests were performed using the GMMAT App (v.0.9.3) with one single thread on a computing node with 15 GB total memory in the Analysis Commons.

Discussion

We have developed and implemented SMMAT, a family of computationally efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. This framework includes extensions of three widely used variant set tests for unrelated individuals to complex study samples with population structure and cryptic relatedness: the burden test (SMMAT-B), SKAT (SMMAT-S), and SKAT-O (SMMAT-O), as well as a new efficient hybrid test that combines the mixed model burden and SKAT tests (SMMAT-E). Specifically, SMMAT-E is constructed by combining the burden test and an adjusted mixed model SKAT statistic that is approximately asymptotically independent from the mixed model burden test statistic, in a similar spirit to MiST in non-mixed model setting,15 but that differs from MiST in that it does not require fitting separate mixed effect burden models for each variant set with the set genetic burden as a fixed-effects covariate. Instead, we use matrix projections to approximate the adjusted SKAT statistic from a global null model without any fixed effects for the variant set-specific genetic burden. Of note, this global null model needs to be fit only once in a whole-genome analysis, which greatly reduces the computational cost. The approximation is highly accurate, even in the presence of large genetic effects. We show in simulation studies and the TOPMed fibrinogen example that SMMAT-E is more powerful than the other three tests in large samples, at the computational cost almost on the same scale of SMMAT-B and SMMAT-S. Therefore, SMMAT-E is recommended in the analysis of large-scale whole-genome sequencing studies.

In the SMMAT framework, different weighting strategies can be used. One can use a function of the MAF,11, 13 or external measures based on functional annotation such as CADD,35 Eigen,36 FATHMM-XF,37 or tissue-specific annotations, such as GENOSKYLINE,38 as the weight for each variant in a set. In the analysis of fibrinogen levels in TOPMed, we used MAF-based weights. Recently, unified variant set tests allowing for multiple functional annotations have been developed,39 and the SMMAT framework can possibly be extended to accommodate multiple weights. Nevertheless, the optimal weighting strategy in rare variant analysis remains an open question and an active field of research.

As SMMAT-E combines the burden test p value pB with an asymptotically independent adjusted SKAT p value pθ using Fisher’s method in our SMMAT implementation in the GMMAT App, we note that other forms of combinations may also be applied.40 For example, previous studies have shown that Tippett’s procedure based on the minimum of pθ and pB might be more powerful than Fisher’s method in MiST when only one of the p values is small.15 Alternatively, instead of combining the p values, weighted linear combinations of chi-square statistics have been proposed41, 42, 43 and they can also be applied to combine the burden test statistic TB and the asymptotically independent SKAT statistic Tθ in the SMMAT framework.

SMMAT also has some limitations. SMMAT p values are computed based on asymptotic distributions, which may be not be accurate in small samples, especially for binary traits and heavily skewed continuous traits. For continuous traits, small-sample inference procedures have been proposed for SKAT,44, 45 and the same methodology can be applied to SMMAT. For ultra-rare genetic variants with very low minor allele counts, the single-variant scores used to construct SMMAT-B, SMMAT-S, SMMAT-O, and SMMAT-E may not be close to a normal distribution, even if the total sample size is large. If there are only ultra-rare variants (e.g., singletons, doubletons) in a test region and the number of variants is small, SMMAT-B might be the best analysis strategy as its asymptotic property depends on the cumulative minor allele counts. Moreover, the asymptotic issue of single-variant scores also exists for binary traits with highly unbalanced case-control ratios, and a saddlepoint approximation approach has been proposed to match the cumulant generating function of the single-variant scores,46 and it has recently been extended to GLMMs.47

Fitting GLMMs with a GRM has O(n3) complexity in general, where n is the sample size. We have overcome this computational challenge by fitting only one GLMM in a whole-genome analysis and using matrix multiplications with O(n2) complexity for each variant set in SMMAT. In large-scale whole-genome sequencing studies, solutions to other computational challenges are being proposed. For example, when the number of variants q in SKAT is very large, eigendecomposition of the covariance matrix, which has O(min(n,q)3) complexity, could be computationally expensive. Recently, the fastSKAT approach has been proposed to efficiently approximate the null distribution of SKAT when q is very large,48 and the same strategy can be applied to speed up SMMAT p value calculation for very large q. On the other hand, as the sample size in ongoing large-scale sequencing projects such as TOPMed eventually expands to hundreds of thousands, using a full n × n GRM would not be computationally practical in pooled analyses, as it may take several weeks to fit even only one GLMM with O(n3) complexity and O(n2) memory footprint. Meta-analyses may be a more appealing analysis strategy in that situation by combining summary statistics from study-specific or ancestry-specific analyses. Essentially equivalently, in pooled analyses, using a sparse and/or block-diagonal GRM with each block corresponding to an individual study in meta-analyses, will help reduce the computational cost in fitting GLMMs, providing one uses specialized routines for manipulation of sparse matrices.49 Although whole-genome sequencing studies have not yet been conducted in large biobanks with sample sizes on the scale of millions of individuals, it is expected that calculating the GRM itself would become a major computational bottleneck. Recently, GRM-free mixed effects models such as BOLT-LMM6, 50 and SAIGE47 have been developed for single variant tests, and we note that extension of these methods to the SMMAT framework will further reduce the computational cost in biobank-scale whole-genome sequencing studies in the future.

In summary, SMMAT provides a flexible and practical statistical framework for large-scale whole-genome sequencing studies with complex study samples, with balanced power and computational performance. With continuing advances in technology, lowering cost and development of new analytical methods, large-scale whole-genome sequencing studies will facilitate human genetic research and enhance our understandings of complex diseases and traits.

Declaration of Interests

In the past three years, E.K.S. received honoraria from Novartis for Continuing Medical Education Seminars and grant and travel support from GlaxoSmithKline. M.H.C. has received grant support from GlaxoSmithKline.

Acknowledgments

This work was supported by National Institutes of Health grants R00 HL130593 (to H.C.), U01 HL120393 (to H.C. and J.E.H.), and R35 CA197449, P01-CA134294, U01-HG009088, U19-CA203654, and R01-HL113338 (to X. Lin). The authors acknowledge the Texas Advanced Computing Center (TACC, https://www.tacc.utexas.edu) at The University of Texas at Austin for providing high performance computing (HPC) resources that have contributed to the research results reported within this paper. Whole-genome sequence analysis of fibrinogen levels in TOPMed was performed in the Analysis Commons on DNAnexus, a hosting platform that uses Amazon Web Services (AWS) to provide a cloud data management and computing environment for large genomic data projects. The Analysis Commons was funded by NIH R01 HL131136. Phenotype harmonization and aggregation of the fibrinogen levels across TOPMed studies were supported in part by NIH R01 HL139553. Detailed TOPMed and study-specific acknowledgments can be found in Supplemental Acknowledgments. The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute, the National Institutes of Health, or the U.S. Department of Health and Human Services. We thank the referees for their helpful comments that have helped improve the paper.

Published: January 10, 2019

Footnotes

Supplemental Data include eight figures, one table, Supplemental Material and Methods, Supplemental Acknowledgments, and the full authorship list with affiliations of the Trans-Omics for Precision Medicine (TOPMed) Consortium and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.12.012.

Appendix A: Approximations in SMMAT-E

Here we derive the approximations used in SMMAT-E to construct the SKAT-type statistic adjusting for the genetic burden

Tθ=(yμ˜B)TGWWGT(yμ˜B)ϕ˜2.

Let ϕ˜, α˜, β˜0, b˜i, V˜, and Σ˜ be estimates for ϕ, α, β0, bi, V, and Σ, respectively, from the burden GLMM (Equation 3). We define Y˜=y as the phenotype vector for continuous traits, and the “working vector” with components Y˜i=Xiα˜+GiW1qβ˜0+b˜i+{μ˜Bi(1μ˜Bi)}1(yiμ˜Bi) at convergence of the logistic burden mixed model for binary traits (Equation 3), where α˜, β˜0, b˜i are fixed-effects and random-effects estimates from the burden GLMM. We have

yμ˜Bϕ˜=V˜1(Y˜Xα˜GW1qβ˜0b˜)
=Σ˜1(Y˜Xα˜GW1qβ˜0)
=Σ˜1{Y˜(XGW1q)(XTΣ˜1XXTΣ˜1GW1q1qTWGTΣ˜1X1qTWGTΣ˜1GW1q)1(XTΣ˜11qTWGTΣ˜1)Y˜}
={Σ˜1Σ˜1X(XTΣ˜1X)1XTΣ˜1}Y˜{Σ˜1Σ˜1X(XTΣ˜1X)1XTΣ˜1}GW1q
[1qTWGT{Σ˜1Σ˜1X(XTΣ˜1X)1XTΣ˜1}GW1q]11qTWGT{Σ˜1Σ˜1X(XTΣ˜1X)1XTΣ˜1}Y˜.

Note that ϕ˜=1 for binary traits. Moreover, since the true value of β0 is small, assuming including the genetic burden GiW1q in the second term in Equation 3 does not dramatically change the variance component estimates for νk and ϕ (and for binary traits, also the “working vector” Y˜ at convergence of the model from Equation 2), we have the approximation Σ˜1Σ˜1X(XTΣ˜1X)1XTΣ˜1Pˆ and (yμˆ0)/ϕˆPˆY˜, then

WGT(yμ˜B)ϕ˜WGT{PˆY˜PˆGW1q(1qTWGTPˆGW1q)11qTWGTPˆY˜}
{IqWGTPˆGW1q(1qTWGTPˆGW1q)11qT}WGT(yμˆ0)ϕˆ.

Therefore,

Tθ=(yμ˜B)TGWWGT(yμ˜B)ϕ˜2
ϕˆ2(yμˆ0)TGW{Iq1q(1qTWGTPˆGW1q)11qTWGTPˆGW}{IqWGTPˆGW1q(1qTWGTPˆGW1q)11qT}WGT(yμˆ0).

Web Resources

Supplemental Data

Document S1. Figures S1–S8, Table S1, Supplemental Material and Methods, Supplemental Acknowledgments, and TOPMed Consortium Author List
mmc1.pdf (1.7MB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (4.1MB, pdf)

References

  • 1.Kang H.M., Zaitlen N.A., Wade C.M., Kirby A., Heckerman D., Daly M.J., Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008;178:1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lippert C., Listgarten J., Liu Y., Kadie C.M., Davidson R.I., Heckerman D. FaST linear mixed models for genome-wide association studies. Nat. Methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
  • 4.Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pirinen M., Donnelly P., Spencer C.C.A. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Ann. Appl. Stat. 2013;7:369–390. [Google Scholar]
  • 6.Loh P.R., Tucker G., Bulik-Sullivan B.K., Vilhjálmsson B.J., Finucane H.K., Salem R.M., Chasman D.I., Ridker P.M., Neale B.M., Berger B. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chen H., Wang C., Conomos M.P., Stilp A.M., Li Z., Sofer T., Szpiro A.A., Chen W., Brehm J.M., Celedón J.C. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed Mmodels. Am. J. Hum. Genet. 2016;98:653–666. doi: 10.1016/j.ajhg.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Breslow N.E., Clayton D.G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 1993;88:9–25. [Google Scholar]
  • 9.Morgenthaler S., Thilly W.G. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat. Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
  • 10.Li B., Leal S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Madsen B.E., Browning S.R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Morris A.P., Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 2010;34:188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lee S., Wu M.C., Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13:762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Sun J., Zheng Y., Hsu L. A unified mixed-effects model for rare-variant association in sequencing studies. Genet. Epidemiol. 2013;37:334–344. doi: 10.1002/gepi.21717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pan W., Kim J., Zhang Y., Shen X., Wei P. A powerful and adaptive association test for rare variants. Genetics. 2014;197:1081–1095. doi: 10.1534/genetics.114.165035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Schifano E.D., Epstein M.P., Bielak L.F., Jhun M.A., Kardia S.L., Peyser P.A., Lin X. SNP set association analysis for familial data. Genet. Epidemiol. 2012;36:797–810. doi: 10.1002/gepi.21676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen H., Meigs J.B., Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet. Epidemiol. 2013;37:196–204. doi: 10.1002/gepi.21703. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Oualkacha K., Dastani Z., Li R., Cingolani P.E., Spector T.D., Hammond C.J., Richards J.B., Ciampi A., Greenwood C.M. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet. Epidemiol. 2013;37:366–376. doi: 10.1002/gepi.21725. [DOI] [PubMed] [Google Scholar]
  • 20.Wang X., Lee S., Zhu X., Redline S., Lin X. GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol. 2013;37:778–786. doi: 10.1002/gepi.21763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jiang D., McPeek M.S. Robust rare variant association testing for quantitative traits in samples with related individuals. Genet. Epidemiol. 2014;38:10–20. doi: 10.1002/gepi.21775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Yan Q., Tiwari H.K., Yi N., Gao G., Zhang K., Lin W.Y., Lou X.Y., Cui X., Liu N. A sequence kernel association test for dichotomous traits in family samples under a generalized linear mixed model. Hum. Hered. 2015;79:60–68. doi: 10.1159/000375409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Park J.Y., Wu C., Basu S., McGue M., Pan W. Adaptive SNP-Set association testing in generalized linear mixed models with application to family studies. Behav. Genet. 2018;48:55–66. doi: 10.1007/s10519-017-9883-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhan X., Hu Y., Li B., Abecasis G.R., Liu D.J. RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data. Bioinformatics. 2016;32:1423–1426. doi: 10.1093/bioinformatics/btw079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Liu D.J., Peloso G.M., Zhan X., Holmen O.L., Zawistowski M., Feng S., Nikpay M., Auer P.L., Goel A., Zhang H. Meta-analysis of gene-level tests for rare variant association. Nat. Genet. 2014;46:200–204. doi: 10.1038/ng.2852. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Feng S., Pistis G., Zhang H., Zawistowski M., Mulas A., Zoledziewska M., Holmen O.L., Busonero F., Sanna S., Hveem K. Methods for association analysis and meta-analysis of rare variants in families. Genet. Epidemiol. 2015;39:227–238. doi: 10.1002/gepi.21892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lee S., Teslovich T.M., Boehnke M., Lin X. General framework for meta-analysis of rare variants in sequencing association studies. Am. J. Hum. Genet. 2013;93:42–53. doi: 10.1016/j.ajhg.2013.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Rice K., Higgins J.P., Lumley T. A re-evaluation of fixed effect(s) meta-analysis. J. R. Stat. Soc. A. 2017;181:205–227. [Google Scholar]
  • 29.MacCluer J.W., VandeBerg J.L., Read B., Ryder O.A. Pedigree analysis by computer simulation. Zoo Biol. 1986;5:147–160. [Google Scholar]
  • 30.Conomos M.P., Laurie C.A., Stilp A.M., Gogarten S.M., McHugh C.P., Nelson S.C., Sofer T., Fernández-Rhodes L., Justice A.E., Graff M. Genetic diversity and association studies in US Hispanic/Latino populations: Applications in the Hispanic Community Health Study/Study of Latinos. Am. J. Hum. Genet. 2016;98:165–184. doi: 10.1016/j.ajhg.2015.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Conomos M.P., Miller M.B., Thornton T.A. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol. 2015;39:276–293. doi: 10.1002/gepi.21896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Morrison A.C., Huang Z., Yu B., Metcalf G., Liu X., Ballantyne C., Coresh J., Yu F., Muzny D., Feofanova E. Practical approaches for whole-genome sequence analysis of heart- and blood-related traits. Am. J. Hum. Genet. 2017;100:205–215. doi: 10.1016/j.ajhg.2016.12.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Brody J.A., Morrison A.C., Bis J.C., O’Connell J.R., Brown M.R., Huffman J.E., Ames D.C., Carroll A., Conomos M.P., Gabriel S., NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium. TOPMed Hematology and Hemostasis Working Group. CHARGE Analysis and Bioinformatics Working Group Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nat. Genet. 2017;49:1560–1563. doi: 10.1038/ng.3968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Huffman J.E., de Vries P.S., Morrison A.C., Sabater-Lleal M., Kacprowski T., Auer P.L., Brody J.A., Chasman D.I., Chen M.H., Guo X. Rare and low-frequency variants and their association with plasma levels of fibrinogen, FVII, FVIII, and vWF. Blood. 2015;126:e19–e29. doi: 10.1182/blood-2015-02-624551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kircher M., Witten D.M., Jain P., O’Roak B.J., Cooper G.M., Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ionita-Laza I., McCallum K., Xu B., Buxbaum J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 2016;48:214–220. doi: 10.1038/ng.3477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rogers M.F., Shihab H.A., Mort M., Cooper D.N., Gaunt T.R., Campbell C. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics. 2018;34:511–513. doi: 10.1093/bioinformatics/btx536. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Lu Q., Powles R.L., Wang Q., He B.J., Zhao H. Integrative tissue-specific functional annotations in the human genome provide novel insights on many complex traits and improve signal prioritization in genome wide association studies. PLoS Genet. 2016;12:e1005947. doi: 10.1371/journal.pgen.1005947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.He Z., Xu B., Lee S., Ionita-Laza I. Unified sequence-based association tests allowing for multiple functional annotations and meta-analysis of noncoding variation in Metabochip data. Am. J. Hum. Genet. 2017;101:340–352. doi: 10.1016/j.ajhg.2017.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Koziol J.A., Perlman M.D. Combining independent chi-squared tests. J. Am. Stat. Assoc. 1978;73:753–763. [Google Scholar]
  • 41.Wu M.C., Maity A., Lee S., Simmons E.M., Harmon Q.E., Lin X., Engel S.M., Molldrem J.J., Armistead P.M. Kernel machine SNP-set testing under multiple candidate kernels. Genet. Epidemiol. 2013;37:267–275. doi: 10.1002/gepi.21715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ionita-Laza I., Lee S., Makarov V., Buxbaum J.D., Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am. J. Hum. Genet. 2013;92:841–853. doi: 10.1016/j.ajhg.2013.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Su Y.R., Di C., Bien S., Huang L., Dong X., Abecasis G., Berndt S., Bezieau S., Brenner H., Caan B. A mixed-effects model for powerful association tests in integrative functional genomics. Am. J. Hum. Genet. 2018;102:904–919. doi: 10.1016/j.ajhg.2018.03.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chen J., Chen W., Zhao N., Wu M.C., Schaid D.J. Small sample kernel association tests for human genetic and microbiome association studies. Genet. Epidemiol. 2016;40:5–19. doi: 10.1002/gepi.21934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Zhou J.J., Hu T., Qiao D., Cho M.H., Zhou H. Boosting gene mapping power and efficiency with efficient exact variance component tests of single nucleotide polymorphism sets. Genetics. 2016;204:921–931. doi: 10.1534/genetics.116.190454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Dey R., Schmidt E.M., Abecasis G.R., Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 2017;101:37–49. doi: 10.1016/j.ajhg.2017.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Zhou W., Nielsen J.B., Fritsche L.G., Dey R., Gabrielsen M.E., Wolford B.N., LeFaive J., VandeHaar P., Gagliano S.A., Gifford A. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Lumley T., Brody J., Peloso G., Morrison A., Rice K. FastSKAT: Sequence kernel association tests for very large sets of markers. Genet. Epidemiol. 2018;42:516–527. doi: 10.1002/gepi.22136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Bates D., Maechler M., Davis T.A., Oehlschlägel J., Riedy J., R Core Team. Matrix: Sparse and Dense Matrix Classes and Methods. R package Version 1.2-14. 2018 https://CRAN.R-project.org/package=Matrix. [Google Scholar]
  • 50.Loh P.R., Kichaev G., Gazal S., Schoech A.P., Price A.L. Mixed-model association for biobank-scale datasets. Nat. Genet. 2018;50:906–908. doi: 10.1038/s41588-018-0144-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S8, Table S1, Supplemental Material and Methods, Supplemental Acknowledgments, and TOPMed Consortium Author List
mmc1.pdf (1.7MB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (4.1MB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES