Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Sep 11.
Published in final edited form as: Ann Appl Stat. 2018 Sep 11;12(3):1558–1582. doi: 10.1214/17-AOAS1121

ADAPTIVE-WEIGHT BURDEN TEST FOR ASSOCIATIONS BETWEEN QUANTITATIVE TRAITS AND GENOTYPE DATA WITH COMPLEX CORRELATIONS

Xiaowei Wu 1, Ting Guan 2, Dajiang J Liu 3, Luis G León Novelo 4, Dipankar Bandyopadhyay 5
PMCID: PMC6133321  NIHMSID: NIHMS940504  PMID: 30214655

Abstract

High-throughput sequencing has often been used to screen samples from pedigrees or with population structure, producing genotype data with complex correlations rendered from both familial relation and linkage disequilibrium. With such data, it is critical to account for these genotypic correlations when assessing the contribution of variants by gene or pathway. Recognizing the limitations of existing association testing methods, we propose Adaptive-weight Burden Test (ABT), a retrospective, mixed-model test for genetic association of quantitative traits on genotype data with complex correlations. This method makes full use of genotypic correlations across both samples and variants, and adopts “data-driven” weights to improve power. We derive the ABT statistic and its explicit distribution under the null hypothesis, and demonstrate through simulation studies that it is generally more powerful than the fixed-weight burden test and family-based SKAT in various scenarios, controlling for the type I error rate. Further investigation reveals the connection of ABT with kernel tests, as well as the adaptability of its weights to the direction of genetic effects. The application of ABT is illustrated by a whole genome analysis of genes with common and rare variants associated with fasting glucose from the NHLBI “Grand Opportunity” Exome Sequencing Project.

Keywords and phrases: Genetic association test, burden test, kernel test, adaptive weight, bi-directional genotypic correlation

MSC 2010 subject classifications: Primary 62F03, secondary 62P10

1. Introduction

Next-generation sequencing technologies have been undergoing rapid evolution in recent years, enabling high-resolution genotyping in a fast, efficient, and cost-effective way (Shendure and Ji, 2008; Ansorge, 2009). These technologies, when applied to family or population based samples, produce rich resources of genotype data with complex correlations, i.e., they are correlated across both samples and variants due to familial relation (or population stratification) and linkage disequilibrium (LD), respectively. Though such genotypic correlations could potentially benefit many applications including data imputation, quality control, and functional annotation, how to effectively use this information for assessing genetic associations with complex diseases remains a challenge.

In genome-wide association studies (GWASs), two types of tests have been commonly used depending on whether single-variant effect or joint effect of multiple variants in a predefined genomic region is of interest. Compared to single-variant association tests, multiple-variant association tests (i.e., SNP-set association tests) are believed to be advantageous in that they are more powerful in aggregating small signals from each single variant, especially when the minor allele frequency is low (Asimit and Zeggini, 2010), capturing possible interactions among variants (Ma, Clark and Keinan, 2013), reducing multiple testing burden (Wu et al., 2010), and leading to interpretable results targeted to genes or LD haplotype blocks (Li et al., 2011).

A number of multiple-variant association tests have been proposed. They generally fall under two broad categories: “burden tests” and “kernel tests”. Burden tests group variants into a single variable called genetic burden score by some transformations or projections, and then perform association testing on the burden score. Typical collapsing methods, including rare variant indicator or weighted sum, have been developed for both unrelated (Morgenthaler and Thilly, 2007; Li and Leal, 2008; Madsen and Browning, 2009; Price et al., 2010a) and related (Chen, Meigs and Dupuis, 2013; Schaid et al., 2013) individuals. Other dimension reduction techniques, such as Fourier transformation (Wang and Elston, 2007), principal component analysis (Gauderman et al., 2007; Wang and Abbott, 2008), and partial least-squares regression (Chun et al., 2011), have also been applied in grouping multiple variants. Although easy to implement, burden tests rely largely on the assumption of homogeneity in the magnitude and direction of the genetic effects from individual variants. When the genomic region of interest includes both risk (positively associated) and protective (negatively associated) variants, or when inappropriate weights (contradicting the true genetic effects) are used, burden tests may experience loss of power.

Recently, kernel tests have received an increasing attention in GWASs. With its roots in kernel-machine regression (Wu et al., 2010) and mixed effects model, kernel tests adopt a statistic with a quadratic form, which essentially is a weighted sum of the score statistics for testing individual variant effects (Wu et al., 2011). Methods in this broad category include the C-alpha test (Neale et al., 2011), SKAT (Wu et al., 2011), and LSKM (Kwee et al., 2008) for unrelated individuals, and have been extended to allow related individuals (Schifano et al., 2012; Chen, Meigs and Dupuis, 2013; Schaid et al., 2013; Wang et al., 2013a, b). In particular, this approach assumes random effects for individual variants, and test the regression coefficients of the variants by a variance-component score test. Since aggregation is on individual variant statistics instead of on variants themselves, kernel tests allow different directions and magnitude of effects for individual variants. It has been shown that burden tests are more powerful when the variants to be tested are most causal with effects of the same direction and similar magnitudes, whereas kernel tests are more powerful when the effects of causal variants are in different directions or a large proportion of neutral variants present (Wu et al., 2011; Chen, Meigs and Dupuis, 2013; Schaid et al., 2013; Lee, Wu and Lin, 2012; Wang, Chen and Yang, 2012). To borrow strength from both approaches and avoid loss of power in certain scenarios, it is possible to combine test statistics from the two categories (Lee, Wu and Lin, 2012; Lee et al., 2012; Jiang and McPeek, 2013), albeit determining the optimal combination weight may be problematic in real data applications, and the null distribution of the combined statistic is usually hard to derive.

In the category of burden tests, considerable effort have been made to seek weighting strategies that allow for the presence of both risk and protective variants (Han and Pan, 2010; Liu and Leal, 2010; Lin and Tang, 2011; Sha et al., 2012; Sha and Zhang, 2014; Fang, Zhang and Sha, 2014). However, as the optimal weights are function of the genotype data, the resulting test statistic does not follow χ12 null distribution as in score-based fixed-weight burden tests. None of these methods derive the explicit null distribution after employing the optimal weights, but use permutations to evaluate the p-value instead. These permutation tests are computationally expensive, especially for whole-genome analysis. Moreover, when genotypic correlations exist across both samples and variants, the application of these methods will be severely restricted. For samples with related individuals, permutations are not straightforward. In addition, because all these methods use prospective regression (which considers trait as random and genotype data as fixed), the LD correlation among the variants under consideration is often hard to incorporate into the weights. Furthermore, the optimality of the resulting statistic needs theoretical justification that is non-trivial.

In this paper, we focus on association mapping of quantitative traits on genotype data with complex correlations. Specifically, we try to answer the following questions: (1) How to model the genotypic correlations across both samples and variants in an efficient way? (2) How to choose the optimal weights of a burden test in such a scenario? (3) Is there a connection between burden tests and kernel tests, under certain weighting stratergies? (4) How are the optimal weights adaptive to the direction of individual variant effects? To address these, we propose the Adaptive-weight Burden Test (ABT), a retrospective, mixed-model test which is able to incorporate complex correlations in genotype data and adopts data-driven weights to improve power. We show that the explicit null distribution of the ABT can be obtained by appropriately projecting genotype data and combining independent individual score tests. Therefore, ABT is computationally more efficient, compared to other permutation-based optimal tests.

The rest of the paper is organized as follows. After a brief introduction to the relevant background of genetic association testing and some preliminaries regarding the single-variant MASTOR test (Jakobsdottir and McPeek, 2013), including its extension to a retrospective fixed-weight burden test, we introduce the ABT test in Section 2, sketch some of its theoretical properties, and illuminate its connection to the kernel tests. In Section 3, we demonstrate via synthetic data that our proposed ABT test is generally more powerful than the fixed-weight burden test and family-based SKAT across various scenarios, well-controlling for the type I error. Besides, the weights of ABT are able to adapt to the direction of individual variant effects. In Section 4, we illustrate the use of our ABT test via a whole genome analysis of association between fasting glucose and previously reported genes with common and rare variants from the NHLBI “Grand Opportunity” Exome Sequencing Project (GO-ESP). Finally, in Section 5, we present some concluding remarks.

2. Statistical framework of the ABT test

Consider an association study where a group of n individuals are sampled for phenotype, covariate and genotype data. For simplicity, we do not assume any missingness in the data (i.e., each individual is assumed to have complete data in phenotypes, covariates and genotypes), though all the results hereafter can be extended to the incomplete data case, in a way similar to Jakobsdottir and McPeek (2013). The phenotype data are collected for a quantitative trait, and denoted by a vector Y of length n. The covariates form an n × k matrix Z, with the columns representing k non-genetic variables (intercept included) such as age, sex, BMI, etc. We consider testing for association between the quantitative trait and a genetic region of m variant sites. Each type-d variant is assumed to be biallelic, with the alleles arbitrarily labeled as “0” and “1”. So the genotype data can be written as an n × m matrix G = [G1, G2, · · ·, Gm] with the (i, j)th element coded as Gij=12×(thenumberofallelesoftype1inindividualiatvariantsitej), 1 ≤ in, 1 ≤ jm. These m variants are further assumed to have a certain LD structure, with the correlation matrix defined by

R=(1r12r1mr121r2mr1mr2m1),

where rij=(p11-pipj)/pi(1-pi)pj(1-pj), 1 ≤ ijm, is the correlation coefficient between variant i and variant j, pi and pj are the allele frequencies of variants i, j respectively, and p11 is the frequency of the haplotype having alleles 1 at both variants. In addition to the correlation among genetic markers, we also consider the correlation (i.e., relatedness) among sampled individuals in this current work. We assume a known pedigree structure for the sampled individuals and define the kinship matrix by

Φ=(1+h12ϕ122ϕ1n2ϕ121+h22ϕ2n2ϕ1n2ϕ2n1+hn),

where hi is the inbreeding coefficient of individual i, and ϕij is the kinship coefficient between individuals i and j, 1 ≤ i, jn. For outbred individuals, the kinship matrix can be considered as the correlation matrix among individual genotypes. As a special case, for unrelated individuals, the kinship matrix reduces to an identity matrix.

To conveniently model correlations in the genotype matrix caused by both familial relation and LD, we will treat genotypes as random and conduct a retrospective analysis based on G|Y, Z. This approach, originated from MQLS (Thornton and McPeek, 2007) and MASTOR (Jakobsdottir and McPeek, 2013), is different from most existing association testing methods for related individuals, such as MONSTER (Jiang and McPeek, 2013), family-based burden test and family-based SKAT (Chen, Meigs and Dupuis, 2013), which are based on a prospective model Y |G, Z. The next few sub-sections proceed as follows. For a better understanding, we begin with a brief introduction to the single-variant MASTOR test, and extend it to a retrospective, fixed-weight burden test. Next, we derive the ABT statistic, and illuminate its connection with kernel tests.

2.1. MASTOR for single-variant effect

In single-variant analysis, MASTOR (Jakobsdottir and McPeek, 2013) is a retrospective, quasi-likelihood score test for genetic association of a quantitative trait in samples with related individuals. MASTOR is able to gain additional power by making full use of the sample relationship information to incorporate partially missing data, therefore is more advantageous than other competitors. Considering a biallelic genetic variant X of interest (an example in our general setting described above is to let X = Gj, 1 ≤ jm), the MASTOR test statistic (for complete data) takes the form:

SMAS=(VTX)2Var^0(VTXY,Z) (2.1)

where V=^0-1(Y-Zβ^0) is the transformed phenotypic residual obtained from the phenotype model Y = 0 + ε, ε ~ N(0,Σ0). Here β0 represents the regression effect under the null hypothesis of no genetic association, and Σ0 is the trait covariance matrix under the null, usually taking a variance component form σe2I+σa2Φ. Let P=^0-1-^0-1Z(ZT^0-1Z)-1ZT^0-1, V is often short notated as PY. Under the assumption of the retrospective model that Var0(XY,Z)=σX2Φ, where σX2 is an unknown scalar representing the variance of variant X, Equation (2.1) can be written as:

SMAS=(VTX)2(VTΦV)σ^X2=(YTPX)2(YTPΦPY)σ^X2. (2.2)

When Hardy-Weinberg equilibrium (HWE) is assumed at the variant, a simple estimator of σX2 can be obtained as σ^X2=p^(1-p^)/2, where = (1TΦ−11)−11TΦ−1X is the best linear unbiased estimator (BLUE) (McPeek, Wu and Ober, 2004) of the allele frequency p of X, and 1 denotes a vector with every element equal to 1. In practice, a more general and robust estimator σ^X2=XTUX/(n-1) can be used instead (Thornton and McPeek, 2010), where U = Φ−1Φ−11(1TΦ−11)−11TΦ−1. Under the null, the MASTOR statistic follows χ12 distribution.

2.2. Retrospective, fixed-weight burden test

MASTOR can be extended to multiple-variant testing, i.e., to test association between trait and a set of genetic variants. An easy extension is through burden tests, which are constructed following a two-step procedure: first collapse multiple variants into genetic burden score by a linear combination, and then obtain the test statistic similarly as in single-variant tests. Following this formulation, we introduce a fixed-weight burden test for quantitative traits and genotype data with LD and sample relatedness. We acronymize this method by FBT, where the F stands for “fixed-weight” or “family-based”, yet, in the latter sense, this retrospective burden test is different from the prospective famBT method of Chen, Meigs and Dupuis (2013). Fixed-weight here refers to the setting of prescribed weights, in contrast to the weights derived in the next subsection which are data-adaptive. Different with other burden tests, FBT is based on a retrospective model analogous to that in MASTOR, thus also possesses nice properties such as ability to incorporate partially missing data and robustness to misspecification of the phenotype model. A similar method can be found in Schaid et al. (2013), with a different approach adopted for defining residuals in the null trait model and deriving covariances of the genetic burden score.

For genotype data G consisting of m variants with corresponding allele frequencies p1, p2, …, pm, we consider the weighted sum burden score:

X=i=1mwiGi=GW, (2.3)

where W = [w1, w2, …, wm]T is a prescribed m×1 weight vector. Following the MASTOR statistic (2.1), a fixed-weight burden test statistic based on X can be constructed as:

SFBT=(VTX)2VT^XV, (2.4)

where Σ̂X is an estimator of the covariance matrix of X under the null. Analogous to MASTOR, it can be shown that, conditional on W, Y, Z, SFBT follows χ12 distribution under the null hypothesis that the genetic score X is not associated with Y, i.e., the set of variants G has no genetic effects on Y after collapsed into X. Furthermore, if the vectorized G has a kronecker product covariance structure, then ΣX can be expressed in terms of the weight vector, across-column covariance, and across-row correlation as ΣX = (WTDRDW)Φ, where D = diag{σj}, 1 ≤ jm and σj is the marginal standard deviation of variant Gj (see S.1 of Supplementary Materials).

Correspondingly, if R and Φ are assumed known, an appropriate estimator of ΣX is:

^X=(WTD^RD^W)Φ, (2.5)

where the jth diagonal element of is estimated as σ^j=p^j(1-p^j)/2, 1 ≤ jm under HWE, with j, the BLUE of the allele frequency pj as previously defined. Therefore, the FBT statistic becomes

SFBT=[WTGTVVTGW][WT(D^RD^)W][VTΦV]. (2.6)

Clearly, SFBT is invariant to the scale of W. As a special case, when m = 1, SFBT in (2.6) reduces to SMAS in (2.2).

In general, the complex correlation structure in the genotype data, i.e., both R and Φ in (2.6), is not assumed known a priori, and needs to be estimated from the data. For simplicity, in our analysis, we assume that the across-row correlation Φ (or the entire pedigree structure) is known. We note that assuming Φ to be known is reasonable since the information of individual relatedness is often available for most family-based studies (Splansky et al., 2007). When population structure or cryptic relatedness presents, Φ can be estimated from genome-screen data (Thornton and McPeek, 2010). As for the across-column correlation R, one may obtain its estimate from a reference population, for example, the one provided by the 1000 Genomes Project (Consortium., 2010). When the reference panel information is not available, a simple and practical way to calculate from G is to use the sample correlation of the matrix Φ−1/2G. This estimation, however, has a nonnegligible impact on the performance of the test statistic, in terms of type I error and power. This issue will be further discussed in Section 3.3.

2.3. Adaptive-weight burden test

A common problem in existing burden tests is to have the variant weights pre-specified or set in some ad hoc way, thereby lacking theoretical justification for the test to be statistically powerful. For example, one may use a simple-sum method to assign equal weights to the variants. A better strategy is the Weighted-Sum method (Madsen and Browning, 2009) which assigns weights according to allele frequency, i.e., wj1/p^j(1-p^j). It has been shown that, with prescribed weights, burden tests cannot distinguish the effects from risk and protective variants, i.e., the minor alleles across all sites have effects in different directions, some positive and some negative (Wu et al., 2011; Chen, Meigs and Dupuis, 2013; Schaid et al., 2013). To overcome this deficiency, several adaptive burden tests have been proposed (Han and Pan, 2010; Liu and Leal, 2010; Lin and Tang, 2011; Sha et al., 2012; Sha and Zhang, 2014; Fang, Zhang and Sha, 2014); however most of these weighting strategies are empirical and cannot make full use of the complex correlation information. Hence, although easy to implement, burden tests are usually not preferred in real applications where no a priori information exists on the effect of individual variants. This motivates us to look for a burden test that is able to “let the data speak for themselves”, i.e., with weights adaptive to the direction of individual variant effects. Such a test is called the adaptive-weight burden test, abbreviated as ABT. Let A = D̂RD̂ and B = bbT where b = GTV. From the generalized Rayleigh quotient form of (2.6), we can show that the weight vector W* that maximizes SFBT satisfies

WA-1b=(D^RD^)-1GTV. (2.7)

Another representation of W* is a vector proportional to (or, with the same direction as) the eigenvector of A−1B corresponding to the largest eigenvalue. We refer to burden tests with such weights as ABT and denote the maximized burden statistic by SABT. It follows by plugging in W* to (2.6) that

SABT=VTG(D^RD^)-1GTVVTΦV. (2.8)

The key question now pertains to derivation of the null distribution of SABT.

Note, maximizing SFBT enables the weights to accommodate to the direction of individual variant effects, however this optimization may not necessarily leads to maximal power. Since W* itself is a function of G, SABT may not follow a χ12 distribution under the null hypothesis. One may attempt to obtain its null distribution by integrating out W* by ∫ f0(SFBT |W = W*)dF(W*), where f0 is the PDF of χ12—the null density of SFBT, and F denotes the distribution function of W* determined by (2.7). We notice that, the term G(D̂RD̂)−1/2 in the numerator of (2.8) is a transformed genotype matrix of m genetic variants with only across-row correlations. Therefore, by matrix algebra, SABT can be considered as the summation of the MASTOR statistics from m independent variants (after appropriate projection to eliminate LD correlations and standardize marginal variances), and hence follows χm2 distribution under the null hypothesis. This finding greatly simplifies the p-value calculation in contrast to other permutation-based approaches (Han and Pan, 2010; Liu and Leal, 2010; Lin and Tang, 2011; Sha et al., 2012; Sha and Zhang, 2014; Fang, Zhang and Sha, 2014), and makes ABT suitable to real applications of whole genome analysis. When R is unknown in (2.8), we replace it by an estimate from G. We will show through a connection with kernel tests that in such a case the null distribution of SABT can be determined by a mixture of χ12’s.

As a simplified example using the configuration of unrelated individuals and common variants (see Scenario I in Section 3.1), Figure 1 demonstrates the empirical cumulative distribution functions (ECDFs) of SABT and SFBT obtained from 10,000 simulated replicates under the null. The simulation is under the setting of number of individuals n = 1, 600, and varying number of variants m = 10, 50, 100 with moderate correlation (0.5) between the latent standard normal random variables which are dichotomized to generate haplotypes in LD. The details are available in Section 3.1.

Fig 1.

Fig 1

Empirical CDFs (shown by triangular markers) of SABT and SFBT based on 10,000 simulated replicates under the null hypothesis, and their theoretical CDFs (shown by solid lines).

2.4. ABT as a kernel test with generalized Madsen-Browning weights

The novel finding of the null distribution of SABT illustrated in the previous subsection provides us motivation to further explore the relation between ABT and family-based kernel tests. Kernel tests (also called “quadratic tests”, or “variance component tests”) assume random effects for the regression coefficients of multiple variants. Specifically, family-based kernel tests under our study design is based on a prospective model:

Y=Zβ+Gγ+ε,ε~N(0,σe2I+σa2Φ). (2.9)

In this model, γ is a vector of the genetic effects, and its jth component, 1 ≤ jm, follows a distribution with mean 0 and variance wj2τ. Here, wj is a fixed, prescribed weight for the jth variant effect, 1 ≤ jm. Testing γ = 0 is equivalent to testing the common variance component τ = 0. Following this formulation, Chen, Meigs and Dupuis (2013) developed the famSKAT statistic for testing association for quantitative traits in family samples as

SKT=VTGWWGTV. (2.10)

Under the null hypothesis, SKT follows the distribution of i=1mλiχ1,i2 where λi’s are the eigenvalues of the matrix WGTPGW and χ1,i2’s are independent χ12 random variables. We note that the W matrix in the above formula plays the role of the square root of the W matrix defined in Chen, Meigs and Dupuis (2013), and the P matrix in our notation is connected to the P0 matrix defined in Chen, Meigs and Dupuis (2013) by P = Σ̂−1 P0 Σ̂−1. Comparing our ABT statistic (2.8) with the famSKAT statistic (2.10), we see that ABT, though straightforwardly derived from the fixed-weight burden test under retrospective setting, can be treated formally as a kernel test with the weight matrix

W#=(VTΦV)-1/2(D^RD^)-1/2. (2.11)

This interesting finding shows that, using data-driven weights selected under the guidance of complex correlations Φ and R in the genotype data, burden tests and kernel tests reach a formal equivalence, regardless of fixed or random effect models, and the underlying prospective or retrospective models they are based on.

From the kernel test perspective, the weight matrix W# ∝ (D̂RD̂)−1/2. The diagonal components of this matrix, when all variants in the genetic region to be tested are in linkage equilibrium, i.e., R = I, become the commonly used Madsen-Browning weights (Madsen and Browning, 2009), i.e., wj1/p^j(1-p^j). We therefore call W# in Equation (2.11) the generalized Madsen-Browning (GMB) weights. We note that the GMB weights refer to the entire matrix of W#, not just its diagonal components, because the weight of an individual variant statistics should also be affected by the weights of other variant statistics on linked sites in the presence of LD. Another analogous view of the ABT statistic to the famSKAT statistic is through a two-step calculation: first eliminate LD correlations in the genotype matrix G by right multiplying (D̂RD̂)−1/2, and then obtain a weighted sum of the individual score statistics for the transformed variants from the decorrelated genotype matrix G(D̂RD̂)−1/2. The weight matrix used in the second step, denoted as , is a diagonal matrix given by = (VTΦV )−1/2I. This indicates that ABT is indeed a kernel test on the decorrelated genotype matrix, using identical weights equal to (VTΦV )−1/2. It is not surprising to see that the weights of these individual score statistics do not depend on the marginal variances, because the decorrelated genotype matrix G(D̂RD̂)−1/2 has an identity covariance matrix.

Similarly to kernel tests, we may obtain an alternative for the null distribution of ABT as i=1mλiχ1,i2, where λi’s are the eigenvalues of the matrix W#GTPGW# and χ1,i2’s are independent χ12 random variables. We will call χm2 the theoretical null distribution of ABT, and call the mixture of χ1,i2’s the practical null distribution of ABT. The latter is usually used when the complex genotypic covariance structure does not follow a kronecker product form, i.e., Φ and R are not separable (Fuentes, 2006), and R is unknown and needs to be estimated from G. These two null distributions and their impact on the performance of ABT will be further explored in Section 3.3. Since ABT is derived from burden test framework, yet takes a form of kernel tests, we expect it to be more appropriate than the unified method of linear-combining fixed-weight burden and kernel test statistics as aSFBT +(1−a)SKT, 0 ≤ a ≤ 1 (Lee, Wu and Lin, 2012; Jiang and McPeek, 2013). Further evidence can be found in our simulation results in Section 3.4.

3. Simulation studies

In this section, we perform simulation studies to assess the type I error rate of ABT and compare its power to that of FBT, family-based kernel test (abbreviated as KT), and MONSTER (Jiang and McPeek, 2013). As a by-product of these simulations, we also illustrate that the weights of ABT can adapt to the direction of the true genetic effects.

3.1. Data generation scenarios

We simulate data under nine genotypic scenarios (I–IX) using different variant (common/rare/mixed) and individual (unrelated/related/mixed) settings, as shown in the table below:

Table 1.

Data generation scenarios

Genotypic scenarios Variant setting
Common Rare Mixed
Individual setting Unrelated I II III
Related IV V VI
Mixed VII VIII IX

In these simulations, we generate genotype data for 1,600 individuals and varying number of variants with minor allele frequencies (MAFs) sampled independently from uniform distribution. For common (Scenarios I, IV, VII) and rare variants (Scenarios II, V, VIII), the MAFs are sampled from unif(0.1, 0.5) and unif(0.01, 0.1), respectively. For mixed variants (Scenarios III, VI, IX), we consider both common and rare variants with equal proportions. To simulate genotype data for unrelated individuals (Scenarios I, II, III), we first generate a latent continuous random sample for each individual from a multivariate normal distribution with mean 0 and compound symmetric covariance matrix Ω = (1 − η)I + η11T. These latent samples are then dichotomized by thresholding according to the variants’ MAFs to form binary haplotypes. Finally the genotype data G are obtained by adding two independent haplotypes, inducing an LD structure that depends on the prespecified parameter η. Note, the LD covariance matrix indeed depends on both η and the variants’ MAFs, as described in S.2 of Supplementary Materials. Also, with dichotomization, the haplotype covariance matrix is no longer Ω, however the parameter η can still be used to roughly indicate the LD correlation level among variants in G. We also note that this latent correlation coefficient η should not be obfuscated with the random effect parameter ρ used in Jiang and McPeek (2013), as they have obviously distinct meanings: η describes the latent correlation of the LD structure among variants in retrospective model, whereas ρ captures the heterogeneity of the random effects in prospective model.

Simulations for related individuals (Scenarios IV, V, VI) are based on a pedigree configuration with 100 outbred, 3-generation families, each containing 16 individuals related as in Figure 2. In order to simulate G with correlations across both samples and variants, we first generate multiple-variant genotype data independently for founders, as described above. The genotype data for non-founders are then generated by Mendelian “gene-dropping” along generations, assuming no recombination within haplotypes. In mixed individual setting (Scenarios VII, VIII, IX), the samples are drawn from 80 families (related as in Figure 2) and 320 unrelated individuals.

Fig 2.

Fig 2

Basic family structure of 16 members coming from 3 generations, used in our simulation studies to generate data from related individuals.

The quantitative trait data are generated from model (2.9). We assume that the design matrix Z includes the intercept and a covariate sampled independently from the standard normal distribution, with the corresponding coefficient vector β = (1, 0.6)T. For the genotype data G generated under Scenarios I–IX, we determine the genetic effects of the variants (denoted by γ, a vector of length m) in each scenario. In the simulation of type I error, we set γ = 0. For the simulation of power, the genetic effects are generated under four settings with different proportions of risk/protective/neutral variants, where the proportion of risk to protective varies from balanced (1: 1) to unbalanced (2: 1). The covariance of the quantitative trait Y (or the error term ε) takes a variance-components form, where σe2 represents variance due to random measurement error and σa2 stands for variance attributed to additive polygenic random effects. The settings of these parameters σe2,σa2, and γ in the simulations are listed in Table 2. For better illustration, we set larger magnitudes for γ in the rare variant Scenarios (II, V, VIII) to achieve enough power.

Table 2.

Parameter settings of variance components and genetic effects in simulations

Scenario
σe2
σa2
R/P/N under Ha Settings of γ under Ha
I, III 10 0 45/45/10 γRiidunif(0.05,0.2),
γPiidunif(-0.2,-0.05)
50/40/10 γRiidunif(0.05,0.175),
γPiidunif(-0.23,-0.05)
55/35/10 γRiidunif(0.05,0.155),
γPiidunif(-0.271,-0.05)
60/30/10 γRiidunif(0.05,0.138),
γPiidunif(-0.325,-0.05)

II 10 0 45/45/10 γRiidunif(0.05,0.5),
γPiidunif(-0.5,-0.05)
50/40/10 γRiidunif(0.05,0.445),
γPiidunif(-0.569,-0.05)
55/35/10 γRiidunif(0.05,0.4),
γPiidunif(-0.657,-0.05)
60/30/10 γRiidunif(0.05,0.363),
γPiidunif(-0.775,-0.05)

IV, VI, VII, IX 2 2 45/45/10 γRiidunif(0.05,0.2),
γPiidunif(-0.2,-0.05)
50/40/10 γRiidunif(0.05,0.175),
γPiidunif(-0.23,-0.05)
55/35/10 γRiidunif(0.05,0.155),
γPiidunif(-0.271,-0.05)
60/30/10 γRiidunif(0.05,0.138),
γPiidunif(-0.325,-0.05)

V, VIII 2 2 45/45/10 γRiidunif(0.05,0.5),
γPiidunif(-0.5,-0.05)
50/40/10 γRiidunif(0.05,0.445),
γPiidunif(-0.569,-0.05)
55/35/10 γRiidunif(0.05,0.4),
γPiidunif(-0.657,-0.05)
60/30/10 γRiidunif(0.05,0.363),
γPiidunif(-0.775,-0.05)

Note: Variance components σe2 and σa2 represent variances attributed to random measurement error and additive polygenic random effects in the phenotypic model. Under Ha, each scenario contains four settings of genetic variants depending on the number of risk, protective, and neutral variants (R/P/N). The genetic effects of risk and protective variants are denoted by γR and γP respectively.

3.2. Assessment of type I error

To assess type I error, we generate 10,000 simulated data replicates from the trait model (2.9) under the null γ = 0, for each scenario and each combination of m = 10, 50, 100 and η = 0.2, 0.5, 0.8. Based on the association testing results using ABT, we obtain the empirical type I error rates at nominal levels 0.01 and 0.05, as presented in Table 3. Under all scenarios and for all nine combinations of m and η, we observe that the empirical type I error rates of ABT are not significantly different from the nominal, based on a z-test at level 0.05. This shows that ABT is able to correctly control the type I error.

Table 3.

Empirical type I error of ABT under various scenarios and nominal levels

m η Nominal level = 0.01
I II III IV V VI VII VIII IX
10 0.2 0.0105 0.0106 0.0093 0.0109 0.0100 0.0108 0.0087 0.0090 0.0099
0.5 0.0112 0.0094 0.0102 0.0098 0.0115 0.0092 0.0108 0.0091 0.0087
0.8 0.0111 0.0094 0.0096 0.0100 0.0083 0.0104 0.0087 0.0089 0.0083

50 0.2 0.0103 0.0089 0.0100 0.0090 0.0097 0.0109 0.0089 0.0091 0.0087
0.5 0.0085 0.0104 0.0099 0.0106 0.0088 0.0083 0.0094 0.0080 0.0098
0.8 0.0102 0.0092 0.0094 0.0089 0.0095 0.0080 0.0083 0.0089 0.0086

100 0.2 0.0092 0.0082 0.0083 0.0089 0.0087 0.0081 0.0097 0.0080 0.0089
0.5 0.0088 0.0087 0.0082 0.0102 0.0085 0.0082 0.0086 0.0092 0.0098
0.8 0.0087 0.0088 0.0090 0.0087 0.0096 0.0087 0.0085 0.0080 0.0085
m η Nominal level = 0.05
I II III IV V VI VII VIII IX
10 0.2 0.0490 0.0512 0.0499 0.0499 0.0517 0.0509 0.0518 0.0473 0.0525
0.5 0.0475 0.0495 0.0492 0.0487 0.0497 0.0497 0.0505 0.0504 0.0513
0.8 0.0500 0.0484 0.0488 0.0521 0.0467 0.0513 0.0499 0.0474 0.0487

50 0.2 0.0510 0.0484 0.0497 0.0461 0.0476 0.0475 0.0468 0.0462 0.0473
0.5 0.0483 0.0500 0.0465 0.0459 0.0478 0.0504 0.0472 0.0465 0.0484
0.8 0.0487 0.0493 0.0486 0.0487 0.0482 0.0459 0.0480 0.0466 0.0468

100 0.2 0.0480 0.0487 0.0473 0.0474 0.0468 0.0475 0.0477 0.0466 0.0476
0.5 0.0460 0.0460 0.0458 0.0501 0.0462 0.0462 0.0462 0.0482 0.0481
0.8 0.0458 0.0479 0.0479 0.0481 0.0496 0.0465 0.0469 0.0490 0.0469

Note: Each entry represents type I error rate estimates defined as the proportion of p-values smaller than nominal under the null hypothesis based on 10,000 simulated genotype replicates. The large sample 95% CIs for norminals 0.01 and 0.05 are [0.0080, 0.0120] and [0.0457, 0.0543], respectively.

3.3. Conservativeness of ABT and analysis of eigenvalues

It is worth noting that the type I error rates presented in Table 3 are calculated from the practical null distribution of ABT instead of from the theoretical null distribution. In practice, we found that when correlations across both samples and variants exist, using the theoretical null distribution of χm2 tends to make ABT conservative. We include the type I error results for ABT based on χm2 in S.3 of the Supplementary Materials. There, we observe smaller empirical type I error rates with increasing η, or by including related individuals in the sample (Scenarios IV-IX). A sensible explanation for this phenomenon may be attributed to the estimation bias of DRD. Comparing equations (2.8) and (2.2) reveals that ABT is essentially a generalized, multiple-variant MASTOR test; when R is known, the m transformed variants from the decorrelated genotype matrix G(D̂RD̂)−1/2 are independent, and hence yield a χm2 null. However, in practice (and also in our simulations), when the complex genotypic covariance structure does not follow a kronecker product form and the LD covariance needs to be estimated, the across-column sample covariance of Φ−1/2G may not provide an accurate estimation of DRD. Hence, the across-column covariance matrix of G(D̂RD̂)−1/2 is not an identity matrix, though it is nearly identity for some cases, e.g., Scenarios I, II, and III with unrelated individuals only. The residual across-column correlation among the transformed variants results in an empirical null distribution i=1mλiχ1,i2 with i=1mλim, which causes the conservativeness of ABT based on χm2 .

To better illustrate how the complex covariance structure in genotype data influences the eigenvalues λi of the matrix W#GTPGW#, we present the boxplots of i=1mλi in Figure 3 (see panels A and B), generated from 10,000 simulated replicates for m = 10 and η = 0.2, 0.5, 0.8 following Scenarios IX and I, respectively. We observe that, for Scenario IX where complex correlations exist, i=1mλi deviates from 10, and shows a decreasing trend as η increases (Figure 3, panel A), due to the residual correlation among the transformed variants. In contrast, for Scenario I where only LD correlation exists, DRD can be accurately estimated, and hence the transformed variants can be treated as independent leading to null distribution of ABT approximated as χm2. This can be seen from i=1mλim in Figure 3, panel B. These results provide a reasonable explanation to the conservativeness of ABT based on χm2, as observed from the additional type I error results in S.3 of the Supplementary Materials. In Figure 3, panels C and D, we further present boxplots for individual λi’s (ordered from large to small) generated from the 10,000 simulated replicates for m = 10 and η = 0.8 from Scenarios IX and I, respectively. When complex genotypic correlation exists (Scenario IX with η = 0.8), the individual λi’s disperse around 1, with a majority less than 1. On the other hand, when only LD correlation exists (Scenario I with η = 0.8), all individual λi’s are very close to 1. This shows that the across-column covariance matrix of G(D̂R○D̂)−1/2 is nearly an identity matrix in this case.

Fig 3.

Fig 3

Variability of eigenvalues in Scenarios IX and I. Panel A: Boxplots of sum of eigenvalues from 10,000 simulated replicates in scenario IX for m = 10 and η = 0.2, 0.5, 0.8; Panel B: Boxplots of sum of eigenvalues from 10,000 simulated replicates in scenario I for m = 10 and η = 0.2, 0.5, 0.8; Panel C: Boxplots of individual eigenvalues from 10,000 simulated replicates in scenario IX for m = 10 and η = 0.8; Panel D: Boxplots of individual eigenvalues from 10,000 simulated replicates in scenario I for m = 10 and η = 0.8.

3.4. Power comparison

To assess power, we simulate 1,000 data replicates using parameter settings described in Table 2, for each scenario with m = 100 and η = 0.2, 0.5, 0.8. Here, we use equal weights for the FBT, KT, and MONSTER statistics. The MONSTER test is constructed on a grid of 11 equally spaced points: ρ1 = 0, ρ2 = 0.1, · · ·, ρ10 = 0.9, ρ11 = 1, as in its original paper (Jiang and McPeek, 2013). The empirical power of FBT, KT, MONSTER, and ABT for typical scenarios (II, IV, V, IX) at α = 0.05 are shown in Figure 4. In all these scenarios, ABT outperforms the other three methods for almost all settings of η and γ. In particular, we observe that FBT loses power in balanced settings γ, KT has overall good performance, and MONSTER, as a linear combination of FBT and KT, usually has power slightly higher than KT. ABT has comparable power to KT and MONSTER in weak LD settings (η = 0.2), and has higher power than others when η is moderate (0.5) or large (0.8). This is as expected since ABT is able to incorporate the LD covariance information using retrospective model whereas other methods cannot.

Fig 4.

Fig 4

Empirical power of FBT, KT, MONSTER, and ABT, at α = 0.05, based on 1,000 simulated replicates with m = 100. Panel A: Scenario IV; Panel B: Scenario II; Panel C: Scenario V; Panel D: Scenario IX.

3.5. Adaptability of weights W* to the direction of genetic effects

One obvious advantage of ABT over other fixed-weight tests is that the weights W* adapt to the direction of true genetic effects. In Section 2.3, we have shown that W* is able to maximize the test statistic of FBT. In order to understand how W* can help gain power in contrast to prescribed weights, we compare the signs of W* to those of the genetic effects γ using the simulated data sets in the power analysis. Figure 5, panel A presents boxplots of the weights W* across the 1,000 replicates in Scenario IV for m = 40, η = 0.8 with balanced setting of γ. We note that, under this setting, the first 45% components of γ are positive followed by the next 45% being negative, and the remaining 10% are zeros. This boxplot clearly demonstrates that on average, the weights W* is able to track the direction of true genetic effects, resulting in stronger association on the weighted sum genetic score. On the contrary, if one adopts FBT with the Madsen-Browning weights (which are all positive), the effects from risk and protective variants will be cancelled through linear combination, thereby weakening the association of the weighted sum genetic score. Figure 5, panel B provides the average of mismatch rates (i.e., proportion of γ × W* < 0 over the causal variants, where × denotes component-wise product of two vectors) across the 1,000 replicates in Scenarios II, IV, V, and IX for m = 100, η varying from 0 to 0.9, and under balanced setting of γ. We observe that for these scenarios, the average mismatch rates are about 22%-34% and slowly increasing with η. Intuitively, lower mismatch rates are indicative of better adaptability of W*, leading to higher power. This intuition is also verified from the constant lower average mismatch rate in Scenarios IV and V (blue and green lines compared to black and red lines in Figure 5, panel B) and the consistent higher power of ABT in Scenarios IV and V (panels A and C in Figures 4 compared to panels B and D).

Fig 5.

Fig 5

Illustration of adaptability of W* to the direction of true genetic effects. Panel A: Boxplots of weights W* across 1,000 simulated replicates in Scenario IV for m = 40 and η = 0.9 with balanced setting of γ; Panel B: Average mismatch rates across 1,000 simulated replicates in Scenarios II, IV, V, and IX for m = 100, η varying from 0 to 0.9, and under balanced setting of γ.

4. Application: Association Analysis of the GO-ESP data

The NHLBI “Grand Opportunity” Exome Sequencing Project (GO-ESP) is a study for identifying genetic variants in coding regions (exons) of the human genome that are associated with heart, lung, and blood diseases. By pioneering the application of next-generation exome sequencing across diverse, richly-phenotyped populations, this project aims to discover novel genes and mechanisms contributing to heart, lung, and blood disorders. Our use of the GO-ESP data was approved by the Institutional Review Board of Virginia Tech. In the GO-ESP (dbGaP Study Accession: phs000401.v12.p10), a total of 499 Framingham Heart Study (FHS) participants were selected for exome sequencing and sent to two sequencing centers—BroadGO and SeattleGO for processing. As these individuals may come from multiple families and LD may exists among densly genotyped variants, we applied ABT to test for gene-based associations on the fasting glucose levels. Uing Q/C metrics, 464 individuals are represented in GO-ESP exome sequencing data in dbGaP, from which we excluded 5 individuals with missing fasting glucose measurements. Among the remaining 459 individuals, 200 are from 76 families and the rest are unrelated individuals. We obtained fasting glucose levels of the sampled individuals at multiple time points (some individuals are from cohort 2 of the FHS with at most 8 measurements whereas others are from cohort 3 with at most 2 measurements), and treated the average fasting glucose level (log-transformed) as the quantitative trait. We then adjusted the quantitative trait by age and sex, and tested its association with the gene regions located on all 22 chromosomes.

Table 4 reports ten most significant p-values for testing the association of fasting glucose level with gene regions on all 22 chromosomes in the GOESP data, using ABT, FBT, KT, and MONSTER. Among these top-ranked genes, RNF214 was previously identified as one of the candidate genes that affect the quantitative variations of plasma lipid and glucose levels and obesity traits in F2 mice (Stewart et al. 2010). It has been reported that variants from the RNF214/BACE1 gene are associated with cardiovascular events among women with migraine (Schurks et al. 2011). OR2F2 was found differentially expressed in pregnant women with positive screening and negative diagnosis for gestational diabetes (Rafael et al. 2016). SMAD5 was identified to regulate insulin signaling, and specifically, regulate Akt2 expression and insulin-induced glucose uptake in L6 myotubes (Fernando et al. 2010). The UBASH3A gene encodes one of two family members belonging to the T-cell ubiquitin ligand family. This gene has been extensively studied and recognized as a risk factor of both type 1 diabetes (Jeffrey et al. 2009; Struan et al. 2009) and celiac disease (Plagnol et al. 2011). The TNFSF13B gene was found in a gene co-expression network associated with plasma HDL and glucose levels (Marcel et al. 2010).

Table 4.

P-values for association of fasting glucose level with all gene regions on 22 chromosomes in the GO-ESP data

Gene Chr # genotyped SNPs p-value based on
ABT FBT KT MONSTER
EDF1 9 7 2.43 × 10−11 3.22 × 10−7 9.11 × 10−11 1.18 × 10−4
RNF214 11 19 3.93 × 10−11 7.51 × 10−8 1.21 × 10−12 7.73 × 10−5
CENPL 1 4 7.83 × 10−11 1.77 × 10−8 8.21 × 10−11 1.65 × 10−5
OR2F2 7 8 1.15 × 10−9 4.33 × 10−1 3.54 × 10−5 1.64 × 10−6
SMAD5 5 9 2.99 × 10−9 1.55 × 10−3 7.66 × 10−6 9.26 × 10−7
UBASH3A 21 41 4.44 × 10−9 9.46 × 10−2 1.16 × 10−5 3.16 × 10−5
IGLL3P 22 13 4.92 × 10−9 8.48 × 10−3 7.02 × 10−9 5.56 × 10−5
TNFSF13B 13 8 9.30 × 10−9 8.27 × 10−5 3.03 × 10−7 3.88 × 10−5
MIF4GD 17 8 1.55 × 10−8 2.76 × 10−4 3.14 × 10−5 1.55 × 10−5
MILR1 17 6 2.21 × 10−8 9.67 × 10−5 6.95 × 10−10 8.66 × 10−6

Note: The quantitative trait—average fasting glucose level (log-transformed) was adjusted for age and sex. For FBT, KT, and MONSTER, Madsen-Browning weights were used. MIM numbers of genes: EDF1[605107], CENPL[611503], SMAD5[603110], UBASH3A[605736], TNFSF13B[ 603969], MIF4GD[612072].

5. Conclusions

The burden tests and the kernel tests are supposedly the two major classes of methods for multiple-variant association analyses that are widely used in genetic association studies. Though each has its advantages, there are two deficiencies shared in common in their implementation to genotype data with complex correlations: (i) Existing methods in both classes are mostly developed under the prospective regression setting (focusing on characterizing the conditional expectation of random trait measurements given covariates and genotypes), where accounting for LD correlations among variants is not straightforward, and (ii) The weights adopted in both classes are usually prescribed, which are not adaptable to the direction of individual variant effects, and therefore may not achieve optimal performance for association testing. In view of these issues, we first develop a retrospective, fixed-weight burden test to incorporate genotypic correlations across both samples and variants, and then employ data-driven weights to maximize the statistic of this fixed-weight burden test. The resulting adaptive-weight burden test can be easily constructed by first projecting genotype data to eliminate LD correlations, and then combining independent MASTOR tests on the transformed variants. The ABT sheds light on a number of aspects as described below.

First, by using a retrospective setting and treating genotype data as random, the ABT is able to directly model correlations across both samples and variants which exist universally in genotype data collected for current genetic association studies. This is highly desirable because very few existing methods can make full use of the valuable genotypic correlation information arising from both Mendelian inheritance and non-random association of alleles at different loci. A literature search revealed that the MONSTER is perhaps the closest method accounting for the complex genetic correlations. However, because of its prospective setting, MONSTER cannot model LD correlation among variants directly, though this information can be thought of being implicitly carried through formulating the covariance Rρ of the variant random effects vector β. Moreover, its application is largely restricted by the compound symmetric form covariance structure Rρ in real applications. Modeling phenotypes as fixed is also theoretically appealing because it makes fewer assumptions about phenotypic covariance structure (Price et al., 2010b). These facts clearly show the advantage of retrospective modeling to prospective modeling in capturing LD correlation.

Second, ABT adopts data-driven weights in the collapsing procedure. Unlike many other existing weighting strategies for burden tests or kernel tests, which lack theoretical justification, these weights guarantee ABT to be statistically powerful. The calculation of the weights W* is straightforward, and the resulting ABT statistic is shown to have an explicit null distribution. Through extensive simulations, we demonstrated that ABT is able to control type I error, and achieves higher power than the other three competing methods FBT, KT, and MONSTER in almost all scenarios with moderate or large LD correlation. Further investigation reveals that the weights of ABT are able to adapt to the direction of true genetic effects. This overcomes the main drawback of fixed-weight burden tests, which lose power in presence of both risk and protective variants.

Third, although ABT is derived from a burden test perspective, we showed that it is formally equivalent to a family-based kernel test with the GMB weights. This interesting finding can be used to guide the selection of weights in traditional kernel tests to accommodate LD correlation among variants. It also motivates statistical geneticists to reconsider and explore in-depth the relation between burden tests and kernel tests. Our simulations demonstrate that when genotypic correlations exist across both samples and variants, using the theoretical null distribution of χm2 results in a conservative test. A plausible explanation for this may be attributed to the covariance estimation bias in the decorrelation procedure. Additional eigenvalue analysis in our simulation study confirms this conjecture. Hence, we suggest using the practical null distribution (mixture of χ12’s) when complex genotypic correlations are not separable and when LD correlation R needs to be estimated from genotype data G.

Finally, as a retrospective association test, ABT is expected to have several additional advantages, for example, in borrowing information from partially informative data, and in maintaining robustness to phenotype model misspecification. The present work is an initial study towards a complete exploration on the underpinning and characteristics of this newly developed method. As more and more high-throughput sequencing data on samples with complex correlation structure become available in recent GWASs, it is of significant importance to investigate the performance of ABT in detecting association for rare variants, and for samples with population or cryptic relatedness structure. This will be pursued elsewhere.

Supplementary Material

Supplement

Acknowledgments

The authors would like to thank the Editor, the Associate Editor and the Reviewers, whose constructive and insightful comments contributed to a significantly improved version of this article. Bandyopadhyay’s research was partially supported by grants R03DE023372 and R01DE024984 from the NIH/NIDCR and VCU Massey Cancer Center Support Grant P30CA016059 from NIH/NCI.

Footnotes

SUPPLEMENTARY MATERIAL

Mathematical justifications and additional results:

(.pdf). The supplementary materials of the paper are organized as follows. Supplement S.1 provides the theoretical justification of the covariance matrix of genetic burden score X. Supplement S.2 derives the LD covariance of the simulated genotype data for founders. In Supplement S.3, additional results from Section 3.3 on the empirical type-I error of ABT based on χm2 null distribution in simulation studies are summarized in Table S1.

Contributor Information

Xiaowei Wu, Department of Statistics, Virginia Tech, 250 Drillfield Drive, MC0439, Blacksburg, VA 24061, USA.

Ting Guan, Department of Statistics, Virginia Tech, 250 Drillfield Drive, MC0439, Blacksburg, VA 24061, USA.

Dajiang J. Liu, Department of Public Health Sciences, Hershey Institute of Personalized Medicine, Pennsylvania State University College of Medicine, Hershey, PA 17033, USA

Luis G. León Novelo, Department of Biostatistics, School of Public Health, University of Texas Health Science Center, Houston, TX 77030, USA

Dipankar Bandyopadhyay, Department of Biostatistics, Virginia Commonwealth University, Richmond, VA 23298, USA.

References

  1. Ansorge WJ. Next-generation DNA Sequencing Techniques. New Biotechnology. 2009;25(4):195–203. doi: 10.1016/j.nbt.2008.12.009. [DOI] [PubMed] [Google Scholar]
  2. Asimit J, Zeggini E. Rare Variant Association Analysis Methods for Complex Traits. Annual Review of Genetics. 2010;44:293–308. doi: 10.1146/annurev-genet-102209-163421. [DOI] [PubMed] [Google Scholar]
  3. Chen H, Meigs JB, Dupuis J. Sequence Kernel Association Test for Quantitative Traits in Family Samples. Genetic Epidemiology. 2013;37(2):196–204. doi: 10.1002/gepi.21703. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chun H, Ballard DH, Cho J, Zhao H. Identification of Association Between Disease and Multiple Markers Via Sparse Partial Least-squares Regression. Genetic Epidemiology. 2011;35(6):479–486. doi: 10.1002/gepi.20596. [DOI] [PubMed] [Google Scholar]
  5. Consortium TGP. A Map of Human Genome Variation from Population-scale Sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Fang S, Zhang S, Sha Q. Detecting Association of Rare Variants by Testing an Optimally Weighted Combination of Variants for Quantitative Traits in General Families. Annals of Human Genetics. 2014;77(6):524–534. doi: 10.1111/ahg.12038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fuentes M. Testing for Separability of Spatial-temporal Covariance Functions. Journal of Statistical Planning and Inference. 2006;136:447–466. [Google Scholar]
  8. Gauderman WJ, Murcray C, Gilliland F, Conti DV. Testing Association Between Disease and Multiple SNPs in a Candidate Gene. Genetic Epidemiology. 2007;31(5):383–395. doi: 10.1002/gepi.20219. [DOI] [PubMed] [Google Scholar]
  9. Han F, Pan W. A Data-adaptive Sum Test for Disease Association with Multiple Common Or Rare Variants. Human Heredity. 2010;70(1):42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Jakobsdottir J, McPeek MS. MASTOR: Mixed-model Association Mapping of Quantitative Traits in Samples with Related Individuals. American Journal of Human Genetics. 2013;92:652–666. doi: 10.1016/j.ajhg.2013.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Jiang D, McPeek MS. Robust Rare Variant Association Testing for Quantitative Traits in Samples with Related Individuals. Genetic Epidemiology. 2013;38(1):1–20. doi: 10.1002/gepi.21775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A Powerful and Flexible Multilocus Association Test for Quantitative Traits. American Journal of Human Genetics. 2008;82:386–397. doi: 10.1016/j.ajhg.2007.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lee S, Wu MC, Lin X. Optimal Tests for Rare Variant Effects in Sequencing Association Studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Wurfel MM, Lin X NHLBI GO Exome Sequencing Project-ESP Lung Project Team, D. C. Optimal Unified Approach for Rare-variant Association Testing with Application to Small-sample Case-control Whole-exome Sequencing Studies. American Journal of Human Genetics. 2012;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Li B, Leal SM. Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data. American Journal of Human Genetics. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Li MX, Gui HS, Kwan JS, Sham PC. GATES: a Rapid and Powerful Gene-based Association Test Using Extended Simes Procedure. American Journal of Human Genetics. 2011;88:283–293. doi: 10.1016/j.ajhg.2011.01.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lin DY, Tang ZZ. A General Framework for Detecting Disease Associations with Rare Variants in Sequencing Studies. American Journal of Human Genetics. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Liu DJ, Leal SM. A Novel Adaptive Method for the Analysis of Next-generation Sequencing Data to Detect Complex Trait Associations with Rare Variants Due to Gene Main Effects and Interactions. PLoS Genetics. 2010;6:e1001156. doi: 10.1371/journal.pgen.1001156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ma L, Clark AG, Keinan A. Gene-based Testing of Interactions in Association Studies of Quantitative Traits. PLoS Genetics. 2013;9:e1003321. doi: 10.1371/journal.pgen.1003321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Madsen BE, Browning SR. A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genetics. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. McPeek MS, Wu X, Ober C. Best Linear Unbiased Allele-frequency Estimation in Complex Pedigrees. Biometrics. 2004;60:359–367. doi: 10.1111/j.0006-341X.2004.00180.x. [DOI] [PubMed] [Google Scholar]
  22. Morgenthaler S, Thilly WG. A Strategy to Discover Genes That Carrymulti-allelic Or Mono-allelic Risk for Common Diseases: A Cohort Allelic Sums Test (CAST) Mutatation Research. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
  23. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an Unusual Distribution of Rare Variants. PLoS Genetics. 2011;7:e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled Association Tests for Rare Variants in Exon-resequencing Studies. American Journal of Human Genetics. 2010a;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Price AL, Zaitlen NA, Reich D, Patterson N. New Approaches to Population Stratification in Genome-wide Association Studies. Nature Reviews Genetics. 2010b;11(7):459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Schaid DJ, McDonnell SK, Sinnwell JP, Thibodeau SM. Multiple Genetic Variant Association Testing by Collapsing and Kernel Methods with Pedigree Or Population Structured Data. Genetic Epidemiology. 2013;37(5):409–418. doi: 10.1002/gepi.21727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser PA, Lin X. SNP Set Association Analysis for Familial Data. Genetic Epidemiology. 2012;36(8):797–810. doi: 10.1002/gepi.21676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Sha Q, Zhang S. A Novel Test for Testing the Optimally Weighted Combination of Rare and Common Variants Based on Data of Parents and Affected Children. Genetic Epidemiology. 2014;38(2):135–143. doi: 10.1002/gepi.21787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Sha Q, Wang X, Wang X, Zhang S. Detecting Association of Rare and Common Variants by Testing an Optimally Weighted Combination of Variants. Genetic Epidemiology. 2012;36(6):561–571. doi: 10.1002/gepi.21649. [DOI] [PubMed] [Google Scholar]
  30. Shendure J, Ji H. Next-generation DNA Sequencing. Nature Biotechnology. 2008;26(10):1135–1145. doi: 10.1038/nbt1486. [DOI] [PubMed] [Google Scholar]
  31. Splansky GL, Corey D, Yang Q, Atwood LD, Cupples LA, Benjamin EJ, D’agostino RB, Fox CS, Larson MG, Murabito JM, et al. The Third Generation Cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: Design, Recruitment, and Initial Examination. American Journal of Epidemiology. 2007;165:1328–1335. doi: 10.1093/aje/kwm021. [DOI] [PubMed] [Google Scholar]
  32. Thornton T, McPeek MS. Case-control Association Testing with Related Individuals: A More Powerful Quasi-likelihood Score Test. American Journal of Human Genetics. 2007;81:321–337. doi: 10.1086/519497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Thornton T, McPeek MS. ROADTRIPS: Case-control Association Testing with Partially Or Completely Unknown Population and Pedigree Structure. American Journal of Human Genetics. 2010;86:172–184. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wang K, Abbott D. A Principal Components Regression Approach to Multilocus Genetic Association Studies. Genetic Epidemiology. 2008;32(2):108–118. doi: 10.1002/gepi.20266. [DOI] [PubMed] [Google Scholar]
  35. Wang Y, Chen YH, Yang Q. Joint Rare Variant Association Test of the Average and Individual Effects for Sequencing Studies. PLoS One. 2012;7:e32485. doi: 10.1371/journal.pone.0032485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Wang T, Elston RC. Improved Power by Use of aWeighted Score Test for Linkage Disequilibrium Mapping. American Journal of Human Genetics. 2007;80:353–360. doi: 10.1086/511312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Wang X, Morris NJ, Zhu X, Elston RC. A Variance Component Based Multi-marker Association Test Using Family and Unrelated Data. BMC Genetics. 2013a;14:17. doi: 10.1186/1471-2156-14-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Wang X, Lee S, Zhu X, Redline S, Lin X. GEE-based SNP Set Association Test for Continuous and Discrete Traits in Family Based Association Studies. Genetic Epidemiology. 2013b;37(8):778–786. doi: 10.1002/gepi.21763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-set Analysis for Case-control Genome-wide Association Studies. American Journal of Human Genetics. 2010;86:929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. American Journal of Human Genetics. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES