Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jun 8.
Published in final edited form as: Genet Epidemiol. 2015 Mar 4;39(4):227–238. doi: 10.1002/gepi.21892

Methods for Association Analysis and Meta-Analysis of Rare Variants in Families

Shuang Feng 1, Giorgio Pistis 1,2, He Zhang 3, Matthew Zawistowski 1, Antonella Mulas 2, Magdalena Zoledziewska 2, Oddgeir L Holmen 4,5, Fabio Busonero 2, Serena Sanna 2, Kristian Hveem 4,6, Cristen Willer 3,7,8, Francesco Cucca 2, Dajiang J Liu 9, Gonçalo R Abecasis 1,*
PMCID: PMC4459524  NIHMSID: NIHMS692335  PMID: 25740221

Abstract

Advances in exome sequencing and the development of exome genotyping arrays are enabling explorations of association between rare coding variants and complex traits. To ensure power for these rare variant analyses, a variety of association tests that group variants by gene or functional unit have been proposed. Here, we extend these tests to family-based studies. We develop family-based burden tests, variable frequency threshold tests and sequence kernel association tests (SKAT). Through simulations we compare the performance of different tests. We describe situations where family-based studies provide greater power than studies of unrelated individuals to detect rare variants associated with moderate to large changes in trait values. Broadly speaking, we find that when sample sizes are limited and only a modest fraction of all trait-associated variants can be identified, family samples are more powerful. Finally, we illustrate our approach by analyzing the relationship between coding variants and HDL in 11,556 individuals from the HUNT and SardiNIA studies, demonstrating association for coding variants in the APOC3, CETP, LIPC, LIPG, and LPL genes and illustrating the value of family samples, meta-analysis and gene-level tests. Our methods are implemented in freely available C++ code.

Keywords: Rare variant, association, family sample, population sample, gene-level tests, study design, meta-analysis

Introduction

Variants of functional consequence, including non-synonymous, splice altering, and protein truncating variants, usually segregate at very low frequency in human populations [Abecasis et al. 2010; Abecasis et al. 2012; Marth et al. 2011; Nelson et al. 2012]. Recent advances in exome sequencing and the development of exome genotyping arrays are enabling explorations of their contributions to complex disease [Kiezun et al. 2012].

Association of rare variants with disease will bring biological insights about disease processes, but standard variant-by-variant association tests lack power when applied to these variants. Our work builds upon three strategies to increase the power of rare variant association studies: grouping variants by gene or functional unit, combining results across many studies through meta-analysis, and analysis of family samples.

Grouping rare variants by gene or functional unit [Li and Leal 2008], whether with weights [Madsen and Browning 2009] or without [Morris and Zeggini 2010], is now a popular strategy for rare variant association analysis [Lee et al. 2012a; Lee et al. 2012b; Lin and Tang 2011; Price et al. 2010; Wu et al. 2011]. The approach assumes that rare variants in the same gene or functional unit have similar functional consequences. When the assumption is correct and rare variants in a region are analyzed together, association signals will be clearer than when evaluating variants individually.

A second strategy to increase power is meta-analysis, which increases sample size and provides a practical approach to difficulties in data-sharing and concerns about heterogeneity [Lin and Zeng 2010; Willer et al. 2010]. Meta-analysis of single variants has been key in establishing association between common variants and complex diseases [Scott et al. 2007; Willer et al. 2010]. Meta-analysis methods for rare variant association tests have now been proposed, although these initial proposals and their implementations have generally focused on samples of unrelated individuals [Lee et al. 2013; Liu et al. 2014; Tang and Lin 2013].

Finally, a third strategy is to study samples of closely related individuals, increasing the odds that multiple copies of each rare variant are observed. Family samples are key in studies of Mendelian disorders but can also have advantages for studies of complex traits [Laird and Lange 2006, 2008; Ott et al. 2011]. For example, they can be more robust to population stratification (which may be more acute in rare variant association studies [Gravel et al. 2011]), allow checks for genotyping errors (improving data quality [Abecasis et al. 2001; Abecasis et al. 2002] [Abecasis et al. 2001; Abecasis et al. 2002]) and can be enriched for variants of large effect by focusing on families with multiple individuals with extreme phenotypes. Early tests for family based association [Abecasis et al. 2000; Laird et al. 2000; Laird and Lange 2008] focused on analysis of transmission disequilibrium, but newer tests rely on variance component models [Chen and Abecasis 2007; Kang et al. 2010] to account for stratification, resulting in tests of association that are typically more powerful [Chen and Abecasis 2007]. Our work also builds on computational enhancements in methods for variance component analysis, which have now been extended to samples of unrelated individuals (using empirical kinship matrices, estimated from genotype data) [Kang et al. 2010; Lippert et al. 2011; Zhou and Stephens 2012].

Here, we describe family-based association tests for rare variants that allow analysis of quantitative traits, with or without covariates, and show how these tests can be applied in meta-analysis settings. Our methods are based on the insight that gene-level test statistics can be constructed from single variant score statistics and estimates of the covariance between those [Liu et al. 2014]. We first analyze single variants using efficient computational algorithms for evaluation of variance component models [Lippert et al. 2011]. We then develop family-based burden (weighted and un-weighted), sequence-kernel association (SKAT), and variable frequency threshold (VT) tests. Using simulation we show that type I error is well controlled and compare different testing approaches. As expected, SKAT tests are more powerful when the fraction of associated variants in each gene is small or associated rare variants have opposite directions of effect; VT tests are more robust to the choice of allele frequency threshold for grouping variants. Our analysis of exome chip genotypes and HDL level data from the HUNT and SardiNIA studies shows that our methods are well calibrated and powerful enough to identify several signals at lipid associated loci.

There has been much recent work focused on extending gene-level association tests to families. Examples include various family-based burden tests [De et al. 2013; Saad and Wijsman 2014; Schaid et al. 2013] and variance component based tests [Chen et al. 2013; Ionita-Laza et al. 2013; Saad and Wijsman 2014; Schaid et al. 2013; Schifano et al. 2012; Svishcheva et al. 2014]. A key difference in our implementation, compared to previous work [Chen et al. 2013; Saad and Wijsman 2014; Schaid et al. 2013; Schifano et al. 2012; Svishcheva et al. 2014] is that we construct our gene-level statistics using single-variant statistics as input. This allows us to quickly re-evaluate gene-level statistics when gene-definitions changes, makes it practical to implement variable frequency-threshold based tests, and facilitates meta-analyses. To ensure computational efficiency in genome-wide analyses, our implementation uses a score-test that requires fitting a maximum likelihood model only once (rather than a Wald-test that would require it for every gene [Saad and Wijsman 2014] Madsen and Browning 2009). We also focused on methods that could accommodate a diverse mix of family structures or even samples that include both families and unrelated individuals. This is in contrast to transmission-based tests, such as proposed by De et al [De et al. 2013; Ionita-Laza et al. 2013] that are limited to simpler family structures and cannot account for cryptic relatedness. As usual, we expect transmission based tests may provide greater protection against stratification – but at the cost of greatly reduced power.

We characterize settings where family studies can provide greater power to detect rare variants with moderate to large phenotypic consequences than studies of unrelated individuals. In studies of unselected samples, this is due to a “Jackpot” effect, whereby a family sample may more easily include multiple copies of trait-associated rare alleles in one locus. While for each locus the expected number of rare alleles will be the same in a family sample or an unrelated sample of same size, family samples are much more likely to exceed this expectation by a large amount. Our simulations show that this difference can have a large impact on power. All the methods described here are implemented in freely available C++ code and tools.

Methods

In this section, we first describe a variance component model to handle familial relationships. Then, we describe how single variant association statistics and their covariance matrices can be calculated and how gene-level association tests can be constructed. Next, we describe meta-analytic approaches for both single variant and gene-level association tests. Finally, we discuss the computational cost of our proposed approach and provide practical suggestions to improve computational performance.

Modeling Relatedness

In a sample of n individuals, we model the observed phenotype vector (y) as a sum of covariate effects (specified by a design matrix X and a vector of covariate effects β), additive genetic effects (modeled in vector g and non-shared environmental effects (modeled in vector ε). Thus:

y=Xβ+g+ε. (Equation 1)

We assume that genetic effects are normally distributed, with mean 0 and covariance 2σg2K where the matrix K summarizes kinship coefficients [Lange 1997] between sampled individuals and σg2 is a positive scalar describing the genetic contribution to the overall variance. We assume that non-shared environmental effects are normally distributed with mean 0 and covariance Iσe2, where I is the identity matrix.

To estimate K, we either use known pedigree structure to define [Lange 1997] or else use the Balding-Nicols empirical estimator [Astle and Balding 2009], which uses observed genotypes to estimate kinship as K^=1vi=1v(Gi2fi1)(Gi2fi1)T4fi(1fi) (here, v is the count of variants, Gi is a genotype vector where each element encodes the number of observed minor alleles in a particular individual, and fi is the estimated allele frequency for the ith variant). Model parameters β^,σg2^andσe2^, are estimated using maximum likelihood and the efficient algorithm described in Lippert et al. [Lippert et al. 2011]. For convenience, let the estimated covariance matrix of y be Ω^=2σg2^K^+σe2^I.

Single-variant Association Tests and Summary Statistics

Since our gene level association tests will build on single-variant test statistics [Chen and Abecasis 2007], we will first describe single variant test statistics and their corresponding variance-covariance matrix.

Consider the model

y=Xβ+γi(GiG¯i)+g+ε.

This model is a refinement of equation (1) above, adding a scalar parameter γi to measure the additive genetic effect of the ith variant. As usual [Lange 1997], the score statistic for testing H0: γi = 0 is:

Ui=(GiG¯i)TΩ^1(yXβ).

And the variance-covariance matrix of these statistics is:

V=(GG¯)(Ω^1Ω^1X(XTΩ^1X)1XTΩ^1)(GG¯).

Under the null, test statistics Ti=Ui2Vii are asymptotically distributed as chi-squared with one degree of freedom.

Gene-level Association Tests for Family Samples

Using single variant statistics Ui and their variance-covariance matrix V, we are now ready to construct a variety of gene-level association test statistics that combine information across variants.

The simplest statistic for a burden test is to estimate the average genetic effect across a series of variants satisfying certain functional (for example, non-synonymous or protein truncating variants) and frequency criteria (for example, allele frequency <.05). Then the rare variant burden for each individual can be defined as a weighted sum of allele counts for variants satisfying these criteria. Abstractly, we define the rare variant burden as (G)w, where w = (w1, w2, … , wm)T is a vector of weights for each of the m variants in the gene. A regression parameter measuring the average effect of each variant can be estimated using the model:.

y=Xβ+γ(GG¯)w+g+ε.

To test the null hypothesis γ = 0, we use a score statistic, expressed as a function of single variant statistics wTU with variance wTVw

Then the burden test statistic Tburden=wTUwTVW is asymptotically normal with mean zero and variance one.

Variable Threshold Tests for Family Samples

The simplest burden tests will be effective when appropriate frequency thresholds and functional annotation are used to select functional variants for analysis. However, this is challenging to do, because the optimal frequency thresholds will vary by gene and by phenotype [Lange et al. 2014]. One possibility is to define a test statistic that considers many alternative frequency thresholds [Lin and Tang 2011; Price et al. 2010].

Following the suggestions of Price et al. 2010 and Lin et al. 2011, we will define the variable threshold test statistic as the maximal absolute value of burden test statistics across all possible frequency thresholds, TVT = maxF |TburdenF |, where TburdenF=ϕFTUϕFTVϕF is the burden test statistic calculated with frequency threshold F and ϕF is a vector of 0s and 1s indicating whether a variant has allele frequency below F. Burden statistics calculated using different frequency thresholds jointly follow a multivariate normal distribution with mean 0, and variance-covariance matrix ψij=ϕiTVϕjϕiTVϕiϕjTVϕj (Lin et al.). P-values can be evaluated using the cumulative density function of this multivariate normal distribution [Genz 1992].

Sequence Kernel Association Tests

Another refinement is to use a test statistic that allows for variants in the same gene to modify the phenotype in opposite directions [Chen et al. 2013; Ionita-Laza et al. 2013; Wu et al. 2011; Yan et al. 2014]. For example, in some genes [Abifadel et al. 2003], both gain-of-function and loss-of-function alleles have been described and these signals might cancel each other in a standard burden analysis. The model for this type of test is

y=Xβ+γ(GG¯)+g+ε,

In this alternative model, the single variant effects γi are assumed to follow a shared distribution, with mean 0 and variance τwi. We test the null hypothesis of no association using the statistic TSKAT = UTWU to evaluate whether τ is nonzero [Chen et al. 2013; Wu et al. 2011]. As usual, W = diag (w1, w2, … , wm) is a diagonal matrix indicating the weight of each variant. TSKAT is distributed as a mixture chi-squared with weights λ1, λ2, … ,λn corresponding to the eigen values of V12WV12 and the χ12(i) correspond to independently distributed chi-squared variables, each with 1 degree of freedom [Wu et al. 2011]. P-values can be evaluated using the Davies algorithm [Davies 1980] or a moment matching algorithm [Liu et al. 2009].

Meta-Analysis

Since we derived all the statistics above from single variant score statistics and their covariance matrix, our approach can be readily extended to meta-analyses. We first define the overall single variant score statistics and their variance-covariance matrix as Umetai=k=1sUikandVmetai=k=1sVij,k, where Uik and Vij,k are the single variant score statistic and variance-covariance matrix components from study k and s is the total number of studies. Whenever variant i is unobserved in study k, we set Uik = 0 and Vij,k = 0 for all j. Next, we simply calculate burden, VT and SKAT meta-analysis statistics using the formulae above.

Computational Efficiency

Since we rely on score statistics and their covariance, we only need to fit the linear mixed model once under the null hypothesis. Fitting parameters for this null mixed model is a major part of the computational cost of our approach. Standard EM or Newton–Raphson methods require calculating the inverse of the covariance matrix in each iteration – with time complexity O(n3), too costly for large datasets. Instead, we used the computationally efficient algorithm described in Lippert et al. [Lippert et al. 2011] to estimate the variance components and fixed effects under the null (Equation 1). The algorithm begins with a one-time singular value decomposition (SVD) of the relationship matrix , a step which has time complexity O(n3)). The results of this decomposition are used in a factorization that transforms the phenotype vector and design matrix so that transformed phenotypes are identically and independently distributed. This second step has time complexity O(n2). After transformation, the cost of updating the log likelihood becomes linear with respect to sample size n (instead of O(n3) using the standard approach). Calculating the score statistics and their covariance for all single variants simply requires a transformation of genotypes and has time complexity O(mn2) for a dataset with m variants. In reality, we calculate covariance of score statistics from markers within a sliding window. For large samples, calculating the SVD of is the computationally most expensive step. A similar idea with comparable computational efficiency has also been described in Zhou et al. [Zhou and Stephens 2012]. Both ideas build upon the algorithm described by Kang et. al and implemented in EMMAX [Kang et al. 2010].

When variants are grouped in gene-level tests, the computational cost of calculating the combined test statistics is small after single variants have been analyzed. Obtaining p-values corresponding to these statistics, especially for SKAT and VT analyses, can still be challenging when the number of rare variants in a gene is large. To speed up this step, we used computationally efficient algorithms to evaluate the multivariate normal probabilities [Genz 1992] and the mixture chi-squared distribution [Davies 1980].

Simulation

We carried out a series of simulations to evaluate the performance of our method. We first simulated a set of 1000 base-pair sequences, which is close to the length of an average protein coding sequence in humans, using the coalescent (as implemented in the program ms [Hudson 2002]) and a demographic model calibrated to mimic European population history [Adams and Hudson 2004; Novembre et al. 2008]. We then carried out gene-dropping simulations [Abecasis et al. 2002] using these simulated sequences as founder haplotypes that were propagated through various pedigree structures (Figure 1).

Figure 1. Pedigree Structures Used in Simulations.

Figure 1

To evaluate power, we assigned a fraction of variants below a desired frequency threshold (<0.01 in simulations unless addressed otherwise) as causal. Typically, we assigned minor alleles at causal variants to all have effects in the same direction but, in some cases, a fraction of causal minor alleles were assigned effects in the opposite direction. When assigning effect sizes to causal variants, we considered two trait-generating models - an equal variance model (where the effect size for each variant is proportional to 1p(1p), a function of the allele frequency p that ensures each causal variant explains the same amount of trait variance) and an equal effect-size model (where the effect size is the same for all causal variants, irrespective of allele frequency). In the equal effect size model, relatively common variants explain a larger amount of the variance; while in the equal variance model, rarer variants have larger effect sizes (See Figure S1 for demonstration). Genetic effects were set so that the total variance explained by each gene (h2gene) was in the 0.1-2% range. Empirical power was calculated using 10,000 simulations for each parameter combination. We used α =1×10-8 for single variant association power and α =2.5×10-6 for gene-level association power. Type I error rate for gene-level tests was estimated using 5,000,000 simulations. To compare studies of families and unrelated individuals, we held the number of genotyped (or sequenced) individuals constant and compared our ability to detect associated variants in studies using different sampling units. In simulations and following association analysis, kinship matrices estimated from pedigree were used to fit the null linear mixed model.

SardiNIA and HUNT Samples Description

To further evaluate how our method in real data analysis, we used exome chip data from the HUNT [Holmen et al. 2014a; Holmen et al. 2014b] and SardiNIA [Giorgio et al. 2014; Pilia et al. 2006] studies, which genotyped 5,803 and 6,602 individuals, respectively. Here, we analyze HDL, adjusted for age and sex (Table S3). Genotypes were called using the Illumina GenCall algorithm in combination with zCall V2.2. Detailed QC procedures can be found in Holmen et al. [Holmen et al. 2014a] for the HUNT study and Pistis et al. [Giorgio et al. 2014] for the SardiNIA study.

Results

Type I Error Rate

To evaluate type I error rate, we simulated family samples of 1,000 or 5,000 individuals with family structures matching 3 generation pedigrees with 10 (Pedigree10) or 50 (Pedigree50) individuals (see Figure 1 for details). Within each gene, variants with frequency <.01 were grouped for analysis. Each type I error estimate summarizes results from five-million simulations. Table S1 shows that the type I error of our gene-level association tests is well controlled for a variety of pedigree structures. Empirical error rates are a little below nominal levels when sample sizes are small (N= 1,000), but approach nominal significance as sample size increases (N=5,000).

Power of Different Rare Variant Association Tests

Next, we evaluated the power of our proposed association tests under various scenarios. We used significance level α =2.5×10-6, which corresponds to Bonferroni adjustment for testing of 20,000 genes. We first simulated samples of 5,000 individuals distributed in 3-generation pedigrees with 10 individuals each (Pedigree10 in Figure 1). Variants with frequency <1% (<5% where noted) explained 1% of the variance in a simulated quantitative trait. When all associated variants had the same effect size and the proportion of causal variants was small (∼20%), SKAT had the largest power. When this proportion grew larger (∼80%), VT became the most powerful test (Table 1). Although we did not simulate a relationship between frequency and effect size among causal variants, VT provided greater power because it sometimes excluded relatively common unassociated variants from consideration, reducing noise. When fraction of causal variants is small, methods that explicitly allow for heterogeneity in effect sizes do better, since no correlation between causality and effect size was simulated, VT can't easily exclude most of the unassociated variants. In practice, the true list of causal variants is usually unknown; and allele frequency is often a good proxy to identify variants likely to modify gene function [Nelson et al. 2012]. In a simplified scenario where only causal variants were grouped and other variants were discarded, the basic burden test became optimal (Table 1).

Table 1. Power when Causal Variants All Increase Trait Values and Have the Same Effect Sizes.

MAF Cutoff Causal Percentage Group by MAF Cutoff Group Only Causal Variantsb

Burden Madsen-Browning VT SKATa Burden Madsen-Browning VT SKAT
0.01 20% 9.7 3 13.1 36.6 94.3 86.7 92.9 82.6
80% 82.4 64.7 88.1 61 96 82.1 94.3 70.7

0.05 20% 14.6 2.6 24.9 36.3 95.4 75.3 93.8 86.5
80% 81.3 39.5 89.2 75 96.3 55.3 94.3 82.9

Simulated samples each had 5,000 individuals, organized in families with pedigree10 structure (See Figure 1). Causal variants were selected among those identified in simulated 1,000 base-pair sequences and explained 1% of trait variance. Each causal variant had the same effect size and direction. Power is tabulated as a percentage of simulations exceeding significance threshold. Significance level α = 2.5 × 10-6 was used in all simulations.

a

Power calculated from Madsen-Browning weighted SKAT.

b

Power when grouping only causal variants. This column represents the largest power we can achieve for each simulation setting.

We next considered more complex scenarios. When 20% causal variants decreased trait values and the remainder increased trait values, the power of burden and VT tests dropped dramatically and SKAT became the most powerful test, regardless of the proportion of causal variants (Table 2). When we setup our simulation so that each variant explained the same fraction of trait variance (and, thus, so that rarer variants had larger effects), SKAT remained the most powerful test when the proportion of causal variants was small, but the Madson-Browning weighted burden (MB) test outperformed VT and SKAT when the proportion of causal variants was large (80%) (Table 3). This was expected since, in this setting, relative effect sizes match those predicted by the Madson-Browning weighting scheme.

Table 2. Power Comparison when Causal Variants Can Have Opposite Effects.

MAF Cutoff Causal Percentage Group by MAF Cutoff Group Only Causal Variants

Burden Madsen-Browning VT SKAT Burden Madsen-Browning VT SKAT
0.01 20% 4.6 0.4 6.0 36.7 38.9 21.1 43.4 83.2
80% 30.5 10.4 33.4 60.0 42.6 18.8 42.2 69.0

0.05 20% 11.7 1.3 15.0 35.7 55.4 22.3 58.3 88.3
80% 44.0 7.8 47.1 74.7 55.1 12.2 54.3 81.6

Simulated samples each had 5,000 individuals, organized in families with pedigree10 structure (See Figure 1). Causal variants were selected among those identified in simulated 1,000 base-pair sequences and explained 1% of trait variance. Among causal variants, 20% were randomly selected to be trait-decreasing, and the rest causal variants were trait-increasing. Power is tabulated as a percentage of simulations exceeding significance threshold. Significance level α = 2.5 × 10-6 was used in all simulations.

Table 3. Power Comparison when Causal Variants All Increase Trait Values and Explain the Same Amount of Trait Variance.

MAF Cutoff Causal Percentage Group by MAF Cutoff Group Only Causal Variants

Burden Madsen-Browning VT SKAT Burden Madsen-Browning VT SKAT
0.01 20% 4.3 4.2 9.1 20.8 88.7 94.9 90.8 67.0
80% 66.9 86.6 85.4 20.1 85.5 97.1 93.8 27.0

0.05 20% 3.8 5.1 9.3 9.8 78.8 98.0 90.1 53.0
80% 38.6 88.5 82.1 9.4 56.0 97.9 92.6 12.4

Simulated samples each had 5,000 individuals, organized in families with pedigree 10 structure (See Figure 1). Causal variants were selected among those identified in simulated 1,000 base-pair sequences and explained 1% of trait variance. Each causal variant explained the same amount of trait variance. All causal variants were trait-increasing. Power is tabulated as a percentage of simulations exceeding significance threshold. Significance level α = 2.5 × 10-6 was used in all simulations.

Power when Misspecifying Frequency Threshold

We next investigated the impact of misspecifying frequency thresholds during analysis. Figure S2A shows that when causal variants have the same effect sizes, VT and Madson-Browning-weighted burden tests perform well as long as the frequency cut-off used during analysis is larger than the cutoff used for simulation. In contrast, the power of SKAT and simple burden tests is greatly reduced when incorrect frequency thresholds are used for analysis. Figure S2B shows that when rare causal variants have larger effects and all variants explain the same amount of trait variance, all tests reach maximum power at a frequency threshold less than or equal to 0.01, the threshold for simulating causal variants. Whereas the power of VT and MB remain close to optimal, the power of SKAT and the simple burden tests drops greatly as the frequency threshold used for analysis increases and non-causal and small effect variants enter the analysis. In real data analysis, because true disease model is unclear, we recommend multiple frequency thresholds should be used when using SKAT or simple burden tests [Lange et al. 2014].

Relative Power of Family Samples and Unrelated Individuals

We used simulations to compare the benefits of samples of families and unrelated individuals in association studies. Family samples can allow many copies of the same trait associated rare alleles to be observed in a single study. Variability in allele counts is larger in families, particularly in pedigrees with many descendants for each founder. For example, for a variant with allele frequency 0.0005 (∼5 alleles expected when 5000 individuals are sequenced), the standard deviation of the allele counts in a sample matching Pedigree50 (from Figure 1) is >3 times larger than a sample of unrelated individuals (see Table S2 for details) – meaning that the chance of observing >10 copies of the variant is 20% when families matching Pedigree50 are sampled, but 4% in samples of unrelated individuals.

We speculated that the increased variability in allele counts in family samples would mean that family samples might sometimes hit a “jackpot” and sample many copies of a trait associated rare allele, increasing power. This speculation was supported by our simulations: a sample of 5,000 individuals in families matching Pedigree50 provides >2-fold greater power to detect a variant with frequency 0.001 and effect size 1 than a population sample of the same size (power was 0.9% in sample of unrelated individuals and 2.3% in sample of families, Figure S3). This increase may seem paltry, but it is important to remember that many susceptibility loci underlie each human complex trait: if there are hundreds of such loci and power increases from 0.9% to 2.3% at each of those, the odds of a successful discovery will increase dramatically. The idea of “jackpot” effect was also supported by close examination of our simulation results. Among all 10,000 simulated samples, the average frequency of trait associated alleles was 0.0010, but in samples that have association p-value <1×10-8, the frequency of trait associated alleles was higher, averaging 0.0032, a >3 fold increase. The relative advantages of family samples over unrelated samples decrease in settings where power (and, typically, the number of expected rare allele carriers) is high. For example, when sample size increases, allele frequency increases, or effect size (or variance explained) increases unrelated samples quickly become more powerful (Figure S3).

Consistent to patterns in single variant association power, Figure 2 shows that family studies have the similar advantages in studies of gene-level rare variant associations. For example, in a sample of 5,000 individuals, power to detect a gene where 20% of variants with frequency <1% are causal and explain 0.5% trait variance increases from 1% for unrelated individuals to 13% for family samples.

Figure 2. Power to Detect Gene-Level Association in Family and Population Samples.

Figure 2

All samples had 5,000 individuals. All family samples used the Pedigree50 structure (see Figure 1 for details). In every simulation, 10,000 haplotypes were simulated and 20% of variants with MAF<0.01 were randomly selected as causal variants, each explaining the same amount of trait variance. Then, a subset of simulated haplotypes were selected as founder haplotypes, segregated through families according to Mendel's laws, and used to simulate quantitative traits. Power of the SKAT test was evaluated using 10,000 simulations and significance level α = 2.5 × 10-6.

Advantages in power from studies of families are strongly correlated to the variance of allele counts (which is a function of family size and pedigree structure). For example, a sample of families matching Pedigree50 (Figure 1) has largest variance in allele counts (Table S2) and also the largest power for detecting a gene explaining 0.5% of trait variance in a sample of 5,000 individuals (Figure 3), whereas a sample of families matching Nuclear4 (Figure 1) has the smallest variance in allele counts and provides the smallest increase in power relative to samples of unrelated individuals (in this simulation, 20% of variants with frequency 1% were causal). All family samples have larger variance in allele counts than unrelated samples.

Figure 3. Power to Detect Gene-Level Association as a Function of Pedigree Structure.

Figure 3

In each simulation, 20% of variants with MAF<0.01 were randomly assigned as causal, each explaining the same amount of trait variance. Together, causal variants explained 0.5% of trait variance. For comparison, the red line shows the power for one variant with frequency of 0.5 and explaining 0.5% of the trait variance. Power of the SKAT test was evaluated using 10,000 simulations and significance level α = 2.5 × 10-6.

The advantage of family samples extends to extremely rare variants. Figure 4A shows that when 20% of singleton variants (defined as alleles present only once in our initial pool of 10,000 simulated sequences) in a gene were causal explaining 0.5% trait variance, power to detect gene-level association increased dramatically from 3.5% in a study of 5,000 unrelated individuals to as much as 19.3% in a study of 5,000 related individuals. Figure 4B shows that when sample size increase to 10,000 individuals, the window where family samples are more advantageous becomes narrower.

Figure 4. Power to Detect Gene-Level Association When Singletons are Causal.

Figure 4

In each simulation, 10,000 simulated haplotypes were simulated. 20% singletons from these haplotypes were chosen as causal variants, together explaining various proportions of trait variance. Trait heritability was 40%. Then, a subset of haplotypes were used to seed founder haplotypes in each family sample. Only singletons or private variants were grouped for association tests. 10,000 simulations were used to evaluate power in samples of 5,000 individuals (panel A) or 10,000 individuals (panel B). See Figure 1 for details of pedigree structures. Power was evaluated in 10,000 simulations using significance level α = 2.5 × 10-6.

In all examples highlighted so far, family studies outperform studies of unrelated individuals but in all of these examples power was low for both families and unrelated individuals. We expect that this is actually a common situation in human genetic studies – there may be very large numbers of trait associated loci but any single study may only provide enough power to detect a few of these. To explore this situation directly, we estimated power to detect at least one of several disease-associated loci. Assuming power to detecting association at a specific gene is p and x genes with similar effect variants exist, then the power to detecting at least one of these is 1-(1-px), assuming independent genes. Figure 5 shows dramatic advantages in the power to detecting at least one of 20 trait associated genes, each explaining the same proportion of trait variance. For example, power to detecting at least one gene explaining 0.5% trait variance (when 20% variants in the gene and with frequency <1% are causal) when 20 such genes exist is >90% in sample of 5,000 individuals distributed in families matching Pedigree50 (Figure 1), whereas only ∼20% in a sample of 5,000 unrelated individuals. The power advantage in family samples increases with the variability in allele counts, which in turn is driven by pedigree structure (Figure 6).

Figure 5. Power to Detect at least One of Twenty Causal Genes.

Figure 5

Assuming power to detect association at a specific gene is p and n genes with similar effect variants exist, then the power to detect at least one of these is 1-(1-p)n. See Figure 2 for power to detect a single gene and additional details of simulation settings.

Figure 6. Power to Detect at least One of Twenty Causal Genes as a Function of Pedigree Structure.

Figure 6

The blue bars show power to detect at least one gene where rare variants explain 20% of trait variance and 20 such genes exist. The red line shows the power to detect at least one common variant with frequency 0.5 that explains 0.5% of trait variance when 20 such variants exist. See the legends of Figure 3 for simulation settings. See the legends of Figure 5 for calculating power to detect at least one of n genes with similar effect variants exist.

Families matching Pedigree50 are not easy to find. For a more realistic comparison of the power of studies of families and unrelated individuals, we repeated our simulations using the family structures and phenotypes observed in the SardiNIA sample. To preserve the correlation of phenotypes among family members, we started with observed HDL values together with sex, age and age-squared as covariates. Figure S4 shows that the SardiNIA families provide larger power for discovering rare variants with moderate effect sizes than studies of same numbers of unrelated individuals. For example, the SardiNIA sample provides 1.6% power to detect a variant with frequency 0.0001 and effect of 2.5 trait standard deviation units, whereas unrelated samples provide only 0.05% power (Figure S4A). If 100 such variants exist, the SardiNIA sample provides ∼80% power to detect at least one, but an equal number of unrelated individuals provides only ∼5% power to detect at least one of such a variant (Figure S4B). When allele frequency increases (Figure S4C, S4D, S4E, S4F), the SardiNIA sample is still advantageous when effect sizes are moderate.

Real Data Analysis Using SardiNIA and HUNT Studies

To evaluate our approach further, we meta-analyzed blood HDL levels for 11,556 individuals from the HUNT and SardiNIA studies (See Table S3 for descriptive statistics for traits). Overall, 93,831 and 76,828 sites were polymorphic in the HUNT and SardiNIA studies respectively, resulting in 117,958 polymorphic variants when combining the two studies (Table S4). Among those, 52,700 variants were shared in both studies (Table S5), 41,130 variants are unique to the HUNT study, and 24,128 variants are unique to the SardiNIA study (Table S6). Using our meta-analysis method, both shared and non-shared variants contribute to association signals.

We first generated summary statistics for each study adjusting for relatedness using empirical kinship matrices estimated from genotype data. Within each sample, test statistics were well calibrated with Genomic Control 1.00 in HUNT study and 1.01 in SardiNIA sample (See Figure S5 for QQ plots). To illustrate the importance of taking into account phenotype correlations, consider that analyzing the SardiNIA exome chip data and treating the samples as unrelated results in a genomic control value of 1.45, which is unacceptably high (results not shown); but using our approach, genomic control becomes 1.01. We next proceeded to meta-analyze single variants. Figure S5 shows that our meta-analysis statistics were also well calibrated with genomic control value <1.05, both for common and rare variants. At a significance threshold of p<4.23×10-7 (corresponding to 0.05/117,958), we found significantly associated low-frequency and rare variants at CETP, LIPC, LIPG, and LPL for HDL (MAF < 5%; See Figure S6 for Manhattan plots). Significant rare variants were only found in LIPC and LIPG (MAF < 1%).

We then proceeded to gene-level meta-analyses. Again, test statistics appear well calibrated, with genomic control value <1.05 (See Figure S7 for QQ plots). Also, by examining QQ plots from SardiNIA and HUNT study (See Figure S7), we discovered that, for family samples or samples from isolated population, in the analysis of rare variants, a small number of individuals can be quite influential such that all variants that are shared between this set of individuals (or families) will exhibit similar and often small p-values. This can lead to apparent inflation in QQ-plots, where confidence intervals are calculated assuming all statistics are independent. At a significance threshold of p<2.84×10-6 (corresponding to 0.05/17,574 and thus allowing for the number of genes tested), we found association at APOC3, CETP, LIPC, LIPG, and LPL for HDL (See Table 4 for tabulated results and Figure S8 for Manhattan plots). Among those, APOC3, LIPG, and LPL had evidence of association stronger than the most significant single variant in the region. In APOC3, none of the individual low frequency and rare variants had p-value lower than 10-4 on its own (Table 4).

Table 4. Significant Genesa from Gene-Level Meta Analysis of HUNT and SardiNIA Exome Chip Data (HDL).

Gene Burden Madsen-Browning VT (Actual MAF Cutoff) SKATc Variants Includedd MAF Effect Sizes (SD) Single Variant p-values
APOC3b 2.3×10-6 1.9×10-6 6.4×10-6 (6.1×10-4) 4.5×10-5 11:116701560:G:A 4.8×10-4 0.959 1.4×10-3
11:116701353:C:T 5.6×10-4 1.009 1.5×10-3
11:116701354:G:A 6.1×10-4 0.528 5.7×10-2

CETP 6×10-20 2.7×10-3 2.4×10-19 (3.2×10-2) 1.2×10-20 16:57015091:G:C 3.2×10-2 -0.359 1.3×10-20
16:57007387:C:T 4.3×10-5 2.241 2.3×10-2
16:56995935:C:G 4.3×10-5 -1.572 1.1×10-1
16:57012039:G:A 4.3×10-5 -0.803 4.2×10-1
16:57009022:G:A 1.7×10-4 0.309 5.3×10-1
16:57015076:G:A 2.2×10-4 0.144 7.4×10-1
16:57012094:A:G 4.3×10-5 0.182 8.5×10-1

LIPGb 1.3×10-10 6.7×10-9 4.5×10-10 (9.4×10-3) 1.9×10-8 18:47109955:A:G 9.4×10-3 0.375 4.5×10-8
18:47113165:C:T 9.1×10-4 0.668 2.3×10-3
18:47109939:G:A 1.7×10-4 1.012 3.9×10-2
18:47101838:G:A 4.3×10-5 1.000 3.1×10-1

LPLb 3.7×10-11 4.5×10-5 1.2×10-10 (2.0×10-2) 2×10-11 8:19813529:A:G 2.0×10-2 -0.273 1.3×10-8
8:19805708:G:A 1.1×10-2 -0.254 7.5×10-5
8:19816888:C:T 1.1×10-3 0.234 2.3×10-1
8:19819628:T:G 4.3×10-5 0.193 8.4×10-1

LIPC 1.8×10-4 1.5×10-4 3.2×10-5 (6.2×10-3) 1.7×10-7 15:58855748:C:T 6.2×10-3 0.539 4.9×10-10
15:58837989:G:A 7.4×10-4 0.542 2.5×10-2
15:58833993:G:A 3.1×10-2 0.054 1.7×10-1
15:58830716:G:A 8.7×10-5 0.123 8.6×10-1
15:58853079:A:C 5.9×10-3 -0.003 9.8×10-1
15:58860956:G:A 4.3×10-5 0.025 9.8×10-1
a

Significance level 2.84×10-6 was used for reporting significant genes. Non-synonymous, splice, and stop variants with MAF<0.05 were included in analysis.

b

The gene-level p-value is smaller than the p-value for each of the single variants included in the test.

c

P-values of SKAT were generated using weights suggested in Wu el al. [Wu et al., 2011].

d

Variants are in the following format: CHR:POS:REF:ALT.

Comparison with other Methods and Tools

To validate our approach, we compared our implementation to several others in a simulated a family sample of 10,000 individuals distributed across 1000 families matching Pedigree10 (see Figure 1). 4000 genes with 1,000 base-pair were simulated in families from a pool of haplotypes. A quantitative trait was simulated under the null. Variants with MAF<0.05 were grouped for gene-level tests. Pedigree-based kinship matrices were used in all analyses. We then analyzed the simulated sample using our own famrvtest (SKAT, burden and VT tests), pedgene (burden and Kernel test) [Schaid et al. 2013], famSKAT [Chen et al. 2013], and FFBSKAT [Svishcheva et al. 2014]. Figure S9 shows that all tests generate well-controlled QQ plots under the null.

To compare methods under the alternative, we simulated a dataset of 5,000 individuals (500 × Pedigree10) where a 1,000 base-pair long gene were 50% variants with MAF<0.05 are causal and together explained 1% trait variance. We simulated data sets where all causal variants had the same direction and also where half of the causal variants had opposite effects. In this simulation, our method always matched or slightly outperformed alternative implementations (see Figure S10).

These comparisons also allowed us to evaluate computation performance and requirements for our tool. Wherever possible, we tried to provide faster computation, less memory use, while still allowing for flexible input formats and varied choices of association tests. famrvtest is implemented a C++ command line tool, uses computationally efficient algorithms to fit linear mixed models [Lippert et al. 2011], and recognizes pedigree-based kinship estimates as block-diagonal matrices to save computational effort. For our simulated dataset with 10,000 individuals and 164,323 variants distributed across 4,000 genes, analysis with famrvtest required 1.5 hours and 1.3GB of memory to calculate both SKAT and burden test statistics, a savings of up to 10-100 fold relative to alternative tools (see Table S7).

Discussion

Gene-level association tests and meta-analysis are important tools for discovering rare variant associations. We have proposed a series of methods that facilitate these analyses in family samples (or in samples where cryptic relatedness is modeled using variance components). Our C++ tools implement simple burden tests, weighted or un-weighted; and variable threshold tests as well as SKAT tests that outperform other tests when only small fractions of variants in each gene are causal or when variants with opposite effects reside in the same gene.

We compare the relative benefits of family samples and population samples. By simulation, we show that family samples can provide substantially greater power for rare variant association studies because of a “jackpot” effect – the potential for observing many copies of a trait associated rare variant. This advantage is likely to be extremely important in the first generation of rare variant association studies, each of which is only expected to detect a small fraction of all the true rare variant association signals. An example of successful discovery of such variant is rs72658864/V578A in LDLR, a rare variant associated to LDL with effect size 23.7 mg/dl [Sanna et al. 2011]. This variant was observed with frequency 0.00035 in the SardiNIA sample, where it was present in multiple families, but has not yet been observed in the 1000 Genomes [Abecasis et al. 2012] or the NHLBI Exome Sequencing Projects [Fu et al. 2013; Tennessen et al. 2012] suggesting that it is rare indeed.

We demonstrate the utility of our methods by analyzing two samples with complex inter-relatedness. Meta-analysis of SardiNIA and HUNT resulted in a well-calibrated genomic control value of 1.02 and increased signal at many loci known to be associated with HDL – demonstrating the feasibility of including family samples in rare variant meta-analysis. We expect that meta-analysis will be useful not only for combining data across studies but also to facilitate analysis of large samples genotyped or sequenced across multiple platforms or analyzed using a single platform but in a batched manner.

We foresee several potential areas for refinement of our methods. For example, a limitation for our current approach to meta-analysis is that cross study relatedness and sample overlap are not modeled. In genome-wide studies, it may be possible to overcome this limitation by using the genome-wide correlation of test statistics between pairs of studies to calculate an adjustment factor that could account for overlap or relatedness between individuals in two studies [Lin and Sullivan 2009] – as suggested by Lin et al. for single marker meta-analyses. Extension of this idea has also been proposed in Han et al. 2013. Extending our methods to non-coding variants will also be attractive, particularly since the majority of trait-associated variants found to date are located in non-coding regions. A difficulty will be the development of good grouping strategies for non-coding variants, where interpretation of functional consequence is more challenging. Another challenge we foresee is the extension of our methods to discrete traits. The natural way to do this is to consider an underlying continuous liability scale and use multivariate integration to fit the model, but there may be more computationally efficient alternatives to be discovered.

In summary, we have proposed a series of gene-level association tests for family samples and methods for calculating these in a meta-analysis of related and/or unrelated samples. We also implemented our methods in freely available and open source C++ tools: http://genome.sph.umich.edu/wiki/FamRvTest and http://genome.sph.umich.edu/wiki/RAREMETAL. We hope these tools and methods will facilitate the next round of gene-mapping studies.

Supplementary Material

Supplemental

Acknowledgments

We thank Drs Michael Boehnke and Peter Song for helpful suggestions and comments. We thank Scott Vrieze for helpful discussions.

References

  1. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Abecasis GR, Cardon LR, Cookson WO. A general test of association for quantitative traits in nuclear families. American journal of human genetics. 2000;66:279–292. doi: 10.1086/302698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Abecasis GR, Cherny SS, Cardon LR. The impact of genotyping error on family-based analysis of quantitative traits. European journal of human genetics : EJHG. 2001;9:130–134. doi: 10.1038/sj.ejhg.5200594. [DOI] [PubMed] [Google Scholar]
  4. Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nature genetics. 2002;30:97–101. doi: 10.1038/ng786. [DOI] [PubMed] [Google Scholar]
  5. Abifadel M, Varret M, Rabes JP, Allard D, Ouguerram K, Devillers M, Cruaud C, Benjannet S, Wickham L, Erlich D, et al. Mutations in PCSK9 cause autosomal dominant hypercholesterolemia. Nature genetics. 2003;34:154–156. doi: 10.1038/ng1161. [DOI] [PubMed] [Google Scholar]
  6. Adams AM, Hudson RR. Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms. Genetics. 2004;168:1699–1712. doi: 10.1534/genetics.104.030171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Astle W, Balding DJ. Population Structure and Cryptic Relatedness in Genetic Association Studies. statistical Science. 2009;24:451–471. [Google Scholar]
  8. Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genetic epidemiology. 2013;37:196–204. doi: 10.1002/gepi.21703. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen WM, Abecasis GR. Family-based association tests for genomewide association scans. American journal of human genetics. 2007;81:913–926. doi: 10.1086/521580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Davies R. The distribution of a linear combination of chi-square random variables. J R StatSoc Ser C Appl Stat. 1980;29 [Google Scholar]
  11. De G, Yip WK, Ionita-Laza I, Laird N. Rare variant analysis for family-based design. PloS one. 2013;8:e48495. doi: 10.1371/journal.pone.0048495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ, Altshuler D, Shendure J, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493:216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Genz A. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics. 1992;1:141–150. [Google Scholar]
  14. Giorgio P, Eleonora P, Vrieze SI, Sidore C, Steri M, Danjou F, Busonero F, Mulas A, Zoledziewska M, Maschio A, et al. Rare variants genotype imputation with thousands of study-specific whole-genome sequences: Implications for cost-effective study designs. European Jourmal of Human Genetics. 2014 doi: 10.1038/ejhg.2014.216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, Yu F, Gibbs RA, Bustamante CD. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences of the United States of America. 2011;108:11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Holmen OL, Zhang H, Fan Y, Hovelson DH, Schmidt EM, Zhou W, Guo Y, Zhang J, Langhammer A, Lochen ML, et al. Systematic evaluation of coding variation identifies a candidate causal variant in TM6SF2 influencing total cholesterol and myocardial infarction risk. Nature genetics. 2014a;46:345–351. doi: 10.1038/ng.2926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Holmen TL, Bratberg G, Krokstad S, Langhammer A, Hveem K, Midthjell K, Heggland J, Holmen J. Cohort profile of the Young-HUNT Study, Norway: A population-based study of adolescents. International journal of epidemiology. 2014b;43:536–544. doi: 10.1093/ije/dys232. [DOI] [PubMed] [Google Scholar]
  18. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
  19. Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Family-based association tests for sequence data, and comparisons with population-based association tests. European journal of human genetics : EJHG. 2013 doi: 10.1038/ejhg.2012.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nature genetics. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Laird NM, Horvath S, Xu X. Implementing a unified approach to family-based tests of association. Genetic epidemiology. 2000;19(Suppl 1):S36–42. doi: 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
  22. Laird NM, Lange C. Family-based designs in the age of large-scale gene-association studies. Nature reviews Genetics. 2006;7:385–394. doi: 10.1038/nrg1839. [DOI] [PubMed] [Google Scholar]
  23. Laird NM, Lange C. Family-based methods for linkage and association analysis. Advances in genetics. 2008;60:219–252. doi: 10.1016/S0065-2660(07)00410-5. [DOI] [PubMed] [Google Scholar]
  24. Lange K. Mathematical and Statistical methods for genetic analysis 1997 [Google Scholar]
  25. Lange LA, Hu Y, Zhang H, Xue C, Schmidt EM, Tang ZZ, Bizon C, Lange EM, Smith JD, Turner EH, et al. Whole-exome sequencing identifies rare and low-frequency coding variants associated with LDL cholesterol. American journal of human genetics. 2014;94:233–245. doi: 10.1016/j.ajhg.2014.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, Lin X. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. American journal of human genetics. 2012a;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lee S, Teslovich TM, Boehnke M, Lin X. General Framework for Meta-analysis of Rare Variants in Sequencing Association Studies. American journal of human genetics. 2013 doi: 10.1016/j.ajhg.2013.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012b;13:762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Lin DY, Sullivan PF. Meta-analysis of genome-wide association studies with overlapping subjects. American journal of human genetics. 2009;85:862–872. doi: 10.1016/j.ajhg.2009.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. American journal of human genetics. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Lin DY, Zeng D. On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika. 2010;97:321–332. doi: 10.1093/biomet/asq006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genome-wide association studies. Nature methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
  33. Liu DJ, Peloso GM, Zhan X, Holmen OL, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, et al. Meta-analysis of gene-level tests for rare variant association. Nature genetics. 2014;46:200–204. doi: 10.1038/ng.2852. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Liu H, Tang Y, Zhang H. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput Stat Data Anal. 2009:853–856. [Google Scholar]
  35. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology. 2010;34:188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337:100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. doi: 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Ott J, Kamatani Y, Lathrop M. Family-based designs for genome-wide association studies. Nature reviews Genetics. 2011;12:465–474. doi: 10.1038/nrg2989. [DOI] [PubMed] [Google Scholar]
  40. Pilia G, Chen WM, Scuteri A, Orru M, Albai G, Dei M, Lai S, Usala G, Lai M, Loi P, et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS genetics. 2006;2:e132. doi: 10.1371/journal.pgen.0020132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Saad M, Wijsman EM. Power of family-based association designs to detect rare variants in large pedigrees using imputed genotypes. Genetic epidemiology. 2014;38:1–9. doi: 10.1002/gepi.21776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Sanna S, Li B, Mulas A, Sidore C, Kang HM, Jackson AU, Piras MG, Usala G, Maninchedda G, Sassu A, et al. Fine mapping of five loci associated with low-density lipoprotein cholesterol detects variants that double the explained heritability. PLoS genetics. 2011;7:e1002198. doi: 10.1371/journal.pgen.1002198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Schaid DJ, McDonnell SK, Sinnwell JP, Thibodeau SN. Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data. Genetic epidemiology. 2013;37:409–418. doi: 10.1002/gepi.21727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser PA, Lin X. SNP Set Association Analysis for Familial Data. Genetic epidemiology. 2012 doi: 10.1002/gepi.21676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007;316:1341–1345. doi: 10.1126/science.1142382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Svishcheva GR, Belonogova NM, Axenovich TI. FFBSKAT: fast family-based sequence kernel association test. PloS one. 2014;9:e99407. doi: 10.1371/journal.pone.0099407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Tang ZZ, Lin DY. MASS: meta-analysis of score statistics for sequencing studies. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26:2190–2191. doi: 10.1093/bioinformatics/btq340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. American journal of human genetics. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Yan Q, Tiwari HK, Yi N, Lin WY, Gao G, Lou XY, Cui X, Liu N. Kernel-machine testing coupled with a rank-truncation method for genetic pathway analysis. Genetic epidemiology. 2014;38:447–456. doi: 10.1002/gepi.21813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nature genetics. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental

RESOURCES