Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jul 1.
Published in final edited form as: Curr Protoc Hum Genet. 2013 Jul;0 1:10.1002/0471142905.hg0126s78. doi: 10.1002/0471142905.hg0126s78

Identifying rare variants associated with complex traits via sequencing

Bingshan Li 1, Dajiang J Liu 2, Suzanne M Leal 3
PMCID: PMC3830981  NIHMSID: NIHMS506755  PMID: 23853079

Abstract

Although genome-wide association studies have been successful in detecting associations with common variants, there is currently an increasing interest in identifying low frequency and rare variants associated with complex traits. Next-generation sequencing technologies make it feasible to survey the full spectrum of genetic variation in coding regions or the entire genome. Due to the low frequency of rare variants, coupled with allelic heterogeneity, however, the association analysis for rare variants is challenging and traditional methods are ineffective. Recently a battery of new statistical methods has been proposed for identifying rare variants associated with complex traits. These methods test for associations by aggregating multiple rare variants across a gene or a genomic region, or a group of variants in the genome. In this Unit, we describe key concepts for rare variant association for complex traits, survey some of the recent methods and discuss their statistical power under various scenarios, and provide practical guidance on analyzing next-generation sequencing data for identifying rare variants associated with complex traits.

INTRODUCTION

To date numerous genes involved in disease and trait etiology have been identified through linkage and association studies. For Mendelian diseases usually linkage analysis is used to localize the genomic regions harboring causal variants and then fine mapping methods are used to pinpoint causal genes and variants (see Unit 1.19 for linkage analysis). Thus far the underlying genetic cause of ~3,500 Mendelian disorders is known (http://www.ncbi.nlm.nih.gov/omim). Unlike for Mendelian diseases, which are caused by rare high penetrant genetic variants, the genetic basis of complex traits remains largely unknown. Until recently the pursuit of understanding the genetic etiology of complex traits has been almost solely based on the common-disease common-variants (CDCV) hypothesis (Hirschhorn and Daly, 2005; Iyengar and Elston, 2007; Schork et al., 2009; Smith and Lusis, 2002), which asserts that common complex diseases are due to common variants (e.g. minor allele frequency (MAF) >0.05) with usually little or no allelic heterogeneity within a locus. Single nucleotide polymorphisms (SNPs) are the most prevalent form of common genetic variation and become de facto markers for localizing common disease-causing variants. Because neighboring common variants can be in strong linkage disequilibrium (LD) and thus provide redundant information, it is only necessary to survey a subset of SNPs (i.e. tag SNPs) to achieve genome-wide coverage (see also Unit 1.4). When disease-causing common variants are in strong LD with one or more tagSNPs, genetic effects of causal variants lead to association signals observable in tagSNPs. This is the basis for Genome-Wide Association Studies (GWAS) and makes it a cost-effective strategy to genotype genome-wide tagSNPs on thousands of samples using microarray chips.

By design this is an indirect mapping strategy and is rarely directly testing causal variants. To date, >8,500 associated SNPs have been identified for a variety of complex traits (the National Human Genome Research Institute Catalog of Published Genome-Wide Association Studies: http://www.genome.gov/gwastudies). Nevertheless, most identified common associated SNPs only have weak genetic effects (e.g. odds ratio <1.5), and collectively these identified variants only account for a small proportion of heritability for most complex traits (Maher, 2008; Manolio et al., 2009), suggesting that other major mechanisms are involved in the genetic etiology of complex traits. Of great interest is the common-disease rare-variants (CDRV) hypothesis (Iyengar and Elston, 2007; Schork et al., 2009; Smith and Lusis, 2002), which asserts that common diseases are due to rare variants, and contrary to the CDCV hypothesis, rare variants often exhibit extreme allelic heterogeneity. To date several studies have unraveled functional roles of rare variants in complex traits (Ahituv et al., 2007; Bodmer and Bonilla, 2008; Cohen et al., 2004; Ji et al., 2008; Romeo et al., 2007), making studying the role of rare variants in complex trait etiology an attractive avenue to pursue. Rare variants are inefficiently tagged in GWAS due to the low correlation between rare variants (e.g. minor MAF <1%) and common tagSNPs with a much higher MAF (e.g. >5%)(Li and Leal, 2008). Therefore associations with causal rare variants will usually be missed in GWAS. Sequencing can uncover the full spectrum of genetic variation and is the optimal approach to identifying rare genetic variants for association studies. Owing to the advancement of cost-effective next-generation sequencing (NGS) technologies (Mardis, 2008; Shendure and Ji, 2008), it is now feasible to sequence whole exomes (i.e. the protein coding regions) in hundreds or even thousands of samples, and will be practical soon for sequencing whole genomes. Our understanding of the allelic architecture of complex traits will accumulate gradually in the near future when more whole genome or exome sequencing studies are carried out due to the continuously decreasing cost of sequencing.

Traditional analysis methods used in GWAS are single-marker tests, where SNPs are tested individually and multiple testing is corrected for to control the family-wise error rate (FWER). Due to the extensive LD among common SNPs, permutation-based approaches can be used to account for the inter-SNP correlation. However for GWAS using permutation-based methods to control for FWER is computationally intensive and not practical for large-scale studies. An alternative approach is to estimate the effective number of independent tests based on the LD patterns and use the Bonferroni approach to control for FWER. In practice a p-value of < 5×10−8 is used to declare genome-wide statistical significance based on European populations, corresponding to correcting for 1 million independent tests (Dudbridge and Gusnanto, 2008). This cutoff is still used even if less or more than one million tests are performed. Although testing individual variants for association can also be done for rare variants, it is inevitably under powered (Li and Leal, 2008), unless the sample size is excessively large. To address the challenges, numerous novel analysis strategies have been developed to identify rare variants associated with complex traits. This battery of new methods, often referred to as collapsing, group-wise or pooled approaches, generally involves aggregating multiple rare variations in a gene or region or any arbitrary set in the genome, and test the association effect of the group of rare variant as a whole. In this commentary, we describe some key concepts related to rare variants, summarize recently developed aggregation methods, provide practical guidance on analysis of rare variation obtained from sequencing, and discuss further challenges that need to be addressed for rare variant association studies.

KEY CONCEPTS

LD, Association Mapping and Rare variants

Recall that linkage disequilibrium (LD) is the non-random association of alleles at different loci, and the co-occurrence of alleles at two loci on the same haplotype is either more or less frequent than expected from random paring of the two alleles based on their allele frequencies (Hartl and Clark, 2007). Let A and a denote the two alleles at locus 1, and B and b be the two alleles at locus 2, with corresponding allele frequencies of pA, pa=1-pA for locus 1 and pB, pb=1-pB for locus 2. Without loss of generality (WLOG) we assume that the minor alleles at the two loci are A and B respectively, and pA<=pB. Denote the frequency of the two-locus haplotype HAB as pAB. Then the LD between the two loci is DAB=pAB-pA*pB, where pA*pB is the expected frequency of HAB under the assumption of no association between the two loci (i.e. no LD).

The concept of LD is the basis for LD-based indirect association mapping, where a variant in LD with the causal variant is analyzed instead of directly analyzing the causal variant. For example, when the underlying causal allele is A at locus 1 and DAB!=0, the frequency of allele A in cases is different from that in controls. Due to the LD between A and B we will also observe differential frequencies of allele B in cases vs. control. Therefore it is not necessary to test directly the causal SNPs and LD-based association mapping is the most commonly employed strategy prior to the sequencing era. There are different measures of LD and for association studies the most relevant measure is r2, which is the correlation coefficient between alleles at the two loci and can be calculated as r2=DAB2/pA(1-pA)pB(1-pB)=(pAB-pA*pB)2/ pA(1-pA)pB(1-pB). There is a simple relationship between the sample size and r2 when the underlying disease model is multiplicative, that is, for fixed statistical power, the sample size required is inversely proportional to r2 (Pritchard and Przeworski, 2001). For example, the sample size needs to be doubled to achieve the same power if the r2 between the tag SNP and the casual variant is 0.5. It is obvious that the higher r2 the better power for association studies.

There is extensive LD in the human genome and HapMap project (Altshuler et al., 2010) comprehensively surveyed the LD among common variants in various populations. A necessary condition to achieve high r2 is the similarity of the allele frequencies at two loci. For given allele frequencies, the maximum r2 is achieved, but not necessarily equal to 1, when the two minor alleles are on the same haplotype and the haplotypes were not broken by historical recombination events. In this case, the two loci are in complete LD (i.e. D'=1) and only three haplotypes (HAB, HaB and HBB, with pAB=pA) are observed in the population. The maximum r2 between the two loci is (pAB-pA*pB)2/pA(1-pA)pB(1-pB)=pA(1-pB)/pB(1-pA. Only when pA=pB, i.e. the variant frequency is exactly the same for the two loci, r2=1; in this case, only two haplotypes (HAB and Hab) are observed and the two loci are in perfect LD. When the allele frequencies of two loci are very different, the r2 will never be very large. To see this, let's assume pA<<PB and 1-pA≈1. Then the maximum r2pA/pB*(1-pB), which is <<1 since PA is very small compared to pB. This indicates that causal rare variants are mostly likely to be missed in GWAS for single marker tests, since GWAS chips are designed to include predominantly common variants (e.g. MAF>0.05).

When considering haplotypes across multiple SNPs, some rare causal variants are likely to be tagged by rare haplotypes, and methods based on rare haplotypes analysis (Li et al., 2010a; Zhu et al., 2010) may be used to identify such signals. It is expected, however, that the majority of unobserved causal rare variants reside on common haplotypes and rare haplotype-based methods may tag only a small proportion of causal rare variants. Other strategies are therefore in order for mapping rare variants associated with complex traits. Let's assume A and B are the minor alleles of two rare variants, that is pA<<1, pB<<1, and 1-pA1, 1-pB≈1. We also assume WLOG that the A allele was introduced in the population later than allele B. By chance it is more likely that allele A occurred on the haplotypes not carrying the B allele. In this case there are 3 haplotypes (HAb, HaB and Hab) after the introduction of the A allele, and r2= (0-pA*pB)2/pA(1-pA)pB(1-pB)≈pApB≈0. Due to the extremely weak r2 between rare variants, it is clear that the LD-based indirect mapping via traditional genotyping is not a viable option for rare variants.

Sequencing, then seems the optimal approach to uncover and identify associated rare variants. Since association tests are performed directly on potentially causal variants, sequencing based association analysis is a direct mapping approach, without the fine mapping step following LD-based association studies. Traditional Sanger sequencing (Sanger et al., 1977) is laborious, low-throughput and expensive. On the other hand, NGS technologies make it feasible to sequence targeted regions, exomes and whole genomes of thousands of samples and hold great promise for genetic studies of complex traits. The 1000 Genome Project (Abecasis et al., 2012; Consortium, 2010) utilized NGS platforms to provide a comprehensive catalogue of genetic variation in various populations. NGS is routinely used for genetic studies of various traits and expected to reveal a comprehensive allelic architecture and genetic etiology for complex traits in the near future.

Sequencing Strategies

For sequencing studies one of the key factors that determines the accuracy and completeness of the underlying genetic variation spectrum is the sequencing depth of coverage, which is defined as the average number of reads mapping to each position in the genome (see Units 18.2, 18.3, 18.4). For example, let the total number of sequenced reads be N each with R bases. The depth of coverage can be calculated as N*R/L, where L is the length of the genomic regions. For genome-level coverage, L=~3×109, the total length of the genome, and for exome-level coverage L is the total length of corresponding exonic target regions for a particular capturing technology. Although high depth (e.g. >30×) whole-genome sequencing (WGS) is ideal for complete surveys of genomic variants, high coverage sequence data is associated with high costs. This strategy is still not practical to date for sequencing the whole genome of thousands of samples. Two alternative strategies, each with specific goals, have been commonly used in current large-scale sequencing studies: low-depth WGS and high-depth exome-sequencing. Low-depth WGS is designed to sequence the whole genome to 4–6× and is a cost-effective approach to have whole-genome coverage of genetic variation. Due to insufficient coverage of each position for inferring underlying genotypes, a haplotype-based approach, e.g. MaCH/Thunder (Howie et al., 2012; Li et al., 2011; Li et al., 2010b) or IMPUTE-2 (Howie et al., 2012; Howie et al., 2009), is required to jointly call genotypes, utilizing the LD among variants across the genome. Due to extensive LD among common variants, this approach is best suited for studying common and low frequency variants (e.g MAF >1%). However, as discussed before, the LD (in terms of r2) among rare variants or between rare and common variants is extremely low and the LD-based joint calling may not be ideal for very rare variants (e.g. MAF <0.5%). Alternatively, exome sequencing selectively sequences to a high depth the coding regions of the genome (~30 megabases (Ng et al., 2009)). Although the exome occupies only ~1% of the genome, it is estimated that it harbors ~85% of disease causing variants (Choi et al., 2009), although this number may change dramatically when non-coding regions are extensively studied through WGS. Exome sequencing starts with targeted capturing of coding sequences and then sequencing the enriched exomes to deep coverage (e.g. >100×). Owing that exome sequencing is still considerably less expensive than WGS and has promises to identify causal coding variants, exome sequencing is currently the most popular approach to study the genetic etiology of Mendelian and complex traits. Although both rare and common variants in the coding region can be accurately inferred from sequencing, it for the most part excludes non-coding regions and is likely to miss regulatory variants. With the continuously decreasing cost of NGS, high-coverage WGS will become common practice in the next few years to study the non-coding portion of the genome.

Likelihood Models for Genetic Association

Let n be the number of individuals in a sample, c be the number of covariates to be included, and k be the total number of variants to be tested in a gene. In this Unit we use “gene” to refer to any collection of variants that are to be analyzed together, e.g. a gene, a region, a pathway, or any arbitrary set of variants in the genome. For i=1,…, n, let yi be the phenotype of the ith individual; for i=1,…, n, j=1,..,k, let Xij denote the number of rare alleles the ith individual carries at the jth variant; for i=1,…, n, j=1,..,c, let Zij denote the value of the jth covariate of the ith individual. We represent the genotype and covariate data of the ith individual in vector form

Xi=[Xi1Xik]Zi=[1Zi1Zic]

Similarly, we let Xj denote the genotypes of the jth variant across all samples, and use X and Z to represent the data matrix of genotypes and covariates respectively, where X is an n by k matrix and Z is an n by c matrix. We can represent the genotype-phenotype relationship in the regression framework as following:

yi=βXi+γZi+ε=j=1kβjXij+j=1cγjZij+ε (1)
logit(pi)=βXj+γZi=j=1kβjXij+j=1cγjZij (2)

Model (1) is for normally distributed quantitative traits and model (2) is for dichotomous traits, where β is the vector of the genetic effects of Xi and γ is a vector of the effects of covariates Zi on yi. For dichotomous traits let pi denote the probability of being a case given the ith individual's genotypes and covariates, and logit(pi) = ln(pi) / (1− pi). In this setup the intercept is included in the γ vector. The coefficients β's are the log odds ratios (ORs) of the rare alleles for case/control data and the additive effects for quantitative traits. Here we include in the model all variants, including the non-causal ones that are to be analyzed together. For non-causal variants the β's are zero and pose no modeling difficulties. Under the null hypothesis that Xi is not associated with yi, β = 0, i.e. β1 = … = βk = 0. For binary traits assuming a logistic model (2) the likelihood for the ith individual is Li(β)=eβXi+γZi1+eβXi+γZi. For quantitative traits assuming a linear model (1) the likelihood for the ith individual is Li(β)=(σ2π)1e(yiu)22σ2, where μ is the population mean of the trait. Combining all individual data the likelihood is L(β)=i=1nLi(β). To avoid redundant discussions for both dichotomous and quantitative traits, we will use dichotomous traits to describe association studies, unless otherwise specified. Most of principles apply directly to quantitative traits as well in the regression framework. Based on the likelihood models, commonly used hypothesis testing approaches for genetic effects β are described below.

Likelihood ratio tests

Let L1 be the maximum likelihood in the full model over the parameters β and γ, and L0 be the maximum likelihood in the null model over the parameter space γ while fixing β1 = … = βk = 0. The Likelihood Ratio Test (LRT) statistic λ = −2 ln(L0 / L1) follows a χk2 distribution with k degrees of freedom (d.f.) asymptotically. An asymptotical p value can be obtained by comparing the LRT statistic with the χk2 distribution for large sample sizes. When k>1, this tests for the overall effects of all SNPs as a whole but not individual SNP effects.

Wald tests

When maximizing the likelihood over the parameters in the full model, the maximum likelihood estimates (MLEs) of βj's and their corresponding standard errors can be obtained in standard statistical packages. Let β^i denote the MLE of βi and SE(β^i) denote the standard error of β^i. The Wald statistic w = βi / SEi) follows a standard normal asymptotically, and an asymptotic p value can be calculated by comparing the statistic with the standard normal distribution. The Wald test can be carried out for any of the βi conditional on covariates and other SNPs.

Score tests

Let α = (β,γ) be the combined vector of the parameters of β and γ, and Ai=(Xi,Zi) be the combined vector of Xi and Zi. Then the score statistic for α is Uα=i=1nln(Li(α))α=i=1n(yiy~i)Ai, where y~ is the expected phenotype for the ith individual. Under the null hypothesis its expectation is zero, and its covariance matrix is Vα=i=1n2ln(Li(α))α2=i=1ny~i(1y~i)AiAi. Since we are only interested in testing the SNP effects β, the covariate effects γ are considered as nuisance parameters. To eliminate γ, let y~ be the fitted phenotype values after regressing out the covariates, i.e. y~i=logit1(γ^Zi). The score vector for β is Uβ=i=1n(yiy~i)Xi. Under the null Uβ follows a multivariate normal distribution asymptotically: Uβ ~ Nk(0, Vβ), where is Vβ the covariance matrix of Uβ under the null. Vβ can be obtained from Vα as Vβ=VββVβγVγγ1Vγβ, where Vββ, Vβγ and Vγγ are corresponding submatrices of Vα. To test the null hypothesis that β = 0, a score test statistic can be computed as S=UβVβ1Uβ, which follows asymptotically a χk2 with k d.f. As for LRT, the score test is only testing the overall effect of all SNPs but not individual SNP effects.

STRATEGIC APPROACH

Traditional Approaches

In GWAS the most commonly used analysis approach is single marker tests, where a statistical test is carried out for each marker and a threshold of 5×10−8 is used to correct for multiple testing to control FWER at the genome level. In GWAS and candidate gene studies, in addition to single marker tests, if the hypothesis is whether a gene harbors association signals, multi-marker tests are often applied to jointly test the overall effect of the markers in a gene or region as a whole. When the CDRV hypothesis holds, although both can be applied to rare variants, the following sections demonstrate that both approaches are under-powered.

Single-marker tests

The simplest approach to genome-wide analysis is to analyze individual variants separately. Without any model assumptions, a 2 d.f. Pearson χ2 test can be performed on a 2 by 3 contingency table to compare the frequencies of three genotypes of a variant in cases vs. controls. For rare variants, the frequency of homozygous rare alleles may be very low and Pearson χ2 tests may have inflated type I error. One remedy is to group the rare homozygotes and the heterozygotes together (i.e. assuming a dominant model). After the grouping a 1 d.f. Pearson χ2 test can be performed and is expected to achieve improved power over a 2 d.f. test. For complex traits, an additive genetic model is usually assumed for the 3 genotypes, where carrying an extra copy of the variant allele increases the genetic risk. A convenient way to code the genotype is 0, 1, 2 for genotypes carrying 0, 1 or 2 rare alleles respectively. To test for association in regression models, only one variant can be included in (1) or (2), and the test for β = 0 can be carried out either through a Wald test, a LRT, or a score test (equivalent to the commonly used Cochran-Amitage test for trend), all with 1 d.f.

The 0, 1, 2 coding for the genotype is not the most powerful approach for all scenarios. If prior knowledge is available, other coding approaches can be used to reflect the genetic effect of each genotype. For example, a 0, 0, 1 coding is for the recessive model, where carrying one copy of variant allele does not increase disease risk, and 0, 1, 1 is for the dominant model, where the increase in disease risk is the same for heterozygous and homozygous variant carriers. Other ways can be used as well to represent complex models. Although an additive model is unlikely to be strictly correct, the trend test (i.e. score test) is not to test the linearity, and as long as there is a trend, which is likely to hold for complex traits, the trend test is expected to be robust and achieve increased power due to the parsimony of the model. For rare variants uncovered through sequencing, it is likely that association tests are carried out directly on causal variants, and therefore flexible genetic models can be used if prior knowledge is available for specific diseases or variants.

Multi-marker tests

Often times the interest is to test whether multiple variants in a gene, region or any collection of variants as a whole are associated with the phenotype. This can be achieved using LRT or score tests discussed before. These multi-marker tests can only jointly test the effect of all markers as a whole, and if the null hypothesis is rejected it is not known which variants are associated with the phenotypes. It is possible that all of them or only a subset is associated with the phenotype. To pinpoint associated variants, it may need to use a single marker test to examine individual variant effects.

Limitations of traditional methods when applied to rare variants

The performance of various testing strategies is heavily influenced by the underlying genetic models, and both the power and type I error can be dramatically different under CDCV and CDRV hypotheses. Let's use a gene as an example. Two key features that are different in the two hypotheses influence power. First it is most likely that there is only one causal variant per gene under CDCV but extreme allelic heterogeneity in a gene is often the case when the CDRV hypothesis holds; that is, for the CDCV hypothesis only one causal variant contributes to the association signal while for the CDRV hypothesis multiple rare variants independently influence the phenotype. The second difference is that common variants in a gene are often in strong LD so that multiple common variants can be used to tag the underlying causal common variant, while rare variants are often weakly correlated. Given these differences, the power for single marker tests of rare variants is low for the following reasons: 1) very few individuals in a sample carry rare alleles at single variant sites and therefore the association signal is weak due to low frequencies; 2) in the presence of allelic heterogeneity distinct causal variants in a gene are observed in affected samples and the association signal of individual variants is weakened by each other (Slager et al., 2000); 3) rare variants are drastically more abundant than common ones (Keinan and Clark, 2012; Nelson et al., 2012; Tennessen et al., 2012) and are only weakly correlated, resulting in a severe penalty to correct for multiple testing. All of these make the single marker test an unfavorable approach.

For multi-marker tests, allelic heterogeneity poses a less severe problem than for single-marker tests, since multiple causal variants jointly contribute to the association signal. As a result, multi-marker tests can be more powerful than single-marker tests (Li and Leal, 2008). However there is a large penalty in terms of degrees of freedom for multi-marker tests due to the excess of rare variants to be tested within a gene and the power is degraded with increasing numbers of rare variants. Although ultimately single marker tests can be used to pinpoint individual causal rare variants when the sample size is sufficiently large, this is currently not feasible due to the prohibitive cost of sequencing thousands of samples. For the modest sample size of current sequencing studies, a promising strategy is testing for rare variant associations by aggregating multiple rare variants across a “gene”. This approach is described in detail in the following sections.

For rare variants, not only the statistical power but also the type I error rate is negatively affected. Due to the low frequency of rare variants, the sparsity of the data may make the asymptotic results inaccurate for modest sample sizes. For example, likelihood ratio tests for case/control data are often anti-conservative due to the numerical instability of the likelihood maximization, and conversely Wald and scores tests are often conservative. In such situations usually permutation is carried out to obtain empirical p values. Even after aggregating multiple rare variants (see the following sections), the cumulated allele frequency may not be sufficient for asymptotic results to hold and permutation is often required to calculate empirical p values.

Aggregation Association Analysis For Rare Variants

The goal is to test whether multiple rare variants in a gene as a whole are associated with the phenotype. This class of tests is often referred to as aggregation association tests. The major advantage of this approach is the achievement of dimension reduction through aggregating multiple rare variants into a single unit of analysis. Specifically, in the regression framework (1) or (2), where Xi represents the genotype of the k rare variants carried of the ith individual in a gene, the key is to reduce the dimension of β from k to one or a small number. A variety of aggregation methods have been proposed (see review papers (Asimit and Zeggini, 2010; Bansal et al., 2010; Dering et al., 2011; Ladouceur et al., 2012; Stitziel et al., 2011)) and we present some of them in the follow categories.

1. Burden Tests

The strategy of this category of aggregation association methods aims to test whether there is an excess of rare variants in cases or controls. A general approach is to collapse or aggregate multiple rare variants in a gene into a single “super” variant and the association tests are performed on this single “super” variant. Such an aggregation, if done properly, can achieve the following benefits: 1) low-frequencies of multiple rare variants when summed increase the overall frequency of the “super” variant and 2) the degrees of freedom are reduced from k to 1. This results in both an enrichment of signals and reduction of dimensionality. Formally, this aggregation can be represented in the regression model: logit(pi)=βaδ(Xi1,,Xik)+j=1cγjZij, whether δ(Xi1,…Xik) is a function that summarizes multiple rare variants into a single number, which represents a single “super” variant, βa is the genetic effect of the “super” variant. By aggregating, the original null hypothesis of β1 = … = βk = 0 is equivalent to the null hypothesis βa = 0. Now the association test of multiple rare variants becomes a single d.f. test, greatly reducing dimensionality. The central component of burden tests is the construction of the aggregation function δ(Xi1,…Xik). Although an appropriately constructed aggregation can increase power, inclusion of non-causal variants can dramatically reduce the power. Several aggregation approaches and their performance in various scenarios are discussed below.

Indicator function

The simplest collapsing way is the use of an indicator function (Li and Leal, 2008),

δ(Xi1,Xik)={1j=1kXij>00otherwise}

This simple approach codes 1 for individuals that carry one or more rare alleles within the tested genetic region and zero if all variants are major alleles. For this method, the association test is transformed into testing whether the frequency of rare variant carriers in cases is different from that in controls. Single marker tests using LRT, Wald or score statistics can be carried out as previously described. In addition, a 2-by-2 table can be constructed with numbers of rare variant carriers and non-carriers, and a Fisher exact test can be performed. Although simple, this strategy has an intuitive interpretation in terms of OR of rare variant carriers and even if other more involved methods are used this simple counting can serve as an estimate of the overall genetic effect of rare variants.

Variant counting

The use of an indicator collapsing method contends that most likely one individual carries only one rare variant. However for larger genes or when multiple genetic regions are analyzed as a single unit, the probability that an individual carries more than one rare variant increases. If it is assumed that individuals with more than one rare variant have an increased risk of being affected, ignoring this information may reduce the power to detect an association. A simple extension is to count the number of rare variants each individual carries (Li and Leal, 2009), i.e.

δ(Xi1,Xik)=j=1kXij.

where Xij is coded as the number of rare alleles the ith individual carries at the jth variant site. As in the single-marker approach, Wald, score or LRT tests can be used. In this aggregation, no simple contingency tables can be tabulated due to the potential LD between rare variants. The estimated βa can be interpreted as the log(OR) per rare variant in a gene on average.

Weighted Sum Statistic

It is likely that different rare variants have differential genetic effects. For example, causal variants with strong deleterious effects are under strong purifying selection and therefore more likely to be rare (Gorlov et al., 2008; Keinan and Clark, 2012; Tennessen et al., 2012), and nonsynonymous variants are more likely to affect the gene function than synonymous variants. In either the indicator collapsing or variant counting approach, such information is ignored, and power loss is expected when rare variants to be collapsed have different effects. To take this into account, Madsen and Browning proposed a weighted-sum statistic (WSS) to aggregate multiple rare variants, that is

δ(Xi1,Xik)=j=1kwjXij,

where wj is the weight assigned to the jth rare variant. Specifically, a frequency dependent weighting is used wj=1pj(1pj), where pj is the allele frequency of the jth rare variant in controls and estimated as pj=mjU+12njU+2, in which mjU is the number of minor alleles of the jth variant in controls and nj is the number of controls with non-missing data. Here the numbers 1 and 2 are used to avoid the estimation of zero frequency which can cause numerical instability. In WSS, rarer variants are up-weighted so that rare alleles contribute more to the test statistic. The genetic scores are then ranked and the WSS is calculated as the sum of the ranks of the cases. Since the frequency estimation depends on the phenotype, i.e. only unaffected individuals are used, permutation is performed to obtain empirical p values by permuting case/control status.

The WSS as originally proposed cannot account for covariate effects and permutation is needed to obtain empirical p values, which is computationally expensive. The same aggregation can be implemented in regression models and has been extended to general score tests (Lin and Tang, 2011). Such a setup can readily incorporate covariates and obtain estimates of genetic effects efficiently. To obtain asymptotic p values in regression models, frequency estimates should not be dependent upon phenotypes, since inflation of type I error is expected if frequencies are estimated based on controls only.

Weighting scheme

The assumption of up-weighting rarer variants is that rarer variants are more likely to have larger effects. The frequency-based weighting scheme proposed in WSS is arbitrary and may have reduced power when the weighting is far away from the true relationship. To see how weighting can affect the power and how the upper bound of power can be achieved by optimal weighting schemes, we can compare the weighted sum approach with the true model. The weighted sum model is as follows:

logit(pi)=βaj=1kwjXij+j=1cγjZij=j=1kβawjXij+j=1cγjZij (3)

Assuming the true model is (2), if we compare (3) to (2) we will see that when wj = βj / βa, (3) is recovered to the true model (2). This indicates that when wj for the jth variant is assigned proportional to its true genetic effect, the aggregation can achieve the optimal power (Lin and Tang, 2011). Conversely, improper weighting schemes that dramatically deviate from this relationship are detrimental to the power. Ideally a zero weight should be assigned to non-associated variants and accidental up-weighting instead of down-weighting non-causal variants amplifies noise and reduces power. Theoretically no uniformly most power tests exist for this multi-dimensional problem (Cox and Hinkley, 1977) and for a fixed weighting scheme the power depends on the alternative hypothesis (i.e. the true genetic model). Empirical evaluation of rare variant analysis methods in various scenarios is consistent with the theory (Ladouceur et al., 2012). Although generally it is impossible to know a priori the true model, the ability to utilize prior knowledge to assign weights that are close to the true genetic model is the key to achieving increased power of aggregation analysis. It should be noted that it is not the absolute value of the weight but rather the ratio of the weights for rare variants that determine the power, since the model is unchanged if weights are multiplied by a non-zero constant and βa is divided by the same constant. Although optimal weighting is not possible in reality, it is helpful to calculate the relative ratio of weights assigned to rare variants when designing weighting schemes. A weighting scheme that generates extremely large ratios is questionable and can lead to a great decrease in power.

Madsen and Browning used a frequency related weighting scheme. This weighting scheme will achieve increased power when wj=1pj(1pj)log(ORj) log(ORj) for the jth variant. Although rarer variants are more likely to be functional due to strong purifying selection, it may be difficult to justify this relationship between ORs and frequencies. The variable threshold (VT) method (Price et al., 2010) proposed to explicitly incorporate functional prediction scores from PolyPhen-2 (Adzhubei et al., 2010; Ramensky et al., 2002) as weights for individual variants in the aggregation testing. Assuming that functional prediction scores reflect the genetic effect of individual variants on the trait under study, this approach has potential to increase power compared to collapsing or equal weighting schemes. Some tools specifically generate predictive functional scores for nonsynonymous variants (Bromberg and Rost, 2007; Ferrer-Costa et al., 2005) while others are more general tools that can assess the potential of disease causing or evaluate the sequence conservation across species (Cooper et al., 2005; Schwarz et al., 2010; Siepel et al., 2005). Since these scores are from external sources and not dependent upon phenotype and genotype data, asymptotic results hold for large sample sizes. One of the challenges is that functional prediction scores from different bioinformatics tools are often not be consistent and it is unclear how to integrate the inconsistent predictions into the analysis. Additionally, even if bioinformatics could predict with high accuracy that a variant is causal for one phenotype, it does not guarantee that it is causal for the trait under study.

2. Mixed-effects models

For case/control studies, burden tests look for an enrichment of rare variants in cases compared to controls for risk alleles, or an excess of protective alleles in controls compared to cases. Burden tests will achieve greatest power when all causal variants have the same direction of genetic effects. When a portion of causal variants has effects in opposite directions, i.e. protective and detrimental or increasing and decreasing quantitative trait values, aggregating variants will weaken the overall association signal, resulting in reduced power to detect an association. In an extreme scenario, when half of the variants decrease disease risk and the other half increase disease risk the association signal can be completely cancelled out. Theoretically, this can be solved by assigning negative weights to protective variants. However it is impossible in realty to decide which set of variants has risk effects and which set is protective. New methods have been developed to deal with this situation.

One of the early developments to tackle this problem for case/control data was the C-alpha test (Neale et al., 2011), which compares the observed variance of allele counts to the expected variance under the null hypothesis of no association for dichotomous traits. Let the number of observed rare alleles be nj for the jth variant. When no variants are associated with the phenotype, the rare allele count at the jth variant sites in cases follows a binomial distribution (nj, pj), where pj=p0 for all j=1,…, k and p0 is the expected proportion of allele count in cases (e.g. p0=0.5 when the numbers of cases and controls are equal). When causal variants have different effect sizes and directions, not all pj's are equal to p0 and the data are a mixture of binomial distributions. Since any mixture of binomial distributions creates over-dispersion, testing the increased variance over its expected value under the null is the foundation for the C-alpha test.

To also address the problem when variants have effects in opposite directions, the Sequence Kernel Association Test (SKAT) (Wu et al., 2011) was developed in a more general framework and it has been shown that C-alpha is a special case of SKAT (Wu et al., 2011). Compared to the current implementation of C-alpha test, SKAT has the flexibility of accommodating more features, such as including covariates (e.g. principal components for adjusting population stratification), accounting for LD among variants, allowing for the analysis of both qualitative and quantitative traits, weighting of variants based on frequencies or functional prediction scores, and handling complex genetic models (e.g. epistasis effects) (Wu et al., 2011). SKAT is a variance-component score test in a multiple regression model (1) or (2). SKAT assumes that each βj follows an arbitrary distribution with a mean of zero and a variance of wjτ, where τ is a variance component and wj is the weight for the jth variant. Under this setup, the original null hypothesis is equivalent to H0: τ = 0. A variance component score test in a mixed-effects model can be used to test this hypothesis. The SKAT score statistic is defined as Q=(yy~)K(yy~), where K=XWX', y~ is the predicted mean of the phenotype under the null hypothesis, i.e. fitting (1) or (2) without βj 's, as described before, W is diag(w1,…,wk) with wj being the weight for the jth variant. Since it is a score test, it can be efficiently computed to obtain asymptotic p values (see (Wu et al., 2011) for details).

An attractive feature of SKAT is that flexible genetic models and prior knowledge can be incorporated in the K matrix. K is an n by n matrix, with the (i, i')-th entry equal to K(Xi, Xi'), representing the genetic similarity of the ith and the jth individuals. K(.,.) is called a kernel function and different kernels can be constructed depending on hypotheses about genetic models of variants for specific studies and genes. The simplest kernel is the weighted linear kernel, i.e. K(Xi,Xi)=j=1kwjXijXij. As discussed for WSS, a good weighting scheme can increase power while one that does not reflect the true underlying genetic model can reduce power. If equal weights are used and no covariates are included for case control data, SKAT is equivalent to C-alpha (Wu et al., 2011). In the original paper, the authors proposed to use a beta distribution to specify weights due to its flexibility of accommodating a wide range of weights. Specifically, wj=Beta(MAFj,a1,a2), where a1 and a2 are pre-specified parameters for a beta distribution and MAFj is the rare allele frequency estimated across both cases and controls. The authors suggested a1=1 and a2=25, which allows for increasing weights for rare variants and decreasing weights for common variants. Other values can also be used for different prior knowledge; for example, a1=a2=1 corresponds to assigning equal weight for all variants and a1=a2=0.5 specifies wj=MAFj(1MAFj), which put strong weights on rare variants. There may not be a simple relationship between the allele frequency and the genetic effect, and other prior information can be incorporated to guide the weighting, such as the functional prediction scores discussed in the VT test. Other complex kernels can also be constructed to accommodate more complex models such as epistasis. For interested users, please see (Wu et al., 2011) for details.

SKAT was proposed as a variance component score test in a mixed effects model and has a connection with the score test in a fixed effect regression model. For the linear kernel, i.e. K(Xi,Xi)=j=1kwjXijXij, it can be shown that Q=j=1kwjSj2, where Sj=Xj(yy~) is the individual score statistic of the jth variant (Wu et al., 2011). Similarly, the WSS score statistic is S=i=1n(yiy~i)j=1kwjXij =j=1kwjXj(yy~)=j=1kwjSj SKAT has also connections with other tests and see (Pan, 2009) for more discussion. For single marker tests, i.e. k=1, SKAT is equivalent to single-marker score tests. When k>1, their behaviors become different, and the relative power depends on genetic models and how weights are assigned. For example, when two rare variants have opposite directions of effect, it is most likely that the score statistic of the risk allele is positive while the score statistic of the other variant is negative. The association signal is cancelled out when positive weights are assigned to both variants, resulting in reduced power as shown in Weighting Scheme. On the contrary, the squared score statistics in SKAT eliminate the direction issue and all variants contribute positive scores to the test statistic. Therefore SKAT can gain more power compared to burden tests in the presence of opposite direction of genetic effects. However, there is a tradeoff between power and robustness. When a large proportion of rare variants have the same direction of effects, SKAT will be less powerful than burden methods, and it is generally not known when to apply SKAT or burden tests due to the lack of prior knowledge of genetic models for complex traits. It is plausible especially for dichotomous traits that genetic variants in the same gene are likely to affect the gene function in a similar fashion and burden tests may be more powerful if the analysis unit it a gene.

3. Data-driven approaches

To carry out aggregation analyses, several criteria need to be determined to increase statistical power. These criteria include for example the frequency cutoff of rare variants and the weight for each variant. Usually these criteria are either pre-specified or from external sources. However the choices are usually arbitrary and may not be proper for some traits or genes. An alternative is to let the data drive these choices – that is to select appropriate criteria based on the phenotype and genotype data under study. Since this class of methods uses the same data for both feature selection and hypothesis testing, permutation procedures are usually needed to obtain empirical p values. We describe a few of these methods and discuss their performance.

Variable threshold method

For a complex trait it is likely that causal variants span a wide spectrum of allele frequencies and that the allelic architecture varies widely from trait to trait. Although variants with allele frequencies less than 0.01 are commonly used in practice for aggregation analyses, there is no clear biological justification for this threshold. An improper threshold cutoff may dramatically reduce power by excluding causal variants and including non-causal variants. To avoid arbitrary specification of frequency cutoffs, the variable threshold method (Price et al., 2010) was proposed to automatically select the “optimal” frequency threshold for rare variants and include the variants with frequencies below this threshold for aggregation analyses. Specifically, for a given weighting scheme (e.g. functional prediction, inverse of allele frequency, or no weighting), a statistic ST is calculated for each threshold T in the range between the lower bound TL and the upper bound TU, and the maximum of Smax and the corresponding threshold Tmax are recorded. Since Smax depends on phenotype and genotype, the distribution of Smax under the null is generally unknown and permutations are required to assess the significance of Smax. Specifically, the same procedure is carried out to obtain Smax_perm for each permutated dataset and Smax is compared to the distribution of the Smax_perm to obtain empirical p values. Although a different statistic was used in the original publication of the VT method, the standard score statistic in a regression model can be used to carry out the same VT testing procedure owing to its desirable statistical properties (Lin and Tang, 2011).

The VT method essentially performs many tests to find the optimal cutoff and uses permutation to correct the multiple testing to obtain empirical p values. When Tmax is far from a user-specified cutoff, VT is expected to achieve increased power by finding the proper threshold, and if Tmax is close to the pre-specified threshold, VT suffers loss of power due to correcting for multiple testing. To reduce the search space, it may be desirable to set TL and TU accordingly to confine the search in a smaller range. For example, it may not be necessary or desirable to include in the aggregation association test variants with frequencies >0.1 for example. VT also explicitly incorporates functional prediction scores (e.g. PolyPhen-2 scores (Adzhubei et al., 2010; Ramensky et al., 2002)) in weighting variants, and proper weighting can reciprocally increase the effectiveness of selecting “optimal” threshold.

Adaptive weighting methods

Although allele frequency and function prediction scores are used for weighting, they may not reflect true genetic models. Adaptive weighting methods have been proposed to utilize phenotype and genotype data to guide the weighting of variants. The estimated regression coefficients (EREC) method (Lin and Tang, 2011) first estimates the regression coefficients in the general model (1) and (2) and then incorporates these estimates of individual coefficients in the weighting scheme. As described before, the optimal weights are the true βj 's, which are unknown. It is tempting to use the maximum likelihood estimates β^js to guide the assignment of the weights of individual variants. In the EREC approach, it was proposed to use wj=β^j+δ as the weight for the jth variant, where δ is a constant. The value of δ is arbitrary and the EREC method recommends that it be set to 1 for dichotomous and 2 for quantitative traits when sample size is <2000. The behavior of the EREC method depends on the choice of the constant δ. When δ = 0, it is equivalent to assigning β^j as the weight of the jth variant. Since β^js are maximum likelihood estimates, plugging β^js as the weights in model (3) results in the maximum likelihood of the original model (2). Therefore, testing based on such a weighting scheme is asymptotically equivalent to the multi-marker LRT or score test with k d.f. On the other extreme, when δ>>β^js, all weights are close to a constant and this is equivalent to the variant counting approach. Therefore EREC can be viewed as a method between multi-marker tests and variant count approach. Further modification can be made such that δ is no longer a constant but variable for different variants when desired, reflecting the prior belief of the genetic effects. The EREC method is expected to achieve both robustness due to its feature of multi-marker tests and increased power as well owning to its collapsing functionality. However it is not clear how the selection of δ values affect the statistical power for various scenarios.

Other data-driven methods use similar approaches for dynamic weight assignment. For example, kernel based adaptive cluster method (Liu and Leal, 2010a) and data adaptive sum test (Han and Pan, 2010) assign weights adaptively to variants based upon variant counts from the data. The RARECOVER (Bhatia et al., 2010) method uses a variable selection approach to select the “optimal” set of variants that maximize the burden test statistics (i.e. finding optimal assignment of zero weights to a subset of variants). A general class of adaptive methods have also been proposed (Pan and Shen, 2011). Statistical significance for these tests is usually evaluated by permutation. We omit the details of these methods and interested reads can refer to their original papers.

4. Hybrid methods

Both burden tests and non-burden methods (e.g. multi-marker tests and SKAT) have increased power for certain genetic models and are under-powered in other situations, and all have reduced power when non-causal variants are included in the analysis. Hybrid methods have been proposed to combine methods that are powerful for variants with the same effects and methods that are robust when either non-causal variants or variants with opposite effects are present. We will describe in this section the combined multivariate and collapsing (CMC) method (Li and Leal, 2008) and the SKAT-O approach (Lee et al., 2012a; Lee et al., 2012b).

CMC

The CMC method combines the burden tests and multivariate tests explicitly to achieve both increased power and robustness. Intuitively it collapses subsets of k rare variants and then jointly test the collapsed subsets in a multivariate test. Based on the regression framework (2), an example of the CMC method is as follows:

logit(pi)=β2CMCδ1(Xij,jΔ1)+β2CMCδ2(Xij,jΔ2)+jkΔ1Δ2βjXij+j=1cγjZij

In the above modeling, Δ1 and Δ2 are two sets of rare variants that are to be aggregated, and β1CMC and β2CMC are the genetic effects of the collapsed “super” variants in the two subsets. The CMC method proposed to use the indicator function for collapsing and can be implemented using other collapsing approaches such as weighted sum score statistic. The null hypothesis in the CMC method becomes H0: β1CMC=β2CMC=βjkΔ1Δ2=0. A multiple d.f. LRT or score test can be carried out to jointly test the hypothesis. Here the dimension reduction is achieved for the rare variants in Δ1 and Δ2 for increased power while the multivariate tests of the subsets achieve robustness. It's shown that common non-causal variants have greater detrimental effects on power of burden tests than that of multivariate tests (Li and Leal, 2008) and if different frequencies are to be analyzed in a gene, collapsing only rare variants using CMC is expected to be robust. Other criteria can also be used to collapse subsets of rare variants, e.g. rare functional variants affecting splicing and stop codons may be collapsed in a subset while less dramatic changes like missense variants are collapsed in another. The CMC method is a flexible framework in that both the complete collapsing and multi-marker tests without collapsing are special cases of CMC at two extremes. However the flexibility often requires appropriate user-defined subsets, which may not be obvious in reality.

SKAT-O

SKAT was developed to tackle the problem that both risk and protective alleles are present in a gene but prior knowledge is rarely available about the directionality of causal variants. Lee et.al. developed SKAT-Optimal test, which combined a burden test and SKAT in a single framework (Lee et al., 2012a; Lee et al., 2012b). Recall that in SKAT each βj is assumed to follow an arbitrary distribution with a mean of zero and a variance of wjτ (see Mixed-effects models). All βj's are assumed to be independent in SKAT. The new class of tests is formulated as a generalized family of SKAT through a family of kernels that incorporate a correlation structure among variant effects. Specifically Lee et.al. used an exchangeable correlation structure and the correlation matrix of βj's is Rρ =(1− ρ)I + ρ11', where I is a k by k identical matrix with 1 on the diagonal and zero otherwise, and 11' is a k by k matrix with all entries being 1. This matrix is a compound symmetry correlation structure with 1 on the diagonal and ρ for all off-diagonal entries. The statistic for dichotomous traits in model (2) is

Qρ=(yy~)Kρ(yy~)

This is similar to the original SKAT statistic with the exception that the kernel is replaced by Kρ XWRρWX'. By separating the Rρ matrix into the sum of two parts, it can be seen that Qρ is a linear combination of a burden test and the SKAT, i.e. Qρ = (1− ρ)QSKAT + ρQburden. The statistic is calculated as Qoptimal=0<ρ<1minpρ where pρ is the p value calculated for a specific ρ. SKAT-O uses a grid search approach to find the best ρ value that minimizes pρ : set a grid 0 < ρ1 <…<ρn < 1, calculate pρ1,…, pρn and obtain Qoptimal = min{pρ1,…, pρn}. For large sample sizes, the p value of Qoptimal is derived analytically to evaluate the significance. Simulation studies suggest that SKAT-O outperformed SKAT and burden tests in a wide range of scenarios (Lee et al., 2012a; Lee et al., 2012b). The correlation ρ determines the relative contribution of either test to the SKAT-O statistic. When ρ = 0 it reduces to a burden test and when ρ = 1 it is equivalent to SKAT, and when 0 < ρ < 1 it achieves the unification of these two kinds of tests. Since ρ is estimated from data, SKAT-O is also a data-driven approach. Using a similar argument, EREC can also be viewed as a data-driven hybrid method. Like other data-driven approaches, SKAT-O also involves multiple testing (i.e. searching for the optimal ρ) and will be less powerful due to the correction of multiple testing than both SKAT and burden tests when the true ρ is close to zero or one.

Replication

To rule out the possibility of spurious associations due to confounding factors, e.g. population stratification and sequencing batch effects, it is critical to replicate findings in independent samples. As discussed above, there are no uniformly most powerful tests for this type of analysis strategy and to potentially reduce false negatives it is a viable approach to carry out different tests in the initial study. If applying multiple tests is not corrected for in the initial study, there will be inflation of type I error rates. Therefore replication studies are extremely important to confirm association findings.

Two replication approaches can be employed. A simple strategy is to genotype the variants discovered in the initial study in an independent panel and appropriate rare variant aggregation tests are performed on these genotyped variants (Liu and Leal, 2010b). This “variant-based” approach is cost-effective and can quickly generate genotype data on a large replication sample. However, if causal rare variants were not uncovered in the initial study, these causal variants will be missed in the replication panel and a loss of power is expected. An alternative strategy is to sequence the genes or regions of interest in a replication sample and association tests are carried out on variant uncovered in the new sample (Liu and Leal, 2010b). This “sequence-based” approach is likely to uncover additional causal rare variants but is more expensive and time-consuming. Which replication strategy to choose depends on specific study goals and designs. For example, if the interest is to identify a more complete allelic spectrum for clinical applications, the sequence-based strategy serves as a better approach than the variant-based approach. Through extensive simulations, it's demonstrated that the sequence-based approach is usually more powerful than the variant-based replication; however for most situations the difference is not dramatic (Liu and Leal, 2010b). Given the low cost of genotyping compared to sequencing, the variant-based strategy is attractive for large-scale replication studies of complex traits.

Estimates of Genetic Effects

After a genetic association is identified, it is desirable to estimate genetic effects for the identified association and quantify the explained genetic variance, in addition to the p-values that are usually reported. In rare variant burden tests, since multiple variants are aggregated as a single “super” variant, the slope parameter βa in the regression model measures the change of mean trait value per unit of change in the burden score. For example it is shown for quantitative trait βa is the weighted average of the individual β's (Pan, 2009). For the same data dataset, different aggregation strategies (indicator function, differential weights in weighted sum approaches) will lead to different estimates of the genetic effects. Therefore, they do not have a straightforward interpretation. On the other hand, the phenotypic variance explained by the burden score has a natural interpretation. In fact, it's shown that the locus genetic variance explained by the burden score will always be lower than the true genetic variance, unless optimal weights are assigned (Liu and Leal, 2012a). This suggests that some proportion of the heritability may still be missing and not explained by the burden analysis, even if an association is established between the gene and the phenotype. It is still necessary to pinpoint causal variants to more precisely estimate the contribution of causal rare variants to complex trait etiology.

Sequencing analysis pipeline for rare variant association

Previous sections present various analysis methods for rare variant association studies. For most practical applications sequencing data will be used to identify genes with associated rare variants. In this section we describe a practical pipeline for analyzing sequencing data to identify rare variant associations with a primary focus on exome sequencing.

1. Variant calling

To identify rare variant association it is critical to accurately call variants from NGS data. NGS reads are usually short with moderate error rates. For example, typical reads from NGS platforms (e.g. Illumina) are around 100 bps with error rates ~0.5–1% per base. It is important to be aware that these imperfect sequences may lead incorrect variant calls. The first step is to align short reads to the human reference genome. A variety of software is available (Nielsen et al., 2011) and a widely used tool is BWA (Li and Durbin, 2010). After the initial alignment, reads around short insertions or deletions need to be realigned, duplicated reads need to be removed and raw base quality scores need to be re-calibrated (DePristo et al., 2011; Nielsen et al., 2011). These steps of pre-processing are meant to generate accurate alignment of bases with calibrated quality scores. After these steps, it is helpful to generate summary statistics about the alignment, such as the fraction of reads mapped to the target regions, distribution of depths on the target regions, base and mapping quality scores etc. Contamination may also be checked based on the alignment (Nielsen et al., 2011). Outlier samples may be identified and removed from downstream analyses. After the sample cleanup, the next step is to identify variant sites and individual genotypes from the aligned bases across study samples. The standard variant calling tools, e.g. GATK (DePristo et al., 2011) and samtools (Li et al., 2009), are likelihood-based and generate quality scores for variant calling. The current calling algorithms recommend joint calling of multiple samples together to increase the accuracy for common variants and decrease the false positive rare variant calls (DePristo et al., 2011; Li, 2011). If the samples are related, PolyMutt (Li et al., 2012a) or TrioCaller (Chen et al., 2013) can be used to perform family-aware variant calling. After this step, an initial Variant Call Format (VCF) (Danecek et al., 2011) file is generated to store all variant sites and individual genotypes with quality scores to indicate the confidence of the calling.

2. Annotation

This step is to annotate all identified variant sites with functional features. Several software packages (Liu et al., 2011; Wang et al., 2010) are available and we will describe ANNOVAR (Wang et al., 2010), a tool that integrates multiple databases and can annotate variants with a variety of functional information. For gene-centric features, it can annotate variants as synonymous, nonsysnonymous, stop gain/loss, splicing, 5'UTR and 3'UTR, and intronic. It generates the corresponding positions and amino-acid changes for coding variants. These annotations are based on transcripts and some variants may have multiple annotations for different transcripts in the same gene or transcripts from overlapping genes. For nonsynonymous variants, it also provides functionality prediction scores from a variety of prediction algorithms, including PolyPhen-2 (Adzhubei et al., 2010; Ramensky et al., 2002), SIFT (Ng and Henikoff, 2003), LRT (Chun and Fay, 2009) and MutationTaster (Schwarz et al., 2010). For all variants ANNOVAR outputs sequence conservation scores such as GERP++ (Cooper et al., 2005) and PhastCon (Siepel et al., 2005). Other information includes dbSNP IDs, allele frequencies in the1000 Genome Project (Abecasis et al., 2012; Consortium, 2010) and the NHLBI-Exome Sequencing Project from the Exome Variant Server (EVS) (Emond et al., 2012; Tennessen et al., 2012). All these categories of information are useful for selecting promising variants for the analysis and for constructing sensible weighting schemes in rare variant aggregation analysis.

3. Quality assessment of variant calling

It is common practice to filter out false positive variant calls using machine learning approaches (Abecasis et al., 2012; Consortium, 2010; DePristo et al., 2011) by using features that are predictive of mis-alignment of reads (e.g. mapping quality, sequence repeats, mappability). After filtering “bad” calls, it is helpful to check the Ti/Tv ratio, i.e. the ratio of numbers of transitions (A<->G and C<->T) vs. transversions (all other nucleotide changes) from the reference alleles. There are more possible Tv's than Ti's and if false variant calling is random regardless of Ti or Tv changes we expect that the Ti/Tv ratio is ~0.5. Since transitions are easier to occur than transversions, a Ti/Tv ratio higher than 0.5 is expected. On the genome level the observed Ti/Tv ratio is ~2.2–2.3 and for coding variants the Ti/Tv ratio is slightly over 3 (Abecasis et al., 2012; Consortium, 2010; Tennessen et al., 2012). A significantly reduced Ti/Tv ratio indicates an excess of false positive variant calls. The Ti/Tv ratio analysis is particularly important to check the quality of novel variant sites that are not present in public databases (Ng et al., 2009).

4. Aggregation analysis

After generating a clean set of genotype calls with functional features, the key is to perform association analyses to identify associated rare variants. Although the focus is on aggregation analysis, it is always desirable to perform single marker tests to identify relatively common variants with larger genetic effects. As we show in the discussion of various aggregation analyses, there is no single method that is superior to other methods in all scenarios. The performance is largely dependent on the true underlying genetic models, which is unknown. Here we provide some practical strategies that we hope are useful in genetic association studies. This is still an active research area and practitioners are encouraged to apply appropriate approaches and adapt new advances for their studies.

For gene-based analyses, the first step is to determine which variants to include in the aggregation analysis. If only rare variants are to be included, a set of pre-specified allele frequency cutoffs may be used, e.g. MAF 0.05 to MAF 0.01, or the VT method is used. To select a cutoff, it is worth noting that approaches based on the estimates from controls need to use permutation to calculate p values to avoid inflation of type I error rates. This is because under the null hypothesis that rare variants are not associated with the trait, the expected frequency in controls under this selection criterion is less than that in cases, which is the allele frequency in the population. The next step is to determine which functional variants are included. In practice, the first priority may be the analysis of stop gain/loss, splicing and missense variants, and gene-based analyses are performed on this category of variants. For dichotomous traits it is natural to apply burden tests to test for an enrichment of these “functional” variants in cases or controls, with possible weighting of each variants based on prior knowledge such as functional prediction scores. Before applying weights to variants, it is desirable to check the ratio of the weights so that the range is compatible with complex traits (e.g. it is hardly believable that the OR of one variant is hundreds of times higher than that of another one, and if non-causal variants are accidently highly weighted than others the association signal will be dramatically diluted by noise). For quantitative traits, either in the random or extreme sampling designs, although rare variants influencing the quantitative trait values in different directions may be expected in which SKAT-O can be applied, it is biologically plausible that a large proportion of causal variants have effects in the same direction and burden tests should also be considered. For complex diseases with strong indication of specific genetic models, specific aggregation tests may be appropriately constructed. For example, if a recessive model is suggested, a compound heterozygote modeling (e.g. the collapsing of heterozygotes where rare alleles are on different haplotypes) can be used to test for highly conserved functional variants, e.g. splicing sites and stop gain/loss variants.

Aggregation tests can also be applied to pathways or gene-sets. It may not be desirable, however, to aggregate all rare variants since the total number of rare variants may be too excessive to detect the association signal contributed by a smaller number of causal variants. The investigation of power of various tests in realistic simulations is lacking in this setup. It is expected that a few genes in a pathway or gene-set may harbor causal variants and the CMC method or SKAT-O may be used to guard against the noise in the non-causal genes. Prior knowledge from other resources (e.g. gene-expression data) or data-driven approaches are viable to sub-select promising genes (e.g. genes expressed in appropriate tissues) to reduce the dimension prior to aggregation analyses.

After obtaining exome-wide p values, it is recommended to draw a QQ plot to check the behavior of all test methods. Systematic deviations from the expectation indicate issues. If inflation of type I error is observed, it can be due to the use of an anti-conservative test (e.g. LRT for extremely rare variants (Li and Leal, 2008)) and conversely a deflation can be observed when conservative tests (Fisher exact tests or score tests when the sample size is not large) are used. These can sometimes be circumvented by obtaining empirical p-values via permutation. Confounding factors, such as population stratification or sequencing batch effects, can generate false association signals and permutation will not resolve these issues. In such situations it is important to explore the data to identify and correct the confounding to avoid spurious associations (see COMMENTARY for more details).

5. Follow-up studies

It is important to carry out follow-up studies to confirm any significant findings in the initial sequencing study, and this is particularly true for rare variant associations. It is of particular interest to assay the function of candidate genes on phenotypes in experimentally but this strategy is time and labor consuming. The most economical approach is the replication of candidate genes in an independent sample. Given the effectiveness of the variant-based approach (Liu and Leal, 2010b), this strategy is more practical to assay thousands of samples. Custom chips can be designed to target the top candidate genes. Of particular note is the exome-chip design, which includes on the array ~240,000 non-synonymous and splice site variants identified by sequencing >12,000 individuals (Do et al., 2012) (http://genome.sph.umich.edu/wiki/Exome_Chip_Design). Although this exome-chip can be used for replication, currently it is also used for primary association studies of coding variants. The exome-chip is being genotyped on >1,000,000 individuals with phenotype data for a wide variety of traits (Do et al., 2012). For replication studies, it is ideal that the same analysis strategy used in the discovery panel to select top candidate genes is applied to the replication data, although it is also desirable to explore other genes for additional signals in the exome-chip data.

Software packages for rare variant analyses

Most original papers describing analysis methods provide software for carrying out the proposed methods. General tools are also available that implemented a battery of published methods. Some of the methods can be easily implemented in statistical packages such as R software (Team, 2008). Due to the complexity of the rare variant analyses, currently available tools may not fulfill specific analysis needs for specific studies. In such situations it is desirable to implement custom-designed approaches in R for example to carry out specific analyses. The following lists a few software packages that implement most of the methods discussed in this Unit.

COMMENTARY

Study designs

In this Unit we focus on unrelated case/control or quantitative designs. Other study designs provide attractive alternatives. For example, family studies were largely ignored in the GWAS due to the low power for common variants. However there is a resurgence of family studies with the advent of rare variant searches and it is still in debate whether family or unrelated designs are more powerful for identifying rare variants with larger genetic effects. It is argued that collecting families with multiple affected individuals can enrich causal rare variants and sequencing such families is expected to achieve improved power over unrelated designs (Cirulli and Goldstein, 2010; Peng et al., 2010). A particular advantage of family studies is the ease of replication of rare variant findings. For example, a much reduced sample size is needed to ascertain additional family members of individuals that carry candidate rare variants for replication. On the other hand, it requires a large sample to observe enough copies of rare variants in unrelated individuals. Family samples, however, are much more difficult to collect. It is likely that both designs will be carried out in the future and complementary to each other. Although we only discussed methods that are designed for unrelated designs, some are extendable to family studies and a few other methods are available as well (Fang et al., 2012; Zhu and Xiong, 2012).

For quantitative traits, sampling individuals with extremely low or high phenotypes is more powerful than random sampling (Barnett et al., 2012; Huang and Lin, 2007; Liu and Leal, 2012b) and this extreme sampling strategy has been successful in identifying rare variants in sequencing studies (Cohen et al., 2005; Cohen et al., 2004; Cohen et al., 2006; Romeo et al., 2007). One simple analysis approach is to treat the two extremes as cases and controls and all methods discussed in this Unit are readily applicable. However this simple strategy ignores the information carried in individual phenotypes. For traits that are normally distributed and sampling that is based only on phenotype value cutoffs, the extreme phenotypes follow truncated normal distributions and likelihood models conditional on the extremely sampling are expected to achieve improved power (Barnett et al., 2012; Huang and Lin, 2007; Liu and Leal, 2012b). One should be cautious in realty to apply such methods if the traits do not follow a normal distribution or additional criteria are used to select extremes. In these situations the extreme phenotypes may not follow truncated normal distributions, and it is recommended that case/control methods be used to gain more robustness.

Confounding factors for rare variant associations

In addition to the statistical challenges of rare variant association analyses, two confounding factors, namely batch effects of sequencing and population stratification, are worth further discussions to avoid spurious associations. These confounding effects have not been investigated extensively but their impact on association results can be substantial. Advanced methods are needed for studies in which such confounding effects are present.

Batch effects of sequencing refer to the differential variant and genotype calls in cases vs. controls due to any possible confounding factors, such as DNA sources, sequencing technologies, sequencing depth, calling approaches and post-processing. For example, sequencing depth has been shown to be such a confounding factor in the 1000 Genome Project data (Abecasis et al., 2012; Consortium, 2010). It is not uncommon that different sequencing strategies are used in the same study. For example, newer technologies are used for later phases of sequencing with reduced error rates and increased coverage, resulting in better genotype calls; different capturing kits have a more dramatic confounding effect due to varying capturing evenness and target regions. These factors on their own can introduce systematic differences in cases and controls and in combination may lead to strong batch effects to generate spurious associations. It is generally true that genotypes of rare variants are more difficult to infer from sequencing than common variants. For a few methods in which the rarer the variant is the more heavily it is up-weighted, it is unclear how the potential false rare variant calls may affect the power and false positive associations, and this warrants additional investigations. In the benign case where the heterogeneity of the sequencing strategies does not lead to confounding in well-designed studies, power loss is expected if these differences are ignored. It is helpful to explore the batch-effects using for example principal component analysis and take appropriate actions once confounding factors are identified. Unless sequencing technologies become mature with high accuracy, differential sequencing platforms (e.g. reagents, software pipelines, etc.) are likely to be used in the same study and advanced methods that take into account such sequencing effects may be needed to increase the analysis power while controlling for batch effects.

The problem of population stratification in association studies is well-recognized and effective methods based on PCA (Price et al., 2006) and variance component models (Kang et al., 2010) are routinely applied to GWAS data. For rare variants, however, it is less clear how population stratification may affect the association analyses. Several recent large-scale sequencing studies reveal that a vast majority of variants are rare and the excess of rare variants is due to recent explosive human population growth (Keinan and Clark, 2012; Nelson et al., 2012). The departure from equilibrium of population growth skews the patterns of genetic variation and makes the modeling of population genetics more challenging. Excessive mutations introduced after the split of modern populations obscure the association studies; for example, study samples regarded as homogenous for GWAS may show differential patterns in the spectrum of rare variants. Simulation studies showed that the impact of rare variant population stratification on association mapping can be stronger than that of common variants (Mathieson and McVean, 2012). Commonly used approaches for correcting population stratification in GWAS, including methods based on PCA and variant-component models, may not always be effective in correcting the population stratification of rare variants (Mathieson and McVean, 2012). Although PCA has been shown to be effective on some data (Zhang et al., 2013), definitive conclusions require more studies, and the development of rare variant analysis methods can clearly benefit from the understanding of the genetic variation due to the recent explosive population growth.

Concluding remarks

In this Unit we have described various rare variant analysis methods, their statistical features and scope of application, and discussed challenges of rare variants analysis from several aspects. We also outline a pipeline for sequencing analysis to identify rare variant associations. However, it is clear that no consensus can be made on standard approaches for aggregation analyses of rare variants. It is up to the investigators to select appropriate analysis strategies tailored to their studies. Results coming out from ongoing studies will further our understanding of architecture of complex traits, which will in turn help develop better analysis strategies. We hope that this Unit serves as a general platform for introducing this emerging field and provides useful guidelines for rare variant association analysis.

Acknowledgment

The authors would like to thank Dr. Wei Pan, Division of Biostatistics School of Public Health, University of Minnesota, for helpful discussions. This research is supported by an internal development fund from Vanderbilt University (B.L) and National Institutes of Health grants MD005964 and HL102926 (S.M.L.).

Literature Cited

  1. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nature methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, Doelle H, Ersoy B, Kryukov G, Schmidt S, Yosef N, Ruppin E, Sharan R, Vaisse C, Sunyaev S, Dent R, Cohen J, McPherson R, Pennacchio LA. Medical sequencing at the extremes of human body mass. American journal of human genetics. 2007;80:779–791. doi: 10.1086/513471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Peltonen L, Dermitzakis E, Bonnen PE, Altshuler DM, Gibbs RA, de Bakker PI, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, Wheeler D, Gibbs RA, Muzny DM, Barnes C, Darvishi K, Hurles M, Korn JM, Kristiansson K, Lee C, McCarrol SA, Nemesh J, Dermitzakis E, Keinan A, Montgomery SB, Pollack S, Price AL, Soranzo N, Bonnen PE, Gibbs RA, Gonzaga-Jauregui C, Keinan A, Price AL, Yu F, Anttila V, Brodeur W, Daly MJ, Leslie S, McVean G, Moutsianas L, Nguyen H, Schaffner SF, Zhang Q, Ghori MJ, McGinnis R, McLaren W, Pollack S, Price AL, Schaffner SF, Takeuchi F, Grossman SR, Shlyakhter I, Hostetter EB, Sabeti PC, Adebamowo CA, Foster MW, Gordon DR, Licinio J, Manca MC, Marshall PA, Matsuda I, Ngare D, Wang VO, Reddy D, Rotimi CN, Royal CD, Sharp RR, Zeng C, Brooks LD, McEwen JE. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Asimit J, Zeggini E. Rare variant association analysis methods for complex traits. Annual review of genetics. 2010;44:293–308. doi: 10.1146/annurev-genet-102209-163421. [DOI] [PubMed] [Google Scholar]
  6. Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nature reviews. Genetics. 2010;11:773–785. doi: 10.1038/nrg2867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Barnett IJ, Lee S, Lin X. Detecting Rare Variant Effects Using Extreme Phenotype Sampling in Sequencing Association Studies. Genetic epidemiology. 2012 doi: 10.1002/gepi.21699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bhatia G, Bansal V, Harismendy O, Schork NJ, Topol EJ, Frazer K, Bafna V. A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS computational biology. 2010;6:e1000954. doi: 10.1371/journal.pcbi.1000954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nature genetics. 2008;40:695–701. doi: 10.1038/ng.f.136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic acids research. 2007;35:3823–3835. doi: 10.1093/nar/gkm238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. Genotype calling and haplotyping in parent-offspring trios. Genome research. 2013;23:142–151. doi: 10.1101/gr.142455.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloglu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2009;106:19096–19101. doi: 10.1073/pnas.0910672106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome research. 2009;19:1553–1561. doi: 10.1101/gr.092619.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature reviews. Genetics. 2010;11:415–425. doi: 10.1038/nrg2779. [DOI] [PubMed] [Google Scholar]
  15. Cohen J, Pertsemlidis A, Kotowski IK, Graham R, Garcia CK, Hobbs HH. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nature genetics. 2005;37:161–165. doi: 10.1038/ng1509. [DOI] [PubMed] [Google Scholar]
  16. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305:869–872. doi: 10.1126/science.1099870. [DOI] [PubMed] [Google Scholar]
  17. Cohen JC, Pertsemlidis A, Fahmi S, Esmail S, Vega GL, Grundy SM, Hobbs HH. Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proceedings of the National Academy of Sciences of the United States of America. 2006;103:1810–1815. doi: 10.1073/pnas.0508483103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Consortium TGP. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome research. 2005;15:901–913. doi: 10.1101/gr.3577405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Cox DR, Hinkley DV. Theoretical Statistics. Chapman and Hall; 1977. [Google Scholar]
  21. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Dering C, Hemmelmann C, Pugh E, Ziegler A. Statistical analysis of rare sequence variants: an overview of collapsing methods. Genetic epidemiology. 2011;35(Suppl 1):S12–17. doi: 10.1002/gepi.20643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Do R, Kathiresan S, Abecasis GR. Exome sequencing and complex disease: practical aspects of rare variant association studies. Human molecular genetics. 2012;21:R1–9. doi: 10.1093/hmg/dds387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Dudbridge F, Gusnanto A. Estimation of significance thresholds for genomewide association scans. Genetic epidemiology. 2008;32:227–234. doi: 10.1002/gepi.20297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Emond MJ, Louie T, Emerson J, Zhao W, Mathias RA, Knowles MR, Wright FA, Rieder MJ, Tabor HK, Nickerson DA, Barnes KC, Gibson RL, Bamshad MJ. Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis. Nature genetics. 2012;44:886–889. doi: 10.1038/ng.2344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Fang S, Sha Q, Zhang S. Two adaptive weighting methods to test for rare variant associations in family-based designs. Genetic epidemiology. 2012;36:499–507. doi: 10.1002/gepi.21646. [DOI] [PubMed] [Google Scholar]
  28. Ferrer-Costa C, Gelpi JL, Zamakola L, Parraga I, de la Cruz X, Orozco M. PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics. 2005;21:3176–3178. doi: 10.1093/bioinformatics/bti486. [DOI] [PubMed] [Google Scholar]
  29. Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR, Amos CI. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. American journal of human genetics. 2008;82:100–112. doi: 10.1016/j.ajhg.2007.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Human heredity. 2010;70:42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hartl DL, Clark AG. Principles of population genetics. Forth Edition Sinauer Associates, Inc.; 2007. [Google Scholar]
  32. Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nature reviews. Genetics. 2005;6:95–108. doi: 10.1038/nrg1521. [DOI] [PubMed] [Google Scholar]
  33. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature genetics. 2012;44:955–959. doi: 10.1038/ng.2354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS genetics. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Huang BE, Lin DY. Efficient association mapping of quantitative trait loci with selective genotyping. American journal of human genetics. 2007;80:567–576. doi: 10.1086/512727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Iyengar SK, Elston RC. The genetic basis of complex traits: rare variants or “common gene, common disease”? Methods Mol Biol. 2007;376:71–84. doi: 10.1007/978-1-59745-389-9_6. [DOI] [PubMed] [Google Scholar]
  37. Ji W, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, Newton-Cheh C, State MW, Levy D, Lifton RP. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nature genetics. 2008;40:592–599. doi: 10.1038/ng.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nature genetics. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Keinan A, Clark AG. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012;336:740–743. doi: 10.1126/science.1217283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CM, Richards JB. The empirical power of rare variant association methods: results from sanger sequencing in 1,998 individuals. PLoS genetics. 2012;8:e1002496. doi: 10.1371/journal.pgen.1002496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, Lin X. Optimal unified approach for rare-variant association testing with application to small-sample casecontrol whole-exome sequencing studies. American journal of human genetics. 2012a;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012b;13:762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Li B, Chen W, Zhan X, Busonero F, Sanna S, Sidore C, Cucca F, Kang HM, Abecasis GR. A likelihood-based framework for variant calling and de novo mutation detection in families. PLoS genetics. 2012a;8:e1002944. doi: 10.1371/journal.pgen.1002944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. American journal of human genetics. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Li B, Leal SM. Discovery of rare variants via sequencing: implications for the design of complex trait association studies. PLoS genetics. 2009;5:e1000481. doi: 10.1371/journal.pgen.1000481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Li B, Wang G, Leal SM. SimRare: a program to generate and analyze sequence-based data for association studies of quantitative and qualitative traits. Bioinformatics. 2012b;28:2703–2704. doi: 10.1093/bioinformatics/bts499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. doi: 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Li Y, Byrnes AE, Li M. To identify associations with rare variants, just WHaIT: Weighted haplotype and imputation-based tests. American journal of human genetics. 2010a;87:728–735. doi: 10.1016/j.ajhg.2010.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: implications for design of complex trait association studies. Genome research. 2011;21:940–951. doi: 10.1101/gr.117259.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic epidemiology. 2010b;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. American journal of human genetics. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS genetics. 2010a;6:e1001156. doi: 10.1371/journal.pgen.1001156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Liu DJ, Leal SM. Replication strategies for rare variant complex trait association studies via next-generation sequencing. American journal of human genetics. 2010b;87:790–801. doi: 10.1016/j.ajhg.2010.10.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Liu DJ, Leal SM. Estimating genetic effects and quantifying missing heritability explained by identified rare-variant associations. American journal of human genetics. 2012a;91:585–596. doi: 10.1016/j.ajhg.2012.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Liu DJ, Leal SM. A unified framework for detecting rare variant quantitative trait associations in pedigree and unrelated individuals via sequence data. Human heredity. 2012b;73:105–122. doi: 10.1159/000336293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Human mutation. 2011;32:894–899. doi: 10.1002/humu.21517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
  60. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Mardis ER. Next-generation DNA sequencing methods. Annual review of genomics and human genetics. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. [DOI] [PubMed] [Google Scholar]
  62. Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nature genetics. 2012;44:243–246. doi: 10.1038/ng.1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS genetics. 2011;7:e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall MD, Nangle K, Wang J, Abecasis G, Cardon LR, Zollner S, Whittaker JC, Chissoe SL, Novembre J, Mooser V. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337:100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nature reviews. Genetics. 2011;12:443–451. doi: 10.1038/nrg2986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genetic epidemiology. 2009;33:497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Pan W, Shen X. Adaptive tests for association analysis of rare variants. Genetic epidemiology. 2011;35:381–388. doi: 10.1002/gepi.20586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Peng B, Li B, Han Y, Amos CI. Power analysis for case-control association studies of samples with known family histories. Human genetics. 2010;127:699–704. doi: 10.1007/s00439-010-0824-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genomewide association studies. Nature genetics. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  73. Pritchard JK, Przeworski M. Linkage disequilibrium in humans: models and data. American journal of human genetics. 2001;69:1–14. doi: 10.1086/321275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic acids research. 2002;30:3894–3900. doi: 10.1093/nar/gkf493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH, Cohen JC. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nature genetics. 2007;39:513–516. doi: 10.1038/ng1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. San Lucas FA, Wang G, Scheet P, Peng B. Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools. Bioinformatics. 2012;28:421–422. doi: 10.1093/bioinformatics/btr667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America. 1977;74:5463–5467. doi: 10.1073/pnas.74.12.5463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Schork NJ, Murray SS, Frazer KA, Topol EJ. Common vs. rare allele hypotheses for complex diseases. Current opinion in genetics & development. 2009;19:212–219. doi: 10.1016/j.gde.2009.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Schwarz JM, Rodelsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nature methods. 2010;7:575–576. doi: 10.1038/nmeth0810-575. [DOI] [PubMed] [Google Scholar]
  80. Shendure J, Ji H. Next-generation DNA sequencing. Nature biotechnology. 2008;26:1135–1145. doi: 10.1038/nbt1486. [DOI] [PubMed] [Google Scholar]
  81. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome research. 2005;15:1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Slager SL, Huang J, Vieland VJ. Effect of allelic heterogeneity on the power of the transmission disequilibrium test. Genetic epidemiology. 2000;18:143–156. doi: 10.1002/(SICI)1098-2272(200002)18:2<143::AID-GEPI4>3.0.CO;2-5. [DOI] [PubMed] [Google Scholar]
  83. Smith DJ, Lusis AJ. The allelic structure of common disease. Human molecular genetics. 2002;11:2455–2461. doi: 10.1093/hmg/11.20.2455. [DOI] [PubMed] [Google Scholar]
  84. Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome biology. 2011;12:227. doi: 10.1186/gb-2011-12-9-227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Team RD. R: A language and environment for statistical computing. Vienna, Austria: 2008. [Google Scholar]
  86. Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research. 2010;38:e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. American journal of human genetics. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Zhang Y, Guan W, Pan W. Adjustment for population stratification via principal components in association analysis of rare variants. Genetic epidemiology. 2013;37:99–109. doi: 10.1002/gepi.21691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Zhu X, Feng T, Li Y, Lu Q, Elston RC. Detecting rare variants for complex traits using family and unrelated data. Genetic epidemiology. 2010;34:171–187. doi: 10.1002/gepi.20449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Zhu Y, Xiong M. Family-based association studies for next-generation sequencing. American journal of human genetics. 2012;90:1028–1045. doi: 10.1016/j.ajhg.2012.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES