Abstract
Detecting uncommon causal variants (minor allele frequency (MAF) < 5%) is difficult with commercial single-nucleotide polymorphism (SNP) arrays that are designed to capture common variants (MAF > 5%). Haplotypes can provide insights into underlying linkage disequilibrium (LD) structure and can tag uncommon variants that are not well tagged by common variants. In this work, we propose a wei-SIMc-matching test that inversely weights haplotype similarities with the estimated standard deviation of haplotype counts, to boost the power of similarity-based approaches for detecting uncommon causal variants. We then compare the power of the wei-SIMc-matching test with that of several popular haplotype-based tests, including four other similarity-based tests, a global score test for haplotypes (global), a test based on the maximum score statistic over all haplotypes (max), and two newly proposed haplotype-based tests for rare variant detection. With systematic simulations under a wide range of LD patterns, the results show that wei-SIMc-matching and global are the two most powerful tests. Among these two tests, wei-SIMc-matching has reliable asymptotic P values, whereas global needs permutations to obtain reliable P values when the frequencies of some haplotype categories are low or when the trait is skewed. Therefore, we recommend wei-SIMc-matching for detecting uncommon causal variants with surrounding common SNPs, in light of its power and computational feasibility.
Keywords: Haplotype, Similarity, Linkage disequilibrium, Rare variants
Introduction
In the past few years, genome-wide association studies (GWAS) have identified hundreds of common genetic variants (minor allele frequency (MAF) > 5%) for complex human diseases. However, these common variants can only explain a small proportion of heritability [Manolio, et al. 2009]. Uncommon variants (MAF < 5%) are likely to play an important role in the missing heritability that cannot be explained by common variants. In this work, we call the variants with MAF < 5% ‘uncommon variants’, including the so-called ‘rare variants’ (MAF < 1%) and ‘low-frequency variants’ (MAF 1%–5%) [Zeggini 2011]. Searching for uncommon variants that are responsible for complex diseases is now attracting more attention [Zeggini 2011]. However, this topic remains challenging. When sequencing data are available, pooling signals of multiple uncommon variants and testing the association of this pooled set with the disease [Han and Pan 2010; Li and Leal 2008; Madsen and Browning 2009; Morris and Zeggini 2010; Price, et al. 2010] is an attractive strategy [Lin, et al. 2011]. However, due to the high cost of sequencing [Sboner, et al. 2011], GWAS data are still the most commonly available data in the current stage [Li, et al. 2010; WTCCC 2007]. For GWAS using commercial single-nucleotide polymorphism (SNP) arrays, the pooling methods are underpowered in detecting uncommon causal variants as they pool signals of common SNPs that cannot well represent the information of uncommon variants (this argument should be apparent and the pooling methods were mainly proposed for sequencing data with rare variants, but we still include them into the following comparisons). Similarly, conventional single-marker analysis is also underpowered because markers in commercial SNP arrays cannot be good surrogates for causal variants that are too rare [Gusev, et al. 2011].
Haplotypes can provide insights into underlying linkage disequilibrium (LD) structure and can tag uncommon causal variants that are not well tagged by common SNPs [Gusev, et al. 2011; Li, et al. 2010]. For some complex diseases such as hypertension, rare haplotypes have been shown to influence the disease susceptibility [Kitsios and Zintzaras 2010; Liu, et al. 2005; Zhu, et al. 2005]. A recent study has shown that identical-by-descent haplotype mapping is powerful for tagging rare variants [Gusev, et al. 2011]. In addition, similarity-based approach has been used in uncommon and common variant detection [Tzeng, et al. 2011]. These studies suggest that similarity-based approaches might be also useful, to some extent, in detecting uncommon causal variants using nearby common SNPs.
In this work, we propose a ‘wei-SIMc-matching’ test, to capture the signals of uncommon causal variants using haplotype information. We inversely weight haplotype similarities with the estimated standard deviation of haplotype counts, to boost the power of similarity-based approaches for detecting uncommon causal variants. We then compare the performance of wei-SIMc-matching with that of several popular haplotype methods. We show that although commercial SNP arrays are not designed to capture uncommon causal variants, some haplotype methods including the proposed wei-SIMc-matching test have a better ability to complement this.
Methods
(I) Similarity-based tests
Statistical model
Let yi be the phenotype of the ith subject (i = 1, …, N), and let Γ(hi) be a k×1 vector coding the frequencies of all ‘haplotype categories’ for the ith subject, where k is the number of ‘haplotype categories’ (i.e., unique haplotypes in the sample, two haplotypes are classified into a same category if all observed alleles on the two haplotypes are the same). For example, if there are three categories of haplotypes {h1,h2,h3} and both of the haplotypes of the ith subject are h1, then Γ(hi)T = [1 0 0], where Γ(hi)T is the transpose of Γ(hi). If one haplotype is h2 and the other is h3, then Γ(hi)T = [0 0.5 0.5]. When the haplotype phase is uncertain, statistical methods such as the expectation-maximization (EM) algorithm [Dempster, et al. 1977] can be used to infer the haplotype frequency vector, under the assumption of Hardy-Weinberg equilibrium (HWE) [Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long, et al. 1995].
We consider a generalized linear model:
(1) |
where g(·) is a link function, Y is an N-length vector of phenotypes, C is an N×(m+1) matrix with the ith row of ciT = [1 ci,1 ci,2 ⋯ ci,m] coding 1 (for the intercept term) and m covariates (e.g., age, gender, ethnicity, etc.) of the ith subject, α is the (m+1) -element vector of covariate effects including the intercept term, x is an N-length vector with the ith element of xi = γT · S · Γ(hi) coding the genetic information (regarding the region under investigation) of the ith subject, and β is the regression coefficient of the genetic information coded by x. The scalar xi = γT · S · Γ(hi) is a quantity comparing the ith subject’s haplotypes against haplotypes of all the other subjects, in which Γ(hi) is the haplotype frequency vector of the ith subject, γ is a specified vector aggregating the haplotype information of all the N subjects, and S is a k×k matrix whose (υ, ν) element is the similarity between the υth and νth categories of haplotypes. The canonical link is the logit function , the identity function (g(μ) = μ), and the log function (g(μ) = log μ) given binary traits, normally-distributed traits, and traits with Poisson distribution, respectively [Nelder and Wedderburn 1972].
Test statistics
Based on the model in Eq. (1) and under the assumption of gene-covariate independence, the score statistic is
(2) |
where μ̂i = ciT (CTC)−1 (CTY) is the fitted value of the ith subject according to the covariates; a (ϕ) is the mean square error (MSE) if the trait distribution is normal, and is 1 if the trait distribution is Bernoulli or Poisson [Nelder and Wedderburn 1972]. With different specification of γ, the score statistic can result in different tests.
When we specify , where p̂ is the vector of the average haplotype frequencies of all the N subjects, the resulting test is called SIMp with the test statistic
(3) |
where Ω̂ is the estimated variance-covariance matrix of [Lin, et al. 2012] and
Because TSIMP is the square of a standard normal variable, it has an asymptotic χ2 distribution with one degree of freedom.
When we specify , the resulting test is called SIMc with the test statistic
(4) |
By the theory of quadratic forms of normal variables [Scheffe 1959], TSIMc is asymptotically distributed as ’s are independent χ2 variables with one degree of freedom, and λ1 ≥ λ2 ≥ ⋯ ≥ λϖ are the ordered eigen values of the matrix Ω̂S (ϖ is the rank of the matrix Ω̂S). The distribution of TSIMc can be approximated by the three-moment approximation method [Allen and Satten 2007; Allen and Satten 2009; Imhof 1961; Tzeng, et al. 2009]. The P value of the observed SIMc test statistic is given by
where is the χ2 distribution with b degrees of freedom.
The similarity matrix S can be constructed based on metrics such as the counting measure or the matching measure [Tzeng, et al. 2003]. The counting measure calculates the percentage of alleles in common between any two haplotypes; the matching measure treats each haplotype as a distinct category and is defined as 1 if two haplotypes match and 0 otherwise. Therefore, the similarity matrix S is a matrix with all diagonal elements of 1 and all off-diagonal elements of 0, if the matching measure is employed. Both the two similarity-based tests (SIMp and SIMc) can be respectively equipped with the counting measure and the matching measure, resulting in four tests (SIMp-counting, SIMp-matching, SIMc-counting, and SIMc-matching).
Weighting similarities
Previous studies show that uncommon causal variants usually have larger effect sizes compared to common causal variants [Bodmer and Bonilla 2008]. Moreover, uncommon variants are more likely to be tagged by uncommon haplotypes than by common haplotypes. Therefore, up-weighting uncommon haplotypes may facilitate the discoveries of uncommon variants. Li et al. [2010] defined Sh = [Nct · fct,h · (1 − fct,h)]−½ (h=1, …, k, in which k is the number of haplotype categories), where Nct is the number of controls; fct,h is the adjusted frequency of haplotype h among controls and is quantified as , in which Cct,h is the number of haplotype h among controls. We let the k×k similarity matrix S be a diagonal matrix with the hth diagonal element of Sh = [Nct · fct,h · (1 − fct,h)]−½, where h=1, 2, …, k. When continuous traits are analyzed, we let Sh = [N · fct,h · (1 − fh)]−½, where N is the total number of subjects and , in which Ch is the number of haplotype h among all the N subjects.
We plug this similarity matrix S into Eq. (4), and the resulting test is referred to as the wei-SIMc-matching test. It is based on the SIMc test with the matching measure inversely weighted by the estimated standard deviation of haplotype counts. The weighting scheme given to haplotypes is inspired from Madsen and Browning’s weights for SNPs [Madsen and Browning 2009]. Using this weight in S implies that we up-weight the similarities contributed by uncommon haplotypes but down-weight the similarities contributed by common haplotypes. Presumably, wei-SIMc-matching can boost the power of similarity-based approaches for detecting uncommon causal variants. We will evaluate its performance with simulations.
(II) Standard haplotype regression tests
A global score test for haplotypes (global) and a test based on the maximum score statistic over all haplotypes (max) have been widely used for detecting common variants [Schaid, et al. 2002]. The global test is regarded as a standard haplotype regression and is usually compared with similarity-based tests [Lin and Schaid 2009; Lin, et al. 2012; Tzeng, et al. 2009; Tzeng, et al. 2011]. The global and max tests are based on a generalized linear model:
(5) |
where g(·) is a link function, ℑ is an N×k matrix with the ith row of Γ(hi)T (the transpose of the haplotype-frequency vector of the ith subject), η is the (m+1) -element vector of covariate effects including the intercept term, and ψ is the k-element vector of the regression coefficients for the k categories of haplotypes in the region. Let Uψ be the score vector of ψ, and Vψ be the variance-covariance matrix of Uψ. The global score statistic is , which has an asymptotic χ2 distribution with degrees of freedom equal to the rank of Vψ [Schaid, et al. 2002].
The maximum score statistic over all haplotypes is , where Uψ,κ is the kth element of Uψ and Vψ,κκ is the (k, k) element of Vψ. There is no analytic form for the distribution function of the max test statistic, so permutation P values are used in practice [Schaid, et al. 2002].
(III) Haplotype-based tests to detect rare variants
Recently, two haplotype-based tests were proposed for rare variant detection. Both the two tests split the data into a training set and a testing set. Zhu et al.’s haplotype grouping test (referred to as ‘HG’) classifies haplotypes as risk or non-risk with the training set (the co-classification stage), and then tests for associations by performing a Fisher’s exact test with the testing set (the association stage) [Zhu, et al. 2010]. This method has been applied to the Wellcome Trust Case Control Consortium (WTCCC) data [Feng and Zhu 2010]. Li et al.’s weighted haplotype test on genotyped SNPs (referred to as ‘WHG’) is based on a similar procedure. The WHG further boosts power to detect rare variants by weighting haplotypes according to their frequencies [Li, et al. 2010]. For both HG and WHG, we followed Li et al. [2010] to randomly select 30% of the sample as the training set and let the remaining 70% be the testing set.
Simulation study
Following Li et al.’s simulation [2010], we first generated 200 data sets each containing 10,000 chromosomes of 1 Mb regions with the Cosi program [Schaffner, et al. 2005]. The chromosomes were generated in consistency with the HapMap CEU (CEPH people from Utah, U.S.A., http://hapmap.ncbi.nlm.nih.gov/) samples. For each data set, we randomly picked an ~50 kb region as the causal region, within the 1 Mb region. Within each causal region, we randomly selected d variants with population MAF between 0.1% and 5% (d=5, 10, 20, 30, or 40), and we treated these variants as causal variants that might increase or decrease the disease risk (or the value of a continuous trait). Among the d causal variants, we let r% of them increase the disease risk while the remaining (100 − r)% decrease the disease risk (or the value of a continuous trait). The value of r was specified at 5, 20, 50, 80, and 100, respectively. In each data set, we randomly chose 120 from the 10,000 chromosomes to mimic the Phase II HapMap CEU data, and these 120 chromosomes were randomly paired to form 60 subjects. Based on the LD patterns of the 60 subjects, tag SNPs were selected according to the conventional criterion of r2 = 0.8 and MAF > 5% (many association studies for complex human diseases tend to use SNPs with MAF > 5% due to a power consideration [Barrett and Cardon 2006; Keating, et al. 2008]), with the H-clust method [Rinaldo, et al. 2005; Roeder, et al. 2005]. These tag SNPs were served as markers used in our simulations.
Binary traits
When evaluating the type-I error rates, the population attributable risk (PAR) was set at 0%. When evaluating the power, the PAR of each causal variant was set at 0.2%, 0.4%, 0.6%, 0.8%, and 1.0%, respectively. We follow previous studies [Li, et al. 2010; Madsen and Browning 2009] to assign larger genetic effects to rarer variants, because rare variants with a chance to be detected usually have larger effect sizes compared to common variants [Bodmer and Bonilla 2008]. The genotype relative risk (GRR) of a causal variant j with PAR of PARj and MAF of MAFj is
(6) |
where I(ξj = 1) is the indicator function with a value of 1 or 0 according to whether the causal variant j decreases the disease risk or not. Given a value of PAR, the relationship between MAF and GRR depicted by Eq. (6) is shown in our supporting information (Supplementary Figures S1 and S2). In addition, we also show the distributions of MAFs and GRRs of the causal variants in our 200 simulated data sets in Supplementary Figures S3 and S4, respectively.
To generate chromosomes of one individual, we randomly selected two chromosomes from the remaining 9,880 (= 10,000 − 120) chromosomes. The disease status of an individual possessing two chromosomes {H1,H2} was determined by
(7) |
in which f0 is the baseline penetrance, and aj is the rare allele of the causal variant j. Following Li et al. [2010], we fixed f0 at 10%. In each replication, we continued the sampling procedure until 1,000 cases and 1,000 controls were reached. After generating the disease status based on Eq. (7), the genotypes of all the causal variants were removed from our data sets. For each data set, we selected an analysis region with 20 tag SNPs, to encompass the d causal variants.
To account for the haplotype ambiguity, we first inferred haplotype phases from unphased multimarker genotypes with the EM algorithm, by using the ‘haplo.em’ function in the ‘haplo.stats’ package [Schaid, et al. 2002]. Note that all phasing algorithms assume Hardy-Weinberg equilibrium (HWE) [Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long, et al. 1995], including the ‘haplo.em’ function. Following Schaid et al. [2002], we assumed HWE in the pooled sample of cases and controls, and then phased cases and controls together (phasing cases and controls together provides a better control of type-I error rates than phasing cases and controls separately [Lin and Huang 2007]). All possible haplotype pairs were considered with their posterior probabilities by treating the posterior probabilities as weights.
In addition to the nine haplotype-based tests, we also used the variable-threshold (VT) test program (http://genetics.bwh.harvard.edu/rare_variants/) to perform four pooling tests, including the fixed-threshold test with two thresholds of 1% and 5% [Morris and Zeggini 2010], the weighted-sum test [Madsen and Browning 2009], and the VT test [Price, et al. 2010]. With a preliminary simulation, we found that the VT test was generally the most powerful test among the four pooling tests. Therefore, in the following comparisons, we only present the result of the VT test [Price, et al. 2010] as a representative of the four pooling tests.
Continuous traits
We further generated a continuous trait (Y) by
(8) |
where C1 was a continuous covariate generated from a standard normal distribution, C2 was a dichotomous covariate taking a value of 0 or 1 each with a probability of 0.5, gj was the number of causal allele on the jth causal variant (gj = 0, 1, or 2), βj was the effect size of the jth causal variant, and e was the random error. The random error, e, was assumed to have a normal distribution with a mean of zero and a variance of Ve. The effect sizes β’s and Ve were determined so that the heritability of each variant (we call it ‘marginal heritability’) was fixed at 0.05%, 0.1%, 0.15%, or 0.2% under the alternative hypothesis. The relationship between MAFs and β’s was shown by our Supplementary Figure S5. The total sample size was set at 2,000. After generating the traits, the genotypes of all the causal variants were removed from our data sets. HG [Zhu, et al. 2010] and WHG [Li, et al. 2010] were proposed for case-control studies and so they were not included in the comparisons for analyzing continuous traits.
In addition to specifying a normally distributed error term (e), we also studied the situation when the random error came from a Gamma distribution with a shape parameter of 1 and a scale parameter of . The Ve and the effect sizes β’s (see Eq. (8)) were determined so that the marginal heritability was fixed at 0.05%, 0.1%, 0.15%, or 0.2% under the alternative hypothesis.
Simulation results
Type-I error rates
By setting the PAR or the marginal heritability at exactly 0%, we used the 200 simulated data sets to evaluate the type-I error rates. For each data set, 200 replications were performed. In the package ‘haplo.stats’, the default of the minimum number of counts for a haplotype to be included in the model is 5. Therefore, by default, haplotypes with frequencies less than , where 2000 is the total number of subjects) would be lumped into a single baseline group when using the ‘haplo.stats’ package. To evaluate the influence of the choice of α0, we specified α0 = 0.125%, 0.25%, and 1%, respectively. The corresponding minimum numbers of counts for a haplotype to be included in the model were 5, 10, and 40, respectively.
For similarity-based tests, following a practical strategy to provide robustness to genotyping errors [Lin and Lee 2010; Lin and Schaid 2009; Sha, et al. 2007], we merged a haplotype with frequency less than a cutoff value α0 with its most similar haplotype with frequency larger than α0. This α0 is not necessary to be identical to the α0 used in ‘haplo.stats’. However, to have a parallel comparison, we here also let α0 = 0.125%, 0.25%, and 1% (where 1% is the cutoff value suggested by Sha et al. [2007]), respectively.
Figure 1 presents the type-I error rates under various nominal significance levels, based on the 40,000 replications across the 200 simulated data sets, for each trait distribution and each α0. When α0 = 0.125% or 0.25%, the asymptotic results of the global test are somewhat conservative for binary trait (panels A and D), but anticonservative for the continuous trait with a normally distributed error term (panels B and E) and the continuous trait with a Gamma-distributed error term (panels C and F). When α0 = 1%, the asymptotic results of the global test are valid for binary trait (panel G) and the continuous trait with a normally distributed error term (panel H), but still somewhat anticonservative for the continuous trait with a Gamma-distributed error term (panel I). All the other tests, including the global test based on permutation P values, are valid in the sense that their type-I error rates correspond to the nominal significance levels. To have a fair comparison in power, we use permutations to evaluate the statistical significance for the global test (permutations are also required for max and VT). In the following simulations, the significances of global and max are obtained with 1,000–20,000 permutations by a sequential Monte Carlo algorithm [Besag and Clifford 1991], according to the default of the package ‘haplo.stats’ [Schaid, et al. 2002]. Moreover, α0 is specified at 0.125% in the following simulations (as the default cutoff value in ‘haplo.stats’) for the global, max, and the similarity-based tests.
Power comparisons - binary traits
Figure 2 presents the power averaged over the 200 data sets representing a wide range of LD patterns, when the trait is binary. For each scenario (each combination of r, PAR, and d) within each simulated data set, we performed 100 replications. The results show that the pooling methods such as the VT test are underpowered because they pool signals of common SNPs that do not well represent the information of uncommon variants. The global, wei-SIMc-matching, and max tests are the three most powerful tests. Specifically, the max test is slightly more powerful than the other two competitors when the PAR of each causal variant is smaller than or equal to 0.4% (the middle column of Figure 2) or when the number of causal variants is smaller than 20 (the right column of Figure 2). When there are more causal variants (d ≥ 20), there are usually more categories of disease-contributing haplotypes. In this situation, global, an omnibus test of all haplotype categories, is more powerful than max.
Overall, global is slightly more powerful than wei-SIMc-matching. The test statistic of SIMc is a summed product of genomic similarities and covariate-adjusted phenotypes (see Eq. (4)). When the causal variants are all uncommon (MAF < 5%), SIMc is underpowered because few subjects have the causal variants and most subjects are similar by having no causal variants. By contrast, global lets each haplotype category (common or uncommon, as long as the frequency is larger than the cutoff α0) account for an equal ONE degree of freedom (see Eq. (3) of [Schaid, et al. 2002]). Therefore, the association of uncommon haplotypes is more likely to be detected by global, rather than by SIMc.
The wei-SIMc-matching test, a variant of SIMc, is thus developed to enhance the ability of similarity-based approach to detect uncommon causal variants. The weight used in the wei-SIMc-matching test is in the order of ½ from the binomial standard deviation viewpoint. Through this work, we see that the SIMc test with this weight on haplotypes still cannot compete with the global test, when all the causal variants are uncommon (MAF < 5%). A larger order of weight can further boost the power to detect uncommon causal variants, however it will inevitably suffer from power loss if there are some common causal variants in that region [Tzeng, et al. 2011].
Comparing SIMc with SIMp, the former is more powerful because it takes not only the within-group similarity but also the between-group similarity into considerations [Allen and Satten 2009; Lin, et al. 2012; Nolte, et al. 2007; Sha, et al. 2007]. SIMp has good power only when the causal variant was introduced at a common haplotype [Lin, et al. 2012]. In our simulations, the disease status was influenced by multiple variants that usually resulted in multiple disease-contributing haplotypes with low frequencies. Therefore, SIMp was underpowered in this situation. Comparing the counting measure with the matching measure, the latter is more powerful because it captures the information of identical-by-descent sharing more precisely [Lin and Lee 2010; Lin and Schaid 2009; Tzeng, et al. 2009].
HG [Zhu, et al. 2010] and WHG [Li, et al. 2010] were not as powerful as the global, wei-SIMc-matching, and max tests. A main reason is that the data were split into a training set and a testing set. Both HG and WHG can be improved by using the entire sample for the co-classification stage and the same entire sample for the association stage, with permutations to adjust for the statistical significance. This strategy is computationally feasible when handling only top genes [Feng and Zhu 2010]. However, it is computationally demanding for our comprehensive simulations.
Power comparisons - continuous traits
Figure 3 presents the power averaged over the 200 data sets when the trait is continuous (100 replications for each scenario within each data set), given the nominal significance level of 10−3. The result given the nominal significance level of 10−4 is presented in Supplementary Figure S7. When the error term is simulated from a normal distribution, the wei-SIMc-matching, global, and max tests are, again, the three most powerful tests. The global test is more robust to the percent of variants among the d causal variants that increase the trait value (the left upper panel of Figure 3). The global and max tests are slightly more powerful than the wei-SIMc-matching test when the marginal heritability of each causal variant is smaller than or equal to 0.1% (the middle upper panel of Figure 3) or when the number of causal variants is smaller than or equal to 20 (the right upper panel of Figure 3).
Note that different from other tests, the power of the VT test is not V-shaped, when the x-axis is the percent of variants among the d causal variants that increase the trait value (the first columns of Figure 3 and Supplementary Figure S7). This is because VT performs a right-tailed test in the program (http://genetics.bwh.harvard.edu/rare_variants/). Revising it to a two-tailed test can improve its power under a small r (the percent of variants among the d causal variants that increase the trait value).
When the error term is simulated from a Gamma distribution, wei-SIMc-matching is consistently the best method under all scenarios we evaluated (the bottom rows of Figure 3 and Supplementary Figure S7). In the package ‘haplo.stats’, the only choice of trait type for a continuous trait is ‘gaussian’. Therefore, we also specify ‘gaussian’ as the trait type, when analyzing the continuous trait with a Gamma-distributed error term. Because the trait is skewed and is not following the normal (gaussian) distribution, the global and max tests (both performed with the package ‘haplo.stats’) suffer from power loss. This problem can be remedied by taking a logarithmic transformation on the trait. However, the skewness of an error term is not always easy to be recognized from the observed trait values. By contrast, the performances of the similarity-based tests are robust to the distribution of the traits (comparing the top rows and the bottom rows of Figure 3 and Supplementary Figure S7).
We also present the power stratified by the marginal heritability (given d = 40, and r = 100%) and then sorted by the percent of rare causal variants with MAF < 0.5% (top rows of Supplementary Figures S10–S11). Given many rare causal variants (MAF < 0.5%), wei-SIMc-matching is underpowered because very few subjects have the causal variants and most subjects are similar by having no causal variants. Furthermore, we also sorted the power by the LD pattern between the causal variants and the surrounding markers (bottom rows of Supplementary Figures S10–S11). As expected, the power of all the tests improves as the average r2 increases. Generally speaking, global is more powerful than wei-SIMc-matching when the average r2 is smaller, whereas wei-SIMc-matching is more powerful when the average r2 is larger (d = 40, the average r2 was obtained by averaging the 40×20 r2’s of any one causal variant and each of the 20 surrounding markers).
Choice of the cutoff valueα0
In the above power comparisons, the cutoff value for haplotype frequencies was set at α0 = 0.125%, the default value used in ‘haplo.stats’. The haplotypes with frequencies less than α0 were pooled into a single baseline group when we used ‘haplo.stats’. Besides, in the similarity-based tests, a haplotype with frequency less than α0 was merged with its most similar haplotype with frequency larger than α0. Because the matching measure is a phase-dependent metric, the choice of α0 may affect the performance of the SIMp-matching, SIMc-matching, and wei-SIMc-matching tests. To evaluate the influence on power of the five tests (global, max, SIMp-matching, SIMc-matching, and wei-SIMc-matching tests) with a different α0, we further performed simulations with α0 = 1%. In Supplementary Figures S12–S14, we compare the result given α0 = 0.125% with that given α0 = 1%.
When analyzing binary traits (Supplementary Figures S12), all the five tests (global, max, SIMp-matching, SIMc-matching, and wei-SIMc-matching tests) became less powerful given an increased α0 of 1%. As shown by Supplementary Figures S3, among all the causal variants in our simulation, the percent of rare causal variants (MAF < 1%) is 73.7%, whereas the percent of extremely rare causal variants (MAF < 0.125%) is 18.5%. The power loss given an increased α0 of 1% is expected because haplotypes with frequencies less than 1% are more likely to tag the rare causal variants (MAF < 1%). However, they are lumped into a single baseline group when performing global and max, or merged with other commoner haplotypes when performing the SIMp-matching, SIMc-matching, and wei-SIMc-matching tests.
When analyzing continuous traits, again, similarity-based tests have a decrease in power when α0 is increased to 1%, especially for the wei-SIMc-matching test whose power is boosted from up-weighting the similarities contributed by uncommon haplotypes. For global and max, however, generally the power improves when α0 is increased to 1%. This contradicts the previous result for binary traits. Scoring many rare haplotypes (frequencies between 0.125% and 1%) in a model may weaken the power of the global and max tests, although we are unclear why this phenomenon only appears in analyzing continuous traits. On average, in each replication, ~18 haplotypes with frequencies larger than 1% and ~20 haplotypes with frequencies between 0.125% and 1% (see Supplementary Figures S15–S16). Therefore, compared with α0 = 1%, ~20 more haplotypes need to be scored in the model given α0 = 0.125%. The many rare haplotypes (frequencies between 0.125% and 1%) may cause unstable estimation of the score vector Uψ and/or the variance-covariance matrix Vψ.
Computational burden
The computational burden to perform the wei-SIMc-matching test is reasonable because no permutation is required. When analyzing binary traits given PAR = 0.2%, d = 20, r = 100%, and the cutoff values for haplotype frequencies α0 = 0.125%, the wei-SIMc-matching test on average takes respectively 0.9, 6.9, and 23.3 seconds for a 20-SNP multimarker set on 1000, 2000, and 3000 subjects, given an Intel Xeon workstation with 3.0 GHz of CPU and 2.0 GB of memory. The global test with 1,000–20,000 permutations on average takes 13.6, 61.3, and 145.9 seconds, for analyzing 1000, 2000, and 3000 subjects, respectively. The range of the required time for the global test is quite large (shown in Supplementary Figure S17), depending on the number of permutations (may range from 1,000 to 20,000) in each replication.
Summary of simulation results
The purposed wei-SIMc-matching test is among the most powerful tests for detecting uncommon causal variants (MAF < 5%), although it still cannot compete with the global test when most causal variants are very rare (MAF < 0.5%) or when the average r2 between the causal variants and the surrounding markers is extremely low (≤ 0.01), as clearly shown by Supplementary Figure S10. However, the performance of the wei-SIMc-matching test is more robust to the trait distributions and the cutoff values for haplotype frequencies (α0). Furthermore, it does not require permutations to obtain reliable statistical significance.
Application to a population-based resequencing study for the ANGPTL4 gene
We then applied the eight (for a continuous trait) or ten (for a binary trait) tests to a population-based resequencing study for the ANGPTL4 gene [Romeo, et al. 2007; Romeo, et al. 2009]. To understand the role of ANGPTL4 in lipid metabolism, Romeo et al. [2007; 2009] sequenced seven exons and the intron-exon boundaries of ANGPTL4. There were 3,551 subjects coming from a population-based probability sample of Dallas County residents, including 1,830 African Americans, 601 Hispanics, 1,045 European Americans, and 75 other ethnicities. In our analysis, we evaluated the performance of the various tests to detect associations between the plasma triglyceride levels and the uncommon variants in ANGPTL4, pretending that all the uncommon variants were not genotyped. We excluded the 75 subjects of other ethnicities from our analysis. Among the 93 variants, we kept two variants with MAF > 5% in the sample of the 3,476 (3,551 − 75 other ethnicities) subjects: P307P (MAF=6.6%) and P389P (MAF=6.5%). We deliberately excluded the variants with MAF less than 5% in order to mimic a commercial SNP array. To the best of our knowledge, E40K (MAF = 0.73%) and R278Q (MAF = 3.1%) are the only two variants reported to be associated with plasma triglyceride levels, based on the analyses for this resequencing data set [King, et al. 2010; Maxwell, et al. 2010; Romeo, et al. 2007; Yi, et al. 2011]. Our objective is to see whether the haplotype-based methods can detect the signal caused by the two reported uncommon variants (MAF < 5%), E40K and R278Q, which were both deliberately excluded from our analyses.
The log-transformed plasma triglyceride levels were first adjusted for age, sex, body-mass index (BMI), and ethnicity (including three levels: African American, Hispanic, and European American), by performing a linear regression of log-transformed plasma triglyceride levels on these four covariates. The residuals (yi − μ̂i) ’s were treated as continuous traits used in Eq. (3) and (4) to perform the similarity-based tests. Because there were 468 subjects missing in age or BMI, the actual number of subjects for the analysis of the continuous trait was 3,008 (= 3,551 − 75 − 468). Following Romeo et al. [2007], we also created a binary trait by coding subjects in the top and bottom quartiles of the residuals as 1 (755 subjects) and 0 (744 subjects), respectively. The remaining subjects were excluded from the analysis. Therefore, the number of subjects for the analysis of the binary trait was 1,499 (= 755+744). We then tested for the association between the continuous / binary trait and the haplotypes formed by the two variants (P307P and P389P). There were 93 variants, generating possible combinations of any two variants. We set the significance level at . The tests yielding significant results included SIMc-counting (P value of the analysis for the continuous / binary trait = 3.6×10−10 / 2.2×10−9), global (<10−6 / <10−6, with 106 permutations), max (6×10−6 / 5×10−6, with 106 permutations), and wei-SIMc-matching (6.3×10−6 / 1.8×10−5). The above four tests can detect the association between the plasma triglyceride levels and the uncommon variants in ANGPTL4 (E40K and R278Q [King, et al. 2010; Maxwell, et al. 2010; Romeo, et al. 2007; Yi, et al. 2011]), even when the uncommon variants were not genotyped. The results for the continuous trait and the binary trait were very similar.
Discussion
When performing haplotype-based tests, the question of how to choose the size of a multimarker set is still open [Schaid 2004]. Although we let 20 SNPs form a multimarker set in our simulations, we also performed simulations by using only 10 SNPs to form a set (Supplementary Figures S18–S20). The relative power performances were very similar to the results by using 20 SNPs, but generally each test was less powerful than that by using 20 SNPs in a multimarker set. This is because larger multimarker sets may allow for measuring sharing over longer genomic sequences and lead to more power gains [Allen and Satten 2009; Lin, et al. 2012].
Through systematic simulations while considering a wide range of LD patterns, we find that although wei-SIMc-matching cannot compete with global in some situations (especially when most causal variants are very rare (MAF < 0.5%) or when the average r2 between the causal variants and the surrounding markers is extremely low (≤ 0.01), as shown by Supplementary Figure S10), it is one of the best approaches for detecting uncommon causal variants (MAF < 5%) with surrounding common SNPs (MAF > 5%). In addition to the power, the merits of wei-SIMc-matching also include its robustness to the trait distributions and the cutoff values for haplotype frequencies (α0). Furthermore, it is computationally feasible in the sense that no permutation is required to obtain reliable P values.
Although max is slightly more powerful than global and wei-SIMc-matching under certain situations when analyzing binary traits, there is no analytic form for the distribution function of the max test statistic and permutation P values are required. Permutation procedure is also required to obtain more reliable P values for global, when the frequencies of some haplotype categories are low or when the trait is skewed (see our Figure 1 or [Schaid, et al. 2002]). When the significance level is much smaller than 0.05 as in whole-genome association studies, the estimation of P values with permutation procedures can be computationally challenging [Tong, et al. 2010]. By contrast, wei-SIMc-matching provides reliable asymptotic P values. As shown by Figure 1, its type-I error rates exactly correspond to the nominal significance levels.
The max test may not be very ideal because it evaluates the significance of a haplotype by assuming no effect of other haplotypes on the trait. Recall that the statistic for global is , where Uψ is the score vector evaluated at ψ = 0 (ψk = 0 for all k, in which ψk is the kth element of ψ). When performing max with the package ‘haplo.stats’, the test statistic is , where Uψ,κ is the kth element of Uψ that is evaluated at ψk = 0 for all k. It will be more precise to calculate the test statistic of max based on (instead of Uψ,κ), where is evaluated at ψk = 0 and ψl = ψ̂1 (l ≠ k, ψ̂l is the maximum likelihood estimate of the effect of haplotype l in the unconstrained model). That is, evaluates the significance of haplotype k while leaving the effects of the rest of haplotypes unconstrained.
In conclusion, compared with the other tests considered in this work, the wei-SIMc-matching test is to be recommended for the detection of uncommon causal variants with surrounding common SNPs, in light of its power and computational feasibility.
Supplementary Material
Acknowledgements
We thank the anonymous reviewers for their insightful and constructive comments; Drs. Jonathan C. Cohen and Helen H. Hobbs for kindly providing the Dallas Heart Study data. This work was supported in part by NIH grants GM081488 (NL), 5R01GM069430-07 (NY), R00 RR024163 (DZ), and R01GM074913 (KZ) from the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
The authors declare that they have no conflict of interest.
References
- Allen AS, Satten GA. Statistical models for haplotype sharing in case-parent trio data. Hum Hered. 2007;64(1):35–44. doi: 10.1159/000101421. [DOI] [PubMed] [Google Scholar]
- Allen AS, Satten GA. A novel haplotype-sharing approach for genome-wide case-control association studies implicates the calpastatin gene in Parkinson's disease. Genet Epidemiol. 2009;33(8):657–667. doi: 10.1002/gepi.20417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet. 2006;38(6):659–662. doi: 10.1038/ng1801. [DOI] [PubMed] [Google Scholar]
- Besag J, Clifford P. Sequential Monte Carlo p-values. Biometrika. 1991;78:301–304. [Google Scholar]
- Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40(6):695–701. doi: 10.1038/ng.f.136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood estimation from incomplete data via the EM algorithm. J R Stat Soc. 1977;39:1–38. [Google Scholar]
- Excoffier L, Slatkin M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995;12(5):921–927. doi: 10.1093/oxfordjournals.molbev.a040269. [DOI] [PubMed] [Google Scholar]
- Feng T, Zhu X. Genome-wide searching of rare genetic variants in WTCCC data. Hum Genet. 2010;128(3):269–280. doi: 10.1007/s00439-010-0849-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gusev A, Kenny EE, Lowe JK, Salit J, Saxena R, Kathiresan S, Altshuler DM, Friedman JM, Breslow JL, Pe'er I. DASH: A Method for Identical-by-Descent Haplotype Mapping Uncovers Association with Recent Variation. Am J Hum Genet. 2011;88(6):706–717. doi: 10.1016/j.ajhg.2011.04.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70(1):42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hawley ME, Kidd KK. HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered. 1995;86(5):409–411. doi: 10.1093/oxfordjournals.jhered.a111613. [DOI] [PubMed] [Google Scholar]
- Imhof JP. Computing the distribution of quadratic forms in normal variables. Biometrika. 1961;48:419–426. [Google Scholar]
- Keating BJ, Tischfield S, Murray SS, Bhangale T, Price TS, Glessner JT, Galver L, Barrett JC, Grant SF, Farlow DN, Chandrupatla HR, Hansen M, Ajmal S, Papanicolaou GJ, Guo Y, Li M, Derohannessian S, de Bakker PI, Bailey SD, Montpetit A, Edmondson AC, Taylor K, Gai X, Wang SS, Fornage M, Shaikh T, Groop L, Boehnke M, Hall AS, Hattersley AT, Frackelton E, Patterson N, Chiang CW, Kim CE, Fabsitz RR, Ouwehand W, Price AL, Munroe P, Caulfield M, Drake T, Boerwinkle E, Reich D, Whitehead AS, Cappola TP, Samani NJ, Lusis AJ, Schadt E, Wilson JG, Koenig W, McCarthy MI, Kathiresan S, Gabriel SB, Hakonarson H, Anand SS, Reilly M, Engert JC, Nickerson DA, Rader DJ, Hirschhorn JN, Fitzgerald GA. Concept, design and implementation of a cardiovascular gene-centric 50 k SNP array for large-scale genomic association studies. PLoS One. 2008;3(10):e3583. doi: 10.1371/journal.pone.0003583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- King CR, Rathouz PJ, Nicolae DL. An evolutionary framework for association testing in resequencing studies. PLoS Genet. 2010;6(11) doi: 10.1371/journal.pgen.1001202. e1001202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kitsios GD, Zintzaras E. An NOS3 Haplotype is Protective against Hypertension in a Caucasian Population. Int J Hypertens. 2010;2010 doi: 10.4061/2010/865031. 865031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Byrnes AE, Li M. To identify associations with rare variants, just WHaIT: Weighted haplotype and imputation-based tests. Am J Hum Genet. 2010;87(5):728–735. doi: 10.1016/j.ajhg.2010.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Huang BE. The use of inferred haplotypes in downstream analyses. Am J Hum Genet. 2007;80(3):577–579. doi: 10.1086/512201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin WY, Lee WC. Discovering joint associations between disease and gene pairs with a novel similarity test. BMC Genet. 2010;11:86. doi: 10.1186/1471-2156-11-86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin WY, Schaid DJ. Power comparisons between similarity-based multilocus association methods, logistic regression, and score tests for haplotypes. Genet Epidemiol. 2009;33(3):183–197. doi: 10.1002/gepi.20364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin WY, Tiwari HK, Gao G, Zhang K, Arcaroli JJ, Abraham E, Liu N. Similarity-based multimarker association tests for continuous traits. Annals of Human Genetics. 2012;76:246–260. doi: 10.1111/j.1469-1809.2012.00706.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin WY, Zhang B, Yi N, Gao G, Liu N. Evaluation of pooled association tests for rare variant identification. BMC Proceedings. 2011;5:S118. doi: 10.1186/1753-6561-5-S9-S118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu PY, Zhang YY, Lu Y, Long JR, Shen H, Zhao LJ, Xu FH, Xiao P, Xiong DH, Liu YJ, Recker RR, Deng HW. A survey of haplotype variants at several disease candidate genes: the importance of rare variants for complex diseases. J Med Genet. 2005;42(3):221–227. doi: 10.1136/jmg.2004.024752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long JC, Williams RC, Urbanek M. An E-M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet. 1995;56(3):799–810. [PMC free article] [PubMed] [Google Scholar]
- Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5(2) doi: 10.1371/journal.pgen.1000384. e1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maxwell TJ, Bendall ML, Staples J, Jarvis T, Crandall KA. Phylogenetics applied to genotype/phenotype association and selection analyses with sequence data from angptl4 in humans. Int J Mol Sci. 2010;11(1):370–385. doi: 10.3390/ijms11010370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34(2):188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelder JA, Wedderburn RWM. Generalized Linear Models. Journal of the Royal Statistical Society. Series A. 1972;135:370–384. [Google Scholar]
- Nolte IM, de Vries AR, Spijker GT, Jansen RC, Brinza D, Zelikovsky A, Te Meerman GJ. Association testing by haplotype-sharing methods applicable to whole-genome analysis. BMC Proc. 2007;1(Suppl 1):S129. doi: 10.1186/1753-6561-1-s1-s129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rinaldo A, Bacanu SA, Devlin B, Sonpar V, Wasserman L, Roeder K. Characterization of multilocus linkage disequilibrium. Genet Epidemiol. 2005;28(3):193–206. doi: 10.1002/gepi.20056. [DOI] [PubMed] [Google Scholar]
- Roeder K, Bacanu SA, Sonpar V, Zhang X, Devlin B. Analysis of single-locus tests to detect gene/disease associations. Genet Epidemiol. 2005;28(3):207–219. doi: 10.1002/gepi.20050. [DOI] [PubMed] [Google Scholar]
- Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH, Cohen JC. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat Genet. 2007;39(4):513–516. doi: 10.1038/ng1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Romeo S, Yin W, Kozlitina J, Pennacchio LA, Boerwinkle E, Hobbs HH, Cohen JC. Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J Clin Invest. 2009;119(1):70–79. doi: 10.1172/JCI37118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biol. 2011;12(8):125. doi: 10.1186/gb-2011-12-8-125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaid DJ. Evaluating associations of haplotypes with traits. Genet Epidemiol. 2004;27(4):348–364. doi: 10.1002/gepi.20037. [DOI] [PubMed] [Google Scholar]
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002;70(2):425–434. doi: 10.1086/338688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scheffe H. The Analysis of Variance. New York: Wiley; 1959. [Google Scholar]
- Sha Q, Chen HS, Zhang S. A new association test using haplotype similarity. Genet Epidemiol. 2007;31(6):577–593. doi: 10.1002/gepi.20230. [DOI] [PubMed] [Google Scholar]
- Tong L, Yang J, Cooper RS. Efficient calculation of P-value and power for quadratic form statistics in multilocus association testing. Ann Hum Genet. 2010;74(3):275–285. doi: 10.1111/j.1469-1809.2010.00574.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng JY, Devlin B, Wasserman L, Roeder K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet. 2003;72(4):891–902. doi: 10.1086/373881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M. Gene-trait similarity regression for multimarker-based association analysis. Biometrics. 2009;65(3):822–832. doi: 10.1111/j.1541-0420.2008.01176.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet. 2011;89(2):277–288. doi: 10.1016/j.ajhg.2011.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi N, Liu N, Zhi D, Li J. Hierarchical generalized linear models for multiple groups of rare and common variants: jointly estimating group and individual-variant effects. PLoS Genet. 2011;7(12) doi: 10.1371/journal.pgen.1002382. e1002382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeggini E. Next-generation association studies for complex traits. Nat Genet. 2011;43(4):287–288. doi: 10.1038/ng0411-287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu X, Fejerman L, Luke A, Adeyemo A, Cooper RS. Haplotypes produced from rare variants in the promoter and coding regions of angiotensinogen contribute to variation in angiotensinogen levels. Hum Mol Genet. 2005;14(5):639–643. doi: 10.1093/hmg/ddi060. [DOI] [PubMed] [Google Scholar]
- Zhu X, Feng T, Li Y, Lu Q, Elston RC. Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol. 2010;34(2):171–187. doi: 10.1002/gepi.20449. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.