Abstract
Recently, Steen et al.1 proposed a novel two-stage approach for family-based genome-wide association studies. In the first stage, a test based on between-family information is used to rank SNPs according to their p-values or conditional power of the test. In the second stage, the R most promising SNPs are tested using a family-based association test. We call this two-stage approach top R method. Ionita-Laza et al.2 proposed an exponential weighting method within a two-stage framework. In the second stage of this approach, instead of testing top R SNPs it tests all SNPs and weights the p-values of association test according to the information of the first stage. However, both of the top R and exponential weighting methods only use the information from the first stage to rank SNPs. It seems that the two methods do not use information from the first stage efficiently. Furthermore, it may be unreasonable for the exponential weighting method to use the same weight for all SNPs within a group when only one or a few SNPs are related to disease.
In this article, we propose a data-driven weighting scheme within a two-stage framework. In this method, we use the information from the first stage to determine a SNP specific weight for each SNP. We use simulation studies to evaluate the performance of our method. The simulation results showed that our proposed method is consistently more powerful than the top R method and the exponential weighting method regardless of LD structure, population structure and family structure.
Keywords: two-stage, data-driven weighting, linkage disequilibrium, population stratification
Introduction
Family-based genome-wide association studies have identified susceptibility loci for some complex human diseases.3 Currently, family-based association tests such as the TDT and its extensions4–8 are the most commonly used methods to detect disease susceptibility loci in genome-wide association studies. This kind of methods uses the within-family information, but not the between-family information. The reason is that the methods using between-family information may have the problem of population stratification. Recently, Steen et al.1 proposed a two-stage test for family-based genome-wide association studies. In the first stage, a test based on between-family information is used to screen SNPs, that is, choose R best SNPs (SNPs with the smallest p-values). In the second stage, a family-based test based on within-family information is used to test the R selected SNPs for association. The two-stage test is robust to population stratification because the association is determined by the family-based test in the second stage. Furthermore, since the statistic used in the first stage is statistically independent of that in the second stage, the overall significance level of the tests in the second stage does not need to be adjusted for the first stage. This two-stage test may be more powerful than family-based tests.1 Feng et al.9 is further extended this two-stage approach to deal with general pedigrees. We call the two-stage approaches, Steen et al.1 and Feng et al., 9 top R method. One problem of the top R method is how to choose the value of R . Steen et al.1 suggested R = 10 . Feng et al.9 pointed out that when the SNPs were independent, 5 to 20 were good choices for R and when there were LD between SNPs, the optimal value of R was between 100 and 500. In fact, the optimal value of R depends on the LD structure between SNPs and therefore it is difficult to determine the optimal value for R .
In order to avoid the problem of choosing the value of R in the top R method, Ionita-Laza et al.2 proposed an exponential weighting method within the two-stage framework. In this approach, SNPs are ordered according to their p-values of the test used in the first stage. Then, the SNPs are divided into groups with the first group containing r1 SNPs and having weight , the second group containing r2 = 2r1 SNPs and having weight , and so on. In the second stage, all SNPs are tested using a family-based test. For a SNP in the ith group with a p-value of pi , if pi ≤wi α , the SNP is declared to be significant at a significance level of α . Ionita-Laza et al.2 showed that the exponential weighting method is more powerful than the top R method. However, the optimal value for r1 (the number of SNPs in the first group) also depends on the LD structure between SNPs, though r1 is more robust to the LD structure than R in the top R method. Furthermore, it may be unreasonable to use the same weight for all SNPs within the same group when only one or a few SNPs are related to disease.
In this article, we propose a data-driven weighting scheme within a two-stage framework. In this method, we use the information from the test in the first stage to determine a SNP specific weight for each SNP. Our method has a similar idea with that of Rubin et al.10 and Roeder et al.11 which used information from a linkage study or an independent association study to determine a SNP specific weight for a case-control design. We use simulation studies to evaluate the performance of our method. The simulation results show that the proposed method is robust to LD structure and is more powerful than the top R method with the optimal choice of R and the exponential weighting method with the optimal choice of r1 .
Methods
Data-driven weighting method
In a two-stage approach, we call a test used in the first stage a screening test and call a test used in the second stage an association test. Within the two-stage framework, the data-driven weighting method has the following steps:
Test all SNPs using a screening test and order SNPs according to their p values of the test. In the following discussion, we assume that the SNPs have been ordered.
Like Ionita-Laza et al.2 did, we divide the SNPs into groups with the first group containing k1 SNPs, the second group containing k2 = 2k1 SNPs, the third group containing k3 = 2k2 = 22 k1 , and so on.
- Let denote the p value of the screening test at the j th SNP in the i th group. Within each group, we will give an importance measure for each SNP. The importance measure of the j th SNP in the i th group is given by
where ε is a small number to make the algorithm stable (in our simulation studies, we use ε = 10−6). Based on the importance measure, we define a weight for each SNP. The weight for the j th SNP in the i th group is given by . Test each SNP using an association test. Denote the p value of the association test at the j th SNP in the i th group. Then, we declare the j th SNP in the i th group is significant at a level of α if .
k1 is a parameter in our algorithm. We use k1 = 20 in our simulation studies. However, the results are robust to the choice of k1 since we use different weights for SNPs within each group. More discussion will be given later in the discussion section.
Statistics
We need two test statistics. One is for the screening test used in the first stage. The other is for the association test used in the second stage. The two test statistics used in this article are those proposed by Feng et al.9. These test statistics can be applied to general pedigree data and can corporate founder’s phenotype. Briefly, consider a sample containing n pedigrees. Suppose that the i th pedigree contains ni informative nuclear families (with both parents and at least one being heterozygous or with at most one parent and two or more children) and the j th informative nuclear family in the i th pedigree contains nij children. For the j th informative nuclear family in the i th pedigree, we use (YijF, YijM, Yij1, …, Yijnij) and (XijF, XijM, Xij1, …, Xijnij) to denote trait values and genotypic scores of the parents and children. We define the mean within-family genotypic score as if the genotypic information of both parents is available, and as if otherwise. In addition, we define the mean within-family trait value of the children as and the overall mean genotypic score and trait value across the whole sample as X̄ and Ȳ . Then, the screening test statistic is given by
Where and δijF = 1 (δijM = 1) if the father (mother) of the jth nuclear family in the ith pedigree is a founder, and = 0 otherwise.
The association test proposed by Feng et al.9 used in the second stage is the quantitative pedigree disequilibrium test.12 The test statistic is given by
Where . Under the null hypothesis of no association, both Tscreen and Ta asymptotically follow the standard normal distribution. The screenin g test uses the between-family and founder’s information. The association test uses t he within-family information and thus is robust to population stratification.
Simulation designs
We evaluate the type I error of our proposed data-driven weighting method and compare power of the method with that of the top R and exponential weighting methods by simulation studies. We carry out simulation studies under several scenarios which include different LD structures, family structures, and population structures. Under each scenario, we simulate M = 100,000 bi-allelic markers for each individual.
A homogeneous population
In a homogeneous population, the simulation studies include two types of family structures and two types of LD structures. The two types of family structures are trio structure and the CEPH family structure.9, 13 Each CEPH family contains three generations: 4 founders, 2 parents and 8 grandchildren. See Morley et al.13 for more details. The two types of LD structures include (1) no LD between SNPs and (2) with LD between SNPs.
For each family structure, we generate genotypes of sampled individuals by first generating genotypes of the founders and then generating genotypes of the children by Mendelian law. For the case of no LD, we generate founder’s genotypes at each SNP by assuming that the minor allele frequency follows a uniform distribution on interval [0.1, 0.5]. For the case of with LD, we generate the founder’s genotypes using the ms program by Hudson.14 In the ms program, we use a mutation rate of 2.5 × 1 0−8 per nucleotide per generation, a recombination rate of 10−8 per pair of nucleotides per generation, and an effective population size of 10,000. These choices were also adopted in Nordborg and Tavare 15, Kimmel and Shamir 16 and Feng et al. 9
Under each scenario, the sample sizes are 400 trios in the trio family structure and 200 CEPH families in the extended family structure. For power comparison, we suppose that there is one disease locus. After we generate the genotypes for all sampled individuals, we randomly choose one SNP at which the minor allele frequency among founders is between 0.1 and 0.4 as the disease locus.
To evaluate type I error rates, we follow Feng et al. 9 to generate trait values under null hypothesis. For a nuclear family with m children, let Y1 = (yF, yM) and Y2 = (y1, y2, ⋯, ym) denote the trait values of the parents and the m children. Assume that (Y1, Y2) follows a multivariate normal distribution with mean vector of zero and variance-covariance matrix of
This covariance structure means that the father and mother are independent, and parents with children and children with children are correlated with correlation coefficient of ρ (ρ = 0.2 is used in this study). The conditional distribution of Y2 = (y1,…, ym) given parental trait values Y1 = (yF, yM) is a multivariate normal distribution with mean vector of and variance-covariance matrix of . To generate trait values of all individuals in a pedigree, we first generate the trait value of each founder by using a standard normal distribution. The trait values of other members can be generated by a multivariate normal distribution with mean vector of µc and variance-covariance matrix of Σc, given the trait values of their parents.
For power comparisons, we generate trait values of a pedigree with B members from model yb = xbβ + εb (b = 1, 2, …, B), where xb is the additive genotypic score at the disease locus, β is a constant and ε1, …, εB are background trait values generated under the null hypothesis using aforementioned method. The value of β is determined by heritability h and is given by , where f is the minor allele frequency at the disease locus.
A structured population
Consider a structured population that consists of two distinct subpopulations with different allele frequencies and distinct phenotypic means. In this set of simulation studies, we consider two family structures as we did for the homogeneous population but only one LD structure, that is, no LD between SNPs. To generate genotypes of founders in a structured population, we follow Ionita-Laza et al. 2 For each SNP, randomly select a number between 0.1 and 0.9 as the ancestral population allele frequency p . Then, independently draw two values from a beta-distribution with parameters p(1 − Fst)/Fst and (1 − p)(1 − Fst)/Fst and scale them to the interval (0.1, 0.9) as allele frequencies for the two subpopulations, where Fst is Wright’s measure of population subdivision.17
The phenotype under null hypothesis is generated similar to that in a homogeneous population. The only difference is that, in the structured population, we generate trait values of the founders in subpopulation 1 from the standard normal distribution and those in subpopulation 2 from a normal distribution with mean 0.2 and variance 1. As argued by Ionita-Laza et al.,2 the differences in allele frequencies and phenotypic means together result in spurious associations. For power comparisons, trait values are generated in the same way as that in a homogeneous population.
Results
Type I error rates
Under each of the simulation scenarios, we generated T = 1,000 datasets to estimate type I error rates of the three approaches. For each approach, we estimate its type I error rate as
where for the tth dataset δ0t = 1 if one or more markers were claimed to be significant, and = 0 otherwise.
For 1,000 replications, the 95% confidence interval of type I error rates is (0.036, 0.064) for a nominal level of 0.05. Table 1 and Table 2 list the estimated type I error rates of the three approaches in the case of a homogeneous population and a structured population, respectively. From the two tables, we can say that, either in a homogeneous population or a structured population, almost all of the estimated type I error rates are within the 95% confidence interval, which means that the three approaches are robust to LD structure and population stratification.
Table 1.
400 tros | 200 CEPHs | |||
---|---|---|---|---|
Without LD | With LD | Without LD | With LD | |
Data-driven weighting | 0.051 | 0.024 | 0.041 | 0.029 |
Exponential weighting | 0.047 | 0.045 | 0.054 | 0.033 |
Top R | 0.048 | 0.052 | 0.049 | 0.042 |
Note: In the top R method, R = 20 ; in the exponential weighting method, r1 = 20.
Table 2.
400 tros | 200 CEPHs | |||||
---|---|---|---|---|---|---|
Fst | Fst | |||||
0.001 | 0.005 | 0.01 | 0.001 | 0.005 | 0.01 | |
Data-driven weighting | 0.041 | 0.053 | 0.037 | 0.052 | 0.051 | 0.059 |
Exponential weighting | 0.034 | 0.038 | 0.055 | 0.055 | 0.039 | 0.053 |
Top R | 0.052 | 0.047 | 0.054 | 0.051 | 0.049 | 0.056 |
Note: In the top R method, R = 20 ; in the exponential weighting method, r1 = 20. Fst is Wright’s measure of population subdivision.
Power comparisons
For power comparisons, we simulated T = 1,000 datasets under each of the simulation scenarios. Each dataset contains either 400 trios or 200 CEPH pedigrees. For a given approach, we assess its power as the proportion of the simulated replications at which the method successfully identified the disease locus. Precisely, we assess the power as
where for the t th dataset δ1t = 1 if the disease locus is detected, and = 0 otherwise.
With Parental Phenotypes
We assume that parental phenotypes are available. In the homogeneous population, we first consider the trio design. In the trio design, we compare the power of our data-driven weighting scheme with that of the top R method for different values of R and exponential weighting method for different values of r1 (Figure 1(a)). From Figure 1(a), we can see that our data-driven weighting method is consistently more powerful than the top R method and exponential weighting method regardless of marker LD and the values of R and r1 . Figure 1(b) gives power comparisons of the three methods for different values of heritability and different LD structures when R and r1 in the top R method and the exponential weighting method are chosen by their corresponding optimal values. Again, Figure 1(b) shows that our proposed method is consistently more powerful than the other two methods for different values of heritability and different LD structures.
For the CEPH family structure, we use the same simulation setup as that for the trio family structure. The pattern of power comparisons for the CEPH family structure (Figure 2) is very similar to that for the trio family structure. Summarizing the results mentioned above, we may conclude that our proposed weighting scheme is more powerful than the top R method and exponential weighting method regardless of LD structure, family structure, and heritability.
We also compare power of the three methods in a structured population. The results of power comparisons are summarized in Figure 3. From this figure, we can make the following two conclusions. One is that our data-driven weighting method is more powerful than the other two methods for different family structures and different values of Fst (which measures how difference of the two subpopulations). The other is that the power of all the three methods does not affect much by different values of Fst which means that the power of the three methods is relatively robust to population stratification. Ionita-Laza et al. 2 has pointed out that the power of the top R method will be affected by Fst if R is fixed, for example, R = 10. Our results do not contradict with that of Ionita-Laza et al. because our conclusion for the top R and exponential weighting methods is based on the fact that R and r1 in the two methods are chosen by their optimal values and the optimal values depend on the value of Fst .
Without Parental Phenotypes
In this set of simulations, we assume that parental phenotypes are not available. The simulation setup is the same as that in the section of “With Parental Phenotypes”, but the minor allele frequency at each SNP (in the case of no LD) is simulated from a beta distribution with parameters (scale them to the interval (0.1, 0.5)) instead of a uniform distribution. The power comparisons in this set of simulations are summarized in figure 4 to figure 6. From these figures, we can see that the patterns of power comparisons without parental phenotypes are very similar to that with parental phenotypes.
Discussions
In this article, we proposed a novel data-driven weighting scheme for family-based two-stage association studies. This scheme improves the exponential weighting method of Ionita-Laza et al.2 by allowing different weights for SNPs in the same group. Our simulation studies show that the proposed weighting scheme is consistently more powerful than the top R method with the optimal value of R and the exponential weighting scheme with the optimal value of r1 in all the cases we considered in the simulation studies.
The innovation of our new scheme is that it uses the between-family information to calculate marker-specific weights. In contrast, the classical top R and exponential weighting approaches only use the between-family information to rank the SNPs. Our proposed weighting scheme is not only applicable to two-stage family-based association studies but also applicable to other two-stage approaches as long as the statistics used in the two stages are independent or orthogonal. For example, Chung et al.18 investigated the orthogonal property between some linkage statistics and family-based association statistics. Our weighting scheme can be applied to a two-stage approach in which the first stage is a linkage test and the second stage is a family-based association test and the two tests are independent or orthogonal.
One thing we should mention is that when we performed the power comparison our proposed method used a constant value for parameter k1 (k1 = 20) and the top R method and exponential weighting method used the optimal value of R and r1 , respectively. In practice, it is difficult to know the optimal values for R or r1 . The optimal value of R or r1 depends on multiple factors, e.g., pedigree structure, marker LD, heritability and so on. The optimal value is small in the absence of LD between SNPs. In the presence of LD, the optimal value of R or r1 could be much larger. To evaluate the effect of parameter k1 in our proposed method, we have done simulation studies for k1 = 1, 5, 10, 20, 50, 100 . The simulation studies (results are not shown) showed that the results of our proposed method are very similar for different values of k1 which means that our proposed method is relatively robust to different choices of k1 .
In this study, we assume consistent genetic effects across all ages. We realize that this assumption may not be true for some diseases, e.g., childhood asthma versus adult asthma, childhood obesity versus adult obesity (see Lasky-Su et al.19). For the diseases where the genetic effects are age dependent, we may need to incorporate age of onset into association tests. However, further investigation on how to incorporate age of onset into testing is needed.
Table 3.
Scale on x axis |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
R | 1 | 5 | 10 | 20 | 50 | 100 | 200 | 500 | 10,000 |
r1 | 1 | 3 | 5 | 10 | 20 | 50 | 100 | 200 | 500 |
Acknowledgements
This work was supported by National Institute of Health grants R01 GM069940 and the Overseas-Returned Scholars Foundation of Department of Education of Heilongjiang Province (1152HZ01).
References
- 1.Steen KV, McQueen MB, Herbert A, et al. Genomic screening and replication using the same dataset in family-based association testing. Nature Genetics. 2005;37:683–691. doi: 10.1038/ng1582. [DOI] [PubMed] [Google Scholar]
- 2.Ionita-Laza I, McQueen MB, Laird NM, Lange C. Genome-wide weighted hypothesis testing in family-based association studies, with an application to a 100k scan. Am J Hum Genet. 2007;81:607–614. doi: 10.1086/519748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Herbert A, Gerry NP, McQueen MB, et al. A common genetic variant is associated with adult and childhood obesity. Science. 2006;312:279–283. doi: 10.1126/science.1124779. [DOI] [PubMed] [Google Scholar]
- 4.Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52(3):506–516. [PMC free article] [PubMed] [Google Scholar]
- 5.Claton D, Jones H. Transmission/Disequilibrium test for extended marker haplotype. Am J Hum Genet. 1999;65:1161–1169. doi: 10.1086/302566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Schaid DJ, Rowland CM. Quantitative trait transmission disequilibrium test: allowance for missing parents. Genetic Epidemiol. 1999;17:S307–S312. doi: 10.1002/gepi.1370170752. [DOI] [PubMed] [Google Scholar]
- 7.Zhao H, Zhang S, et al. Transmission/Disequilibrium test using multiple tightly linked markers. Am J Hum Genet. 2000;67:936–946. doi: 10.1086/303073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Selman H, Roeder K, Devlin B. Transmission/Disequilibrium test meets measured haplotype analysis: Family-based association guided by evolution of haplotype. Am J Hum Genet. 2001;68:1250–1263. doi: 10.1086/320110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Feng T, Zhang S, Sha Q. Two-stage association tests for genome-wide association studies based on family data with arbitrary family structure. Eu J Hum Genet. 2007;15:169–1175. doi: 10.1038/sj.ejhg.5201902. [DOI] [PubMed] [Google Scholar]
- 10.Rubin D, van der Laan M, Dudoit S. Multiple testing approaches which are optimal at a simple alternative. Collection of Biostatistics Research Archive. 2006. http://www.bepress.com/ucbbiostat/paper 171.
- 11.Roeder K, Devlin B, Wasserman L. Improving Power in Genome-Wide Association Studies: Weights Tip the Scale. Genet Epidemiol. 2007;31:741–747. doi: 10.1002/gepi.20237. [DOI] [PubMed] [Google Scholar]
- 12.Zhang S, Zhang K, Li J, Sun FZ, Zhao H. Test of linkage and association for quantitative traits in general pedigree: the quantitative pedigree disequilibrium test. Genetic Epi. 2001;18(Supp 1):370–375. doi: 10.1002/gepi.2001.21.s1.s370. [DOI] [PubMed] [Google Scholar]
- 13.Morley M, Molony CM, Weber T, et al. Genetic analysis of genome-wide variation in human gene expression. Nature. 2004;430:743–747. doi: 10.1038/nature02797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformations. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- 15.Nordborg M, Tavare S. Linkage disequilibrium: what history has to tell us? Trends Genet. 2002;18:83–90. doi: 10.1016/s0168-9525(02)02557-x. [DOI] [PubMed] [Google Scholar]
- 16.Kimmel G, Shamir R. A fast method for computing high significance disease association in large population-based studies. Am J Hum Genet. 2006;79:481–492. doi: 10.1086/507317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12. doi: 10.1007/BF01441146. [DOI] [PubMed] [Google Scholar]
- 18.Chung RH, Hauser ER, Martin ER. Interpretation of simultaneous linkage and family-based association tests in genome screens. Genet Epidemiol. 2007;31:134–142. doi: 10.1002/gepi.20196. [DOI] [PubMed] [Google Scholar]
- 19.Lasky-Su J, Lyon HN, Emilsson V, et al. On the replication of genetic associations: timing can be everything! Am J Hum Genet. 2008;82:849–858. doi: 10.1016/j.ajhg.2008.01.018. [DOI] [PMC free article] [PubMed] [Google Scholar]