Abstract
Motivation: A two-stage association study is the most commonly used method among multistage designs to efficiently identify disease susceptibility genes. Recently, some SNP studies have utilized more than two stages to detect disease genes. However, there are few available programs for calculating statistical powers and positive predictive values (PPVs) of arbitrary n-stage designs.
Results: We developed programs for a multistage case–control association study using R language. In our programs, input parameters include numbers of samples and candidate loci, genome-wide false positive rate and proportions of samples and loci to be selected at the k-th stage (k=1,…, n). The programs output statistical powers, PPVs and numbers of typings in arbitrary n-stage designs. The programs can contribute to prior simulations under various conditions in planning a genome-wide association study.
Availability: The R programs are freely available for academic users and can be downloaded from http://www.med.niigata-u.ac.jp/eng/resources/informatics/gwa.html
Contact: nktmr@m12.alpha-net.ne.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Genome-wide association (GWA) case–control studies using genetic polymorphism data, including data on single nucleotide polymorphisms (SNPs) and microsatellite markers, aim to identify genes involved in common human diseases. Some multistage GWA case–control studies have been proposed as powerful and cost-effective study designs. The most commonly used method among the multistage designs is a two-stage design. In the designs, all markers in the first stage are genotyped and statistical tests are conducted for each marker. Only significant markers selected in the first stage remain for the following stages. At the end of the last stage, the overall evidence for association is evaluated by replication-based analysis (RBA)(Satagopan et al., 2002) or joint analysis (JA) (Skol et al., 2006).
Recently, some SNP-based studies have utilized more than two stages of multistage designs (Kathiresan et al., 2007; Prentice et al., 2006). However, there are few available programs for multistage designs. Skol et al. (2006) proposed JA by new statistics of a linear combination weighted by the square root of the proportion of subjects and presented their power calculation for one- and two-stage designs on their website, but they have not reported about more than two stages. In previous studies, statistical power was used to evaluate the performance of different designs. On the other hand, positive predictive values (PPVs) are more practical for identifying disease susceptibility genes implicated in complex traits. However, there is no program for statistical powers, PPVs and numbers of typings of multistage designs for arbitrary n >2.
In this article, we present programs for calculating the indicators of arbitrary n-stage designs using disease model parameters (prevalence, disease-associated allele frequency and genetic relative risk) or 2×3 contingency table data as input parameters by R language (Crawley, 2005). The programs can output the indicators for allelic models, additive models, dominant models and recessive models under various conditions.
2 METHODS
2.1 Statistical background for detecting risk SNPs
We extended Skol's power-evaluation methods of two-stage designs Skol et al. (2006 to generalized n-multistage designs (n being an arbitrary natural number). Let πs,k be a proportion of the sample size at the k-th stage and let πm,k be a proportion of loci to be selected at the k-th stage (k=1,…, n). In RBA, statistical tests are conducted by statistics for difference in ratio of an expected disease-associated allele rate in cases (p′) and that in controls (p). Critical values for the k-th stage of RBA (Cm,k) can be obtained by calculating Φ−1 [1− π m,k /2] under the null hypothesis, and powers of k-stage designs of RBA are calculated by multiplying cumulative distribution functions of a normal distribution of each stage under the alternative hypothesis.
Test statistics of JA are represented as the
weighted sum of statistics of each stage. Since the statistics of JA and those of previous stages are not independent from those definitions, critical values of JA (Cj,k) can be calculated iteratively by solving the equation of n-multiple integrals under the null hypothesis using the Newton–Raphson method. Powers of JA can be calculated with Cj,k under the alternative hypothesis.
PPVs can be calculated by the number of true positive loci (tpk) and false positive loci (fpk). Let N be the number of samples in cases and controls, M be the number of candidate loci and m be the number of true loci. tpk is given asm× power of RBA or JA, and fpk is given as(M−m)×Πnk=1 πm,k.
Therefore, PPV at the k-th stage is calculated as tpk/(tpk+fpk).
The number of typings of one-stage design is given as 2MN.
The number of typings of n-stage design is
2.2 Algorithm and programs for multistage designs
The algorithm for the powers and PPVs and numbers of typings of multistage designs consists of the following three modules: a module for internal parameters, a module for RBA and a module for JA.
Let a candidate locus be a base pair of allele A and allele a and let a penetrance of aa be f0. The relative risk to f0 in a human population of Aa and that of AA will be f1/f0 and f2/f0 respectively. By using disease model parameters, such as prevalence (Prev) and disease-associated allele rate (ppo) as input parameters, f0, f1 and f2 in the allelic model, in which the numbers of each allele per cases and controls are counted, is calculated as follows:
![]() |
Therefore, p′ and p are calculated as follows:
![]() |
In additive models, dominant models and recessive models, those genotype rates in cases and controls are calculated according to each definition (refer to Supplementary Material).
The programs for calculating the indicators were made by using R language (version 2.7.1). In the program, calculation of the equation of n-multiple integrals uses the ‘pmvnorm’ function. This function computes the distribution function of the multivariate normal distribution for arbitrary limits and correlation matrices based on algorithms by Genz (1992). We prepared the R package and the web user interface (WUI) for calculating them in our website (see Availability).
2.3 Simulation algorithm
Assume the following conditions for calculating the powers and PPVs and the numbers of typings in one- to five-stage designs:N=1000, M=500 000, m=5, genome-wide false positive rate (αgenome)=0.05 and odds ratio (OR)=1.5, which is calculated by Prev=0.1, disease-associated allele frequency = 0.4 and GRR=1.44, and the genetic model is set to be an allelic model. The samples were equally divided into each stage. In addition, πm,k are set to equal proportions to adjust the product of them to 0.0001 so that the number of remaining loci in the final stage by each stage design is equal.
3 RESULTS
Figure 1 shows the powers and PPVs and the numbers of typings of RBA and JA in one- to five-stage designs. The powers of JA were always higher than those of RBA. In RBA, with an increase in the number of stages, the numbers of typings and the powers monotonically decreased. However, in JA, the powers decreased more slowly than in RBA. The powers of the three-stage design were higher than the powers of two-, four- and five-stage designs. On the other hand, PPVs of RBA and those of JA both exhibited plateau behaviors in a high range of PPVs.
Fig. 1.
Powers and PPVs and numbers of typings by RBA and JA in one- to five-stage designs.
4 DISCUSSION
Multistage case–control association designs have the distinct advantage of considerable reduction in the number of typings, while maintaining high powers for many practical projects to detect disease-related genes.
We made programs for multistage designs by an arbitrary number of stages by using package R language. In this study, the properties of multistage designs of RBA and JA were investigated by comparisons of one- to five-stage designs, and it was shown that the powers by JA are larger than those by RBA in any number of stages and that three-stage designs are superior to two-, four- and five-stage designs in powers under the condition that the samples are equally divided into each stage (refer to Supplementary Material).
Factors affecting powers include N, M, πs,k,πm,k and study designs. However, there is no simple way to predict the powers by various study designs. Therefore, our programs would be beneficial in study communities at the planning stage of GWA studies.
Funding: This research was supported by grants-in-aid for scientific research [Based research (B)] by the Japanese Ministry of Education, Culture, Sports, Science and Technology.
Conflict of Interest: none declared.
Supplementary Material
REFERENCES
- Crawley MJ. An Introduction Using R. England: John Wiley&Sons; 2005. Statistics. [Google Scholar]
- Genz A. Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1992;1:141–150. [Google Scholar]
- Kathiresan S, et al. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med. Genet. 2007;8 doi: 10.1186/1471-2350-8-S1-S17. http://www.biomedcentral.com/1471-2350/8/S1/S17 (last accessed on December 7, 2008) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prentice RL, Lihong Qi. Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation. Biostatistics. 2006;7:339–654. doi: 10.1093/biostatistics/kxj020. [DOI] [PubMed] [Google Scholar]
- Satagopan JM, et al. Two-stage designs for gene-disease association studies. Biometrics. 2002;58:163–170. doi: 10.1111/j.0006-341x.2002.00163.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Skol AD, et al. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 2006;38:209–213. doi: 10.1038/ng1706. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



