Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Mar 11.
Published in final edited form as: Ann Hum Genet. 2008 Jan 23;72(Pt 3):375–387. doi: 10.1111/j.1469-1809.2007.00419.x

Optimal two-stage design for case-control association analysis incorporating genotyping errors

Yijun Zuo 1,, Guohua Zou 2,3,, Jiexun Wang 2,4, Hongyu Zhao 5, Hua Liang 3,*
PMCID: PMC2836813  NIHMSID: NIHMS175592  PMID: 18215207

Abstract

Two-stage design is a cost effective approach for identifying disease genes in genetic studies and it has received much attention recently. In general, there are two types of two-stage designs that differ on the methods and samples used to measure allele frequencies in the first stage: (1) Individual genotyping is used in the first stage; (2) DNA pooling is used in the first stage. In this paper, we focus on the latter. Zuo et al. (2006) investigated statistical power of such a design, among other things, but the cost of study was not taken into account. The purpose of this paper is to study the optimal design under the given overall cost. We investigate how to allocate the resources to the two stages. Note that in addition to the measurement errors associated with DNA pooling, genotyping errors are also unavoidable with individual genotyping. Therefore, we discuss the optimal design combining genotyping errors associated with individual genotyping. The joint statistical distributions of test statistics in the first and second stages are derived. For a fixed cost, our results show that the optimal design requires no additional samples in the second stage but only that the samples in the first stage be re-used. When the second stage uses an entirely independent sample, however, the optimal design under a given cost depends on the population allele frequency and allele frequency difference between the case and control groups. For the current genotyping costs, we can roughly allocate 1/3 to 1/2 of the total sample size to the first stage for screening.

Keywords: DNA pooling, genotyping errors, individual genotyping, measurement errors, power, two-stage design

Introduction

In recent years, case-control association studies, in comparison to linkage analysis, have enjoyed much popularity as a potentially more effective strategy to identify disease susceptibility genes (Risch 2000). Association studies can be conducted by using either the candidate gene approach or the genome-wide association approach. Candidate-gene association studies aim to identify genes/markers that show different distributions between the case and control groups motivated through biological knowledge and hypothesis. But currently our knowledge for biological mechanism is rather limited, and it is extremely difficult to predict which specific genes play a role in disease etiology. Therefore, candidate-gene association studies will likely miss important genes. However, with the completion of the human genome sequencing, the identification of millions of genetic variations, and the advances in genotyping technologies, genome-wide association studies have become a viable and potentially powerful approach to identifying disease genes. Such studies allow a comprehensive scan in the whole genome and so increase the possibility of detecting disease associated genes. Tens or hundreds of thousands of markers, however, are required for these studies. On the other hand, other factors, such as disease heterogeneity and gene-environment interactions, make disease gene identification a very challenging task, and so hundreds or even thousands of subjects may be needed to achieve sufficient statistical power. As a result, the total study cost would rise rapidly in spite of great improvements of genotyping technologies.

A cost-effective approach for detecting disease genes is to use a two-stage design, which has received much attention recently. There are two types of the two-stage designs depending on the methods/samples used to measure allele frequencies in the first stage: (1) individual genotyping is used in the first stage (Satagopan et al. 2002, 2004; Satagopan and Elston 2003; Lin 2006; Skol et al. 2006; Wang et al. 2006), hereafter referenced as Two-Stage Design for Individual Genotyping (TSD-IG); (2) DNA pooling is used in the first stage (Zuo et al. 2006), hereafter referenced as Two-Stage Design for DNA Pooling (TSD-P). In this paper, we will focus on the TSD-P. Zuo et al. (2006) investigated statistical power of such a design, among other things, but the study cost was not taken into account. The purpose of this paper is to study the optimal TSD-P with a focus on the overall study cost. We investigate how to allocate the resources to the two stages. The corresponding issues for the TSD-IG were considered by Satagopan et al. (2002) and Satagopan and Elston (2003).

Note that an accurate measurement of single-nucleotide polymorphism (SNP) frequencies in a sample is crucially important to successful association studies. Although individual genotyping accuracy has been greatly improved over the past decade, genotyping errors still inevitably occur (Buetow 1991; Shields et al. 1991; Akey et al. 2001; Gordon et al. 2001; Gordon et al. 2002). Many researchers have showed that genotyping errors can have large effects on the precision of allele frequency estimation and power of tests and so on (Gordon et al. 1999; Rice and Holmans 2003; Ott 2004; Zou and Zhao 2004). Therefore, we will discuss the optimal two-stage design incorporating genotyping errors associated with individual genotyping, in addition to the measurement errors associated with DNA pooling (see, for example, Barratt et al. 2002; Jawaid et al. 2002) considered in Zuo et al. (2006) for the TSD-P. Note that genotyping errors associated with individual genotyping were not taken into account by Satagopan et al. (2002) or Satagopan and Elston (2003) for the TSD-IG.

This article is organized as follows. We will first derive the formulas for calculating the power of the TSD-P incorporating genotyping errors, and then discuss the methods for finding the optimal designs under a given cost. Based on these results, we will present our numerical results. The technical details are provided in the Appendix.

Methods

We consider two alleles, A and a, at a candidate marker, whose frequencies are p and q = 1 − p, respectively. For simplicity, we consider a case-control study with n cases and n controls. Let Xi denote the number of allele A carried by the ith individual in the case group, and Yi is similarly defined for the ith individual in the control group. Assuming Hardy-Weinberg Equilibrium (HWE), each Xi and Yi has a value of 2, 1, 0 with respective probabilities p2, 2pq and q2 under the null hypothesis of no association between the candidate marker and disease. When the candidate marker is associated with disease, we assume that the penetrances are f2 for genotype AA, f1 for genotype Aa, and f0 for genotype aa. Note that these two alleles may be true functional alleles or may be in LD with true functional alleles. Under this genetic model, the probability of an individual having k copies of A among the cases, mk = P(Xi = k), and that among the controls, mk=P(Yi=k), are given by

m0=q2f0p2f2+2pqf1+q2f0,m1=2pqf1p2f2+2pqf1+q2f0,m2=p2f2p2f2+2pqf1+q2f0,m0=q2(1f0)p2(1f2)+2pq(1f1)+q2(1f0),m1=2pq(1f1)p2(1f2)+2pq(1f1)+q2(1f0),m2=p2(1f2)p2(1f2)+2pq(1f1)+q2(1f0).

One-stage designs

For convenience and useful reference, we first briefly summarize the test statistics and power expressions based on a one-stage design using either individual genotyping or DNA pooling. See Zou and Zhao (2004) or Zuo et al. (2006) for details.

(a) Individual genotyping

For individual genotyping, we assume that genotyping errors are introduced independently to each allele, and the error rate from the true allele A to the erroneous allele a is e1 and from the true allele a to the erroneous allele A is e2. A more complete model for genotyping errors may be based on genotypes instead of alleles (Mote and Anderson 1965; Kang et al. 2004a; Kang et al. 2004b). Let nA*andnU* denote the observed numbers of allele A in the case group and control group, respectively. The statistic to test the association between the candidate marker and disease is

tind=(nA*nU*)/(2n)p^*(1p^*)/n,

where p^*=(nA*+nU*)/(4n).

Like Risch and Teng (1998) and Satagopan and Elston (2003), we consider a one-sided test. With a significance level of α, the power of the test statistic tind is

Φ(zαp˜*(1p˜*)/(1e1e2)+nμ*σ*), (1)

where Φ is the cumulative standard normal distribution function, zα is the upper 100 α th percentile of the standard normal distribution, p˜*=(pA*+pU*)/2,μ*=(pA*pU*)/(1e1e2),

σ*2=[pA*(1pA*)pA12*/4]+[pU*(1pU*)pU12*/4](1e1e2)2,

and

pA*=pA11*+pA12*/2,pU*=pU11*+pU12*/2,

with

pA11*=(1e1)2m2+(1e1)e2m1+e22m0,pA12*=2(1e1)e1m2+[(1e1)(1e2)+e1e2]m1+2(1e2)e2m0,

and pU11*andpU12* are similarly defined for the control group and their values can be calculated by replacing mi with mi  (i=0,1,2) in the above formulas for pA11*andpA12*.

(b) DNA pooling

For DNA pooling, we consider m pools of cases and m pools of controls each having size s such that n=ms and assume the following models relating the observed frequencies to the true frequencies of allele A in the sample:

p^Aipool=pAi+ui,p^Uipool=pUi+vi,

where pAi and pUi denote the true frequencies of allele A in the ith case group and control group, respectively, ui and vi are deviations with mean 0 and variance ε2 and are assumed to be independent and normally distributed. Define

p^Apool=1mi=1mp^Aipool,

and

p^Upool=1mi=1mp^Uipool.

The following statistic can be used to test genetic association based on DNA pooling data:

tpool=p^Apoolp^Upoolp^pool(1p^pool)n+2ε2m,

where p^pool=12(p^Apool+p^Upool).

If we use a one-sided test and a significance level of α, then the power of the test statistic is tpool is

Φ(zαp˜(1p˜)n+2ε2m+μσ2n+2ε2m),

where p˜=μ/2+m2+m1/2,μ=m2+12m1m212m1, and σ2=14[4m2+m1(2m2+m1)2+4m2+m1(2m2+m1)2].

Two-stage designs

We consider the following two-stage design. In the first stage, we use pooled DNA data to test all M (a known number) markers, and assume the number of statistically significant markers is M1 which is unknown and needs to be estimated. In the second stage, we use individual genotype data to test the M1 promising markers. In practical genetic epidemiological studies, the sample in the first stage can be re-used in the second stage, and this introduces some correlation between the two test statistics, tpool and tind. In Zuo et al. (2006), this two-stage scheme was called the two-stage dependent design. On the other hand, we may use two separate samples in the two stages with one sample used for screening and the other used for confirmation. In this scenario, the two test statistics, tpool and tind are independent. Zuo et al. (2006) called such a two-stage scheme the two-stage independent design. In the following, we first consider the situation of the two-stage dependent design.

The joint distribution of test statistics in the first and second stages for the two-stage dependent design

In the first stage, i.e., the DNA pooling stage, we consider m pools of cases and m pools of controls each having size s such that n=ms. In the second stage, i.e., the individual genotyping stage based on a selected set of markers, in addition to the n cases and n controls used in the pooling stage, we also consider an additional case sample of size na and an additional control sample of size na.

Under the null hypothesis H0, we can write approximately,

tpool=pqnξ0+wpqn+2ε2m,

and

tind=nn+na·ξ0*+nan+na·η0*,

where (ξ0,ξ0*)T~N(0,Σ0) (T means transpose) with

Σ0=(1(1e1e2)pqp*(1p*)(1e1e2)pqp*(1p*)1),

and p*=(1e1)p+e2q  ,  η0*~N(0,1)  ,  w=u¯v¯~N(0,2ε2m),  u¯=1mi=1mui  ,  v¯=1mi=1mvi,and(ξ0,ξ0*)T,η0* and w are mutually independent (see the Appendix for the strict proof under the alternative hypothesis H1). So under the null hypothesis of no association, (tpool, tind)T has an approximate joint bivariate normal distribution N(0,Σ0*), where

Σ0*=(1(1e1e2)pq(n+na)p*(1p*)·pq/n+2ε2/m(1e1e2)pq(n+na)p*(1p*)·pq/n+2ε2/m1).

This distribution involves the parameters p, e1, e2, and ε, which are either known or can be easily estimated (Bansal et al. 2002; Rice and Holmans 2003; Zou and Zhao 2003; Gordon et al. 2004; Lai et al. 2007; Tintle et al. 2007). Thus, the determination of the critical values for the two-stage dependent design is available.

Under the genetic model discussed above, we can write approximately,

tpool=σ2nξ1+wp˜(1p˜)n+2ε2m,

and

tind=nn+naσ*·ξ1*+nan+naσ*·η1*p˜*(1p˜*),

where (ξ1,ξ1*)T~N(nμ˜,Σ1) with

μ˜=(μ/σ,μ*(1e1e2)/σ*)T,Σ1=(1σ12*4σσ*σ12*4σσ*(1e1e2)2),

and

σ12*  =σA12*+σU12*,σA12*=4m2(1e1)+m1(1e1+e2)(2m2+m1)(2pA11*+pA12*),σU12*=4m2(1e1)+m1(1e1+e2)(2m2+m1)(2pU11*+pU12*),

and η1*~N(naμ*(1e1e2)/σ*,(1e1e2)2),and(ξ1,ξ1*)T,η1* and w are mutually independent. Thus, under the alternative hypothesis H1, (tpool, tind)T has an approximate joint bivariate normal distribution N(μ˜*,Σ1*) (see the Appendix for the details of proof), where

μ˜*=(μp˜(1p˜)/n+2ε2/m,n+naμ*(1e1e2)p˜*(1p˜*))T,

and

Σ1*=(σ2/n+2ε2/mp˜(1p˜)/n+2ε2/mσ12*4(n+na)p˜*(1p˜*)·p˜(1p˜)/n+2ε2/mσ12*4(n+na)p˜*(1p˜*)·p˜(1p˜)/n+2ε2/mσ*2(1e1e2)2p˜*(1p˜*)).

This distribution involves the seven parameters p, e1, e2, ε, f2, f1, and f0. The first four are either known or can be readily estimated as in the situation of H0 The last three, however, are generally unknown and not so easy to estimate.

Power for two-stage designs

For the given sample size n in the first stage, if we assume that one of the significance level α1 and power 1 − β1 of the first stage is known, then we can determine a critical value k1 by solving α1 = P(tpool > k1 | H0) or 1 − β1 = P(tpool > k1 | H1). Thus, for the overall significance level α and additional sample size na, we can determine the critical value k2 in the second stage by solving

α/M=P(tpool>k1,tind>k2|H0)=k1k2h0(x,y)dxdy,

where h0 (x, y) is the density function of (tpool, tind)T under H0, which is given by

h0(x,y)=12π|Σ0*|exp{12(x,y)Σ0*1(xy)}

with |Σ0*| being the determinant of the matrix Σ0*,and  Σ0*1 being the inverse of Σ0*. Here for M markers, we have used Bonferroni correction for each marker test.

The power of the two-stage design for detecting a disease associated marker is then given by

1β=P(tpool>k1,tind>k2|H1)=k1k2h1(x,y)dxdy,

where h1(x, y) is the density function of (tpool, tind)T under H1, which is given by

h1(x,y)=12π|Σ1*|exp{12((x,y)μ˜*T)Σ1*1((xy)μ˜*)}.

For the two-stage independent design, the Type I error rate and statistical power are simply the products of those in both stages, respectively. That is,

P(tpool>k1,tind>k2|H0)=P(tpool>k1|H0)·P(tind>k2|H0),

and

P(tpool>k1,tind>k2|H1)=P(tpool>k1|H1)·P(tind>k2|H1).

The optimal choice of the parameters in two-stage design for a fixed cost

In the above discussion, the sample sizes used in both stages are fixed in the design which implies that the cost is not our main concern. If the total cost is the major constraint which is often the situation in practical studies, how do we optimize the two-stage design? That is, given the total cost and overall significance level, how do we choose the sample size in the first stage for screening, the sample size in the second stage for confirmation, the significance levels in the two stages, and the power in the first stage so that the power of the two-stage design is maximized?

Now let C be the total cost, C1 be the cost of recruiting an individual, Cpool be the cost of measuring allele frequency at a single marker for a DNA pool, Cind be the cost of genotyping a single marker for an individual, and C0 be the other cost associated with a study, such as infrastructure cost that does not scale with the study size. Then we have

C=C0+C1·2(n+na)+Cpool·2mM+Cind·2(n+na)M1 (2)

for the two-stage dependent design, and

C=C0+C1·2(n+na)+Cpool·2mM+Cind·2naM1 (3)

for the two-stage independent design with the sample size of na in the second stage, where M is the number of entire markers, and M1 is the number of markers selected from the first stage as before. As in Satagopan and Elston (2003) and Satagopan et al. (2004), the expected number of markers selected from the first stage is

M¯1=(MK)α1+K(1β1),

where K is the number of disease associated markers which is unknown. Thus, the cost functions in (2) and (3) can be replaced by

C=C0+C1·2(n+na)+Cpool·2mM+Cind·2(n+na)[(MK)α1+K(1β1)], (4)

and

C=C0+C1·2(n+na)+Cpool·2mM+Cind·2na[(MK)α1+K(1β1)], (5)

respectively.

Denote the critical values in the first and second stages by k1 and k2, respectively. Then we have the following constraints on the unknown parameters α1 and k1, and 1 − β1, k1 and n:

α1=1Φ(k1), (6)

and

1β1=Φ(k1p˜(1p˜)n+2ε2m+μσ2n+2ε2m). (7)

Further, for an overall fixed significance level α,

α/M=k1k2h0(x,y)dxdy (8)

for the two-stage dependent design, and

α/M=α1·[1Φ(k2)] (9)

for the two-stage independent design.

Our goal is to maximize the statistical power

1β=k1k2h1(x,y)dxdy

for the two-stage dependent design, or

1β=(1β1)·Φ(k2p˜*(1p˜*)/(1e1e2)+naμ*σ*)

for the two-stage independent design both as the functions of the unknown parameters (n, na, k1, k2, α1, β1) for a given cost C and overall significance level α.

The calculation procedure of optimal parameters

The above two maximization problems, denoted as P1 and P2 for the two-stage dependent design and two-stage independent design, respectively, are the Mixed Integer Nonlinear Programming (MINLP) with discrete and continuous variables, nonlinearity in the objective functions and constraints, and hence difficult to solve. In order to obtain their global solutions, we adopt a simple and intuitive method. We illustrate our procedure of calculation using the two-stage dependent design as follows (The situation of the two-stage independent design is similar):

  • Step 1: We convert the original problem (P1) with the parameters n, na, k1, k2, α1 and β1 into the problem (denoted byP1') with the parameters n and β1 by the constraints (4), (6)(8);

  • Step 2: We divide the problem (P1') into n nonlinear optimization sub-problems {P1'(1), …, P1'(n)} with the parameter β1;

  • Step 3: We solve each sub-problem (P1'(i)) by using the existing function “fminbnd” in Matlab software, which finds minimum of single-variable function in a fixed interval, and write down its minimal objective value f*(i) and its corresponding solution β1*(i);

  • Step 4: We obtain the global minimum of the objective function f* = min {f*(i), i = 1, 2, …, n} and the corresponding global solution β1*=arg min1in{f*(i)}.

Results

To investigate the statistical power of the two-stage design in the presence of genotyping errors, we set the sample size in the first stage to be n = 500, and the supplemental sample size in the second stage to be na = 500. Note that the main purpose in the first stage is to screen for the truly associated markers, and so we hope that the probability of the truly associated markers being included is large. Therefore, as in Zuo et al. (2006), we set the power to be 95% in the pooling stage. The significance level of the two-stage design for a single marker test is taken to be α / M = 5×10−8, a level suggested by Risch and Merikangas (1996) for the genome-wide association studies. Similar to the situation in the absence of genotyping errors, the power of the two-stage design in the presence of genotyping errors depends on the genetic model and population allele frequency through almost only the population allele frequency and allele frequency difference between the cases and controls (data not shown). So for specificity, we consider the multiplicative model (i.e., f12=f2f0) and set f0 as 0.01 in the following calculations.

Table 1 gives the statistical power of the two-stage dependent design that a disease-associated marker is detected under various error rates. It is easily seen that for almost all the situations, the impact of genotyping errors on detecting a disease-associated marker with smaller allele frequency difference between the cases and controls is larger. That is, to detect a disease-associated marker with large allele frequency difference between the cases and controls, the impact of genotyping errors is small. This observation is consistent with the results of Gordon and Finch (2005), who pointed out that one way to minimize the effect of genotyping errors is to design studies with higher power initially (see Figure 4 of their paper). The impact of unequal error rates from one allele to another allele is between those of equal error rates. Similar patterns are observed for the two-stage independent design (Table 2). As in the situation of no genotyping errors in the second stage (Zuo et al. 2006), the effect of the measurement error rate ε on the power is not serious, especially for the two-stage dependent design (data not shown). The same phenomenon is observed when we use multiple pools: the use of multiple pools will not significantly increase the power of the two-stage design, especially for the two-stage dependent design (data not shown). This can be expected because the benefit of using multiple pools is to reduce the measurement errors which have no large effect on the power especially for the two-stage dependent design as mentioned above. In Table 1 and Table 2, the power in the first stage is set as 0.95. We have also considered the effect of various 1 − β1. In general, the effect of 1 − β1 on the power of the two-stage design depends on the population allele frequency p and the allele frequency difference between the cases and controls pApU. In most situations, higher power can be obtained by increasing 1 − β1, especially for the big values of pApU (data not shown). As for the optimal value of 1 − β1, it can be found by maximizing the power of the two-stage design for the given cost or minimizing the cost for the given power.

Table 1.

The power of the two-stage dependent design for fixed population allele frequency and allele frequency difference between the case and control groups in the presence of genotyping errors

pApU = 0.03 pApU = 0.05 pApU = 0.07 pApU = 0.10
[e1, e2] p = 0.05
[0, 0] 0.070 0.744 0.948 0.950
[0.01, 0.01] 0.039 0.613 0.943 0.950
[0.03, 0.03] 0.013 0.369 0.902 0.950
[0.01, 0.03] 0.015 0.395 0.910 0.950
[e1, e2] p = 0.20
[0, 0] 0.001 0.062 0.458 0.938
[0.01, 0.01] 0.001 0.050 0.403 0.930
[0.03, 0.03] 6.52×10−4 0.032 0.302 0.902
[0.01, 0.03] 7.26×10−4 0.036 0.327 0.911
[e1, e2] p = 0.70
[0, 0] 6.24×10−4 0.037 0.374 0.936
[0.01, 0.01] 5.21×10−4 0.030 0.327 0.927
[0.03, 0.03] 3.64×10−4 0.020 0.244 0.895
[0.01, 0.03] 4.69×10−4 0.027 0.302 0.920
*

The power in the pooling stage is 1 − β1 = 95%, and the significance level for the two-stage design is α / M = 5×10−8.

**

The sample size in the first stage is 500, the supplemental sample size in the second stage is also 500, the measurement error rate is ε = 0.01, and the number of pool pairs is m = 1.

Table 2.

The power of the two-stage independent design for fixed population allele frequency and allele frequency difference between the case and control groups in the presence of genotyping errors

pApU = 0.03 pApU = 0.05 pApU = 0.07 pApU = 0.10
[e1, e2] p = 0.05
[0, 0] 0.006 0.229 0.820 0.950
[0.01, 0.01] 0.004 0.153 0.734 0.950
[0.03, 0.03] 0.001 0.069 0.539 0.948
[0.01, 0.03] 0.001 0.076 0.564 0.949
[e1, e2] p = 0.20
[0, 0] 1.44×10−4 0.007 0.112 0.773
[0.01, 0.01] 1.20×10−4 0.006 0.094 0.735
[0.03, 0.03] 8.34×10−5 0.004 0.065 0.651
[0.01, 0.03] 9.10×10−5 0.004 0.072 0.675
[e1, e2] p = 0.70
[0, 0] 7.79×10−5 0.004 0.081 0.764
[0.01, 0.01] 6.74×10−5 0.003 0.069 0.726
[0.03, 0.03] 5.05×10−5 0.002 0.049 0.642
[0.01, 0.03] 6.19×10−5 0.003 0.063 0.705
*

The power in the pooling stage is 1 − β1 = 95%, and the significance level for the two-stage design is α / M = 5×10−8.

**

The sample sizes in the first and second stages are both 500, the measurement error rate is ε = 0.01, and the number of pool pairs is m = 1.

Comparing the two-stage dependent and two-stage independent designs, we see from Table 1 and Table 2 that the two-stage dependent design has much higher power, of course, at the price of extra genotyping in the second stage. On the other hand, if we use the sample size of na = 1000 in the second stage for the two-stage independent design which is the same as that for the two-stage dependent design, then the two-stage independent design will have higher power (data not shown). Clearly, in this situation, the extra sample collection cost is needed for the two-stage independent design. Noting that the extra genotyping cost for the two-stage dependent design is generally much lower than the extra sample collection cost for the two-stage independent design, the two-stage dependent design would generally have higher power. Of course, a fair comparison should base on the same total cost (see below).

To see the optimal choice of the parameters in the two-stage design for a fixed cost, we assume the total number of markers is M = 106, the number of the truly disease-associated markers is K = 1, and the number of pool pairs is m = 1. Further, we let the total cost C = 2×106 (Unit: US$), the cost of recruiting an individual C1 = 2000, the cost of genotyping a single marker for an individual Cind = 0.02, the cost of measuring allele frequency at a single marker for a DNA pool Cpool = 0.03, the other cost C0 = 0, the genotyping error rates e1 = e2 = 0.001, and the measurement error rate ε = 0.01. The calculation results for various population allele frequencies and allele frequency differences between the cases and controls are summarized in Table 3 for the two-stage dependent design and Table 4 for the two-stage independent design. It is clear from Table 3 that under the given total cost, to obtain the optimal design with the highest power, no additional sample is substantially needed at the second stage for the two-stage dependent design. This means that for the two-stage dependent design, all individuals should be used at both stages. We have also considered other cost settings, e.g. C = 5×105, C1 = 200, C0 = 0, Cind = 0.02, Cpool = 0.03, and also confirmed our findings (Table 5). For the two-stage independent design, the optimal allocation of sample size depends on the cost and the allele frequency difference between the case and control groups. Roughly, for the current genotyping costs and error rates, we recommend to allocate 1/3 to 1/2 of the total sample size to the first stage for the two-stage independent design.

Table 3.

The optimal power of the two-stage dependent design for a fixed total cost of C = 2×106 when the number of pool pairs is m = 1

pApU = 0.05 pApU = 0.10
p = 0.05 p = 0.20 p = 0.70 p = 0.05 p = 0.20 p = 0.70
Power 0.113 0.003 0.002 0.972 0.398 0.386
n 478 456 484 473 484 483
na 0 27 0 0 0 0
k1 2.975 3.420 3.585 2.803 3.532 3.344
k2 5.217 5.307 5.317 5.246 5.306 5.323
β1 0.610 0.911 0.941 0.011 0.274 0.219
*

The significance level for the two-stage design is α / M = 5×10−8, C0 = 0, C1 = 2000, Cind = 0.02, Cpool = 0.03, M = 106, K = 1, and the error rates are e1 = e2 = 0.001 and ε = 0.01.

Table 4.

The optimal power of the two-stage independent design for a fixed total cost of C = 2×106 when the number of pool pairs is m = 1

pApU = 0.05 pApU = 0.10
p = 0.05 p = 0.20 p = 0.70 p = 0.05 p = 0.20 p = 0.70
Power 0.050 0.002 0.001 0.826 0.232 0.226
n 124 140 146 169 162 163
na 339 335 330 294 312 311
k1 2.484 2.749 2.779 2.433 2.694 2.693
k2 4.323 4.148 4.128 4.354 4.186 4.187
β1 0.748 0.924 0.936 0.099 0.490 0.491
*

The significance level for the two-stage design is α / M = 5×10−8, C0 = 0, C1 = 2000, Cind = 0.02, Cpool = 0.03, M = 106, K = 1, and the error rates are e1 = e2 = 0.001 and ε = 0.01.

Table 5.

The optimal power of the two-stage dependent design for a fixed total cost of C = 5×105 when the number of pool pairs is m = 1

pApU = 0.05 pApU = 0.10
p = 0.05 p = 0.20 p = 0.70 p = 0.05 p = 0.20 p = 0.70
Power 0.501
(0.067)
0.074
(0.012)
0.046
(0.008)
0.999
(0.545)
0.971
(0.362)
0.969
(0.356)
n 950
(681)
1110
(946)
1116
(975)
828
(379)
1053
(667)
1056
(674)
na 0
(0)
1
(1)
0
(0)
0
(0)
0
(0)
0
(0)
k1 2.862
(2.463)
3.389
(2.856)
3.428
(2.912)
2.662
(2.047)
3.115
(2.445)
3.125
(2.454)
k2 5.090
(4.798)
5.216
(4.831)
5.257
(4.875)
5.170
(5.051)
5.267
(5.065)
5.296
(5.118)
β1 0.434
(0.906)
0.771
(0.959)
0.805
(0.964)
0.001
(0.432)
0.021
(0.600)
0.022
(0.603)
*

The significance level for the two-stage design is α / M = 5×10−8, C0 = 0, C1 = 200, Cind = Cpool = 0.02, M = 106, K = 1, and the error rates are e1 = e2 = 0.001 and ε = 0.01 (The values in bracket correspond to ε = 0.03).

Unlike the situation where the sample sizes are fixed in the two stages (which implies that the cost is not our main concern), when the cost is given, both the measurement error rate ε and the number of pool pairs m can have large effect on the power of the two-stage design. For example, for the two-stage dependent design, Table 5 shows that the increase of ε from 0.01 to 0.03 substantially reduces the power. This seems to be in contradiction with our previous observation when the sample sizes are fixed in the two stages. In fact, as Zuo et al. (2006) pointed out, for a two-stage design, measurement errors have a large impact only on the first stage. This means that the number of the promising markers M1 will be seriously affected by ε. Now, M1 is an important part in the cost function C, it is therefore not difficult to understand why the effect of ε on the power could be large. Such an effect would become small when multiple pools are used (data not shown). As for the effect of multiple pools on the power, it is observed that using multiple pools often leads to reduced power (Table 3 and Table 6) which looks like unusual. An intuitive explanation is that the use of multiple pools will greatly increase the cost of measuring allele frequency in the first stage, and so reduce the sample sizes at both stages, although it can reduce the measurement error rate ε.

Table 6.

The optimal power of the two-stage dependent design for a fixed total cost of C = 2×106 when the number of pool pairs is m = 10

pApU = 0.05 pApU = 0.10
p = 0.05 p = 0.20 p = 0.70 p = 0.05 p = 0.20 p = 0.70
Power 0.038 0.001 0.001 0.830 0.154 0.152
n 349 342 343 349 349 349
na 0 0 0 0 0 0
k1 4.860 2.829 2.872 4.872 4.978 4.225
k2 5.245 5.266 5.255 5.240 5.312 5.311
β1 0.930 0.744 0.798 0.120 0.770 0.500
*

The significance level for the two-stage design is α / M = 5×10−8, C0 = 0, C1 = 2000, Cind = 0.02, Cpool = 0.03, M = 106, K = 1, and the error rates are e1 = e2 = 0.001 and ε = 0.01.

For a fixed cost, it is now clear from Table 3 and Table 4 that the two-stage dependent design has higher power than the two-stage independent design. This is in fact not surprising because all of the sample information is used in the second stage for the two-stage dependent design but not for the two-stage independent design.

It should be noted that the error model for individual genotyping we considered here is a simple allele error model. Such a model means that the loci in HWE before the errors are introduced will still be in HWE after the errors are introduced. However, departure from HWE is often used as a way to detect genotyping errors (Hosking et al. 2004; Leal 2005; Cox and Kraft 2006). So in the following, we investigate the robustness of our methods to the departure from HWE through simulations. We consider a general error model (Kang et al. 2004a; Kang et al. 2004b) which is on genotypes but not on alleles and given in Table 7. Some special situations of the model can be found in Sobel et al. (2002) and Douglas et al. (2002). In our simulations, we consider two groups of the values of error rates: “ ε21 = 0.01, ε20 = 0.02, ε12 = 0.03, ε10 = 0.02, ε02 = 0.01, ε01 = 0.02 ” and “ ε21 = 0.1, ε20 = 0.2, ε12 = 0.3, ε10 = 0.2, ε02 = 0.1, ε01 = 0.2 ”. We first generate a population with the size of 5×106 assuming the population allele frequency p = 0.2 under HWE, then we use the penetrances f2, f1 and f0 to simulate the disease status of each individual and thus we obtain the case subpopulation and control subpopulation. We randomly draw n = na = 500 cases from the case subpopulation and n = na = 500 controls from the control subpopulation. Using the general genotyping error model in Table 7, we generate the genotype data with errors for the samples from which we can estimate the allele error rates e1 and e2. By making use of our test statistics, the empirical Type I error rate and power can be calculated. Note that to see whether our method leads to a nominal level, a very large number of simulations is needed if we set α / M = 5×10−8. So we consider the nominal level of α / M = 0.05 instead. The empirical Type I error rate is the proportion of significant replicates out of the total number of replicates under H0. The empirical power is the proportion of significant replicates out of the total number of replicates under H1. Based on 10,000 replicates, our results for the empirical Type I error rate and empirical power are summarized in Table 8 when the two-stage dependent design is used (The results are similar for the two-stage independent design). It can be seen that the empirical Type I error rates and empirical power are generally close to the nominal level of 0.05 and the theoretical power calculated by our asymptotic normal formulas, respectively. While we also observed a slight increase in Type I error rate for extremely large genotyping error rates, we comment that, in practice most SNP genotyping error rates are below 0.005. So the results of Table 8 indicate that when genotyping error rates are this low, our methods are robust to deviations from HWE.

Table 7.

A general genotyping error model (Kang et al. 2004a, Kang et al. 2004b)

Observed
genotype
True genotype at the candidate marker
AA Aa aa
AA 1 − ε21 − ε20 ε12 ε02
Aa ε21 1 − ε12 − ε10 ε01
aa ε20 ε10 1 − ε01 − ε02

Table 8.

Empirical Type I error rate for the prevalence of 0.05 and empirical power for the penetrances f2 = 0.0226, f1 = f0 = 0.0113 for the two-stage dependent design in the presence of genotyping errors

Genotyping
error rate
ε21 = 0.01, ε20 = 0.02, ε12 = 0.03,
ε10 = 0.02, ε02 = 0.01, ε01 = 0.02
ε21 = 0.1, ε20 = 0.2, ε12 = 0.3,
ε10 = 0.2, ε02 = 0.1, ε01 = 0.2
Empirical Type I
error rate
(nominal level)
0.049
(0.050)
0.077
(0.050)
Empirical power
(theoretical
power)
0.712
(0.715)
0.351
(0.349)
*

The theoretical power in the pooling stage is 1 − β1 = 95%, the population allele frequency is 0.2, the sample size in the first stage is 500, the supplemental sample size in the second stage is also 500, the measurement error rate is ε = 0.01, and the number of pool pairs is m = 1.

Discussion

In this paper, we have investigated the optimization of the TSD-P incorporating genotyping errors in genetic association studies. For the same total cost, it was observed that the two-stage dependent design is more powerful than the two-stage independent design. Further, under the fixed cost, for the two-stage dependent design, the optimal design that leads to the highest power is to use the whole sample in both stages. This contrasts with the TSD-IG in Skol et al. (2006) in which it is best to allocate 1/3 to 1/2 of the sample to the first stage. An intuitive explanation is that pooling costs are the same regardless of how many individuals are in the pool and hence more individuals can be pooled together in the first stage for the TSD-P. For the two-stage independent design, the optimal design under a given cost depends on the population allele frequency and allele frequency difference between the case and control groups. For the current genotyping costs and error rates, we can roughly allocate 1/3 to 1/2 of the total sample size to the first stage for screening.

The impact of genotyping errors on the two-stage design has also been considered. For both the two-stage dependent and independent designs, we found that the impact of genotyping errors on detecting a disease-associated marker with smaller allele frequency difference between the cases and controls is larger. Also, the impact of unequal genotyping error rates from one allele to another allele is between those of equal error rates. When the sample sizes at both stages are fixed, i.e., the cost is not a main factor we concern, the effects of both the measurement errors associated with pooled DNA and the number of pool pairs on the power of the two-stage design are not large, especially for the two-stage dependent design. When the cost is our main concern, however, both the measurement errors and the number of pool pairs can have large effects on the power of the two-stage design: the increase of measurement errors can substantially decrease the power, and the use of multiple pools often leads to reduced power. Our suggestions are therefore that when the cost is given, we should reduce the measurement errors in the first stage but need not to form too many pools.

Although the error model we considered is a simple allele error model for individual genotyping, our simulations show that our methods are generally robust to the departure from HWE. On the other hand, we have used the two different error models for the DNA pooling and individual genotyping technologies. An interesting relationship between them can be built which is available from the authors upon request.

In this article, we have optimized the two-stage design based on a given cost. Clearly, optimizing the two-stage design for a given power is also an interesting topic. The related problem is our ongoing work.

Throughout the paper, we have considered genetic association tests between diseases and genes on the basis of single-marker and used the Bonferroni correction to deal with the testing problem of multiple markers. But Bonferroni method is known to be conservative, especially when the markers under study are in LD (Malley et al. 2002), because in this situation, it would ignore the correlations among the loci and this may lead to detect only those loci with large marginal genetic effect but not those with small ones. On the other hand, multiple loci may contribute to disease susceptibility through complex interactions. Therefore, how to develop effective methods to test multiple markers simultaneously warrants further research.

In practical genetic epidemiological studies, missing genotype data are unavoidable. For example, the missing rates for individual genotyping are generally between 5 to 10% across different platforms. The research on the two-stage designs in the presence of missing data is no doubt an interesting topic for future research.

In the above two-stage independent design, the second stage data is used alone. This is a traditional method which does not make full use of the data information from the first stage. Recently, for the TSD-IG, Skol et al. (2006) proposed a joint analysis strategy which combines the full data information from the first stage in the second stage analysis and so leads to higher power to detect genetic association. For the two-stage independent design considered here, how to conduct similar joint analysis is a future research topic.

Acknowledgments

The authors are grateful to the two reviewers for their constructive comments and suggestions which led to substantial improvements of the original paper. This work was supported in part by grants DMS0234078 from the National Science Foundation (to Y. Zuo), Nos. 70625004 and 70221001 from the National Natural Science Foundation of China (to G. Zou), GM59507 from the National Institutes of Health (to H. Zhao), and AI62247-01 and AI59773 from the National Institute of Allergy and Infectious Diseases (to H. Liang).

Appendix

The proof of approximate normality for the test statistic (tpool, tind)T of the two-stage dependent design

Clearly, we need only to consider the situation under the alternative hypothesis. Let Xi (Yi) denote the number of allele A carried by the ith individual in the case (control) group, i = 1, …, n. When the n cases (controls) are partitioned into m pools each having the size of s, we let Xij (Yij) denote the number of allele A carried by the jth individual in the ith case (control) pool, j = 1, …, s ; i = 1, …, m. Also, let Xi*(Yi*) denote the observed number of allele A for the ith individual in the case (control) group when individual genotyping is used. Then pAi=j=1sXij/(2s),pUi=j=1sYij/(2s),nA*=i=1n+naXi*,andnU*=i=1n+naYi*. Therefore,

tpool=12msi=1mj=1s(XijYij)+(u¯v¯)p^pool(1p^pool)n+2ε2m=σ2n·12ni=1n(XiYi)/σ2n+wp^pool(1p^pool)n+2ε2mσ2n·nUn+wp^pool(1p^pool)n+2ε2m,    (say) (A.1)

and

tind=nn+naσ*·12ni=1n(Xi*Yi*)/σ*n+nan+naσ*·12nai=n+1n+na(Xi*Yi*)/σ*nap^*(1p^*)nn+naσ*·nVn(1)+nan+naσ*·naVna(2)p^*(1p^*).    (say) (A.2)

We suppose that n/(n + na) → ℓ1(< ∞). When ε is a non-zero constant, the proof is easy. In the following, we assume that nε2(<). Denote

ζi(1)=XiYi2σ,ζi(2)=Xi*Yi*2σ*,wi=uivi.

Then Un=1ni=1nζi(1),Vn(1)=1ni=1nζi(2),Vna(2)=1nai=n+1n+naζi(2),andw=1ni=1nwi. Clearly, (ζi(1),ζi(2))T (i = 1, …, n, …) are i.i.d. random vectors, whose mean is μ̃ = (μ / σ, μ*(1 − e1e2) / σ*)T, and variance-covariance matrix is

Σ1=(1σ12*4σσ*σ12*4σσ*(1e1e2)2).

Thus, from the central limit theorem, we obtain

n((Un,Vn(1))Tμ˜)  d.  N(0,Σ1),

where   d.   means convergence in distribution. Further, we have

na(Vna(2)μ*(1e1e2)σ*)  d.  N(0,(1e1e2)2).

Now, denote

Wn,na=(n(Unμ/σ),n[Vn(1)μ*(1e1e2)/σ*],na[Vna(2)μ*(1e1e2)/σ*],w/ε)T,
θn,na=(nμ/σ,nμ*(1e1e2)/σ*,naμ*(1e1e2)/σ*,0)T,

and

Bn,na=(σ2/np^pool(1p^pool)/n+2ε2/m   00εp^pool(1p^pool)/n+2ε2/m0n/(n+na)σ*p^*(1p^*)na/(n+na)σ*p^*(1p^*)0).

Then from (A.1) and (A.2), we have

(tpool,tind)T=Bn,naWn,na+Bn,naθn,na. (A.3)

Note that p^pool  p.  p˜  and  p^*  p.  p˜*,where  p.   means convergence in probability. So the limit of Bn,na in probability is

B=(σp˜(1p˜)+222/m   002p˜(1p˜)+222/m01σ*p˜*(1p˜*)11σ*p˜*(1p˜*)0).

From (A.3), when n + na → ∞, we obtain

(tpool,tind)Tμ^n,na  d.  N(0,Σ1**), (A.4)

where

μ^n,na=Bn,naθn,na=(μp^pool(1p^pool)/n+2ε2/m,n+naμ*(1e1e2)p^*(1p^*))T,

and

Σ1**=B(Σ1000(1e1e2)20002/m)BT=(σ2+222/mp˜(1p˜)+222/m1σ12*4p˜*(1p˜*)·p˜(1p˜)+222/m1σ12*4p˜*(1p˜*)·p˜(1p˜)+222/mσ*2(1e1e2)2p˜*(1p˜*)).

Finally, it is readily seen that μ^n,na=μ˜*+oP(n+na),  and  Σ1*Σ1**. Combining these and (A.4), we see that under the alternative hypothesis H1, (tpool, tind)T approximately follows a joint bivariate normal distribution N(μ˜*,Σ1*).

References

  • 1.Akey JM, Zhang K, Xiong M, Doris P, Jin L. The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am J Hum Genet. 2001;68:1447–1456. doi: 10.1086/320607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bansal A, van den Boom D, Kammerer S, Honisch C, Adam G, Cantor CR, Kleyn P, Braun A. Association testing by DNA pooling: an effective initial screen. Proc Natl Acad Sci USA. 2002;99:16871–16874. doi: 10.1073/pnas.262671399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Barratt BJ, Payne F, Rance HE, Nuland S, Todd JA, Clayton DG. Identification of the sources of error in allele frequency estimations from pooled DNA indicated an optimal experimental design. Ann Hum Genet. 2002;66:393–405. doi: 10.1017/S0003480002001252. [DOI] [PubMed] [Google Scholar]
  • 4.Buetow KH. Influence of aberrant observations on high-resolution linkage analysis outcomes. Am J Hum Genet. 1991;49:985–994. [PMC free article] [PubMed] [Google Scholar]
  • 5.Cox DG, Kraft P. Quantification of the power of Hardy-Weinberg equilibrium testing to detect genotyping error. Hum Hered. 2006;61:10–14. doi: 10.1159/000091787. [DOI] [PubMed] [Google Scholar]
  • 6.Douglas JA, Skol AD, Boehnke M. Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear- family data. Am J Hum Genet. 2002;70:487–495. doi: 10.1086/338919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gordon D, Finch SJ. Factors affecting statistical power in the detection of genetic association. J Clin Invest. 2005;115:1408–1418. doi: 10.1172/JCI24756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gordon D, Finch SJ, Nothnagel M, Ott J. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered. 2002;54:22–33. doi: 10.1159/000066696. [DOI] [PubMed] [Google Scholar]
  • 9.Gordon D, Heath SC, Liu X, Ott J. A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. Am J Hum Genet. 2001;69:371–380. doi: 10.1086/321981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gordon D, Matise TC, Heath SC, Ott J. Power loss for multiallelic transmission/disequilibrium test when errors introduced: GAW11 simulated data. Genet Epidemiol. 1999;17:S587–S592. doi: 10.1002/gepi.1370170795. [DOI] [PubMed] [Google Scholar]
  • 11.Gordon D, Yang Y, Haynes C, Finch SJ, Mendell NR, Brown AM, Haroutunian V. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat Appl Genet Mol Biol. 2004;3 doi: 10.2202/1544-6115.1085. Article 26. [DOI] [PubMed] [Google Scholar]
  • 12.Hosking L, Lumsden S, Lewis K, Yeo A, McCarthy L, Bansal A, Riley J, Purvis I, Xu CF. Detection of genotyping errors by Hardy-Weinberg equilibrium testing. Eur J Hum Genet. 2004;12:395–399. doi: 10.1038/sj.ejhg.5201164. [DOI] [PubMed] [Google Scholar]
  • 13.Jawaid A, Bader JS, Purcell S, Cherny SS, Sham P. Optimal selection strategies for QTL mapping using pooled DNA samples. Eur J Hum Genet. 2002;10:125–132. doi: 10.1038/sj.ejhg.5200771. [DOI] [PubMed] [Google Scholar]
  • 14.Kang SJ, Finch SJ, Haynes C, Gordon D. Quantifying the percent increase in minimum sample size for SNP genotyping errors in genetic model-based association studies. Hum Hered. 2004a;58:139–144. doi: 10.1159/000083540. [DOI] [PubMed] [Google Scholar]
  • 15.Kang SJ, Gordon D, Finch SJ. What SNP genotyping errors are most costly for genetic association studies? Genet Epidemiol. 2004b;26:132–141. doi: 10.1002/gepi.10301. [DOI] [PubMed] [Google Scholar]
  • 16.Lai R, Zhang H, Yang Y. Repeated measurement sampling in genetic association analysis with genotyping errors. Genet Epidemiol. 2007;31:143–153. doi: 10.1002/gepi.20197. [DOI] [PubMed] [Google Scholar]
  • 17.Leal SM. Detection of genotyping errors and pseudo-SNPs via deviations from Hardy-Weinberg equilibrium. Genet Epidemiol. 2005;29:204–214. doi: 10.1002/gepi.20086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lin DY. Evaluating statistical significance in two-stage genomewide association studies. Am J Hum Genet. 2006;78:505–509. doi: 10.1086/500812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Malley JD, Naiman DQ, Bailey-Wilson JE. A comprehensive method for genome scans. Hum Hered. 2002;54:174–185. doi: 10.1159/000070663. [DOI] [PubMed] [Google Scholar]
  • 20.Mote VL, Anderson RL. An investigation of the effect of misclassification on the properties of chi square-tests in the analysis of categorical data. Biometrika. 1965;52:95–109. [PubMed] [Google Scholar]
  • 21.Ott J. Issues in Association Analysis: Error Control in Case-Control Association Studies for Disease Gene Discovery. Hum Hered. 2004;58:171–174. doi: 10.1159/000083544. [DOI] [PubMed] [Google Scholar]
  • 22.Rice KM, Holmans P. Allowing for genotyping error in analysis of unmatched cases and controls. Ann Hum Genet. 2003;67:165–174. doi: 10.1046/j.1469-1809.2003.00020.x. [DOI] [PubMed] [Google Scholar]
  • 23.Risch N. Searching for genetic determinants in the new millennium. Nature. 2000;405:847–856. doi: 10.1038/35015718. [DOI] [PubMed] [Google Scholar]
  • 24.Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
  • 25.Risch N, Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. 1998;8:1273–1288. doi: 10.1101/gr.8.12.1273. [DOI] [PubMed] [Google Scholar]
  • 26.Satagopan JM, Elston RC. Optimal two-stage genotyping in population-based association studies. Genet Epidemiol. 2003;25:149–157. doi: 10.1002/gepi.10260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Satagopan JM, Venkatraman ES, Begg CB. Two-stage designs for gene-disease association studies with sample size constraints. Biometrics. 2004;60:589–597. doi: 10.1111/j.0006-341X.2004.00207.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB. Two-stage designs for gene-disease association studies. Biometrics. 2002;58:163–170. doi: 10.1111/j.0006-341x.2002.00163.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Shields DC, Collins A, Buetow KH, Morton NE. Error filtration, interference, and the human linkage map. Proc Natl Acad Sci USA. 1991;88:6501–6505. doi: 10.1073/pnas.88.15.6501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2006;38:209–213. doi: 10.1038/ng1706. [DOI] [PubMed] [Google Scholar]
  • 31.Sobel E, Papp JC, Lange K. Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 2002;70:496–508. doi: 10.1086/338920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Tintle NL, Gordon D, McMahon FJ, Finch SJ. Using duplicate genotyped data in genetic analyses: testing association and estimating error rates. Stat Appl Genet Mol Biol. 2007;6 doi: 10.2202/1544-6115.1251. Article 4. [DOI] [PubMed] [Google Scholar]
  • 33.Wang H, Thomas DC, Pe'er I, Stram DO. Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol. 2006;30:356–368. doi: 10.1002/gepi.20150. [DOI] [PubMed] [Google Scholar]
  • 34.Zou G, Zhao H. Haplotype frequency estimation in the presence of genotyping errors. Hum Hered. 2003;56:131–138. doi: 10.1159/000073741. [DOI] [PubMed] [Google Scholar]
  • 35.Zou G, Zhao H. The impacts of errors in individual genotyping and DNA pooling on association studies. Genet Epidemiol. 2004;26:1–10. doi: 10.1002/gepi.10277. [DOI] [PubMed] [Google Scholar]
  • 36.Zuo Y, Zou G, Zhao H. Two-stage designs in case-control association analysis. Genetics. 2006;173:1747–1760. doi: 10.1534/genetics.105.042648. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES