Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Nov 30.
Published in final edited form as: Ann Hum Genet. 2005 Jul;69(0 4):429–442. doi: 10.1046/j.1529-8817.2005.00164.x

Family-Based Association Tests for Different Family Structures Using Pooled DNA

Guohua Zou 1,2, Hongyu Zhao 1,*
PMCID: PMC4664084  NIHMSID: NIHMS4592  PMID: 15996171

Summary

DNA pooling is a cost-effective strategy for genomewise association studies to identify disease genes. In the context of family-based association studies, Risch & Teng (1998) mainly considered families of identical structures to detect associations between genetic markers and disease, and suggested possible approaches to incorporating different family types without a thorough study of their properties. However, families collected in real genetic studies often have different structures and, more importantly, the informativeness of each family structure depends on the disease model which is generally unknown. So there is a need to develop and investigate statistical methods to combine information from diverse family types. In this article, we propose a general strategy to incorporate different family types by assigning each family an “optimal” weight in association tests. In addition, we consider measurement errors in our analysis. When we evaluate our approach under different disease models and measurement errors, we find that our weighting scheme may lead to a substantial reduction in sample size required over the approach suggested by Risch & Teng (1998), and measurement errors may have significant impact on the required sample size when the error rates are not negligible.

Keywords: family-based association study, family structure, DNA pooling, measurement error, sample size

Introduction

The genome-wide association study is a promising approach to identifying disease genes. However, it is still extremely expensive to genotype hundreds or thousands of individuals at hundreds of thousands marker loci with current technologies. As a result, DNA pooling has received much attention recently due to its potential in saving genotyping cost (Michelmore et al. 1991; Lipkin et al. 1998; Risch & Teng, 1998; Xu et al. 1999; Bader et al. 2001; Jawaid et al. 2002; Ito et al. 2003; Wang et al. 2003, among others). Recent developments in quantitative assays and in the design and analysis of pooling studies were reviewed by Sham et al. (2002).

For quantitative phenotypes, Bader & Sham (2002) proposed statistical methods to use DNA pooling in family-based association designs. For qualitative phenotypes, Risch & Teng (1998) derived formulae for calculating power to detect associations for identical family structures when DNA pooling is used. Compared to the results of population-based case-control tests under both individual genotyping and DNA pooling (c.f., Zou & Zhao, 2004), their research shows that when families with parents are used pooling leads to higher power, especially as the number of affected children increases, although when case-parent trias are used the power between them is similar. Note that family-based designs are robust to population stratification, and family-based association tests are promising when pooled DNA is used. However, families often have different structures in practical genetic studies, and it is more flexible if a study does not constrain family types. For example, it may not be easy to collect only families with three affected children. More importantly, Risch & Teng’s (1998) research and our own suggest that the sample sizes required for various family structures depend on the disease model, which is often unknown to researchers. Therefore, there is a need to develop statistical methods to combine families of different types both for practical and design considerations. As pointed out by Risch & Teng (1998), simple pooling of families of different structures, e.g., all affecteds are pooled together and all parents and unaffecteds are pooled together, is not a robust procedure. They proposed two ways to combine families of different structures. The first is to form pools using only families with identical structures, the second is to duplicate individuals for different family types, so that the ratio of the number of affecteds and the number of unaffecteds remains constant. However, they did not investigate the properties nor the power of their proposed methods. In this article, we consider the first method of forming pools using families of the same structures. Through marker score distributions, we first derive formulae for the mean and variance of the test statistic using DNA pooling data from families of identical structures. Our general results cover those of Risch & Teng (1998) as special cases. Based on these results, we propose a weighting scheme to combine data of different family types. When our approach is applied to different disease models we find that it may lead to significant reduction in sample size requirements compared to those through Risch & Teng’s approach under certain disease models. We also consider errors in measuring allele frequencies, which are unavoidable for DNA pooling technology. Recent research suggests that for a given DNA pooling sample, the standard deviation of the estimated allele frequency is between 1% and 4% (cf., Buetow et al. 2001; Grupe et al. 2001; Le Hellard et al. 2002, and Sham et al. 2002). For example, Le Hellard et al. (2002) reported that using the SNaPshot™ Method, which is based on allele-specific extension or minisequencing from a primer adjacent to the site of the SNP, the standard deviations for estimating allele frequency are from 1% to 4% depending on the specific markers being tested. Therefore, we also incorporate measurement errors into our approach. Our numerical results show that the sample size required to attain the desired significance level and power using pooled DNA samples may be seriously affected when the error rates are not negligible.

Genetic Models and Measurement Error Models

Consider a disease locus with two alleles D and d, and a marker locus with two alleles A and a. Assume that the penetrance is f2 for genotype DD, f1 for genotype Dd, and f0 for genotype dd. Denote the frequency of allele A in the families of structure (r, s, a) by p(r, s, a), where r (= 0, 1, 2) is the number of available parents in a family, s (= 1, 2, …) is the number of siblings, and a (= 1, …, s) is the number of affected siblings. Let q(r, s, a) = 1 − p(r, s, a). In this paper, we assume Hardy- Weinberg equilibrium for parents. If r = 1, that is, for the family type with only one parent, we further assume random mating between two parents, and the parents are missing at random. This is a reasonable assumption if the parental missingness is not related to the phenotype. For the ith family of structure (r, s, a), where i = 1, …, n(r, s, a) and n(r, s, a) is the number of families of type (r, s, a) in the sample, let X(r,s,a)i(j) be the true (unobservable) number of allele A at the marker locus for the jth (j = 1, …, a) affected sibling, Y(r,s,a)i(j) be the true (unobservable) number of allele A for the jth (j = 1, …, sa) unaffected sibling, and Z(r,s,a)i(j) be the true (unobservable) number of allele A for the jth (j = 1, …, r) parent, who may be the father or mother. Here, to simplify our analysis, we treat the parental phenotypes as unknown. Let n be the total number of families drawn at random from the ascertainment subpopulation, which consists of all families with at least one affected sibling, then the n(r, s, a) (r = 0, 1, 2; s = 1, 2, …; a = 1, …, s) are random variables that satisfy r=02s=1a=1sn(r,s,a)=n. For each family type we form two pools, one consisting of affected siblings, and the other consisting of unaffected siblings and available parents. Here, we discard the families in which all children are affected and no parents are available, because the control group cannot be formed for such families.

For a genetic model, denote the mean and variance of X(r,s,a)i(j) by μ(r, s, a) X and σ(r,s,a)X2, and the covariance and mixed second moment of X(r,s,a)i(j) and X(r,s,a)i(k)(jk) by γ(r, s, a)XX and Δ(r, s, a)XX, respectively. Other notations are defined similarly. The formulae for calculating these means, variances, covariances and mixed second moments are provided in Appendix A. To consider measurement errors, we assume the following measurement error models

{p^(r,s,a)A=i=1n(r,s,a)j=1aX(r,s,a)i(j)2n(r,s,a)a+ξ(r,s,a),p^(r,s,a)U=i=1n(r,s,a)[j=1saY(r,s,a)i(j)+j=1rZ(r,s,a)i(j)]2n(r,s,a)(sa+r)+η(r,s,a), (1)

where (r, s, a) A is the sample frequency of allele A among the affected siblings of family type (r, s, a), and (r, s, a) U is the sample frequency of allele A among the unaffected siblings and available parents of family type (r, s, a). Given X(r,s,a)i(j),Y(r,s,a)i(j) and Z(r,s,a)i(j), ξ(r, s, a) and η(r, s, a) are independent normal random variables with mean 0 and variance ε2. Here we assume that for the DNA pooling technology, standard deviation ε is not affected by family structures or true allele proportions. However, ε may be related to these factors in practice. In this case, we need to replace ε by ε(r, s, a) for the family type (r, s, a) in the formulae below.

Statistical Tests Combining Families of Different Structures

To test the null hypothesis of H0: no association between the marker and disease, we form our pools using families with identical structures. To simplify our presentation, we consider the case of perfect linkage disequilibrium between the marker locus and disease locus, i.e., the two loci are identical. More general situations are discussed in the Discussion Section. As in Risch & Teng (1998), a one-sided test will be used. Consider the following general weighting scheme combining information from various family structures:

T=r=02s=1a=1sw(r,s,a)(p^(r,s,a)Ap^(r,s,a)U),

where the w(r, s, a) are weights to be discussed below. It can be seen that under the null hypothesis H0, (r, s, a) A(r, s, a) U has mean 0, and variance

V(p^(r,s,a)Ap^(r,s,a)U)=14E1[1n(r,s,a)]·p(r,s,a)q(r,s,a)×[1a+sa+2rr2(sa+r)2]+2ε214E1[1n(r,s,a)]·Δ(r,s,a)0+2ε2σ(r,s,a)02,

which can be estimated by

σ^(r,s,a)02=p^(r,s,a)(1p^(r,s,a))4n(r,s,a)[1a+sa+2rr2(sa+r)2]+2ε2,

Where

p^(r,s,a)=ap^(r,s,a)A+(sa+r)p^(r,s,a)Us+r

is the sample frequency of allele A for family type (r, s, a), and E1 denotes the expectation over all possible values of n(r, s, a). Other estimation methods of the frequency of allele A for family type (r, s, a) under H0 are possible (c.f., Risch & Teng, 1998). So under H0,

V(T)=r=02s=1a=1sw(r,s,a)2σ(r,s,a)02.

The optimal value of w(r, s, a) in the sense of minimizing V(T) is given by

w(r,s,a)=1/σ(r,s,a)02r=02s=1a=1s1/σ(r,s,a)02.

That is, the weights should be inversely proportional to the corresponding variances. The minimal variance is given by

V(T)=1r=02s=1a=1s1/σ(r,s,a)02.

Therefore, we propose the following test statistic for H0:

t=r=02s=1a=1s1σ^(r,s,a)02(p^(r,s,a)Ap^(r,s,a)U)r=02s=1a=1s1σ^(r,s,a)02.

Using a one-sided test and assuming asymptotic normality (a proof on the asymptotic normality of the test statistic t under H0 is provided in Appendix B), the power to reject the null hypothesis with significance level α can be approximated by

Φ(zαr=02s=1a=1s1/σ˜(r,s,a)02+r=02s=1a=1sμ(r,s,a)/σ˜(r,s,a)02r=02s=1a=1sσ(r,s,a)2/σ˜(r,s,a)04),

where μ(r, s, a) and σ(r,s,a)2 are the mean and variance of the difference between the two allele frequency estimates (r, s, a) A and (r, s, a) U under the alternative hypothesis H1, respectively, whose expressions are given by (A.22) and (A.24) in Appendix A, and

σ˜(r,s,a)02=E1[1n(r,s,a)]·p˜(r,s,a)(1p˜(r,s,a))4×[1a+sa+2rr2(sa+r)2]+2ε214E1[1n(r,s,a)]·Δ˜(r,s,a)+2ε2,

With

p˜(r,s,a)=as+rμ(r,s,a)+(sa)μ(r,s,a)Y+rμ(r,s,a)Z2(sa+r)

being the expected frequency of allele A in family type (r, s, a) under H1, Φ is the cumulative standard normal distribution function, and zα the upper 100α percentile of the standard normal distribution. If the penetrances are low, then (r, s, a) is simplified to

p˜(r,s,a)=as+rμ(r,s,a)+u,vu+v4muv(r,s,a),

Where muv(r,s,a) is the conditional probability of the mating type of parents, G = (u, v), given the family type being (r, s, a). The sample size necessary to obtain a power of 1 − β with a significance level of α satisfies

zαr=02s=1a=1s1σ˜(r,s,a)02r=02s=1a=1sμ(r,s,a)σ˜(r,s,a)02=z1βr=02s=1a=1sσ(r,s,a)2σ˜(r,s,a)04. (2)

In particular, when there is no measurement error, i.e., ε = 0, we have

n=[zαr=02s=1a=1sλ(r,s,a)Δ˜(r,s,a)z1βr=02s=1a=1sλ(r,s,a)Δ(r,s,a)*Δ˜(r,s,a)22r=02s=1a=1sλ(r,s,a)μ(r,s,a)Δ˜(r,s,a)]2,

where Δ(r,s,a)* is given in (A.24), λ(r, s, a) is the proportion of families with type (r, s, a) in the ascertainment subpopulation, and we have used the first order approximation of E1[1n(r,s,a)] (see (A.25) and (A.26) in Appendix A). Note that the resulting sample size n has included uninformative families, i.e. those families with type (0, a, a).

If we use the weights suggested by Risch & Teng (1998), i.e. w(r, s, a) is proportional to an(r, s, a), then the test statistic for H0 is

tRT=r=02s=1a=1san(r,s,a)(p^(r,s,a)Ap^(r,s,a)U)r=02s=1a=1sa2n(r,s,a)2σ^(r,s,a)02.

The corresponding power to reject the null hypothesis with significance level α is given by

Φ(zαr=02s=1a=1sa2λ(r,s,a)2σ˜(r,s,a)02+r=02s=1a=1saλ(r,s,a)μ(r,s,a)r=02s=1a=1sa2λ(r,s,a)2σ(r,s,a)2).

The sample size necessary to obtain a power of 1 − β with a significance level of α satisfies

zαr=02s=1a=1sa2λ(r,s,a)2σ˜(r,s,a)02r=02s=1a=1saλ(r,s,a)μ(r,s,a)=z1βr=02s=1a=1sa2λ(r,s,a)2σ(r,s,a)2. (3)

For the case of ε = 0, the sample size required is

n=[zαr=02s=1a=1sa2λ(r,s,a)Δ˜(r,s,a)z1βr=02s=1a=1sa2λ(r,s,a)Δ(r,s,a)*2r=02s=1a=1saλ(r,s,a)μ(r,s,a)]2.

Numerical Results and Simulation Study

Now we consider an example given by Risch & Teng (1998) to (i) compare the sample sizes required to detect association under our weighting design and the weighting scheme suggested by Risch & Teng (1998); (ii) illustrate the impact of measurement errors on sample size; and (iii) compare the sample sizes required to detect association only using families of the same structures and combining different family structures. In this regard, we should note that because combining different family types needs more pools, our method is slightly more expensive.

We consider two types of family structures: (a) (0, 3, 2), i.e., two affected and one unaffected children, and no parents; and (b) (0, 3, 1), i.e., one affected and two unaffected children, and no parents. From formulae (2) and (3) we calculate the sample size necessary to attain the significance level of α = 5 × 10−8 and power of 1 − β = 80%, the levels suggested by Risch & Merikangas (1996) for a genome scan, under various genetic models and measurement errors. The results are presented in Table 2 for low penetrances and Table 3 for high penetrances. Note that the sample sizes provided in the tables are the number of families required. It should be mentioned that the sample sizes obtained in our calculations do not include uninformative families with the structure (0, a, a) because there are no such families in the population we considered. Based on the results in these tables, we can see that (i) For low penetrances, the sample sizes required under our weighting scheme and that suggested by Risch & Teng (1998) are almost the same; both are close to the case of using only families of type (0, 3, 1). This can be easily understood by noting that for the low penetrances, there are much more families with structure (0, 3, 1) than those with structure (0, 3, 2). For high penetrances, the sample sizes required under our weighting scheme are generally smaller than those under that of Risch & Teng (1998). The difference is largest for dominant models, and smallest for recessive models. It can also be observed that for the recessive model and not large allele frequencies, the weighting method of Risch & Teng is even slightly better, although the difference is small (the largest relative difference is about 5%). This is not surprising because our weighting scheme will not necessarily result in a uniformly optimal power. (ii) The impact of measurement errors on sample size is generally large. Relative to the case of low penetrances, for high penetrances the impact is not very large when the error rates are small and the allele frequencies are not small. But the impact can be substantial for moderate error rates (ε = 0.01) or small allele frequencies, especially for low penetrances. In these cases, there is a dramatic increase in sample sizes.

Table 2.

Sample size required to detect genetic associations combining different family structures for low penetrances*

ε = 0 ε = 0.005 ε = 0.01
Dominant
  p = 0.05 532(533) 1702(1701) **(∞)
  p = 0.20 355(357) 456(457) 2751(2969)
  p = 0.70 4286(4306) ∞ (∞) ∞ (∞)
Recessive
  p = 0.05 59117(59092) ∞ (∞) ∞ (∞)
  p = 0.20 1494(1495) 45486(∞) ∞ (∞)
  p = 0.70 270(270) 353(353) 3312(4511)
Multiplic.
  p = 0.05 2026(2030) ∞ (∞) ∞ (∞)
  p = 0.20 653(654) 1135(1135) ∞ (∞)
  p = 0.70 639(640) 1256(1255) ∞ (∞)
Additive
  p = 0.05 1209(1212) ∞ (∞) ∞ (∞)
  p = 0.20 524(525) 788(789) 97207(∞)
  p = 0.70 984(986) 3547(3571) ∞ (∞)
*

The values in brackets are based on the weighting scheme suggested by Risch & Teng (1998);

**

∞ means that 80% power cannot be attained or the sample size required is unrealistically large (greater than 100 000); Significance level α = 5 × 10−8; power 1 − β = 0.80; Dominant model: f2 = f1 = 0.004, f0 = 0.001; Recessive model: f2 = 0.004, f1 = f0 = 0.001; Multiplicative model: f2 = 0.004, f1 = 0.002, f0 = 0.001; Additive model: f2 = 0.004, f1 = 0.0025, f0 = 0.001.

Table 3.

Sample size required to detect genetic associations combining different family structures for high penetrances*

ε = 0 ε = 0.005 ε = 0.01
Dominant
  p = 0.05 338(373) 537(500) 2124(∞**)
  p = 0.20 185(217) 202(232) 269(290)
  p = 0.70 1746(2104) 4663(5890) ∞ (∞)
Recessive
  p = 0.05 48005(45598) ∞ (∞) ∞ (∞)
  p = 0.20 1139(1126) 2238(2121) ∞ (∞)
  p = 0.70 155(159) 167(170) 215(216)
Multiplic.
  p = 0.05 1513(1669) ∞ (∞) ∞ (∞)
  p = 0.20 443(486) 564(587) 1458(1568)
  p = 0.70 333(358) 385(409) 704(724)
Additive
  p = 0.05 869(967) 4448(9965) ∞ (∞)
  p = 0.20 335(376) 398(430) 762(752)
  p = 0.70 485(534) 598(650) 1804(1867)
*

The values in brackets are based on the weighting scheme suggested by Risch & Teng (1998);

**

∞ means that 80% power cannot be attained or the sample size required is unrealistically large (greater than 100 000); Significance level α = 5 × 10−8; power 1 − β = 0.80; Dominant model: f2 = f1 = 0.4, f0 = 0.1; Recessive model: f2 = 0.4, f1 = f0 = 0.1; Multiplicative model: f2 = 0.4, f1 = 0.2, f0 = 0.1; Additive model: f2 = 0.4, f1 = 0.25, f0 = 0.1.

To compare the sample sizes required by the design combining different family structures, and by the design only using families of the same structures, we further calculate the sample sizes by using the families with types (0, 3, 2) and (0, 3, 1) for the case of no measurement errors, respectively. The results for high penetrances are provided in Table 4. It is clear from this table that the sample sizes required for incorporating different family structures are between those separately using families with type (0, 3, 2) or (0, 3, 1). One family structure is not always preferable over the other, and the relative information for disease association depends on specific disease models. Similar conclusions can be drawn for low penetrances. Therefore, there is added benefit in incorporating various family structures when the mode of inheritance is unknown.

Table 4.

Sample sizes required to detect genetic associations combining different family structures and using only the same family structures for high penetrances* and no measurement errors

Using both
(0, 3, 2) and (0, 3, 1)
Using only
(0, 3, 2)
Using only
(0, 3, 1)
Dominant
  p = 0.05 338(373) 208 355
  p = 0.20 185(217) 187 184
  p = 0.70 1746(2104) 2539 1417
Recessive
  p = 0.05 48005(45598) 17237 55431
  p = 0.20 1139(1126) 534 1275
  p = 0.70 155(159) 150 152
Multiplic.
  p = 0.05 1513(1669) 1116 1556
  p = 0.20 443(486) 359 459
  p = 0.70 333(358) 348 322
Additive
  p = 0.05 869(967) 615 898
  p = 0.20 335(376) 299 342
  p = 0.70 485(534) 534 459
*

The values in brackets are based on the weighting scheme suggested by Risch & Teng (1998); Significance level α = 5 × 10−8; power 1 − β = 0.80; Dominant model: f2 = f1 = 0.4, f0 = 0.1; Recessive model: f2 = 0.4, f1 = f0 = 0.1; Multiplicative model: f2 = 0.4, f1 = 0.2, f0 = 0.1; Additive model: f2 = 0.4, f1 = 0.25, f0 = 0.1.

We conduct some simulation studies to confirm our large sample results. We first generate the genotypes of parents assuming Hardy-Weinberg Equilibrium, and the genotypes of three children assuming Mendelian transmission. Using the penetrances f2, f1 and f0 we simulate the disease status of each child. For a given sample size we confine ourselves to the families with one or two affected children. The test statistic t is used to calculate the empirical type I error rate and power. Note that to see whether our method leads to a correct type I error rate, a very large number of simulations is needed to consider α = 5 × 10−8. So we consider the nominal significance level of α = 0.05 instead. The empirical type I error rate is the proportion of significant replicates out of the total number of replicates under H0. By making use of the sample sizes suggested by the asymptotic power approximations (when the sample size suggested is ∞, we do not report power), we can calculate the empirical power and hence check whether a power of 80% can be attained. The empirical power is the proportion of significant replicates out of the total number of replicates under H1. Based on 500 replicates (100 replicates for the case of recessive model and very low allele frequency, p = 0.05) our results are summarized in Table 5 for empirical type I error rate and in Table 6 for empirical power. It can be seen that the empirical type I error rates and empirical powers are generally close to the significance level of 0.05 and power of 80%, respectively.

Table 5.

Empirical type I error rate for prevalence of 0.1*

ε = 0 ε = 0.005 ε = 0.01
p = 0.05 0.064 0.049 0.057
p = 0.20 0.048 0.042 0.044
p = 0.70 0.060 0.070 0.068
*

The critical value is 1.6449 (which corresponds to the significance level of 0.05 under normality), and the sample size is 200.

Table 6.

Empirical power using sample sizes obtained through asymptotic approximation for high penetrances*

ε = 0 ε = 0.005 ε = 0.01
Dominant
  p = 0.05 0.840 0.842 0.866
  p = 0.20 0.840 0.846 0.840
  p = 0.70 0.896 0.896
Recessive
  p = 0.05 0.780
  p = 0.20 0.808 0.702
  p = 0.70 0.796 0.836 0.804
Multiplic.
  p = 0.05 0.872
  p = 0.20 0.796 0.830 0.852
  p = 0.70 0.790 0.818 0.814
Additive
  p = 0.05 0.774 0.796
  p = 0.20 0.800 0.764 0.890
  p = 0.70 0.780 0.730 0.816
*

The critical value is 5.3267 (which corresponds to the significance level of 5 × 10−8 under normality); Dominant model: f2 = f1 = 0.4, f0 = 0.1; Recessive model: f2 = 0.4, f1 = f0 = 0.1; Multiplicative model: f2 = 0.4, f1 = 0.2, f0 = 0.1; Additive model: f2 = 0.4, f1 = 0.25, f0 = 0.1.

Discussion

In this article, we have developed a general weighting scheme to combine families of different structures in the detection of genetic associations using DNA pooling through family-based association designs. In addition, we explicitly modelled the measurement errors in our approach. It is observed that our weighting scheme is usually better than that suggested by Risch & Teng (1998). In the example we considered, where the families have two different types of structures, the efficiency of the design combining families of different structures is always between those of the designs only using families with one of the two structures. However, because it is generally much easier to collect families of different structures in practice and, more importantly, the informativeness of each family structure depends on the disease model, which is often unknown, we advocate the use of a study design that maximizes the usage of available family data. We also studied the impact of measurement errors on the sample size required. Our numerical results showed that, similar to the case of pooled population data (Zou & Zhao, 2004), the sample size required to attain a desired significance level and power using pooled DNA may significantly increase as the measurement errors increase for family-based association tests. However, such impact can be reduced if multiple replicates of each pooled sample are measured. For example, if four replicate measurements are used and accordingly, (r, s, a) A and (r, s, a) U are replaced by the averages of these four measurements, then the standard deviation ε will be reduced by half, ε/2, and all formulae in the paper are still true. Thus, if ε = 0.01, then the standard deviation after four replicate measurements will be 0.005. From Tables 2 and 3 we see that the sample sizes required are greatly reduced. As a result of measurement errors it is possible that a specified power, e.g. 80%, may never be achieved under certain disease models (see results in Tables 2 and 3). Therefore, our analysis emphasizes the importance of reducing measurement errors in DNA pooling studies. Note that in our discussion the standard deviation ε is assumed to be known. If ε is unknown then we can infer it from laboratory experiments or from the distributions of the test statistics (Jawaid et al. 2002). Although a precise value of ε is impossible, our findings based on asymptotic results and simulation studies in Section 4 suggest that for high penetrances the effect of a minor misspecification of ε (for example, the estimate of ε is 0.005 but ε = 0.0075 in reality) on the association test is not very large. Relatively speaking the effect is slightly larger for low allele frequencies. However, for low penetrances such an effect can be large (data not shown).

It should be pointed out that we have assumed that the parental phenotypes are unknown in order to simplify our analysis. If the parental disease prevalence is low, then our results are close to the case of unaffected parents. Generally we can use separate pools for families with affected parents and for families with unaffected parents. More precisely we can consider the following family types separately: two affected parents, two unaffected parents, and one affected parent and one unaffected parent for the families with two parents; one affected parent, and one unaffected parent for the families with only one parent; and the families with no parents. Such consideration should provide additional information. The analytical details can be given along the line devised here. However, this will be more complicated and remains to be studied in our future work.

In this discussion we have assumed perfect linkage disequilibrium between the disease locus and marker locus. However, it is more likely that the marker being examined is in incomplete linkage disequilibrium with the genetic variant of interest. In this case it is necessary to derive the penetrances of the genotypes at a marker locus for each family structure (r, s, a). Then all the formulae obtained previously can be used. Risch & Teng’s (1998) results can serve this purpose, although the problem will be more difficult if several markers are considered together. In fact, let p(r, s, a) and q(r, s, a) be the frequencies of alleles D and d at the disease locus, and f2, f1 and f0 still be the penetrances of genotypes DD, Dd and dd, respectively. Further, let p(r,s,a) and q(r,s,a) be the frequencies of alleles A and a at the marker locus, and f2,f1 and f0 be the penetrances of genotypes AA, Aa and aa, respectively. If we use Bengtsson & Thomson’s (1981) definition of the linkage disequilibrium parameter δ:

δ=P(A|D)P(A)1P(A),

then from Risch & Teng (1998) we have

P(D|A)=p(r,s,a)+p(r,s,a)q(r,s,a)p(r,s,a)δ(r,s,a),
P(d|A)=q(r,s,a)p(r,s,a)q(r,s,a)p(r,s,a)δ(r,s,a),
P(D|a)=p(r,s,a)p(r,s,a)δ(r,s,a),

and

P(d|a)=q(r,s,a)+p(r,s,a)δ(r,s,a),

where δ(r, s, a) is the linkage disequilibrium measure in the family structure (r, s, a). Therefore,

f2=f2P2(D|A)+2f1P(D|A)P(d|A)+f0P2(d|A),
f1=f2P(D|A)P(D|a)+f1[P(D|A)P(d|a)+P(d|A)P(D|a)]+f0P(d|A)P(d|a),

and

f0=f2P2(D|a)+2f1P(D|a)P(d|a)+f0P2(d|a).

Note that f2,f1 and f0 may be dependent on the family structures. But the formulae in this paper can still be used for this case as long as we substitute them for f2, f1 and f0, respectively.

In this paper we have assumed the random missingness of parental genotypes so that the available and missing parents have the same marker score distributions. This is plausible if the parental missingness is not related to the phenotype. For example, the random missingness assumption holds if we are unable to locate the parents because of death from some other disease or accident. However, in the situation where the missingness of a parent is related to the phenotype, this assumption may not be reasonable. For instance, in a study of genetic factors in an aggressive form of cancer, it is more likely that parents carrying the disease-predisposing allele are missing. A detailed discussion can be found in Allen et al. (2003). The construction of appropriate test statistics under this scenario warrants further research.

Acknowledgments

This work was supported in part by grant GM59507 from the National Institutes of Health and by grant No. 70221001 from the National Natural Science Foundation of China. The authors thank two reviewers for their helpful comments.

Appendix A

In this appendix, we first derive the marginal distributions and joint distributions of the marker scores X(r,s,a)i(j) for the affected siblings, Y(r,s,a)i(j) for the unaffected siblings, and Z(r,s,a)i(j) for the parents whose phenotypes are assumed to be unknown, then their means, variances and covariances, and finally the mean and variance of the difference between the two allele frequency estimates (r, s, a) A and (r, s, a) U for the families with identical structures under the null hypothesis H0 and alternative hypothesis H1. We give only the results under H1 as the distributions under H0 can be obtained by replacing the penetrances f2, f1 and f0 by the disease prevalence for each family type (r, s, a).

Marginal marker score distributions

Let G = (u, v) be the mating type of parents and muv(r,s,a) be the conditional probability of G = (u, v) given the family type being (r, s, a). When the parents are missing at random, the values of muv(r,s,a) are given in Table 1 and this reduces to Table 1 of Risch & Teng (1998) if the penetrances are low so that the unaffected individuals can be regarded as having unknown phenotypes. Denote muv(r,s,a)=muv(r,s,a)+mvu(r,s,a) when uv. Then the distribution of X(r,s,a)i(j) is

{P(X(r,s,a)i(j)=2)=m22(r,s,a)+f2f2+f1m(21)(r,s,a)+f2f2+2f1+f0m11(r,s,a)α(r,s,a)X(2),P(X(r,s,a)i(j)=1)=f1f2+f1m(21)(r,s,a)+m(20)(r,s,a)+2f1f2+2f1+f0m11(r,s,a)+f1f1+f0m(10)(r,s,a)α(r,s,a)X(1),P(X(r,s,a)i(j)=0)=f0f2+2f1+f0m11(r,s,a)+f0f1+f0m(10)(r,s,a)+m00(r,s,a)α(r,s,a)X(0). (A.1)

The distribution of Y(r,s,a)i(j) can be obtained by replacing fw by 1 − fw in formula (A.1), where w = 0, 1 and 2 and the probabilities are denoted by α(r, s, a) Y (w′), where w′ = 0, 1 and 2. In the following discussion, u, v, w, and w′ always take a value of 0, 1, or 2.

Table 1.

Conditional probability muv(s,a) of mating type given a affected and sa unaffected children

Mating
type
population
frequency
muv(s,a)
(2,2) g22
f2a(1f2)sag22/Ks,a
(2,1) g21
(f2+f12)a(1f2+f12)sag21/Ks,a
(2,0) g20
f1a(1f1)sag20/Ks,a
(1,2) g12
(f2+f12)a(1f2+f12)sag12/Ks,a
(1,1) g11
(f2+2f1+f04)a(1f22f1+f04)sag11/Ks,a
(1,0) g10
(f1+f02)a(1f1+f02)sag10/Ks,a
(0,2) g02
f1a(1f1)sag02/Ks,a
(0,1) g01
(f1+f02)a(1f1+f02)sag01/Ks,a
(0,0) g00
f0a(1f0)sag00/Ks,a
*

Ks,a is the sum of all numerators in the third column.

Denote the numbers of allele A of the father and mother in the ith family of structure (r, s, a) by Z(r,s,a)i(f) and Z(r,s,a)i(m), respectively. Note that the notation Z(r,s,a)i(1) in the previous sections is not necessarily the same as Z(r,s,a)i(f) and can be equal to Z(r,s,a)i(m) depending on the observed results. The distribution of marker scores for the father is given by

{P(Z(r,s,a)i(f)=2)=m22(r,s,a)+m21(r,s,a)+m20(r,s,a)α(r,s,a)Z(f)(2),P(Z(r,s,a)i(f)=1)=m12(r,s,a)+m11(r,s,a)+m10(r,s,a)α(r,s,a)Z(f)(1),P(Z(r,s,a)i(f)=0)=m02(r,s,a)+m01(r,s,a)+m00(r,s,a)α(r,s,a)Z(f)(0), (A.2)

and the distribution for the mother can be obtained by replacing muv(r,s,a) by mvu(r,s,a) in formula (A.2), and is denoted as α(r, s, a)Z(m) (w′).

Joint marker score distributions

Now we consider the joint distributions for marker scores. Let α(r, s, a) XY (u, v) denote the probability that one affected sibling and one unaffected sibling in the family with type (r, s, a) have u and v alleles A, respectively. Then it can be shown that

α(r,s,a)XY(2,2)=m22(r,s,a)+f2(1f2)(f2+f1)[(1f2)+(1f1)]m(21)(r,s,a)+f2(1f2)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a), (A.3)
α(r,s,a)XY(2,1)=f2(1f1)(f2+f1)[(1f2)+(1f1)]m(21)(r,s,a)+2f2(1f1)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a), (A.4)
α(r,s,a)XY(2,0)=f2(1f0)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a), (A.5)
α(r,s,a)XY(1,2)=f1(1f2)(f2+f1)[(1f2)+(1f1)]m(21)(r,s,a)+2f1(1f2)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a), (A.6)
α(r,s,a)XY(1,1)=f1(1f1)(f2+f1)[(1f2)+(1f1)]m(21)(r,s,a)+m(20)(r,s,a)+4f1(1f1)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a)+f1(1f1)(f1+f0)[(1f1)+(1f0)]m(10)(r,s,a), (A.7)
α(r,s,a)XY(1,0)=2f1(1f0)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a)+f1(1f0)(f1+f0)[(1f1)+(1f0)]m(10)(r,s,a), (A.8)
α(r,s,a)XY(0,2)=f0(1f2)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a), (A.9)
α(r,s,a)XY(0,1)=2f0(1f1)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a)+f0(1f1)(f1+f0)[(1f1)+(1f0)]m(10)(r,s,a), (A.10)

and

α(r,s,a)XY(0,0)=f0(1f0)(f2+2f1+f0)[(1f2)+2(1f1)+(1f0)]×m11(r,s,a)+f0(1f0)(f1+f0)[(1f1)+(1f0)]m(10)(r,s,a)+m00(r,s,a). (A.11)

The joint distribution for two affected (unaffected) siblings can be obtained by replacing 1 − fw(fw) by fw(1 − fw. Note that at this time, 1 − fw in the formulas remains unchanged) in the formulas (A.3)(A.11), and is denoted as α(r, s, a) XX (u, v) (α(r, s, a) YY (u, v)).

Likewise, if we let α(r, s, a) XZ(f) (u, v) be the probability that one affected sibling and the father in the family with type (r, s, a) have u and v alleles A, respectively, then we obtain

α(r,s,a)XZ(f)(2,2)=m22(r,s,a)+f2f2+f1m21(r,s,a), (A.12)
α(r,s,a)XZ(f)(2,1)=f2f2+f1m12(r,s,a)+f2f2+2f1+f0m11(r,s,a), (A.13)
α(r,s,a)XZ(f)(2,0)=0, (A.14)
α(r,s,a)XZ(f)(1,2)=f1f2+f1m21(r,s,a)+m20(r,s,a), (A.15)
α(r,s,a)XZ(f)(1,1)=f1f2+f1m12(r,s,a)+2f1f2+2f1+f0m11(r,s,a)+f1f1+f0m10(r,s,a), (A.16)
α(r,s,a)XZ(f)(1,0)=m02(r,s,a)+f1f1+f0m01(r,s,a), (A.17)
α(r,s,a)XZ(f)(0,2)=0, (A.18)
α(r,s,a)XZ(f)(0,1)=f0f2+2f1+f0m11(r,s,a)+f0f1+f0m10(r,s,a), (A.19)

and

α(r,s,a)XZ(f)(0,0)=f0f1+f0m01(r,s,a)+m00(r,s,a). (A.20)

The joint distribution for one affected sibling and the mother can be obtained by replacing muv(r,s,a) by mvu(r,s,a) in the formulae (A.12)(A.20), and is denoted by α(r, s, a) XZ(m) (u, v), and the joint distribution for one unaffected sibling and the father can be obtained by replacing fw by 1 − fw in the formulae (A.12)(A.20), and is denoted by α(r, s, a) YZ(f) (u, v). As for the joint distribution for one unaffected sibling and the mother, this can be obtained by replacing muv(r,s,a) by mvu(r,s,a) and fw by 1 − fw in the formulae (A.12)(A.20) and is denoted by α(r, s, a) YZ(m) (u, v).

Mean, variance, and covariance of marker scores

From the marker score distributions of the affected and unaffected siblings and their parents, and the joint distributions of the family members, we can obtain the corresponding expectations, variances, and covariances:

E[X(r,s,a)i(j)]=2α(r,s,a)X(2)+α(r,s,a)X(1)μ(r,s,a)X;

Similarly, we can obtain the expressions for the expectations of Y(r,s,a)i(j),Z(r,s,a)i(f) and Z(r,s,a)i(m);

V[X(r,s,a)i(j)]=4α(r,s,a)X(2)+α(r,s,a)X(1)(μ(r,s,a)X)2σ(r,s,a)X2;

The variances of Y(r,s,a)i(j),Z(r,s,a)i(f) and Z(r,s,a)i(m) have similar forms;

Cov(X(r,s,a)i(j),Y(r,s,a)i(k))=4α(r,s,a)XY(2,2)+2[α(r,s,a)XY(2,1)+α(r,s,a)XY(1,2)]+α(r,s,a)XY(1,1)μ(r,s,a)Xμ(r,s,a)YΔ(r,s,a)XYμ(r,s,a)Xμ(r,s,a)Yγ(r,s,a)XY;

The expressions of the covariances between X(r,s,a)i(j) and X(r,s,a)i(k) and Y(r,s,a)i(j) and Y(r,s,a)i(k) are similar;

Cov(Z(r,s,a)i(f),Z(r,s,a)i(m))=4m22(r,s,a)+2m(21)(r,s,a)+m11(r,s,a)μ(r,s,a)Z(f)μ(r,s,a)Z(m)Δ(r,s,a)ZZμ(r,s,a)Z(f)μ(r,s,a)Z(m)γ(r,s,a)ZZ;
Cov(X(r,s,a)i(j),Z(r,s,a)i(f))=4m22(r,s,a)+2f2+f1f2+f1×(2m21(r,s,a)+m12(r,s,a))+2(f2+f1)f2+2f1+f0m11(r,s,a)+2m20(r,s,a)+f1f1+f0m10(r,s,a)μ(r,s,a)Xμ(r,s,a)Z(f)Δ(r,s,a)XZ(f)μ(r,s,a)Xμ(r,s,a)Z(f)γ(r,s,a)XZ(f); (A.21)

The covariance Cov(X(r,s,a)i(j),Z(r,s,a)i(m)) has a similar form to Cov(X(r,s,a)i(j),Z(r,s,a)i(f)) except muv(r,s,a) and μ(r, s, a)Z(m) are in place of mvu(r,s,a) and μ(r, s, a) Z(f) in the above formula. Likewise, the expressions for the covariance Cov(Y(r,s,a)i(j),Z(r,s,a)i(f)) can be obtained by replacing fw by 1 − fw in formula (A.21), and the covariance Cov(Y(r,s,a)i(j),Z(r,s,a)i(m)) can be obtained by replacing fw by 1 − fw, muv(r,s,a) by mvu(r,s,a) and μ(r, s, a) Z(f) by μ(r, s, a) Z(m) in formula (A.21).

Mean and variance of the difference between the two allele frequency estimates p̂(r, s, a) A and p̂(r, s, a) U for families with identical structures.

In the following, we derive the mean and variance of (r, s, a) A(r, s, a) U under the alternative hypothesis. It can be shown that the means of allele frequency estimates in the case group and control group are

E(p^(r,s,a)A)=μ(r,s,a)X/2,

and

E(p^(r,s,a)U)=(sa)μ(r,s,a)Y+j=1rμ(r,s,a)Z(j)2(sa+r),

respectively. If we define

μ(r,s,a)Z12[μ(r,s,a)Z(1)+μ(r,s,a)Z(2)],

and note that

j=1rμ(r,s,a)Z(j)=rμ(r,s,a)Z

since we have assumed random mating of parents when r = 1, then we have

E(p^(r,s,a)Ap^(r,s,a)U)=12[μ(r,s,a)X(sa)μ(r,s,a)Y+rμ(r,s,a)Zsa+r]μ(r,s,a). (A.22)

Likewise, we can show that

V(p^(r,s,a)A)=E1[1n(r,s,a)]·σ(r,s,a)X2+(a1)γ(r,s,a)XX4a+ε2,

and

V(p^(r,s,a)U)=E1[1n(r,s,a)]·14(sa+r)2×[(sa)σ(r,s,a)Y2+(sa)(sa1)γ(r,s,a)YY+j=1rσ(r,s,a)Z(j)2+r(r1)γ(r,s,a)ZZ+2(sa)j=1rγ(r,s,a)YZ(j)]+ε2,

where E1 denotes the expectation over all possible values of n(r, s, a). Now define

σ(r,s,a)Z212[σ(r,s,a)Z(1)2+σ(r,s,a)Z(2)2],

and

γ(r,s,a)YZ12[γ(r,s,a)YZ(1)+γ(r,s,a)YZ(2)].

Then from the formulae for σ(r,s,a)Z(i)2 and γ(r, s, a) YZ(i) (i = 1, 2) provided above (noting that σ(r,s,a)Z2 etc., defined above, depend only on the sum of the variances etc. for both parents, we can regard the first parent as the father, and the second parent as the mother), we have

σ(r,s,a)Z2=2(2m22(r,s,a)+m(21)(r,s,a)+m(20)(r,s,a))+12(m(21)(r,s,a)+2m11(r,s,a)+m(10)(r,s,a))12[(μ(r,s,a)Z(1))2+(μ(r,s,a)Z(2))2]Δ(r,s,a)Z12[(μ(r,s,a)Z(1))2+(μ(r,s,a)Z(2))2],

and

γ(r,s,a)YZ=4m22(r,s,a)+3[2(1f2)+(1f1)]2[(1f2)+(1f1)]m(21)(r,s,a)+m(20)(r,s,a)+2[(1f2)+(1f1)](1f2)+2(1f1)+(1f0)m11(r,s,a)+1f12[(1f1)+(1f0)]m(10)(r,s,a)μ(r,s,a)Y·12[μ(r,s,a)Z(1)+μ(r,s,a)Z(2)]Δ(r,s,a)YZμ(r,s,a)Yμ(r,s,a)Z. (A.23)

Note that the mating of the parents is assumed to be random when r = 1. Hence

j=1rσ(r,s,a)Z(j)2=rσ(r,s,a)Z2,

and

j=1rγ(r,s,a)YZ(j)=rγ(r,s,a)YZ.

Therefore, the variance of (r, s, a) U can be expressed as

V(p^(r,s,a)U)=E1[1n(r,s,a)]·14(sa+r)2×[(sa)σ(r,s,a)Y2+(sa)(sa1)γ(r,s,a)YY+rσ(r,s,a)Z2+r(r1)γ(r,s,a)ZZ+2(sa)rγ(r,s,a)YZ]+ε2.

Further, we can obtain

Cov(p^(r,s,a)A,p^(r,s,a)U)=E1[1n(r,s,a)]·(sa)γ(r,s,a)XY+rγ(r,s,a)XZ4(sa+r),

where

γ(r,s,a)XZ12j=12γ(r,s,a)XZ(j)Δ(r,s,a)XZμ(r,s,a)Xμ(r,s,a)Z

has similar meaning to γ(r, s, a) YZ given in (A.23) except we should replace 1 − fw by fw in the expression. Consequently,

V(p^(r,s,a)Ap^(r,s,a)U)=14E1[1n(r,s,a)]·{σ(r,s,a)X2γ(r,s,a)XXa+γ(r,s,a)XX+1(sa+r)2[(sa)(σ(r,s,a)Y2γ(r,s,a)YY)+(sa)2γ(r,s,a)YY+rσ(r,s,a)Z2+r(r1)γ(r,s,a)ZZ+2(sa)rγ(r,s,a)YZ]2·(sa)γ(r,s,a)XY+rγ(r,s,a)XZsa+r}+2ε2=14E1[1n(r,s,a)]·{σ(r,s,a)X2γ(r,s,a)XXa+Δ(r,s,a)XX+1(sa+r)2[(sa)(σ(r,s,a)Y2γ(r,s,a)YY)+(sa)2Δ(r,s,a)YY+rΔ(r,s,a)Z+r(r1)Δ(r,s,a)ZZ+2(sa)rΔ(r,s,a)YZ]2·(sa)Δ(r,s,a)XY+rΔ(r,s,a)XZsa+r4(μ(r,s,a))2}+2ε214E1[1n(r,s,a)]·Δ(r,s,a)*+2ε2σ(r,s,a)2. (A.24)

From Stephan (1945), we have

E1[1n(r,s,a)]=1nλ(r,s,a)+1λ(r,s,a)n2λ(r,s,a)2+o(1n2), (A.25)

where as before, λ(r, s, a) is the proportion of families with type (r, s, a) in the ascertainment subpopulation, and can be accurately estimated when the sample size is large. If we use the first order approximation of E1[1n(r,s,a)], then we get

σ(r,s,a)2Δ(r,s,a)*4nλ(r,s,a)+2ε2. (A.26)

We use formula (A.26) to estimate the sample size required to detect association in this study, although it is straightforward to approximate power and sample size using the second order approximation of E1[1n(r,s,a)]. In fact, our numerical calculation for the example given in Section 4 shows that using the first and second order approximations lead to almost identical power (data not shown). If we, like Risch & Teng (1998), assume that the penetrance is low, then the expectation and variance of the difference between the sample frequencies among the affecteds and controls, (r, s, a) A(r, s, a) U, under the alternative hypothesis reduce to

μ(r,s,a)=π21m(21)(r,s,a)+π11m11(r,s,a)+π10m(10)(r,s,a), (A.27)

and

V(p^(r,s,a)Ap^(r,s,a)U)=E1[1n(r,s,a)]·{π212m(21)(r,s,a)+π112m11(r,s,a)+π102m(10)(r,s,a)+1a(ψ212m(21)(r,s,a)+ψ112m11(r,s,a)+ψ102m(10)(r,s,a))+1sa+r(116m(21)(r,s,a)+18m11(r,s,a)+116m(10)(r,s,a))+r4(sa+r)2[(1r)(14m(21)(r,s,a)+m(20)(r,s,a)+14m(10)(r,s,a))+(m(20)(r,s,a)12m11(r,s,a))](μ(r,s,a))2}+2ε2, (A.28)

respectively, where

π21=f2f14(f2+f1),π11=f2f02(f2+2f1+f0),
π10=f1f04(f1+f0),

and

ψ212=f2f14(f2+f1)2,ψ112=f2f1+2f2f0+f1f02(f2+2f1+f0)2,
ψ102=f1f04(f1+f0)2.

These results are the same as those in Risch & Teng (1998). Equations (A.27) and (A.28) give unified formulae for various family structures. By taking different values of r, s and a, we can obtain the corresponding results in Risch & Teng (1998).

Appendix B

In this appendix, we prove the asymptotic normality of our proposed test statistic t under H0. For convenience we denote the variance of the measurement error by εn2 when the sample size is n. When tn is a nonzero constant, the proof is obvious. Now we assume that nεn<, where ℓ is a constant. It can be seen that

n(p^(r,s,a)Ap^(r,s,a)U=n(1n(r,s,a)i=1n(r,s,a)ζ(r,s,a)i+ξ(r,s,a)η(r,s,a)),

where

ζ(r,s,a)i=j=1aX(r,s,a)i(j)2aj=1saY(r,s,a)i(j)+j=1rZ(r,s,a)i(j)2(sa+r)

has mean zero and variance

σ(r,s,a)0*2=14p(r,s,a)q(r,s,a)[1a+sa+2rr2(sa+r)2]

under H0. Note that when n → ∞,

n(r,s,a)npλ(r,s,a),

where → p. means convergence in probability. So from the central limit theorem and the assumptions given in Section 2, we have

n(p^(r,s,a)Ap^(r,s,a)UdN(0,σ(r,s,a)0*2λ(r,s,a)+22),

where → d. means convergence in distribution. On the other hand, under H0, (r, s, a)p. p(r, s, a). Hence,

nσ^(r,s,a)02=nn(r,s,a)·14p^(r,s,a)(1p^(r,s,a))×[1a+sa+2rr2(sa+r)2]+2nεn2p1λ(r,s,a)·14p(r,s,a)(1p(r,s,a))×[1a+sa+2rr2(sa+r)2]+22=σ(r,s,a)0*2λ(r,s,a)+22

Thus,

t=r=02s=1a=1s1nσ^(r,s,a)02·n(p^(r,s,a)Ap^(r,s,a)U)r=02s=1a=1s1nσ^(r,s,a)02dN(0,1).

References

  1. Allen A, Rathouz P, Satten G. Informative missingness in genetic association studies: case-parent designs. Am J Hum Genet. 2003;72:671–680. doi: 10.1086/368276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bader J, Bansal A, Sham P. Efficient SNP-based tests of association for quantitative phenotypes using pooled DNA. GeneScreen. 2001;1:143–150. [Google Scholar]
  3. Bader J, Sham P. Family-based association tests for quantitative traits using pooled DNA. Eur J Hum Genet. 2002;10:870–878. doi: 10.1038/sj.ejhg.5200893. [DOI] [PubMed] [Google Scholar]
  4. Bengtsson BO, Thomson G. Measuring the strength of associations between HLA antigens and diseases. Tissue Antigens. 1981;18:356–363. doi: 10.1111/j.1399-0039.1981.tb01404.x. [DOI] [PubMed] [Google Scholar]
  5. Buetow KH, Edmonson M, MacDonald R, Clifford P, Yip P, Kelley J, Little DP, Strausberg R, Koester H, Cantor CR, Braun A. High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Proc Natl Acad Sci USA. 2001;98:581–584. doi: 10.1073/pnas.021506298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Grupe A, Germer S, Usuka J, Aud D, Belknap JK, Klein RF, Ahluwalia MK, Higuchi R, Peltz G. In silico mapping of complex disease-related traits in mice. Science. 2001;292:1915–1918. doi: 10.1126/science.1058889. [DOI] [PubMed] [Google Scholar]
  7. Ito T, Chiku S, Inoue E, Tomita M, Morisaki T, Morisaki H, Kamatani N. Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am J Hum Genet. 2003;72:384–398. doi: 10.1086/346116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Jawaid A, Bader J, Purcell S, Cherny S, Sham P. Optimal selection strategies for QTL mapping using pooled DNA samples. Eur J Hum Genet. 2002;10:125–132. doi: 10.1038/sj.ejhg.5200771. [DOI] [PubMed] [Google Scholar]
  9. Le Hellard S, Ballereau SJ, Visscher PM, Torrance HS, Pinson J, Morris SW, Thomson ML, Semple CA, Muir WJ, Blackwood DH, Porteous DJ, Evans KL. SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Res. 2002;30:e74. doi: 10.1093/nar/gnf070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lipkin E, Mosig MO, Darvasi A, Ezra E, Shalom A, Friedmann A, Soller M. Quantitative trait locus mapping in dairy cattle by means of selective milk DNA pooling using dinucleotide microsatellite markers: analysis of milk protein percentage. Genetics. 1998;149:1557–1567. doi: 10.1093/genetics/149.3.1557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Michelmore R, Paran I, Kesseli R. Identification of marker linked to disease resistance gene by bulk segregant analysis: a rapid method to detect markers in specific genomic regions using segregating populations. Proc Natl Acad Sci USA. 1991;88:9828–9832. doi: 10.1073/pnas.88.21.9828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
  13. Risch N, Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. Genome Res. 1998;8:1273–1288. doi: 10.1101/gr.8.12.1273. [DOI] [PubMed] [Google Scholar]
  14. Sham P, Bader J, Craig I, O’Donovan M, Owen M. DNA pooling: a tool for large-scale association studies. Nature Reviews Genetics. 2002;3:862–871. doi: 10.1038/nrg930. [DOI] [PubMed] [Google Scholar]
  15. Stephan FF. The expected value and variance of the reciprocal and other negative powers of a positive Bernoulli variate. Ann Math Statist. 1945;16:50–61. [Google Scholar]
  16. Xu C, Donnelly C, Montgomery D, Allan C, Purvis I. Determination of SNP allele frequency by a DNA pooling method. Am J Hum Genet. 1999;65:2577. [Google Scholar]
  17. Wang S, Kidd KK, Zhao H. On the use of DNA pooling to estimate haplotype frequencies. Genet Epidemiol. 2003;24:74–82. doi: 10.1002/gepi.10195. [DOI] [PubMed] [Google Scholar]
  18. Zou G, Zhao H. The impacts of errors in individual genotyping and DNA pooling on association studies. Genet Epidemiol. 2004;26:1–10. doi: 10.1002/gepi.10277. [DOI] [PubMed] [Google Scholar]

RESOURCES