Summary
DNA pooling is a cost-effective strategy for genomewise association studies to identify disease genes. In the context of family-based association studies, Risch & Teng (1998) mainly considered families of identical structures to detect associations between genetic markers and disease, and suggested possible approaches to incorporating different family types without a thorough study of their properties. However, families collected in real genetic studies often have different structures and, more importantly, the informativeness of each family structure depends on the disease model which is generally unknown. So there is a need to develop and investigate statistical methods to combine information from diverse family types. In this article, we propose a general strategy to incorporate different family types by assigning each family an “optimal” weight in association tests. In addition, we consider measurement errors in our analysis. When we evaluate our approach under different disease models and measurement errors, we find that our weighting scheme may lead to a substantial reduction in sample size required over the approach suggested by Risch & Teng (1998), and measurement errors may have significant impact on the required sample size when the error rates are not negligible.
Keywords: family-based association study, family structure, DNA pooling, measurement error, sample size
Introduction
The genome-wide association study is a promising approach to identifying disease genes. However, it is still extremely expensive to genotype hundreds or thousands of individuals at hundreds of thousands marker loci with current technologies. As a result, DNA pooling has received much attention recently due to its potential in saving genotyping cost (Michelmore et al. 1991; Lipkin et al. 1998; Risch & Teng, 1998; Xu et al. 1999; Bader et al. 2001; Jawaid et al. 2002; Ito et al. 2003; Wang et al. 2003, among others). Recent developments in quantitative assays and in the design and analysis of pooling studies were reviewed by Sham et al. (2002).
For quantitative phenotypes, Bader & Sham (2002) proposed statistical methods to use DNA pooling in family-based association designs. For qualitative phenotypes, Risch & Teng (1998) derived formulae for calculating power to detect associations for identical family structures when DNA pooling is used. Compared to the results of population-based case-control tests under both individual genotyping and DNA pooling (c.f., Zou & Zhao, 2004), their research shows that when families with parents are used pooling leads to higher power, especially as the number of affected children increases, although when case-parent trias are used the power between them is similar. Note that family-based designs are robust to population stratification, and family-based association tests are promising when pooled DNA is used. However, families often have different structures in practical genetic studies, and it is more flexible if a study does not constrain family types. For example, it may not be easy to collect only families with three affected children. More importantly, Risch & Teng’s (1998) research and our own suggest that the sample sizes required for various family structures depend on the disease model, which is often unknown to researchers. Therefore, there is a need to develop statistical methods to combine families of different types both for practical and design considerations. As pointed out by Risch & Teng (1998), simple pooling of families of different structures, e.g., all affecteds are pooled together and all parents and unaffecteds are pooled together, is not a robust procedure. They proposed two ways to combine families of different structures. The first is to form pools using only families with identical structures, the second is to duplicate individuals for different family types, so that the ratio of the number of affecteds and the number of unaffecteds remains constant. However, they did not investigate the properties nor the power of their proposed methods. In this article, we consider the first method of forming pools using families of the same structures. Through marker score distributions, we first derive formulae for the mean and variance of the test statistic using DNA pooling data from families of identical structures. Our general results cover those of Risch & Teng (1998) as special cases. Based on these results, we propose a weighting scheme to combine data of different family types. When our approach is applied to different disease models we find that it may lead to significant reduction in sample size requirements compared to those through Risch & Teng’s approach under certain disease models. We also consider errors in measuring allele frequencies, which are unavoidable for DNA pooling technology. Recent research suggests that for a given DNA pooling sample, the standard deviation of the estimated allele frequency is between 1% and 4% (cf., Buetow et al. 2001; Grupe et al. 2001; Le Hellard et al. 2002, and Sham et al. 2002). For example, Le Hellard et al. (2002) reported that using the SNaPshot™ Method, which is based on allele-specific extension or minisequencing from a primer adjacent to the site of the SNP, the standard deviations for estimating allele frequency are from 1% to 4% depending on the specific markers being tested. Therefore, we also incorporate measurement errors into our approach. Our numerical results show that the sample size required to attain the desired significance level and power using pooled DNA samples may be seriously affected when the error rates are not negligible.
Genetic Models and Measurement Error Models
Consider a disease locus with two alleles D and d, and a marker locus with two alleles A and a. Assume that the penetrance is f2 for genotype DD, f1 for genotype Dd, and f0 for genotype dd. Denote the frequency of allele A in the families of structure (r, s, a) by p(r, s, a), where r (= 0, 1, 2) is the number of available parents in a family, s (= 1, 2, …) is the number of siblings, and a (= 1, …, s) is the number of affected siblings. Let q(r, s, a) = 1 − p(r, s, a). In this paper, we assume Hardy- Weinberg equilibrium for parents. If r = 1, that is, for the family type with only one parent, we further assume random mating between two parents, and the parents are missing at random. This is a reasonable assumption if the parental missingness is not related to the phenotype. For the ith family of structure (r, s, a), where i = 1, …, n(r, s, a) and n(r, s, a) is the number of families of type (r, s, a) in the sample, let be the true (unobservable) number of allele A at the marker locus for the jth (j = 1, …, a) affected sibling, be the true (unobservable) number of allele A for the jth (j = 1, …, s − a) unaffected sibling, and be the true (unobservable) number of allele A for the jth (j = 1, …, r) parent, who may be the father or mother. Here, to simplify our analysis, we treat the parental phenotypes as unknown. Let n be the total number of families drawn at random from the ascertainment subpopulation, which consists of all families with at least one affected sibling, then the n(r, s, a) (r = 0, 1, 2; s = 1, 2, …; a = 1, …, s) are random variables that satisfy . For each family type we form two pools, one consisting of affected siblings, and the other consisting of unaffected siblings and available parents. Here, we discard the families in which all children are affected and no parents are available, because the control group cannot be formed for such families.
For a genetic model, denote the mean and variance of by μ(r, s, a) X and , and the covariance and mixed second moment of and by γ(r, s, a)XX and Δ(r, s, a)XX, respectively. Other notations are defined similarly. The formulae for calculating these means, variances, covariances and mixed second moments are provided in Appendix A. To consider measurement errors, we assume the following measurement error models
(1) |
where p̂(r, s, a) A is the sample frequency of allele A among the affected siblings of family type (r, s, a), and p̂(r, s, a) U is the sample frequency of allele A among the unaffected siblings and available parents of family type (r, s, a). Given and , ξ(r, s, a) and η(r, s, a) are independent normal random variables with mean 0 and variance ε2. Here we assume that for the DNA pooling technology, standard deviation ε is not affected by family structures or true allele proportions. However, ε may be related to these factors in practice. In this case, we need to replace ε by ε(r, s, a) for the family type (r, s, a) in the formulae below.
Statistical Tests Combining Families of Different Structures
To test the null hypothesis of H0: no association between the marker and disease, we form our pools using families with identical structures. To simplify our presentation, we consider the case of perfect linkage disequilibrium between the marker locus and disease locus, i.e., the two loci are identical. More general situations are discussed in the Discussion Section. As in Risch & Teng (1998), a one-sided test will be used. Consider the following general weighting scheme combining information from various family structures:
where the w(r, s, a) are weights to be discussed below. It can be seen that under the null hypothesis H0, p̂(r, s, a) A − p̂(r, s, a) U has mean 0, and variance
which can be estimated by
Where
is the sample frequency of allele A for family type (r, s, a), and E1 denotes the expectation over all possible values of n(r, s, a). Other estimation methods of the frequency of allele A for family type (r, s, a) under H0 are possible (c.f., Risch & Teng, 1998). So under H0,
The optimal value of w(r, s, a) in the sense of minimizing V(T) is given by
That is, the weights should be inversely proportional to the corresponding variances. The minimal variance is given by
Therefore, we propose the following test statistic for H0:
Using a one-sided test and assuming asymptotic normality (a proof on the asymptotic normality of the test statistic t under H0 is provided in Appendix B), the power to reject the null hypothesis with significance level α can be approximated by
where μ(r, s, a) and are the mean and variance of the difference between the two allele frequency estimates p̂(r, s, a) A and p̂(r, s, a) U under the alternative hypothesis H1, respectively, whose expressions are given by (A.22) and (A.24) in Appendix A, and
With
being the expected frequency of allele A in family type (r, s, a) under H1, Φ is the cumulative standard normal distribution function, and zα the upper 100α percentile of the standard normal distribution. If the penetrances are low, then p̃(r, s, a) is simplified to
Where is the conditional probability of the mating type of parents, G = (u, v), given the family type being (r, s, a). The sample size necessary to obtain a power of 1 − β with a significance level of α satisfies
(2) |
In particular, when there is no measurement error, i.e., ε = 0, we have
where is given in (A.24), λ(r, s, a) is the proportion of families with type (r, s, a) in the ascertainment subpopulation, and we have used the first order approximation of (see (A.25) and (A.26) in Appendix A). Note that the resulting sample size n has included uninformative families, i.e. those families with type (0, a, a).
If we use the weights suggested by Risch & Teng (1998), i.e. w(r, s, a) is proportional to an(r, s, a), then the test statistic for H0 is
The corresponding power to reject the null hypothesis with significance level α is given by
The sample size necessary to obtain a power of 1 − β with a significance level of α satisfies
(3) |
For the case of ε = 0, the sample size required is
Numerical Results and Simulation Study
Now we consider an example given by Risch & Teng (1998) to (i) compare the sample sizes required to detect association under our weighting design and the weighting scheme suggested by Risch & Teng (1998); (ii) illustrate the impact of measurement errors on sample size; and (iii) compare the sample sizes required to detect association only using families of the same structures and combining different family structures. In this regard, we should note that because combining different family types needs more pools, our method is slightly more expensive.
We consider two types of family structures: (a) (0, 3, 2), i.e., two affected and one unaffected children, and no parents; and (b) (0, 3, 1), i.e., one affected and two unaffected children, and no parents. From formulae (2) and (3) we calculate the sample size necessary to attain the significance level of α = 5 × 10−8 and power of 1 − β = 80%, the levels suggested by Risch & Merikangas (1996) for a genome scan, under various genetic models and measurement errors. The results are presented in Table 2 for low penetrances and Table 3 for high penetrances. Note that the sample sizes provided in the tables are the number of families required. It should be mentioned that the sample sizes obtained in our calculations do not include uninformative families with the structure (0, a, a) because there are no such families in the population we considered. Based on the results in these tables, we can see that (i) For low penetrances, the sample sizes required under our weighting scheme and that suggested by Risch & Teng (1998) are almost the same; both are close to the case of using only families of type (0, 3, 1). This can be easily understood by noting that for the low penetrances, there are much more families with structure (0, 3, 1) than those with structure (0, 3, 2). For high penetrances, the sample sizes required under our weighting scheme are generally smaller than those under that of Risch & Teng (1998). The difference is largest for dominant models, and smallest for recessive models. It can also be observed that for the recessive model and not large allele frequencies, the weighting method of Risch & Teng is even slightly better, although the difference is small (the largest relative difference is about 5%). This is not surprising because our weighting scheme will not necessarily result in a uniformly optimal power. (ii) The impact of measurement errors on sample size is generally large. Relative to the case of low penetrances, for high penetrances the impact is not very large when the error rates are small and the allele frequencies are not small. But the impact can be substantial for moderate error rates (ε = 0.01) or small allele frequencies, especially for low penetrances. In these cases, there is a dramatic increase in sample sizes.
Table 2.
ε = 0 | ε = 0.005 | ε = 0.01 | |
---|---|---|---|
Dominant | |||
p = 0.05 | 532(533) | 1702(1701) | ∞**(∞) |
p = 0.20 | 355(357) | 456(457) | 2751(2969) |
p = 0.70 | 4286(4306) | ∞ (∞) | ∞ (∞) |
Recessive | |||
p = 0.05 | 59117(59092) | ∞ (∞) | ∞ (∞) |
p = 0.20 | 1494(1495) | 45486(∞) | ∞ (∞) |
p = 0.70 | 270(270) | 353(353) | 3312(4511) |
Multiplic. | |||
p = 0.05 | 2026(2030) | ∞ (∞) | ∞ (∞) |
p = 0.20 | 653(654) | 1135(1135) | ∞ (∞) |
p = 0.70 | 639(640) | 1256(1255) | ∞ (∞) |
Additive | |||
p = 0.05 | 1209(1212) | ∞ (∞) | ∞ (∞) |
p = 0.20 | 524(525) | 788(789) | 97207(∞) |
p = 0.70 | 984(986) | 3547(3571) | ∞ (∞) |
The values in brackets are based on the weighting scheme suggested by Risch & Teng (1998);
∞ means that 80% power cannot be attained or the sample size required is unrealistically large (greater than 100 000); Significance level α = 5 × 10−8; power 1 − β = 0.80; Dominant model: f2 = f1 = 0.004, f0 = 0.001; Recessive model: f2 = 0.004, f1 = f0 = 0.001; Multiplicative model: f2 = 0.004, f1 = 0.002, f0 = 0.001; Additive model: f2 = 0.004, f1 = 0.0025, f0 = 0.001.
Table 3.
ε = 0 | ε = 0.005 | ε = 0.01 | |
---|---|---|---|
Dominant | |||
p = 0.05 | 338(373) | 537(500) | 2124(∞**) |
p = 0.20 | 185(217) | 202(232) | 269(290) |
p = 0.70 | 1746(2104) | 4663(5890) | ∞ (∞) |
Recessive | |||
p = 0.05 | 48005(45598) | ∞ (∞) | ∞ (∞) |
p = 0.20 | 1139(1126) | 2238(2121) | ∞ (∞) |
p = 0.70 | 155(159) | 167(170) | 215(216) |
Multiplic. | |||
p = 0.05 | 1513(1669) | ∞ (∞) | ∞ (∞) |
p = 0.20 | 443(486) | 564(587) | 1458(1568) |
p = 0.70 | 333(358) | 385(409) | 704(724) |
Additive | |||
p = 0.05 | 869(967) | 4448(9965) | ∞ (∞) |
p = 0.20 | 335(376) | 398(430) | 762(752) |
p = 0.70 | 485(534) | 598(650) | 1804(1867) |
The values in brackets are based on the weighting scheme suggested by Risch & Teng (1998);
∞ means that 80% power cannot be attained or the sample size required is unrealistically large (greater than 100 000); Significance level α = 5 × 10−8; power 1 − β = 0.80; Dominant model: f2 = f1 = 0.4, f0 = 0.1; Recessive model: f2 = 0.4, f1 = f0 = 0.1; Multiplicative model: f2 = 0.4, f1 = 0.2, f0 = 0.1; Additive model: f2 = 0.4, f1 = 0.25, f0 = 0.1.
To compare the sample sizes required by the design combining different family structures, and by the design only using families of the same structures, we further calculate the sample sizes by using the families with types (0, 3, 2) and (0, 3, 1) for the case of no measurement errors, respectively. The results for high penetrances are provided in Table 4. It is clear from this table that the sample sizes required for incorporating different family structures are between those separately using families with type (0, 3, 2) or (0, 3, 1). One family structure is not always preferable over the other, and the relative information for disease association depends on specific disease models. Similar conclusions can be drawn for low penetrances. Therefore, there is added benefit in incorporating various family structures when the mode of inheritance is unknown.
Table 4.
Using both (0, 3, 2) and (0, 3, 1) |
Using only (0, 3, 2) |
Using only (0, 3, 1) |
|
---|---|---|---|
Dominant | |||
p = 0.05 | 338(373) | 208 | 355 |
p = 0.20 | 185(217) | 187 | 184 |
p = 0.70 | 1746(2104) | 2539 | 1417 |
Recessive | |||
p = 0.05 | 48005(45598) | 17237 | 55431 |
p = 0.20 | 1139(1126) | 534 | 1275 |
p = 0.70 | 155(159) | 150 | 152 |
Multiplic. | |||
p = 0.05 | 1513(1669) | 1116 | 1556 |
p = 0.20 | 443(486) | 359 | 459 |
p = 0.70 | 333(358) | 348 | 322 |
Additive | |||
p = 0.05 | 869(967) | 615 | 898 |
p = 0.20 | 335(376) | 299 | 342 |
p = 0.70 | 485(534) | 534 | 459 |
The values in brackets are based on the weighting scheme suggested by Risch & Teng (1998); Significance level α = 5 × 10−8; power 1 − β = 0.80; Dominant model: f2 = f1 = 0.4, f0 = 0.1; Recessive model: f2 = 0.4, f1 = f0 = 0.1; Multiplicative model: f2 = 0.4, f1 = 0.2, f0 = 0.1; Additive model: f2 = 0.4, f1 = 0.25, f0 = 0.1.
We conduct some simulation studies to confirm our large sample results. We first generate the genotypes of parents assuming Hardy-Weinberg Equilibrium, and the genotypes of three children assuming Mendelian transmission. Using the penetrances f2, f1 and f0 we simulate the disease status of each child. For a given sample size we confine ourselves to the families with one or two affected children. The test statistic t is used to calculate the empirical type I error rate and power. Note that to see whether our method leads to a correct type I error rate, a very large number of simulations is needed to consider α = 5 × 10−8. So we consider the nominal significance level of α = 0.05 instead. The empirical type I error rate is the proportion of significant replicates out of the total number of replicates under H0. By making use of the sample sizes suggested by the asymptotic power approximations (when the sample size suggested is ∞, we do not report power), we can calculate the empirical power and hence check whether a power of 80% can be attained. The empirical power is the proportion of significant replicates out of the total number of replicates under H1. Based on 500 replicates (100 replicates for the case of recessive model and very low allele frequency, p = 0.05) our results are summarized in Table 5 for empirical type I error rate and in Table 6 for empirical power. It can be seen that the empirical type I error rates and empirical powers are generally close to the significance level of 0.05 and power of 80%, respectively.
Table 5.
ε = 0 | ε = 0.005 | ε = 0.01 | |
---|---|---|---|
p = 0.05 | 0.064 | 0.049 | 0.057 |
p = 0.20 | 0.048 | 0.042 | 0.044 |
p = 0.70 | 0.060 | 0.070 | 0.068 |
The critical value is 1.6449 (which corresponds to the significance level of 0.05 under normality), and the sample size is 200.
Table 6.
ε = 0 | ε = 0.005 | ε = 0.01 | |
---|---|---|---|
Dominant | |||
p = 0.05 | 0.840 | 0.842 | 0.866 |
p = 0.20 | 0.840 | 0.846 | 0.840 |
p = 0.70 | 0.896 | 0.896 | |
Recessive | |||
p = 0.05 | 0.780 | ||
p = 0.20 | 0.808 | 0.702 | |
p = 0.70 | 0.796 | 0.836 | 0.804 |
Multiplic. | |||
p = 0.05 | 0.872 | ||
p = 0.20 | 0.796 | 0.830 | 0.852 |
p = 0.70 | 0.790 | 0.818 | 0.814 |
Additive | |||
p = 0.05 | 0.774 | 0.796 | |
p = 0.20 | 0.800 | 0.764 | 0.890 |
p = 0.70 | 0.780 | 0.730 | 0.816 |
The critical value is 5.3267 (which corresponds to the significance level of 5 × 10−8 under normality); Dominant model: f2 = f1 = 0.4, f0 = 0.1; Recessive model: f2 = 0.4, f1 = f0 = 0.1; Multiplicative model: f2 = 0.4, f1 = 0.2, f0 = 0.1; Additive model: f2 = 0.4, f1 = 0.25, f0 = 0.1.
Discussion
In this article, we have developed a general weighting scheme to combine families of different structures in the detection of genetic associations using DNA pooling through family-based association designs. In addition, we explicitly modelled the measurement errors in our approach. It is observed that our weighting scheme is usually better than that suggested by Risch & Teng (1998). In the example we considered, where the families have two different types of structures, the efficiency of the design combining families of different structures is always between those of the designs only using families with one of the two structures. However, because it is generally much easier to collect families of different structures in practice and, more importantly, the informativeness of each family structure depends on the disease model, which is often unknown, we advocate the use of a study design that maximizes the usage of available family data. We also studied the impact of measurement errors on the sample size required. Our numerical results showed that, similar to the case of pooled population data (Zou & Zhao, 2004), the sample size required to attain a desired significance level and power using pooled DNA may significantly increase as the measurement errors increase for family-based association tests. However, such impact can be reduced if multiple replicates of each pooled sample are measured. For example, if four replicate measurements are used and accordingly, p̂(r, s, a) A and p̂(r, s, a) U are replaced by the averages of these four measurements, then the standard deviation ε will be reduced by half, ε/2, and all formulae in the paper are still true. Thus, if ε = 0.01, then the standard deviation after four replicate measurements will be 0.005. From Tables 2 and 3 we see that the sample sizes required are greatly reduced. As a result of measurement errors it is possible that a specified power, e.g. 80%, may never be achieved under certain disease models (see results in Tables 2 and 3). Therefore, our analysis emphasizes the importance of reducing measurement errors in DNA pooling studies. Note that in our discussion the standard deviation ε is assumed to be known. If ε is unknown then we can infer it from laboratory experiments or from the distributions of the test statistics (Jawaid et al. 2002). Although a precise value of ε is impossible, our findings based on asymptotic results and simulation studies in Section 4 suggest that for high penetrances the effect of a minor misspecification of ε (for example, the estimate of ε is 0.005 but ε = 0.0075 in reality) on the association test is not very large. Relatively speaking the effect is slightly larger for low allele frequencies. However, for low penetrances such an effect can be large (data not shown).
It should be pointed out that we have assumed that the parental phenotypes are unknown in order to simplify our analysis. If the parental disease prevalence is low, then our results are close to the case of unaffected parents. Generally we can use separate pools for families with affected parents and for families with unaffected parents. More precisely we can consider the following family types separately: two affected parents, two unaffected parents, and one affected parent and one unaffected parent for the families with two parents; one affected parent, and one unaffected parent for the families with only one parent; and the families with no parents. Such consideration should provide additional information. The analytical details can be given along the line devised here. However, this will be more complicated and remains to be studied in our future work.
In this discussion we have assumed perfect linkage disequilibrium between the disease locus and marker locus. However, it is more likely that the marker being examined is in incomplete linkage disequilibrium with the genetic variant of interest. In this case it is necessary to derive the penetrances of the genotypes at a marker locus for each family structure (r, s, a). Then all the formulae obtained previously can be used. Risch & Teng’s (1998) results can serve this purpose, although the problem will be more difficult if several markers are considered together. In fact, let p(r, s, a) and q(r, s, a) be the frequencies of alleles D and d at the disease locus, and f2, f1 and f0 still be the penetrances of genotypes DD, Dd and dd, respectively. Further, let and be the frequencies of alleles A and a at the marker locus, and and be the penetrances of genotypes AA, Aa and aa, respectively. If we use Bengtsson & Thomson’s (1981) definition of the linkage disequilibrium parameter δ:
then from Risch & Teng (1998) we have
and
where δ(r, s, a) is the linkage disequilibrium measure in the family structure (r, s, a). Therefore,
and
Note that and may be dependent on the family structures. But the formulae in this paper can still be used for this case as long as we substitute them for f2, f1 and f0, respectively.
In this paper we have assumed the random missingness of parental genotypes so that the available and missing parents have the same marker score distributions. This is plausible if the parental missingness is not related to the phenotype. For example, the random missingness assumption holds if we are unable to locate the parents because of death from some other disease or accident. However, in the situation where the missingness of a parent is related to the phenotype, this assumption may not be reasonable. For instance, in a study of genetic factors in an aggressive form of cancer, it is more likely that parents carrying the disease-predisposing allele are missing. A detailed discussion can be found in Allen et al. (2003). The construction of appropriate test statistics under this scenario warrants further research.
Acknowledgments
This work was supported in part by grant GM59507 from the National Institutes of Health and by grant No. 70221001 from the National Natural Science Foundation of China. The authors thank two reviewers for their helpful comments.
Appendix A
In this appendix, we first derive the marginal distributions and joint distributions of the marker scores for the affected siblings, for the unaffected siblings, and for the parents whose phenotypes are assumed to be unknown, then their means, variances and covariances, and finally the mean and variance of the difference between the two allele frequency estimates p̂(r, s, a) A and p̂(r, s, a) U for the families with identical structures under the null hypothesis H0 and alternative hypothesis H1. We give only the results under H1 as the distributions under H0 can be obtained by replacing the penetrances f2, f1 and f0 by the disease prevalence for each family type (r, s, a).
Marginal marker score distributions
Let G = (u, v) be the mating type of parents and be the conditional probability of G = (u, v) given the family type being (r, s, a). When the parents are missing at random, the values of are given in Table 1 and this reduces to Table 1 of Risch & Teng (1998) if the penetrances are low so that the unaffected individuals can be regarded as having unknown phenotypes. Denote when u ≠ v. Then the distribution of is
(A.1) |
The distribution of can be obtained by replacing fw by 1 − fw in formula (A.1), where w = 0, 1 and 2 and the probabilities are denoted by α(r, s, a) Y (w′), where w′ = 0, 1 and 2. In the following discussion, u, v, w, and w′ always take a value of 0, 1, or 2.
Table 1.
Mating type |
population frequency |
||
---|---|---|---|
(2,2) | g22 | ||
(2,1) | g21 | ||
(2,0) | g20 | ||
(1,2) | g12 | ||
(1,1) | g11 | ||
(1,0) | g10 | ||
(0,2) | g02 | ||
(0,1) | g01 | ||
(0,0) | g00 |
Ks,a is the sum of all numerators in the third column.
Denote the numbers of allele A of the father and mother in the ith family of structure (r, s, a) by and , respectively. Note that the notation in the previous sections is not necessarily the same as and can be equal to depending on the observed results. The distribution of marker scores for the father is given by
(A.2) |
and the distribution for the mother can be obtained by replacing by in formula (A.2), and is denoted as α(r, s, a)Z(m) (w′).
Joint marker score distributions
Now we consider the joint distributions for marker scores. Let α(r, s, a) XY (u, v) denote the probability that one affected sibling and one unaffected sibling in the family with type (r, s, a) have u and v alleles A, respectively. Then it can be shown that
(A.3) |
(A.4) |
(A.5) |
(A.6) |
(A.7) |
(A.8) |
(A.9) |
(A.10) |
and
(A.11) |
The joint distribution for two affected (unaffected) siblings can be obtained by replacing 1 − fw(fw) by fw(1 − fw. Note that at this time, 1 − fw in the formulas remains unchanged) in the formulas (A.3)–(A.11), and is denoted as α(r, s, a) XX (u, v) (α(r, s, a) YY (u, v)).
Likewise, if we let α(r, s, a) XZ(f) (u, v) be the probability that one affected sibling and the father in the family with type (r, s, a) have u and v alleles A, respectively, then we obtain
(A.12) |
(A.13) |
(A.14) |
(A.15) |
(A.16) |
(A.17) |
(A.18) |
(A.19) |
and
(A.20) |
The joint distribution for one affected sibling and the mother can be obtained by replacing by in the formulae (A.12)–(A.20), and is denoted by α(r, s, a) XZ(m) (u, v), and the joint distribution for one unaffected sibling and the father can be obtained by replacing fw by 1 − fw in the formulae (A.12)–(A.20), and is denoted by α(r, s, a) YZ(f) (u, v). As for the joint distribution for one unaffected sibling and the mother, this can be obtained by replacing by and fw by 1 − fw in the formulae (A.12)–(A.20) and is denoted by α(r, s, a) YZ(m) (u, v).
Mean, variance, and covariance of marker scores
From the marker score distributions of the affected and unaffected siblings and their parents, and the joint distributions of the family members, we can obtain the corresponding expectations, variances, and covariances:
Similarly, we can obtain the expressions for the expectations of and ;
The variances of and have similar forms;
The expressions of the covariances between and and and are similar;
(A.21) |
The covariance has a similar form to except and μ(r, s, a)Z(m) are in place of and μ(r, s, a) Z(f) in the above formula. Likewise, the expressions for the covariance can be obtained by replacing fw by 1 − fw in formula (A.21), and the covariance can be obtained by replacing fw by 1 − fw, by and μ(r, s, a) Z(f) by μ(r, s, a) Z(m) in formula (A.21).
Mean and variance of the difference between the two allele frequency estimates p̂(r, s, a) A and p̂(r, s, a) U for families with identical structures.
In the following, we derive the mean and variance of p̂(r, s, a) A − p̂(r, s, a) U under the alternative hypothesis. It can be shown that the means of allele frequency estimates in the case group and control group are
and
respectively. If we define
and note that
since we have assumed random mating of parents when r = 1, then we have
(A.22) |
Likewise, we can show that
and
where E1 denotes the expectation over all possible values of n(r, s, a). Now define
and
Then from the formulae for and γ(r, s, a) YZ(i) (i = 1, 2) provided above (noting that etc., defined above, depend only on the sum of the variances etc. for both parents, we can regard the first parent as the father, and the second parent as the mother), we have
and
(A.23) |
Note that the mating of the parents is assumed to be random when r = 1. Hence
and
Therefore, the variance of p̂(r, s, a) U can be expressed as
Further, we can obtain
where
has similar meaning to γ(r, s, a) YZ given in (A.23) except we should replace 1 − fw by fw in the expression. Consequently,
(A.24) |
From Stephan (1945), we have
(A.25) |
where as before, λ(r, s, a) is the proportion of families with type (r, s, a) in the ascertainment subpopulation, and can be accurately estimated when the sample size is large. If we use the first order approximation of , then we get
(A.26) |
We use formula (A.26) to estimate the sample size required to detect association in this study, although it is straightforward to approximate power and sample size using the second order approximation of . In fact, our numerical calculation for the example given in Section 4 shows that using the first and second order approximations lead to almost identical power (data not shown). If we, like Risch & Teng (1998), assume that the penetrance is low, then the expectation and variance of the difference between the sample frequencies among the affecteds and controls, p̂(r, s, a) A − p̂(r, s, a) U, under the alternative hypothesis reduce to
(A.27) |
and
(A.28) |
respectively, where
and
These results are the same as those in Risch & Teng (1998). Equations (A.27) and (A.28) give unified formulae for various family structures. By taking different values of r, s and a, we can obtain the corresponding results in Risch & Teng (1998).
Appendix B
In this appendix, we prove the asymptotic normality of our proposed test statistic t under H0. For convenience we denote the variance of the measurement error by when the sample size is n. When tn is a nonzero constant, the proof is obvious. Now we assume that , where ℓ is a constant. It can be seen that
where
has mean zero and variance
under H0. Note that when n → ∞,
where → p. means convergence in probability. So from the central limit theorem and the assumptions given in Section 2, we have
where → d. means convergence in distribution. On the other hand, under H0, p̂(r, s, a) → p. p(r, s, a). Hence,
Thus,
References
- Allen A, Rathouz P, Satten G. Informative missingness in genetic association studies: case-parent designs. Am J Hum Genet. 2003;72:671–680. doi: 10.1086/368276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bader J, Bansal A, Sham P. Efficient SNP-based tests of association for quantitative phenotypes using pooled DNA. GeneScreen. 2001;1:143–150. [Google Scholar]
- Bader J, Sham P. Family-based association tests for quantitative traits using pooled DNA. Eur J Hum Genet. 2002;10:870–878. doi: 10.1038/sj.ejhg.5200893. [DOI] [PubMed] [Google Scholar]
- Bengtsson BO, Thomson G. Measuring the strength of associations between HLA antigens and diseases. Tissue Antigens. 1981;18:356–363. doi: 10.1111/j.1399-0039.1981.tb01404.x. [DOI] [PubMed] [Google Scholar]
- Buetow KH, Edmonson M, MacDonald R, Clifford P, Yip P, Kelley J, Little DP, Strausberg R, Koester H, Cantor CR, Braun A. High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Proc Natl Acad Sci USA. 2001;98:581–584. doi: 10.1073/pnas.021506298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grupe A, Germer S, Usuka J, Aud D, Belknap JK, Klein RF, Ahluwalia MK, Higuchi R, Peltz G. In silico mapping of complex disease-related traits in mice. Science. 2001;292:1915–1918. doi: 10.1126/science.1058889. [DOI] [PubMed] [Google Scholar]
- Ito T, Chiku S, Inoue E, Tomita M, Morisaki T, Morisaki H, Kamatani N. Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am J Hum Genet. 2003;72:384–398. doi: 10.1086/346116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jawaid A, Bader J, Purcell S, Cherny S, Sham P. Optimal selection strategies for QTL mapping using pooled DNA samples. Eur J Hum Genet. 2002;10:125–132. doi: 10.1038/sj.ejhg.5200771. [DOI] [PubMed] [Google Scholar]
- Le Hellard S, Ballereau SJ, Visscher PM, Torrance HS, Pinson J, Morris SW, Thomson ML, Semple CA, Muir WJ, Blackwood DH, Porteous DJ, Evans KL. SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Res. 2002;30:e74. doi: 10.1093/nar/gnf070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lipkin E, Mosig MO, Darvasi A, Ezra E, Shalom A, Friedmann A, Soller M. Quantitative trait locus mapping in dairy cattle by means of selective milk DNA pooling using dinucleotide microsatellite markers: analysis of milk protein percentage. Genetics. 1998;149:1557–1567. doi: 10.1093/genetics/149.3.1557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michelmore R, Paran I, Kesseli R. Identification of marker linked to disease resistance gene by bulk segregant analysis: a rapid method to detect markers in specific genomic regions using segregating populations. Proc Natl Acad Sci USA. 1991;88:9828–9832. doi: 10.1073/pnas.88.21.9828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
- Risch N, Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. Genome Res. 1998;8:1273–1288. doi: 10.1101/gr.8.12.1273. [DOI] [PubMed] [Google Scholar]
- Sham P, Bader J, Craig I, O’Donovan M, Owen M. DNA pooling: a tool for large-scale association studies. Nature Reviews Genetics. 2002;3:862–871. doi: 10.1038/nrg930. [DOI] [PubMed] [Google Scholar]
- Stephan FF. The expected value and variance of the reciprocal and other negative powers of a positive Bernoulli variate. Ann Math Statist. 1945;16:50–61. [Google Scholar]
- Xu C, Donnelly C, Montgomery D, Allan C, Purvis I. Determination of SNP allele frequency by a DNA pooling method. Am J Hum Genet. 1999;65:2577. [Google Scholar]
- Wang S, Kidd KK, Zhao H. On the use of DNA pooling to estimate haplotype frequencies. Genet Epidemiol. 2003;24:74–82. doi: 10.1002/gepi.10195. [DOI] [PubMed] [Google Scholar]
- Zou G, Zhao H. The impacts of errors in individual genotyping and DNA pooling on association studies. Genet Epidemiol. 2004;26:1–10. doi: 10.1002/gepi.10277. [DOI] [PubMed] [Google Scholar]