Abstract
Current common wisdom posits that association analyses using family-based designs have inflated type 1 error rates (if relationships are ignored) and independent controls are more powerful than familial controls. We explore these suppositions. We show theoretically that family-based designs can have deflated type-error rates. Through simulation, we examine the validity and power of family designs for several scenarios: cases from randomly or selectively ascertained pedigrees; and familial or independent controls. Family structures considered are: sibships, nuclear families, moderate-sized and extended pedigrees. Three methods were considered with the chi-squared test for trend: variance correction (VC), weighted (weights assigned to account for genetic similarity), and naïve (ignoring relatedness) as well as the Modified Quasi-likelihood Score (MQLS) test. Selectively ascertained pedigrees had similar levels of disease enrichment; random ascertainment had no such restriction. Data for 1000 cases and 1000 controls were created under the null and alternate models. The VC and MQLS methods were always valid. The naïve method was anti-conservative if independent controls were used and valid or conservative in designs with familial controls. The weighted association method was generally valid for independent controls, and was conservative for familial controls. With regard to power, independent controls were more powerful for small to moderate selectively ascertained pedigrees, but familial and independent controls were equivalent in the extended pedigrees and familial controls were consistently more powerful for all randomly ascertained pedigrees. These results suggest a more complex situation than previously assumed which has important implications for study design and analysis.
INTRODUCTION
The genetic relationships that exist between individuals in family-based sampling designs have been considered to universally violate statistical assumptions for standard statistical tests of association, hence violating the validity of these tests. This has led to a current common wisdom that suggests that applying standard population-based association tests to data from family-based designs will lead to anti-conservative tests if genetic relationships are not accounted for. An additional current common wisdom is that family-based controls (controls that reside in a pedigree that contains cases) are considered to be less powerful than independent controls in association testing. Certainly these suppositions have been shown to be the correct in certain situations [Bourgain, et al. 2003; Browning, et al. 2005; McArdle, et al. 2007] and at first glance they appear entirely reasonable. However, there are a few studies that have shown contradictory evidence [Newman, et al. 2001; Slager, et al. 2003] and the suppositions have not been thoroughly tested across a variety of scenarios to assess their uniform applicability. There are a great many family-based resources that already exist across a vast variety of complex diseases, the majority previously ascertained for linkage analyses. These family-based resources are increasingly being used for association studies. It is therefore important to more thoroughly understand the validity and power of association testing in family-based resources to be able to best utilize already ascertained samples.
Several methods for correcting relatedness have been developed to enable valid association analyses in family-based sampling designs and many different statistical tests have also been suggested [Allen-Brady, et al. 2006; Browning, et al. 2005; Knight, et al. 2009; Rabinowitz and Laird 2000; Slager and Schaid 2001; Spielman, et al. 1993; Thornton and McPeek 2007]. Here we focus on a basic association test (chi-squared test for trend) and investigate differences that occur due to the method of correction for related subjects and the design of the study. For correction of relatedness, we will focus on two approaches: variance correction (VC) and individual weighting methods, as both can be used in a chi-square trend test framework. The VC approach was proposed by Slager and Schaid [Slager and Schaid 2001], where the variance used in the test statistic is adjusted by the covariance of the studied individuals (cases and controls). An individual weighting method was first proposed by Browning et al [Browning, et al. 2005]. In this approach, weights are assigned to the individuals in pedigrees to correct for the relationships between them, and these weights can be used in chi-square tests. Weights are calculated using an `all-pairs' approach, that is, all pairs of related cases are considered when determining the weights. We also proposed a weighting method [Knight, et al. 2009], similar to that of Browning et al. but where all relationships between studied individuals are considered simultaneously when calculating weights. This second method was shown to have a slight power improvement over Browning's original method due to a tendency for the original `all-pairs' approach to over-correct for the relationships [Knight, et al. 2009]. The validity of VC method is well establish [Bourgain, et al. 2003; Slager and Schaid 2001] and the validity of the weighting method has been shown using independent controls [Browning, et al. 2005]. The advantages of VC and weighting methods over empirical approaches such as Genie [Allen-Brady et al 2005], are their utility in genome-wide association studies where millions of genetic markers are tested and extreme critical probabilities are required for significance (situations that are impractical in an empirical approach).
A few studies have suggested that the use of a naïve approach (ignoring the known relationships) will always result in inflated type 1 error rates [Bourgain, et al. 2003; McArdle, et al. 2007]; however other studies have also shown that the naïve approach can results in conservative rates [Newman, et al. 2001; Slager, et al. 2003]. It is known that if relationships between studied subjects are ignored, point estimates, such as the mean, are unbiased. However, the variance is biased by the relationships between studied individuals. Slager and Schaid [2001] derived the equation for the unbiased, corrected variance of the Armitage test for trend when related individuals are considered (see below).
where: X is the score vector used for the trend test (usually (0,1,2)), R is the number of cases, N is the total sample size, yi is the genotype vector for the ith case, and zi is the genotype vector for the ith control.
As can be seen from the equation above, the corrected variance is constructed of components based on the variance of the cases (Var(yi)), variance of controls Var(zi), and covariances for case-case pairs (Cov(yi,yj)), control-control pairs (Cov(zi,zj)), and case-control pairs (Cov(yi,zj)). In the situation with unrelated individuals only the first two components exist because all covariances are zero. For related individuals the three covariance components change the variance estimate. If the combination of the covariance components is positive then the true variance will be larger than the naïve estimated variance. In this situation if the relationships are ignored, the variance is under-estimated, the test statistic over-estimated, resulting in inflated type 1 errors. However, it should be noted that the coefficient for the case-control pair covariance component is negative. Hence, provided the covariance components themselves are positive, it is theoretically possible that, if the case-control covariance dominates the other two covariance components, the true variance can be smaller than the naïve variance. In such a situation, ignoring relationships would result in an over-estimated variance and conservative type 1 errors.
It can be shown that all covariance components are positive. For example, for a sib-pair, the covariance component is p(1−p), where p is the disease minor allele frequency (MAF) and for all other (non inbred) relationships, the covariance component is π1p(1−p), where π1 is the probability that the pair shares one allele identical by descent (IBD) (see Appendix). Thus, if only cases are related (a family design with independent controls), then there is only one covariance (the case-case covariance) with a positive coefficient. Hence, the corrected variance is larger than the naïve variance and ignoring relationships will lead to inflated type 1 errors. However, when familial controls are studied, the balance between the case-case, control-control and case-control covariance components can lead to a corrected variance that can be larger than, equal to, or smaller than the naïve variance. For example with a sibship of two cases and two controls, there would be one case-case pair and one control-control pair, but four case-control pairs leading to a corrected variance that is smaller than the naïve variance. Theoretically, therefore, it is certainly possible for the naïve approach to lead to valid or conservative tests in familial data.
Variance correction, weighting, and naïve testing (simply ignoring relationships) methods have not been extensively examined for validity and power across varying pedigree structures and varying case and control selection strategies. As illustrated above, validity of some methods can vary based on the familial structure. The power of Slager and Schaid's VC method has been studied using cases from sibships [Tian, et al. 2007], and moderate-sized and extended pedigrees [Bourgain, et al. 2003; Thornton and McPeek 2007]. Power of the weighting method has been studied using moderate [Browning, et al. 2005] and extended pedigrees [Knight, et al. 2009]. These studies used independent controls and familial controls; however, none of these studies compared the power between these two types of controls. The selection of cases or pedigrees has been shown to impact power [Fingerlin, et al. 2004; Moore, et al. 2005], however, the impact of pedigree selection on these methods has not been studied.
To enable genetic researchers to design family-based association studies that are valid and maximize power we have examined the validity and power of three methods for managing familial subjects under a variety of scenarios based on different pedigree structures (from sibships to extended pedigrees), pedigree selection criteria (random versus selectively ascertained pedigrees), and control populations (familial controls versus independent controls). The three Cochran-Armitage trend test analysis approaches considered were 1) a VC approach, 2) a weighting method, and 3) a naive method where the related cases and controls are treated as independent subjects. In addition, we also considered the MQLS test to assess the robustness of our findings beyond the basic trend test. The MQLS test is a quasi-likelihood score test for association that is also often used in pedigree-based resources and has been shown to have more power in some situations [Thornton and McPeek 2007].
METHODS
Analysis methods
Four analysis methods were considered. Three of the methods are based on the Cochran-Armitage trend test: Slager and Schaid's VC method, the weighting method, and the naïve method. The MQLS test was also considered.
Slager and Schaid's VC method adjusts the variance by using the covariance between related cases and controls which can be calculated based on the IBD probabilities of these individuals [Slager and Schaid 2001]. Posterior or prior IBD probabilities can be used. Posterior estimates use IBD probabilities as estimated from the observed genotypes (e.g. using GENEHUNTER [Kruglyak, et al. 1996]). Prior IBD probabilities can be derived from the kinship coefficients of the individuals considered. Given the limitation that posterior probabilities can only be easily and accurately estimated for small to moderately-sized pedigrees, we have used prior probabilities for all the covariance calculations.
The weighting algorithm used was developed by Knight et al [2009] and uses gene-drop simulations to determine individual weights that account for the complete configuration of related individuals in a pedigree. The weighting algorithm assigns weights to subjects (both the cases/controls) based on expected allele sharing in the pedigree, estimated by simulating inheritance vectors under the null (i.e. without regard to the affected status of the individuals). The advantage of our algorithm is that it can easily be applied to pedigrees of arbitrary sizes and structures. Also, while Browning's original method [Browning, et al. 2005] can result in weights that are negative, weights from this new weighting algorithm are always positive resulting in larger effective samples sizes. Furthermore, a correction that considers all individuals simultaneously wards against a potential over-correction that is possible when weights are based on an all-pairs approach [Knight et al 2009]. Our algorithm has been implemented in a java program and is available for download at our website (http://wwwgenepi.med.utah.edu/Genie/index.html). We used 10,000 gene-drops to calculate weights for each case or control in a pedigree. These weights were used in a weighted-chi-square trend test.
A naïve trend analysis, that simply ignored any familial relationships, was also performed. All three of these trend test analyses were performed using R software.
The MQLS method was also performed. Similarly to the VC method, MQLS accounts for the correlations among related individuals using a kinship matrix. However, MQLS additionally uses individual weights, based on the case-control configuration in the pedigree, in an attempt to increase power. For example, a case without any other case-relatives would be weighted differently from a case which is a member of an affected sib-pair. The rationale for this is that under the alternative hypothesis, cases that are clustered with other cases are more likely to be enriched for being risk variant carriers, and by weighting these more highly, more power is achieved.
Simulation Data
We simulated genotypes for four pedigree structures – sibships (containing five siblings), nuclear families (two parents and five offspring), moderately-sized pedigrees (three generation pedigrees with no genotyping in top founders and 20 descendants), and extended pedigrees (five generation pedigrees with no genotyping in top two generations and a total of 274 individuals). Singleton controls were also considered. Genotypes for SNPs were simulated using a gene-drop approach. Founders in the pedigree were randomly assigned alleles according to the MAF for the genetic model being considered and these were then segregated to offspring according to Mendelian inheritance. The simulated genotypes were created for families and singletons under alternative and null models. For the alternative data, cases status was assigned using five models each with a baseline sporadic rate of 10%, MAFs for a risk allele ranged from 0.01 to 0.4 and multiplicative allelic effects were assumed that ranged from 1.18 to 1.80 (Table 1). To match the phenotype clustering between the alterative and the null scenarios, for the null data, cases status was assigned using an alternative model; however, genotypes were then removed and replaced with genotypes from a random gene drop. In this way, the phenotypic clustering appropriately represented a disease with a genetic component, but the genetic data analyzed was from the null.
Table 1.
Model | MAF | RR |
---|---|---|
1 | 1% | 1.80 |
2 | 5% | 1.40 |
3 | 10% | 1.30 |
4 | 20% | 1.20 |
5 | 40% | 1.18 |
Note: MAF, minor allele frequency (for disease); RR, relative risk (of disease).
Pedigrees were either ascertained randomly or selectively. For random ascertainment, families were selected for analysis without regard to the number of cases in a family. This means that familial controls could come from pedigrees without cases and cases could come from pedigrees with no familial controls. For selective ascertainment, families were chosen based on a minimum number of cases which was chosen to represent an enrichment of cases in the pedigree (p-value~0.10). This resulted in a minimum of two cases for the sibships and nuclear families, four cases for the moderate pedigrees and 31 cases for the extended pedigrees. For both the randomly and selectively ascertained pedigrees, the pedigrees were generated until 1000 cases were obtained. Controls were chosen either by random selection from within the generated families (familial controls) or independent individuals from singleton simulations (independent controls). For each design, 1000 cases and 1000 controls were used in the analyses. For example, for randomly ascertainment, we generated pedigrees, keeping each pedigree, until we had 1000 cases; hence, some pedigrees may not contain any cases. The 1000 familial controls were randomly sampled from within all generated pedigrees, hence some controls may be selected from pedigrees without cases. For selective ascertainment, we generated pedigrees, keeping only those pedigrees that surpassed the minimum number of cases required, until we had 1000 cases. The 1000 familial controls were randomly sampled from within these selected pedigrees.
To assess validity 5000 simulations of the null were used (1000 null simulations from each model). A trend test was performed for the VC, weighted and naïve methods, in addition to the MQLS test. For 5000 simulations, and considering a multiple testing correction for the 16 study design scenarios, the 95% confidence interval for a type 1 error of 0.05 is (0.041 to 0.060). Power was based on 1000 simulations for each of the 16 study design scenarios and genetic models.
RESULTS
The type I error rate results are shown in Table 2. As expected, both the VC and MQLS methods were valid for all the scenarios considered. In-line with the current supposition, the naïve method was found to be trending towards or significantly anti-conservative (inflated type 1 error) if independent controls were used. This was true whether the pedigrees were randomly or selectively ascertained. However, contrary to the common wisdom, the naïve method was found to be valid in designs that used randomly ascertained pedigrees and familial controls, and could be significantly conservative (reduced type 1 error rate) with familial-controls if pedigrees were selectively ascertained. For both ascertainment criteria, the weighted association method was generally valid for independent controls, but was conservative when familial controls were used. To check that the pattern observed for our weighting method was representative of the Browning weighting method, we repeated the investigation using Browning's algorithm. As would be expected, an extremely similar pattern was observed when using the Browning weighting method with prior kinship coefficients (see supplementary table 1).
Table 2.
Ascertainment | Control | Method | Sibship | Nuclear | Moderate Pedigree | Extended Pedigree |
---|---|---|---|---|---|---|
Random | Familial | VC | 0.052 | 0.048 | 0.049 | 0.049 |
Naïve | 0.053 | 0.049 | 0.048 | 0.050 | ||
Weight | 0.035 | 0.024 | 0.031 | 0.028 | ||
MQLS | 0.052 | 0.050 | 0.048 | 0.052 | ||
Independent | VC | 0.051 | 0.045 | 0.045 | 0.049 | |
Naïve | 0.066 | 0.066 | 0.059 | 0.078 | ||
Weight | 0.053 | 0.051 | 0.048 | 0.060 | ||
MQLS | 0.053 | 0.040 | 0.046 | 0.045 | ||
| ||||||
Selective | Familial | VC | 0.058 | 0.052 | 0.051 | 0.055 |
Naïve | 0.016 | 0.024 | 0.039 | 0.054 | ||
Weight | 0.000 | 0.003 | 0.004 | 0.024 | ||
MQLS | 0.058 | 0.050 | 0.046 | 0.048 | ||
Independent | VC | 0.050 | 0.055 | 0.057 | 0.048 | |
Naïve | 0.084 | 0.100 | 0.103 | 0.092 | ||
Weight | 0.057 | 0.059 | 0.062 | 0.064 | ||
MQLS | 0.051 | 0.048 | 0.047 | 0.050 |
Note: For 5000 simulations and considering a multiple testing correction to correct for the 16 scenarios, the 95% confidence interval for a type 1 error of 0.05 is (0.041 to 0.060). Bold values represent significantly inflated from a type 1 error of 0.05 (anti-conservative) and underlined values significantly reduced type 1 error of 0.05 (conservative). VC, variance correction.
Comparisons of power are illustrated in Table 3. Results are shown for familial versus independent controls, both ascertainment criteria (random, selective), and five genetic models. We only report the power for the VC and MQLS methods as they were valid for all scenarios and therefore power is meaningful. For the small to moderate selectively ascertained pedigrees, independent controls were consistently found to be significantly more powerful for all genetic models considered. This is consistent with what has been suggested previously. However, contrary to common wisdom, for randomly ascertained pedigrees of any size, familial controls consistently provided greater power than independent controls. Furthermore, for selectively ascertained extended pedigrees the power of the familial and independent controls were similar across the genetic models considered, with no consistent direction of one design out-performing the other.
Table 3.
Sibship | Nuclear | Moderate Pedigree | Extended Pedigree | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Ascertainment | Model | Control | VC | MQLS | VC | MQLS | VC | MQLS | VC | MQLS |
Random | MAF=1%; RR=1.80 | Familial | 0.685 | 0.731 | 0.668 | 0.717 | 0.700 | 0.729 | 0.674 | 0.551 |
Independent | 0.641 | 0.667 | 0.623 | 0.665 | 0.616 | 0.650 | 0.537 | 0.509 | ||
MAF=5; RR=1.40 | Familial | 0.784 | 0.837 | 0.805 | 0.839 | 0.813 | 0.836 | 0.792 | 0.693 | |
Independent | 0.748 | 0.769 | 0.777 | 0.757 | 0.771 | 0.785 | 0.723 | 0.681 | ||
MAF=10%; RR=1.30 | Familial | 0.845 | 0.872 | 0.825 | 0.856 | 0.848 | 0.858 | 0.851 | 0.746 | |
Independent | 0.807 | 0.822 | 0.762 | 0.783 | 0.805 | 0.806 | 0.752 | 0.708 | ||
MAF=20%; RR=1.20 | Familial | 0.757 | 0.804 | 0.756 | 0.790 | 0.763 | 0.773 | 0.757 | 0.660 | |
Independent | 0.694 | 0.704 | 0.686 | 0.699 | 0.723 | 0.728 | 0.646 | 0.601 | ||
MAF=40%; RR=1.80 | Familial | 0.834 | 0.861 | 0.834 | 0.873 | 0.814 | 0.819 | 0.818 | 0.721 | |
Independent | 0.777 | 0.795 | 0.766 | 0.774 | 0.772 | 0.777 | 0.716 | 0.670 | ||
Selective | MAF=1%; RR=1.80 | Familial | 0.495 | 0.543 | 0.510 | 0.634 | 0.645 | 0.781 | 0.710 | 0.706 |
Independent | 0.884 | 0.897 | 0.844 | 0.861 | 0.891 | 0.903 | 0.691 | 0.711 | ||
MAF=5%; RR=1.40 | Familial | 0.564 | 0.614 | 0.557 | 0.668 | 0.747 | 0.828 | 0.805 | 0.797 | |
Independent | 0.940 | 0.947 | 0.911 | 0.917 | 0.925 | 0.925 | 0.806 | 0.806 | ||
MAF=10%; RR=1.30 | Familial | 0.563 | 0.593 | 0.565 | 0.700 | 0.745 | 0.850 | 0.846 | 0.827 | |
Independent | 0.960 | 0.961 | 0.922 | 0.926 | 0.941 | 0.942 | 0.838 | 0.826 | ||
MAF=20%; RR=1.20 | Familial | 0.473 | 0.516 | 0.469 | 0.585 | 0.624 | 0.744 | 0.731 | 0.707 | |
Independent | 0.906 | 0.914 | 0.911 | 0.873 | 0.876 | 0.883 | 0.736 | 0.719 | ||
MAF=40%; RR=1.18 | Familial | 0.531 | 0.557 | 0.536 | 0.665 | 0.684 | 0.783 | 0.831 | 0.787 | |
Independent | 0.927 | 0.926 | 0.907 | 0.911 | 0.896 | 0.903 | 0.792 | 0.774 |
Note: VC, variance correction; MAF, minor allele frequency (for disease); RR, relative risk (of disease)
DISCUSSION
The general findings with regard to validity are as follows. First, the VC and MQLS methods were found to be valid for all scenarios. Second, methods using weighting to account for relatedness were only valid with independent controls, and otherwise were conservative (sometimes extremely so). Third, the naïve method was anti-conservative if independent controls were used, but was valid for randomly ascertained pedigrees with familial controls, and was conservative for selectively ascertained pedigrees with familial controls.
Our validity results for the VC and MQLS methods confirm prior studies [Bourgain, et al. 2003; Slager and Schaid 2001; Thornton and McPeek 2007; Tian, et al. 2007] and extend evidence of the validity to more scenarios, such as extended pedigrees with independent controls, sibships with familial controls and randomly ascertained pedigrees. The fact that these methods were valid for all pedigree structures and selection criteria makes them very appealing analysis choices with family-based samples. Furthermore, both are computationally feasible to be considered for use in a genomewide scan. It has previously been suggested that the MQLS is more powerful than the VC method [Thornton and McPeek 2007]. Our findings support this for small to moderately sized pedigrees. However, for the extended pedigrees we found the VC method to be slightly more powerful. We note, however, that in these analyses we used only data for the cases and controls, and our data did not include individuals of unknown phenotype. MQLS is also able to incorporate ungenotyped relatives as well as distinguishing between unaffected and unknown phenotype controls. This makes MQLS more desirable in many real scenarios.
The weighting methods were valid only when independent controls were used. This is due to the fact that these methods are ignorant to the case/control status of the related individuals being weighted. Related individuals are always down-weighted. This is appropriate when there are no relationships between cases and controls. However, it is clear from the corrected variance equation presented previously that the familial relationships between cases and controls effect the variance in the opposite direction and needs to be considered separately. Therefore, in scenarios with familial controls, the weighting method can be extremely conservative because the case-control relationships are being corrected for in the wrong direction. The use of only independent controls was suggested by Browning [Browning, et al. 2005] cited due to a reduction in power because familial cases may be non-penetrant carriers of the risk variant. Our results indicate that the reduction in power is likely due to the -sometimes extremely- conservative nature of the weighting procedure and not the choice of familial controls, per se. Previously it was suggested that Browning's weighting method was more powerful than the VC approach [Browning, et al. 2005]. We investigated this with our data and found the power of Browning's method to be slightly lower than the VC method when these were adjusted for type 1 error rate (data not shown). As can be seen in Table 2, we did find statistically significant increased type I error rate for the larger selectively ascertained pedigrees with independent controls (which appeared to be more severe when using Browning's weighting method). An increased rate was similarly reported by Browning using an allelic test for a moderate sized pedigree [Browning, et al. 2005]. It is unclear whether this increase is due to the size or structure of the pedigree, the selectively ascertainment of the pedigree or a combination of these, or whether it is simply random fluctuation. However, it has been reported that non-random sampling of pedigrees can bias results [Bucher and Schrott 1982] and researchers should be aware of this possible bias..
In the situations where the weighting method is valid, one advantage of the technique is that it can easily be extended to other statistical tests that allow the incorporation of weights. To examine this further, we used the same pedigrees and genotypes but simulated a quantitative trait. We found that the weighting method continued to be valid for all family structures and independent controls. This suggests that if independent controls are used, weighting methods maintain their validity for other statistics, too.
Perhaps most unexpected, and contrary to common wisdom, is the suggestion that the naïve analysis may be valid or even conservative if familial controls are used. Although perhaps counter-intuitive at first glance, inspection of the corrected variance equation shows theoretically that valid or conservative tests can result, depending on the balance of case-case, control-control and case-control pairs in the resource being studied. We found that for randomly ascertained family-based resources of any pedigree size that the naïve analysis was valid with familial controls. We note that random ascertainment of pedigrees indicates that a resource was selected and studied without respect to the disease status of individuals in the pedigrees. This means that pedigrees in these resources may contain only cases, only controls, or both cases and controls. In our study the percentage of related cases and controls varied by the pedigree structures for the random sample with 60% related in the sibships, 50% in nuclear families, 90% in the moderate pedigrees and 80% in the extended pedigrees. While the moderate and extended pedigrees had more related individuals the average kinship coefficient for these individuals was 0.15 and 0.03, respectively compared to 0.24 for the sibships and nuclear families. Our results indicate that this random combination of cases and controls balances the corrected variance leading to validity with a simple naïve approach. There are a few community or population based cohorts that could be considered random pedigree ascertainment. These include the Framingham Heart Study [Cupples, et al. 2007] and the Hutterites [Ober, et al. 1983]. Our results indicate that for these resources, if all or a random subset of cases and controls are studied, that use of the naïve approach could be valid. Use of the naïve approach has the advantage that it allows for easy and quick analyses, a characteristic necessary for genomewide investigations. Furthermore, it appears plausible that this validity would extend to other more sophisticated statistical analyses.
For power, we present three main findings which hold for both the VC and the MQLS method. First, that the use of independent controls is a significantly more powerful approach if cases reside in small to moderate sized selectively ascertained pedigrees. Second, that use of familial controls is a reasonable study design for resources of selectively ascertained extended pedigrees. In the highly selected pedigrees studied here (containing at least 31 cases) independent and familial controls were essentially equivalent. In further investigations of less highly selected extended pedigrees, familial controls substantially outperformed independent controls (data not shown). Third, that familial controls are more powerful for randomly ascertained pedigrees of any size.
As commonly assumed, we found that for selectively ascertained small to moderate pedigrees, independent controls were most powerful. This finding supports other studies using different methods that found that independent controls increased the power [Li, et al. 2006; Witte, et al. 1999]. This conforms to the argument that familial controls in diseased-ascertained families may be non-penetrant carriers of the disease allele leading to reduced power, such that the use of unrelated independent controls is superior. However, one concern with using unrelated, independent controls is the possibility of inappropriate matching and this can lead to invalidity (population stratification) and also loss of power.
Contrary to common wisdom, however, in our simulations we found that for all randomly ascertained pedigrees that familial controls resulted in increased power, and that and for selectively ascertained extended pedigrees familial controls were also a reasonable design choice. The increase in power for the randomly ascertained pedigrees was relatively minor, but consistent. The intuition to avoid familial controls in the context of selectively ascertained pedigrees is due to the enrichment for disease alleles in the selected pedigrees and therefore the potential for an enrichment of non-penetrant familial controls, hence decreasing power. In randomly ascertained pedigrees, however, the enrichment for disease alleles does not exist and therefore neither does the enrichment for non-penetrant control carriers, hence, there is no detrimental effect due to that issue. Furthermore, the superior familial matching of the cases and controls may produce beneficial effects and explain the improved power. The similar, or superior, power for familial controls in selectively ascertained extended pedigrees may be largely due to the fact that controls are, on average, less related to the cases than in smaller pedigrees. This may allow the beneficial aspects of matching to play a greater role than the detrimental enrichment of non-penetrant controls. Hence, familial controls may be a very reasonable option for selectively ascertained extended pedigrees.
It is important to bear in mind that our findings are limited to the pedigree structures and disease models considered and therefore generalization should be made with caution. Nonetheless, our findings provide important insight for association study designs performed in family-based resources. For example, in randomly ascertained family-based resources of any pedigree size a naïve analysis may be valid and familial controls preferable. Furthermore, for a resource of selectively ascertained, large, extended pedigrees, familial controls may be more powerful and a naïve method appropriate. These observations shed new light on possible optimal study designs for resources such as the Framingham Heart Study and extended high-risk pedigree resources, respectively.
In conclusion, we have presented evidence that strongly suggests that the situation for validity and power of family-based sampling designs for association testing is not as simple as previously assumed. Our results indicate support both for and against the current common wisdom. The findings presented here may help researchers select valid analysis approaches and design more powerful studies involving family-based data. In particular, for randomly ascertained resources and extended pedigree resources that are being considered for association analyses, the potential impact of these findings is substantial.
Supplementary Material
Acknowledgement
We would like to acknowledge the National Cancer Institute grant number CA098364 and the Susan G. Komen Foundation (to NJC) for grant support of this research. Stacey Knight is supported by a National Library of Medicine fellowship T15 LM07124.
Appendix
The formula for determining the covariance as outlined by Slager and Schaid [2001] and of two related individuals is as follows:
Where p' is the genotype probabilities ((1−p)2, 2p(1−p), p2) for genotypes aa, aA, and AA, where p is the probability of the risk allele. P(gi,gj) is the matrix of the joint genotype probabilities and is given by
Where P(gi) is a matrix with genotype probabilities pi along the diagonal and 0 otherwise and πk are the IBD probabilities for sharing k alleles. I,T, and O are matrixes [Li 1955], where I is the identity matrix, O have the genotype probabilities ((1−p)2, 2p(1−p), p2) for each row of the matrix, and T is given as follows.
For siblings the Cov(wi,wj) is
This covariance matrix is multiplied by the score functions, X'=(0,1,2) to give us the covariance component for the variance correction equation.
For non-sibling relationships (assuming no in-breeding) the covariance component is given as follows
Which for parent-child relationships also equals p(1−p).
REFERENCE
- Allen-Brady K, Wong J, Camp NJ. PedGenie: an analysis approach for genetic association testing in extended pedigrees and genealogies of arbitrary size. BMC Bioinformatics. 2006;7:209. doi: 10.1186/1471-2105-7-209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober C, McPeek MS. Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet. 2003;73(3):612–26. doi: 10.1086/378208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning SR, Briley JD, Briley LP, Chandra G, Charnecki JH, Ehm MG, Johansson KA, Jones BJ, Karter AJ, Yarnall DP, et al. Case-control single-marker and haplotypic association analysis of pedigree data. Genet Epidemiol. 2005;28(2):110–22. doi: 10.1002/gepi.20051. [DOI] [PubMed] [Google Scholar]
- Bucher K, Schrott H. The effects of nonrandom sampling of familial correlation. Biometrics. 1982;38:249–253. [Google Scholar]
- Cupples LA, Arruda HT, Benjamin EJ, D'Agostino RB, Sr., Demissie S, DeStefano AL, Dupuis J, Falls KM, Fox CS, Gottlieb DJ, et al. The Framingham Heart Study 100K SNP genome-wide association study resource: overview of 17 phenotype working group reports. BMC Med Genet. 2007;8(Suppl 1):S1. doi: 10.1186/1471-2350-8-S1-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fingerlin TE, Boehnke M, Abecasis GR. Increasing the power and efficiency of disease-marker case-control association studies through use of allele-sharing information. Am J Hum Genet. 2004;74(3):432–43. doi: 10.1086/381652. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knight S, Abo RP, Wong J, Thomas A, Camp NJ. Pedigree Association: Assigning individual weights to pedigree-members for genetic association analysis. BMC Proceedings. 2009;3(Suppl 7):S121. doi: 10.1186/1753-6561-3-s7-s121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet. 1996;58(6):1347–63. [PMC free article] [PubMed] [Google Scholar]
- Li C, editor. Population genetics. University of Chicago Press; Chicago: 1955. [Google Scholar]
- Li M, Boehnke M, Abecasis GR. Efficient study designs for test of genetic association using sibship data and unrelated cases and controls. Am J Hum Genet. 2006;78(5):778–92. doi: 10.1086/503711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McArdle PF, O'Connell JR, Pollin TI, Baumgarten M, Shuldiner AR, Peyser PA, Mitchell BD. Accounting for relatedness in family based genetic association studies. Hum Hered. 2007;64(4):234–42. doi: 10.1159/000103861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moore RM, Pinel T, Zhao JH, March R, Jawaid A. Selecting cases from nuclear families for case-control association analysis. BMC Genet. 2005;6(Suppl 1):S105. doi: 10.1186/1471-2156-6-S1-S105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman DL, Abney M, McPeek MS, Ober C, Cox NJ. The importance of genealogy in determining genetic associations with complex traits. Am J Hum Genet. 2001;69(5):1146–8. doi: 10.1086/323659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ober CL, Martin AO, Simpson JL, Hauck WW, Amos DB, Kostyu DD, Fotino M, Allen FH., Jr. Shared HLA antigens and reproductive performance among Hutterites. Am J Hum Genet. 1983;35(5):994–1004. [PMC free article] [PubMed] [Google Scholar]
- Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000;50(4):211–23. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]
- Slager SL, Schaid DJ. Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am J Hum Genet. 2001;68(6):1457–62. doi: 10.1086/320608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slager SL, Schaid DJ, Wang L, Thibodeau SN. Candidate-gene association studies with pedigree data: controlling for environmental covariates. Genet Epidemiol. 2003;24(4):273–83. doi: 10.1002/gepi.10228. [DOI] [PubMed] [Google Scholar]
- Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52(3):506–16. [PMC free article] [PubMed] [Google Scholar]
- Thornton T, McPeek MS. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am J Hum Genet. 2007;81(2):321–37. doi: 10.1086/519497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tian X, Joo J, Wu CO, Lin JP. Comparing strategies for evaluation of candidate genes in case-control studies using family data. BMC Proc. 2007;1(Suppl 1):S31. doi: 10.1186/1753-6561-1-s1-s31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol. 1999;149(8):693–705. doi: 10.1093/oxfordjournals.aje.a009877. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.