BLUP Genotype Imputation for Case-Control Association Testing with Related Individuals and Missing Data

Mary Sara McPeek

doi:10.1089/cmb.2012.0024

. 2012 Jun;19(6):756–765. doi: 10.1089/cmb.2012.0024

BLUP Genotype Imputation for Case-Control Association Testing with Related Individuals and Missing Data

Mary Sara McPeek ^1,^✉

PMCID: PMC3375641 PMID: 22697245

Abstract

We consider the problem of case-control association testing in samples that contain related individuals, where we assume the pedigree structure is known. Typically, for each marker tested, some individuals will have missing genotype data. The MQLS method has been proposed for association testing in this situation. We show that the MQLS method is equivalent to an approach in which missing genotypes are imputed using the best linear unbiased predictor (BLUP) based on relatives' genotype data. Viewed this way, the MQLS exactly corrects for the imputation error and for the extra correlation due to imputation. We also investigate the amount of additional power for detecting association that is provided by this BLUP imputation approach.

Key words: GWAS, pedigrees, quasi-likelihood, score test

1. Introduction

Standard methods for case-control association testing typically assume that sampled individuals are unrelated. However, many ongoing genetic studies involve families, so genetic association analysis of such studies requires some way of dealing with related individuals. The simplest way is to select a subsample of unrelated individuals, but that clearly results in a loss of information. Family-based tests (Rabinowitz and Laird, 2000) can be used in some contexts. Such approaches have the benefit that they protect against population structure, but they can also be underpowered relative to case-control association tests in some circumstances (Risch and Teng, 1998; Thornton and McPeek, 2010). A third approach is to perform a standard χ² test with a post-hoc correction factor applied. Slager and Schaid (2001) and Bourgain et al. (2003) have derived explicit correction factors for χ² tests of association performed in samples of related individuals. Another way to obtain such a correction factor in the context of a genome screen would be to use the genomic control method of Devlin and Roeder (1999), which works well if the rate of missing genotypes is similar at all markers. A fourth approach is to try to make more efficient use of the available information to obtain a more powerful test than the one that results from correcting the standard χ² test. This approach has been taken by Bourgain et al. (2003) and by Thornton and McPeek (2007), who showed that there can be a nonnegligible power gain for their M_QLS method over the corrected χ².

One purpose of this article is to better understand the nature of this power gain. One way in which the M_QLS improves on the corrected χ² is that it uses a more efficient estimator of allele frequency, which is a nuisance parameter in the analysis. Both the M_QLS and corrected χ² could also make use of unphenotyped controls who are not related to phenotyped individuals, but who are genotyped, to improve the nuisance parameter estimation. Additionally, the M_QLS can make use of genotyped individuals who are unphenotyped but who have phenotyped relatives in the sample as well as phenotyped individuals who have missing genotypes but who have genotyped relatives in the sample. It is this last aspect of power gain, the gain from using information on related individuals with missing genotype or phenotype, that we study in this article. We derive a connection between the M_QLS and imputation of missing genotypes based on genotypes of relatives, and we consider the question of how much power can be gained by squeezing this information from the data.

2. Methods

2.1. Best linear unbiased prediction of genotypes

First, we ignore phenotype and focus on the problem of predicting missing genotypes based on the observed genotypes of relatives. Suppose our study consists of n + m individuals, where we assume, without loss of generality, that the first n of the n + m individuals have non-missing genotype data at the marker under consideration, while the last m have missing genotype data at the marker. The n + m individuals can be arbitrarily related, with the pedigree(s) that specify the relationships assumed to be known. The method we describe is feasible even for large complex pedigrees with inbreeding loops. Unrelated individuals can also be included in the sample. If their genotypes are known, they will contribute to the prediction of missing genotypes in others by contributing to the allele frequency estimate. However, if individuals with missing genotypes are unrelated to others in the sample, then their genotypes will not be predicted by our method.

For the moment, we assume that the marker under consideration is an autosomal binary marker (e.g., a SNP) with alleles labeled “0” and “1.” We extend to multiple alleles in the next subsection. Let Inline graphic be the vector of true genotypes at the marker under consideration, where G_i = 0,.5, or 1, according to whether individual i has, respectively, 0, 1, or 2 copies of allele 1 at the marker. We assume that the first n entries of G are observed, while the last m entries are unobserved and are to be predicted. We use the notation Inline graphic to denote the partition of the vector G into the n-vector G_N of observed genotypes and the m-vector G_M of unobserved genotypes.

Let p represent the frequency of allele 1 in the population from which the pedigree founders are assumed to be drawn, where 0 < p < 1. Then, we have E(G) = p1_n+m, where 1_n+m is a column vector of length n + m with every entry equal to 1. If we further assume Hardy-Weinberg equilibrium (HWE) in the population from which the pedigree founders are drawn, then Var(G) = σ²Φ, where Inline graphic and Φ is the kinship matrix given by

(1)

where φ_i,j is the kinship coefficient between individuals i and j, and h_i is the inbreeding coefficient of individual i. To make the approach more robust to deviations from HWE, we can remove the assumption that Inline graphic and simply assume Var(G) = σ²Φ, where σ² is an unknown parameter that we estimate. Corresponding to the partition of G into G_N and G_M, we have the following partition of the Φ matrix:

(2)

where Φ_N is the n × n kinship matrix for the individuals with observed genotypes at the marker, Inline graphic is the n × m matrix of kinship coefficients for pairs of individuals in which one individual has oberved genotype and the other has unobserved genotype, and Φ_M is the m × m kinship matrix for the individuals with unobserved genotypes at the marker. Provided that the set of individuals with observed genotypes does not contain both members of any monozygotic (MZ) twin pair, the matrix Φ_N is invertible (Thornton and McPeek 2007).

Under these conditions, the best linear unbiased estimator (BLUE) of p is given by Inline graphic , and the variance of this estimator is Var . McPeek et al. (2004) show that this is the BLUE based on the allele indicators for the set of genotyped individuals in the situation when parental origin of allele is not known.

Now we propose to predict G_M by finding its best linear unbiased predictor (BLUP) based on G_N. That is, among all fixed m × n real matrices R, we find the one that minimizes

(3)

for every fixed m-vector c, subject to

(4)

Equivalently, we find the R satisfying condition (4) such that that Var Inline graphic − Var(G_M − RG_N) is a positive semidefinite matrix for all satisfying condition (4). In our case, condition (4) reduces to R1 = 1_m, where we economize on subscripts by letting 1 always denote a vector of length n with every entry equal to 1, while we let 1_m denote a vector of length m with every element equal to 1. In the Appendix, we show that the resulting R is given by

(5)

so that the BLUP of G_M is given by

(6)

The variance of the BLUP is

(7)

If p were known, only the first term of equation (7) would be present; the second term of the variance results from estimation of p.

The above development holds for the case when the set of individuals with observed genotypes does not contain both members of any MZ twin pair. If the set of genotyped individuals contains both members of one or more MZ twin pairs, then Φ_N is not invertible, but the above results hold with Inline graphic replaced by the Moore-Penrose generalized inverse, which we write as . Provided that the two members of any MZ twin pair have identical genotypes, use of in place of in the above formulas is mathematically equivalent to setting the genotype of one of the two members of each MZ twin pair to be missing, in order to obtain invertible Φ_N.

Advantages of the BLUP, Inline graphic , in equation (6), as a predictor of G_M are that (1) it is extremely fast to calculate, even in large complex pedigrees with inbreeding loops, making it feasible to use in studies with large numbers of people and markers; and (2) its variance-covariance matrix, given in equation (7), is also very easy to calculate, making it relatively easy to incorporate the prediction uncertainty and correlation between predicted values into a genetic analysis. Note that for different markers, the sets N and M of individuals with, respectively, non-missing and missing genotype, will change, and so the vectors Inline graphic and R will differ from marker to marker.

2.2. BLUP for multi-allelic markers

Now suppose the marker under consideration is observed to have a distinct alleles. Let Inline graphic be the vector of allele frequencies for alleles 1 through a − 1. (Note that , so p_a is redundant and can be dropped from p.) For this subsection only, we redefine , where , with equal to 0, 1/2 or 1, according to whether individual i has 0, 1, or 2 copies of the jth allele, 1 ≤ j ≤ a − 1. Thus, the vector G now has length (n + m)(a − 1). As before, the first n entries of each G^(j) are observed, while the last m entries are to be predicted. We write Inline graphic , to denote the partition of the vector G^(j) into the n-vector of observed genotypes and the m-vector of unobserved genotypes.

In the multi-allelic case, we have E(G) = (I_a-1 ⊗ 1_n+m)p, with ⊗ denoting Kronecker product and I_a-1 denoting the identity matrix of dimension a − 1. Under HWE, we also have Var(G) = F ⊗ Φ, where F_(a-1)×(a-1) has (i,j)th entry equal to Inline graphic and . As shown by McPeek et al. (2004), the BLUE of p is given by , where , which is the BLUE for the frequency of allele j that would be calculated if the marker were treated as biallelic with one allele being j and all the other alleles being collapsed into a single “not-j” allele. Furthermore, Var Inline graphic .

The BLUP of G_M based on G_N is found to be Inline graphic , where is the BLUP of that would be calculated if the marker were treated as biallelic with one allele being j and all the other alleles being collapsed into a single “not-j” allele. We have

(8)

Equation (8) is similar to equation (7) with σ² replaced by F ⊗ .

An important feature of the multiallelic BLUE and BLUP calculations is that they can be performed by, for example, calculating Inline graphic and R only once for a given multiallelic marker and then taking the inner products of these with a- 1 different vectors

2.3. Overview of the M_QLS method for case-control association testing with related individuals

Now we suppose that, in addition to having genotype data, we also have case-control phenotype data, where we allow some phenotype values to be unknown. Let D denote the phenotype data on the n + m individuals, with each individual categorized as “affected,” “unaffected,” or “unknown phenotype.” Here, the designation “unknown phenotype” could be used to refer to, for example, an unphenotyped individual taken from a generic control panel. Alternatively, it could refer to an individual whose phenotype has not yet become apparent (e.g., for an age-related trait). For 1 ≤ i ≤ n + m, we define the ith component of D to be D_i = 1 if i is affected, D_i = 0 if i is unaffected, and D_i = k if i is of unknown phenotype, where k is an externally obtained estimate of the population prevalence of the trait. We write Inline graphic to denote the partition of D into the vectors D_N and D_M, corresponding to individuals with non-missing and missing genotype, respectively, at the marker being tested. For simplicity, we describe the M_QLS method for the case when the marker being tested is biallelic. The multiallelic case is given in Thornton and McPeek (2007).

We analyze the data retrospectively, i.e., we condition on D and treat G_N as random in the analysis. The retrospective approach is appropriate, for example, with either random or phenotype-based ascertainment. Our null hypothesis is that there is no association and no linkage between the marker being tested and the trait. Under the null hypothesis, we assume that E₀(G_N|D) = p1 and Var₀(G_N|D) = σ²Φ_N. The test statistic for the M_QLS method is given by

(9)

with

(10)

where I_n is the n × n identity matrix. We typically use Inline graphic , as proposed by Thornton and McPeek (2010). Under the null hypothesis of no linkage and no association between the marker and the trait, the M_QLS statistic given in equation (9) is asymptotically distributed.

2.4. Previous interpretations of the M_QLS method

There are several possible ways of understanding the M_QLS statistic of equation (9). The original development of the M_QLS came from the fact that it is the quasi-likelihood score test of the null hypothesis H₀ : γ = 0 in the retrospective model E(G_N|D) = p1 + γ[Φ_N(D_N − k1) + Φ_N,M(D_M − k1_m)], where γ represents the association parameter. For a genotyped individual i, this conditional expectation can be written

(11)

where we let 2φ_i,i = 1 + h_i. In expression (11), the association parameter, γ, is multiplied by a weighted sum of centered phenotype values, where the weight of individual j's centered phenotype is twice the kinship coefficient of individual j with individual i. This results in an enrichment effect, i.e., individuals with multiple affected relatives are assumed to have a higher chance of carrying a causal allele than individuals without affected relatives. For outbred individuals, it can be shown (Thornton and McPeek, 2007) that this retrospective model holds, up to terms of order o(γ), assuming any prospective, two-allele, disease model, when the effect size γ is close to 0.

A later development (Wang and McPeek, 2009) shows that the M_QLS is closely connected to a retrospective likelihood score test based on the following prospective model: Inline graphic . Here P₀(D) is the null model for the joint distribution of phenotype values in the absence of association with the marker being tested, where arbitrary dependence among phenotypes of related individuals is allowed. It is actually not necessary to specify the form of P₀(D). c(G,r) is simply a normalizing constant. Logistic regression can be obtained as a special case of this prospective model. If we take into account incomplete data in deriving a retrospective likelihood score test for r = 0 under this class of models, then it is asymptotically equivalent to the M_QLS test, where the difference between them arises from the fact that the M_QLS test estimates the nuisance parameter p by the best linear unbiased estimator (BLUE), while the retrospective likelihood score test uses the maximum likelihood estimator of p under the null hypothesis.

2.5. A new interpretation of the M_QLS method

We describe a novel interpretation of the M_QLS statistic, which is somewhat different in flavor from previous interpretations and, therefore, can be illuminating. We note that the expression V^TG_N in the numerator of the M_QLS test statistic of equation (9) can be rewritten as

(12)

where L is the set of phenotyped individuals, Inline graphic is the BLUE of allele frequency, and is the BLUP of G_j given by equation (6). The first sum is taken over all individuals who are both phenotyped and genotyped, and it represents the inner product of their genotypic and phenotypic residuals, where phenotype is centered around the externally derived prevalence estimate, k, and genotype is centered around the BLUE of allele frequency. The second sum is over all phenotyped individuals who have missing genotype at the marker being tested, in which case we replace the missing genotype G_j by its BLUP under the null hypothesis, Inline graphic . Thus, we can interpret the M_QLS as involving a form of imputation of missing genotypes by their BLUPs based on genotyped relatives. The main advantage of this form of imputation is that, under the null hypothesis, the uncertainty in imputation and dependence in imputation across individuals is exactly taken into account in the variance that appears in the denominator of equation (9).

2.6. How much power does BLUP imputation of missing genotypes add to the analysis?

To assess the effect, on the analysis, of BLUP imputation of missing genotypes, we perform analytical power calculations using a noncentral Inline graphic approximation to the alternative distribution of the M_QLS statistic of equation (9). We illustrate this computation in two examples. In each case, we obtain a noncentrality parameter λ, and then calculate power at level 10⁻³ as 1 − R_λ,₁(K), where R_λ,₁ is the cumulative distribution function of a noncentral Inline graphic with noncentrality parameter λ, and K ≈ 10.82757 is the upper 10⁻³ quantile of a distribution, i.e., 1 − R₀₁(K) = 10⁻³.

Example 1. Sib pairs

In this example, we assume that there are f sampled families, with each family consisting of a sib pair whose phenotypes are known. We assume that, in the resulting sample of size 2f, the proportion of affected individuals is μ, and the correlation of affection status between sibs in the study is ρ. (Note that both μ and ρ reflect ascertainment. For example, if discordant sib pairs were preferentially ascertained, then ρ could be negative in the sample, even if there were positive sib-sib correlation for the trait in the general population.) For all 3 cases below, we use the mean model given in equation (11). We define a value s, which we call the scaled genetic effect, by s = σ⁻²fμ(1 − μ)γ², and we assume s > 0. All the noncentrality parameters we calculate are proportional to s.

Case 1. Everyone genotyped

When all 2f individuals are both phenotyped and genotyped at the marker being tested, then we have n = 2f, m = 0, and

(13)

Plugging into equation (10), we obtain v_complete = D_N − μ1_2f. Then the noncentrality parameter is

where s is defined in the previous paragraph.

Case 2. One sib in each pair genotyped, BLUP used

In this case, both sibs in each pair are phenotyped, and the sib who has missing genotype in each pair is chosen at random. Then n = m = f, Φ_N = I_f, Inline graphic , and, plugging into equation (10), we obtain . Then the noncentrality parameter that results is

Case 3. One sib in each pair genotyped, missing sib discarded

In this case, although both sibs' phenotypes are observed, in each family, the sib with missing genotype is discarded from the analysis. This is in contrast to Case 2, in which that individual is still included. The calculation of E(G_N|D) in case 3 is exactly the same as for case 2. However, for calculation of v_dropped, we use n = f, m = 0, and Φ_N = I_f. Then, plugging into equation (10), we obtain v_dropped = D_N − μ1_f. The resulting noncentrality parameter is Inline graphic .

By comparison of cases 1, 2, and 3 for sib pairs, we see that λ_complete > λ_partial ≥ λ_dropped, with equality between λ_partial and λ_dropped only when ρ = 1 or −1. The greatest difference between λ_partial and λ_dropped occurs when ρ = 0. This makes sense because ρ = 1 or −1 corresponds to the case when the phenotype information on the ungenotyped individuals is completely redundant and provides no new information, while ρ = 0 corresponds to the phenotype information on the ungenotyped individuals being maximally informative (or at least, not well-predicted based on the phenotypes of their sibs). Power plots examining these three cases are given in Results.

Example 2. Sib quartets

The reason for considering the sib quartets example is that it is similar to the sib pairs example in many respects, except that, with two typed sibs per family, there is more information available on the missing genotypes, so the BLUP imputation might be expected to be more informative. We assume that there are f sampled families with each family consisting of a sib quartet whose phenotypes are known. In the resulting sample of 4f individuals, we assume that the proportion of affected individuals is μ, and the correlation of affection status between sibs in the study is ρ. For the sake of comparison, we use the same scaled genetic effect, s = σ⁻²fμ(1 − μ)γ², that is used in the sib pair example.

Case 1. Everyone genotyped

When all 4f individuals are both phenotyped and genotyped at the marker being tested, then we have n = 4f, m = 0, and

(14)

We obtain v_complete = D_N − μ1_4f, and λ_complete = s(4 + 6ρ). Note that with four sibs, if every pair has correlation ρ, then we have the constraint −1/3 ≤ ρ ≤ 1, so λ_complete ≥ 2s > 0.

Case 2. Two sibs in each quartet genotyped, BLUP used

In this case, all four sibs in each quartet are phenotyped, and the 2 sibs who have missing genotype in each quartet are chosen at random. Then n = m = 2f, Φ_N has the same form as in equation (13), and

(15)

We obtain Inline graphic , where

(16)

is a permutation of the vector D_M such that the two ungenotyped sibs in each family have their phenotypes interchanged. Then Inline graphic .

Case 3. Two sibs in each quartet genotyped, missing sibs discarded

In this case, although all four sibs' phenotypes are observed, in each family, the two sibs with missing genotypes are discarded from the analysis. We calculate E(G_N|D) in the same way as in case 2. However, for calculation of v_dropped, we use n = 2f, m = 0, and the same Φ_N as in equation (13). Then v_dropped = D_N − μ1_2f, and Inline graphic .

By comparison of Cases 1, 2, and 3 for sib quartets, we see λ_complete > λ_partial ≥ λ_dropped, with equality between λ_partial and λ_dropped only when ρ = 1 (recall the constraint −1/3 ≤ ρ ≤ 1 for sib quartets). Power plots for all cases are given in Results.

3. Results

Figure 1 shows power results, at significance level 10⁻³, for f sampled families, where each family is either a sib pair or a sib quartet. In each case, the solid line represents the situation when all members of the sibship are available to be analyzed (Case 1 above). The dotted line represents the situation when all individuals are phenotyped, but only half the members of the sibship are genotyped, and the individuals with missing genotype are simply dropped from the analysis (Case 3 above). The dashed line represents the situation when phenotype data are available on all members of the sibship, but genotype data are available on only half the members, and the M_QLS is used to analyze the data, which is equivalent to BLUP imputation of the missing genotypes for the other half of the sibship (Case 2 above). We expect the dashed line (BLUP imputation) to be intermediate between the dotted line (individuals with missing genotype discarded) and the solid line (complete information on the individuals), and it is of interest to get an idea of how much of the full information can be recovered by the BLUP, in the context of association testing.

FIG. 1. — How much power for association is recovered by using BLUPs of missing genotypes? Approximate power at level 10⁻³, as calculated using a non-central chi-square approximation, vs. scaled genetic effect, given by σ⁻²fμ(1 − μ)γ², for sib pairs or sib quartets and for different values of the correlation between sampled sib phenotypes, where this correlation would depend on ascertainment. In each plot, the solid line represents the situation in which the phenotypes and genotypes of all sibs are observed, representing the gold standard of perfect recovery of missing genotypes. The dashed line represents the situation in which the genotypes of half of the sibs (1 sib in a sib pair or 2 sibs in a sib quartet) are not observed, and they are incorporated into the *M_QLS* statistic, which is equivalent to BLUP imputation. The dotted line represents the situation in which the genotypes of half the sibs are not observed, and the ungenotyped sibs are simply removed from the analysis.

The sib quartet study design represents a larger sample, so it is to be expected that power should be higher than for the sib pair study design. In addition, we can see that the power is higher overall when the study design is such that the correlation between sampled sibs' phenotypes is higher. One explanation for this is the enrichment effect: for complex traits, an affected individual with an affected sibling is more likely to carry a particular variant associated with the phenotype than is an affected individual with an unaffected sibling. By the same token, an unaffected individual with an unaffected sibling is more likely to carry a particular protective variant than is an unaffected individual with an affected sibling. Therefore, we might expect to have higher power to detect a genetic effect when we sample individuals who have relatives with similar trait values. By comparing the results from complete data on the the sibship (solid lines) to the results that ignore half of the sibs (dotted lines) for different values of the correlation, we see that the higher the phenotype correlation between the sibs, the less important it is to have the missing sib(s) in the analysis (solid and dotted lines get closer). This also seems reasonable, because if the missing sibs' phenotype(s) are well-predicted by the observed sibs' phenotype(s), then there should be less new information by including the additional sib(s).

Finally, we can see that in all 6 cases, the use of the BLUP imputation for missing genotypes (i.e., MQLS test) can provide a moderate power increase over ignoring individuals with missing genotypes. In particular, the MQLS method seems more effective with the sib quartet design, which is to be expected, because in that case there is more information available, on related individuals, to predict the missing genotypes.

4. Discussion

We describe an interesting connection between the M_QLS method, for case-control association testing in samples of related individuals, and the imputation of missing genotypes by the best linear unbiased predictor based on relatives' genotypes. In examples, we show that the use of BLUP to predict missing genotypes can add a reasonable amount of power to detect association. The amount of power added is higher when there are more typed relatives available to improve the prediction of missing genotypes.

The BLUP imputation described here is single-point. In contrast, most current genotype imputation methods (Scheet and Stephens, 2006; Browning, and Browning 2009) use information across many markers. However, with related individuals, imputed genotypes are dependent among relatives, where the dependence among imputed genotypes differs from the ordinary dependence among genotypes and is affected by the type and amount of information available for each individual. For association mapping, this complex dependence among imputed genotypes would need to be taken into account in the analysis in order to construct a valid test for association. A key feature of the single-point BLUP imputation we describe here is that the dependence among imputed genotypes is exactly taken into account in the construction of the MQLS test, in a way that is fast and computationally feasible even for large, inbred pedigrees.

The BLUP that we use is constructed assuming that there is no population structure beyond that captured by Φ. Because the genotype prediction for an individual is based on genotypes of close relatives, one would expect it to be robust to mild population structure. The main difficulty for the BLUP would seem to be the possibility that the BLUE of allele frequency, Inline graphic which is the centering value for the BLUP, could be inappropriate in cases of highly differentiated markers, when the allele frequencies are very different in different subpopulations represented in the sample. If information on population structure were available (e.g. in the form of structure-capturing vectors), then this information could be used to replace the vectors Inline graphic and of equation (6) by vectors in which the entry for the ith individual is an ancestry-specific estimated allele frequency. Alternatively, with mild population structure, in the context of case-control association testing, the ROADTRIPS method (Thornton and McPeek, 2010) could be used. ROADTRIPS is a more robust form of MQLS in which an estimated structure matrix Inline graphic is used to correct the variance of the test statistic for misspecified relationships in Φ as well as for mild population structure.

5. Appendix

5.1. Proof that the BLUP is given by equations (5) and (6) with variance in equation (7)

As mentioned in subsection 2.1, we need to find Inline graphic minimizing for every m-vector c, subject to condition (4), which is . Note that condition (4) implies , so we have . Define R by equation (5). We can trivially write .

Claim

Cov Inline graphic .

Proof

We have Cov Inline graphic . Applying equation (5), we get . Condition (4) and equation (5) imply , so we have , which proves the claim.

Thus, Inline graphic . This is minimized for every m-vector c by . When Φ_N is invertible, R is the unique minimizer, and the unique BLUP is given by RG_N. (When the set of individuals with observed genotypes contains both members of one or more MZ twin pairs, Φ_N is not invertible, and a minimizer R can be obtained by replacing Inline graphic with in equation (5). In that case, R is not the unique minimizer, but any other minimizer R* satisfies R*G_N = RG_N, provided that the two members of any MZ twin pair have identical genotypes, so RG_N is still the unique BLUP.) Expression (6) follows immediately from expression (5), using Inline graphic , where both and are scalars. Expression (7) follows from the fact that Var .

5.2. Proof of equation (12)

We have Inline graphic , which is the first line of equation (12). The second line of equation (12) follows by noting that for and similarly for .

Acknowledgments

This study was supported in part by the National Institutes of Health (grant R01 HG001645).

Disclosure Statement

No competing financial interests exist.

References

Bourgain C. Hoffjian S. Nicolae R., et al. Novel case-control test in a founder population identifies P-selectin as an atopy susceptibility locus. Am. J. Hum. Genet. 2003;73:612–626. doi: 10.1086/378208. [DOI] [PMC free article] [PubMed] [Google Scholar]
Browning S.R. Browning B.L. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
Devlin B. Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
McPeek M.S. Wu X. Ober C. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004;60:359–367. doi: 10.1111/j.0006-341X.2004.00180.x. [DOI] [PubMed] [Google Scholar]
Rabinowitz D. Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum. Hered. 2000;50:211–223. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]
Risch N. Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. 1998;8:1273–1288. doi: 10.1101/gr.8.12.1273. [DOI] [PubMed] [Google Scholar]
Scheet P. Stephens M. A fast and flexible statistical method for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Slager S. Schaid D.J. Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am. J. Hum. Genet. 2001;68:1457–1462. doi: 10.1086/320608. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thornton T. McPeek M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 2007;81:321–337. doi: 10.1086/519497. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thornton T. McPeek M.S. ROADTRIPS: Case-control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 2010;86:172–184. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Z. McPeek M.S. An incomplete-data quasi-likelihood approach to haplotype-based genetic association studies on related individuals. JASA. 2009;104:1251–1260. doi: 10.1198/jasa.2009.tm08507. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] Bourgain C. Hoffjian S. Nicolae R., et al. Novel case-control test in a founder population identifies P-selectin as an atopy susceptibility locus. Am. J. Hum. Genet. 2003;73:612–626. doi: 10.1086/378208. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Browning S.R. Browning B.L. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Devlin B. Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]

[B4] McPeek M.S. Wu X. Ober C. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004;60:359–367. doi: 10.1111/j.0006-341X.2004.00180.x. [DOI] [PubMed] [Google Scholar]

[B5] Rabinowitz D. Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum. Hered. 2000;50:211–223. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]

[B6] Risch N. Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. 1998;8:1273–1288. doi: 10.1101/gr.8.12.1273. [DOI] [PubMed] [Google Scholar]

[B7] Scheet P. Stephens M. A fast and flexible statistical method for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Slager S. Schaid D.J. Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am. J. Hum. Genet. 2001;68:1457–1462. doi: 10.1086/320608. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Thornton T. McPeek M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 2007;81:321–337. doi: 10.1086/519497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Thornton T. McPeek M.S. ROADTRIPS: Case-control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 2010;86:172–184. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Wang Z. McPeek M.S. An incomplete-data quasi-likelihood approach to haplotype-based genetic association studies on related individuals. JASA. 2009;104:1251–1260. doi: 10.1198/jasa.2009.tm08507. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

BLUP Genotype Imputation for Case-Control Association Testing with Related Individuals and Missing Data

Mary Sara McPeek

Abstract

1. Introduction

2. Methods