Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2012 Jun;19(6):756–765. doi: 10.1089/cmb.2012.0024

BLUP Genotype Imputation for Case-Control Association Testing with Related Individuals and Missing Data

Mary Sara McPeek 1,
PMCID: PMC3375641  PMID: 22697245

Abstract

We consider the problem of case-control association testing in samples that contain related individuals, where we assume the pedigree structure is known. Typically, for each marker tested, some individuals will have missing genotype data. The MQLS method has been proposed for association testing in this situation. We show that the MQLS method is equivalent to an approach in which missing genotypes are imputed using the best linear unbiased predictor (BLUP) based on relatives' genotype data. Viewed this way, the MQLS exactly corrects for the imputation error and for the extra correlation due to imputation. We also investigate the amount of additional power for detecting association that is provided by this BLUP imputation approach.

Key words: GWAS, pedigrees, quasi-likelihood, score test

1. Introduction

Standard methods for case-control association testing typically assume that sampled individuals are unrelated. However, many ongoing genetic studies involve families, so genetic association analysis of such studies requires some way of dealing with related individuals. The simplest way is to select a subsample of unrelated individuals, but that clearly results in a loss of information. Family-based tests (Rabinowitz and Laird, 2000) can be used in some contexts. Such approaches have the benefit that they protect against population structure, but they can also be underpowered relative to case-control association tests in some circumstances (Risch and Teng, 1998; Thornton and McPeek, 2010). A third approach is to perform a standard χ2 test with a post-hoc correction factor applied. Slager and Schaid (2001) and Bourgain et al. (2003) have derived explicit correction factors for χ2 tests of association performed in samples of related individuals. Another way to obtain such a correction factor in the context of a genome screen would be to use the genomic control method of Devlin and Roeder (1999), which works well if the rate of missing genotypes is similar at all markers. A fourth approach is to try to make more efficient use of the available information to obtain a more powerful test than the one that results from correcting the standard χ2 test. This approach has been taken by Bourgain et al. (2003) and by Thornton and McPeek (2007), who showed that there can be a nonnegligible power gain for their MQLS method over the corrected χ2.

One purpose of this article is to better understand the nature of this power gain. One way in which the MQLS improves on the corrected χ2 is that it uses a more efficient estimator of allele frequency, which is a nuisance parameter in the analysis. Both the MQLS and corrected χ2 could also make use of unphenotyped controls who are not related to phenotyped individuals, but who are genotyped, to improve the nuisance parameter estimation. Additionally, the MQLS can make use of genotyped individuals who are unphenotyped but who have phenotyped relatives in the sample as well as phenotyped individuals who have missing genotypes but who have genotyped relatives in the sample. It is this last aspect of power gain, the gain from using information on related individuals with missing genotype or phenotype, that we study in this article. We derive a connection between the MQLS and imputation of missing genotypes based on genotypes of relatives, and we consider the question of how much power can be gained by squeezing this information from the data.

2. Methods

2.1. Best linear unbiased prediction of genotypes

First, we ignore phenotype and focus on the problem of predicting missing genotypes based on the observed genotypes of relatives. Suppose our study consists of n + m individuals, where we assume, without loss of generality, that the first n of the n + m individuals have non-missing genotype data at the marker under consideration, while the last m have missing genotype data at the marker. The n + m individuals can be arbitrarily related, with the pedigree(s) that specify the relationships assumed to be known. The method we describe is feasible even for large complex pedigrees with inbreeding loops. Unrelated individuals can also be included in the sample. If their genotypes are known, they will contribute to the prediction of missing genotypes in others by contributing to the allele frequency estimate. However, if individuals with missing genotypes are unrelated to others in the sample, then their genotypes will not be predicted by our method.

For the moment, we assume that the marker under consideration is an autosomal binary marker (e.g., a SNP) with alleles labeled “0” and “1.” We extend to multiple alleles in the next subsection. Let Inline graphic be the vector of true genotypes at the marker under consideration, where Gi = 0,.5, or 1, according to whether individual i has, respectively, 0, 1, or 2 copies of allele 1 at the marker. We assume that the first n entries of G are observed, while the last m entries are unobserved and are to be predicted. We use the notation Inline graphic to denote the partition of the vector G into the n-vector GN of observed genotypes and the m-vector GM of unobserved genotypes.

Let p represent the frequency of allele 1 in the population from which the pedigree founders are assumed to be drawn, where 0 < p < 1. Then, we have E(G) = p1n+m, where 1n+m is a column vector of length n + m with every entry equal to 1. If we further assume Hardy-Weinberg equilibrium (HWE) in the population from which the pedigree founders are drawn, then Var(G) = σ2Φ, where Inline graphic and Φ is the kinship matrix given by

graphic file with name M4.gif (1)

where φi,j is the kinship coefficient between individuals i and j, and hi is the inbreeding coefficient of individual i. To make the approach more robust to deviations from HWE, we can remove the assumption that Inline graphic and simply assume Var(G) = σ2Φ, where σ2 is an unknown parameter that we estimate. Corresponding to the partition of G into GN and GM, we have the following partition of the Φ matrix:

graphic file with name M6.gif (2)

where ΦN is the n × n kinship matrix for the individuals with observed genotypes at the marker, Inline graphic is the n × m matrix of kinship coefficients for pairs of individuals in which one individual has oberved genotype and the other has unobserved genotype, and ΦM is the m × m kinship matrix for the individuals with unobserved genotypes at the marker. Provided that the set of individuals with observed genotypes does not contain both members of any monozygotic (MZ) twin pair, the matrix ΦN is invertible (Thornton and McPeek 2007).

Under these conditions, the best linear unbiased estimator (BLUE) of p is given by Inline graphic, and the variance of this estimator is Var Inline graphic. McPeek et al. (2004) show that this is the BLUE based on the allele indicators for the set of genotyped individuals in the situation when parental origin of allele is not known.

Now we propose to predict GM by finding its best linear unbiased predictor (BLUP) based on GN. That is, among all fixed m × n real matrices R, we find the one that minimizes

graphic file with name M10.gif (3)

for every fixed m-vector c, subject to

graphic file with name M11.gif (4)

Equivalently, we find the R satisfying condition (4) such that that Var Inline graphic − Var(GM − RGN) is a positive semidefinite matrix for all Inline graphic satisfying condition (4). In our case, condition (4) reduces to R1 = 1m, where we economize on subscripts by letting 1 always denote a vector of length n with every entry equal to 1, while we let 1m denote a vector of length m with every element equal to 1. In the Appendix, we show that the resulting R is given by

graphic file with name M14.gif (5)

so that the BLUP of GM is given by

graphic file with name M15.gif (6)

The variance of the BLUP is

graphic file with name M16.gif (7)

If p were known, only the first term of equation (7) would be present; the second term of the variance results from estimation of p.

The above development holds for the case when the set of individuals with observed genotypes does not contain both members of any MZ twin pair. If the set of genotyped individuals contains both members of one or more MZ twin pairs, then ΦN is not invertible, but the above results hold with Inline graphic replaced by the Moore-Penrose generalized inverse, which we write as Inline graphic. Provided that the two members of any MZ twin pair have identical genotypes, use of Inline graphic in place of Inline graphic in the above formulas is mathematically equivalent to setting the genotype of one of the two members of each MZ twin pair to be missing, in order to obtain invertible ΦN.

Advantages of the BLUP, Inline graphic, in equation (6), as a predictor of GM are that (1) it is extremely fast to calculate, even in large complex pedigrees with inbreeding loops, making it feasible to use in studies with large numbers of people and markers; and (2) its variance-covariance matrix, given in equation (7), is also very easy to calculate, making it relatively easy to incorporate the prediction uncertainty and correlation between predicted values into a genetic analysis. Note that for different markers, the sets N and M of individuals with, respectively, non-missing and missing genotype, will change, and so the vectors Inline graphic and R will differ from marker to marker.

2.2. BLUP for multi-allelic markers

Now suppose the marker under consideration is observed to have a distinct alleles. Let Inline graphic be the vector of allele frequencies for alleles 1 through a − 1. (Note that Inline graphic, so pa is redundant and can be dropped from p.) For this subsection only, we redefine Inline graphic, where Inline graphic, with Inline graphic equal to 0, 1/2 or 1, according to whether individual i has 0, 1, or 2 copies of the jth allele, 1 ≤ j ≤ a − 1. Thus, the vector G now has length (n + m)(a − 1). As before, the first n entries of each G(j) are observed, while the last m entries are to be predicted. We write Inline graphic, to denote the partition of the vector G(j) into the n-vector Inline graphic of observed genotypes and the m-vector Inline graphic of unobserved genotypes.

In the multi-allelic case, we have E(G) = (Ia-1 ⊗ 1n+m)p, with ⊗ denoting Kronecker product and Ia-1 denoting the identity matrix of dimension a − 1. Under HWE, we also have Var(G) = F ⊗ Φ, where F(a-1)×(a-1) has (i,j)th entry equal to Inline graphic and Inline graphic. As shown by McPeek et al. (2004), the BLUE of p is given by Inline graphic, where Inline graphic, which is the BLUE for the frequency of allele j that would be calculated if the marker were treated as biallelic with one allele being j and all the other alleles being collapsed into a single “not-j” allele. Furthermore, Var Inline graphic.

The BLUP of GM based on GN is found to be Inline graphic, where Inline graphic is the BLUP of Inline graphic that would be calculated if the marker were treated as biallelic with one allele being j and all the other alleles being collapsed into a single “not-j” allele. We have

graphic file with name M39.gif (8)

Equation (8) is similar to equation (7) with σ2 replaced by F ⊗ .

An important feature of the multiallelic BLUE and BLUP calculations is that they can be performed by, for example, calculating Inline graphic and R only once for a given multiallelic marker and then taking the inner products of these with a- 1 different vectors Inline graphic

2.3. Overview of the MQLS method for case-control association testing with related individuals

Now we suppose that, in addition to having genotype data, we also have case-control phenotype data, where we allow some phenotype values to be unknown. Let D denote the phenotype data on the n + m individuals, with each individual categorized as “affected,” “unaffected,” or “unknown phenotype.” Here, the designation “unknown phenotype” could be used to refer to, for example, an unphenotyped individual taken from a generic control panel. Alternatively, it could refer to an individual whose phenotype has not yet become apparent (e.g., for an age-related trait). For 1 ≤ i ≤ n + m, we define the ith component of D to be Di = 1 if i is affected, Di = 0 if i is unaffected, and Di = k if i is of unknown phenotype, where k is an externally obtained estimate of the population prevalence of the trait. We write Inline graphic to denote the partition of D into the vectors DN and DM, corresponding to individuals with non-missing and missing genotype, respectively, at the marker being tested. For simplicity, we describe the MQLS method for the case when the marker being tested is biallelic. The multiallelic case is given in Thornton and McPeek (2007).

We analyze the data retrospectively, i.e., we condition on D and treat GN as random in the analysis. The retrospective approach is appropriate, for example, with either random or phenotype-based ascertainment. Our null hypothesis is that there is no association and no linkage between the marker being tested and the trait. Under the null hypothesis, we assume that E0(GN|D) = p1 and Var0(GN|D) = σ2ΦN. The test statistic for the MQLS method is given by

graphic file with name M43.gif (9)

with

graphic file with name M44.gif (10)

where In is the n × n identity matrix. We typically use Inline graphic, as proposed by Thornton and McPeek (2010). Under the null hypothesis of no linkage and no association between the marker and the trait, the MQLS statistic given in equation (9) is asymptotically Inline graphic distributed.

2.4. Previous interpretations of the MQLS method

There are several possible ways of understanding the MQLS statistic of equation (9). The original development of the MQLS came from the fact that it is the quasi-likelihood score test of the null hypothesis H0 : γ = 0 in the retrospective model E(GN|D) = p1 + γ[ΦN(DN − k1) + ΦN,M(DM − k1m)], where γ represents the association parameter. For a genotyped individual i, this conditional expectation can be written

graphic file with name M47.gif (11)

where we let 2φi,i = 1 + hi. In expression (11), the association parameter, γ, is multiplied by a weighted sum of centered phenotype values, where the weight of individual j's centered phenotype is twice the kinship coefficient of individual j with individual i. This results in an enrichment effect, i.e., individuals with multiple affected relatives are assumed to have a higher chance of carrying a causal allele than individuals without affected relatives. For outbred individuals, it can be shown (Thornton and McPeek, 2007) that this retrospective model holds, up to terms of order o(γ), assuming any prospective, two-allele, disease model, when the effect size γ is close to 0.

A later development (Wang and McPeek, 2009) shows that the MQLS is closely connected to a retrospective likelihood score test based on the following prospective model: Inline graphic. Here P0(D) is the null model for the joint distribution of phenotype values in the absence of association with the marker being tested, where arbitrary dependence among phenotypes of related individuals is allowed. It is actually not necessary to specify the form of P0(D). c(G,r) is simply a normalizing constant. Logistic regression can be obtained as a special case of this prospective model. If we take into account incomplete data in deriving a retrospective likelihood score test for r = 0 under this class of models, then it is asymptotically equivalent to the MQLS test, where the difference between them arises from the fact that the MQLS test estimates the nuisance parameter p by the best linear unbiased estimator (BLUE), while the retrospective likelihood score test uses the maximum likelihood estimator of p under the null hypothesis.

2.5. A new interpretation of the MQLS method

We describe a novel interpretation of the MQLS statistic, which is somewhat different in flavor from previous interpretations and, therefore, can be illuminating. We note that the expression VTGN in the numerator of the MQLS test statistic of equation (9) can be rewritten as

graphic file with name M49.gif (12)

where L is the set of phenotyped individuals, Inline graphic is the BLUE of allele frequency, and Inline graphic is the BLUP of Gj given by equation (6). The first sum is taken over all individuals who are both phenotyped and genotyped, and it represents the inner product of their genotypic and phenotypic residuals, where phenotype is centered around the externally derived prevalence estimate, k, and genotype is centered around the BLUE of allele frequency. The second sum is over all phenotyped individuals who have missing genotype at the marker being tested, in which case we replace the missing genotype Gj by its BLUP under the null hypothesis, Inline graphic. Thus, we can interpret the MQLS as involving a form of imputation of missing genotypes by their BLUPs based on genotyped relatives. The main advantage of this form of imputation is that, under the null hypothesis, the uncertainty in imputation and dependence in imputation across individuals is exactly taken into account in the variance that appears in the denominator of equation (9).

2.6. How much power does BLUP imputation of missing genotypes add to the analysis?

To assess the effect, on the analysis, of BLUP imputation of missing genotypes, we perform analytical power calculations using a noncentral Inline graphic approximation to the alternative distribution of the MQLS statistic of equation (9). We illustrate this computation in two examples. In each case, we obtain a noncentrality parameter λ, and then calculate power at level 10−3 as 1 − Rλ,1(K), where Rλ,1 is the cumulative distribution function of a noncentral Inline graphic with noncentrality parameter λ, and K ≈ 10.82757 is the upper 10−3 quantile of a Inline graphic distribution, i.e., 1 − R01(K) = 10−3.

Example 1. Sib pairs

In this example, we assume that there are f sampled families, with each family consisting of a sib pair whose phenotypes are known. We assume that, in the resulting sample of size 2f, the proportion of affected individuals is μ, and the correlation of affection status between sibs in the study is ρ. (Note that both μ and ρ reflect ascertainment. For example, if discordant sib pairs were preferentially ascertained, then ρ could be negative in the sample, even if there were positive sib-sib correlation for the trait in the general population.) For all 3 cases below, we use the mean model given in equation (11). We define a value s, which we call the scaled genetic effect, by s = σ−2(1 − μ)γ2, and we assume s > 0. All the noncentrality parameters we calculate are proportional to s.

Case 1. Everyone genotyped

When all 2f individuals are both phenotyped and genotyped at the marker being tested, then we have n = 2f, m = 0, and

graphic file with name M56.gif (13)

Plugging into equation (10), we obtain vcomplete = DN − μ12f. Then the noncentrality parameter is

graphic file with name M57.gif

where s is defined in the previous paragraph.

Case 2. One sib in each pair genotyped, BLUP used

In this case, both sibs in each pair are phenotyped, and the sib who has missing genotype in each pair is chosen at random. Then n = m = f, ΦN = If, Inline graphic, and, plugging into equation (10), we obtain Inline graphic. Then the noncentrality parameter that results is

graphic file with name M60.gif

Case 3. One sib in each pair genotyped, missing sib discarded

In this case, although both sibs' phenotypes are observed, in each family, the sib with missing genotype is discarded from the analysis. This is in contrast to Case 2, in which that individual is still included. The calculation of E(GN|D) in case 3 is exactly the same as for case 2. However, for calculation of vdropped, we use n = f, m = 0, and ΦN = If. Then, plugging into equation (10), we obtain vdropped = DN − μ1f. The resulting noncentrality parameter is Inline graphic.

By comparison of cases 1, 2, and 3 for sib pairs, we see that λcomplete > λpartial ≥ λdropped, with equality between λpartial and λdropped only when ρ = 1 or −1. The greatest difference between λpartial and λdropped occurs when ρ = 0. This makes sense because ρ = 1 or −1 corresponds to the case when the phenotype information on the ungenotyped individuals is completely redundant and provides no new information, while ρ = 0 corresponds to the phenotype information on the ungenotyped individuals being maximally informative (or at least, not well-predicted based on the phenotypes of their sibs). Power plots examining these three cases are given in Results.

Example 2. Sib quartets

The reason for considering the sib quartets example is that it is similar to the sib pairs example in many respects, except that, with two typed sibs per family, there is more information available on the missing genotypes, so the BLUP imputation might be expected to be more informative. We assume that there are f sampled families with each family consisting of a sib quartet whose phenotypes are known. In the resulting sample of 4f individuals, we assume that the proportion of affected individuals is μ, and the correlation of affection status between sibs in the study is ρ. For the sake of comparison, we use the same scaled genetic effect, s = σ−2(1 − μ)γ2, that is used in the sib pair example.

Case 1. Everyone genotyped

When all 4f individuals are both phenotyped and genotyped at the marker being tested, then we have n = 4f, m = 0, and

graphic file with name M62.gif (14)

We obtain vcomplete = DN − μ14f, and λcomplete = s(4 + 6ρ). Note that with four sibs, if every pair has correlation ρ, then we have the constraint −1/3 ≤ ρ ≤ 1, so λcomplete ≥ 2s > 0.

Case 2. Two sibs in each quartet genotyped, BLUP used

In this case, all four sibs in each quartet are phenotyped, and the 2 sibs who have missing genotype in each quartet are chosen at random. Then n = m = 2f, ΦN has the same form as in equation (13), and

graphic file with name M63.gif (15)

We obtain Inline graphic, where

graphic file with name M65.gif (16)

is a permutation of the vector DM such that the two ungenotyped sibs in each family have their phenotypes interchanged. Then Inline graphic.

Case 3. Two sibs in each quartet genotyped, missing sibs discarded

In this case, although all four sibs' phenotypes are observed, in each family, the two sibs with missing genotypes are discarded from the analysis. We calculate E(GN|D) in the same way as in case 2. However, for calculation of vdropped, we use n = 2f, m = 0, and the same ΦN as in equation (13). Then vdropped = DN − μ12f, and Inline graphic.

By comparison of Cases 1, 2, and 3 for sib quartets, we see λcomplete > λpartial ≥ λdropped, with equality between λpartial and λdropped only when ρ = 1 (recall the constraint −1/3 ≤ ρ ≤ 1 for sib quartets). Power plots for all cases are given in Results.

3. Results

Figure 1 shows power results, at significance level 10−3, for f sampled families, where each family is either a sib pair or a sib quartet. In each case, the solid line represents the situation when all members of the sibship are available to be analyzed (Case 1 above). The dotted line represents the situation when all individuals are phenotyped, but only half the members of the sibship are genotyped, and the individuals with missing genotype are simply dropped from the analysis (Case 3 above). The dashed line represents the situation when phenotype data are available on all members of the sibship, but genotype data are available on only half the members, and the MQLS is used to analyze the data, which is equivalent to BLUP imputation of the missing genotypes for the other half of the sibship (Case 2 above). We expect the dashed line (BLUP imputation) to be intermediate between the dotted line (individuals with missing genotype discarded) and the solid line (complete information on the individuals), and it is of interest to get an idea of how much of the full information can be recovered by the BLUP, in the context of association testing.

FIG. 1.

FIG. 1.

How much power for association is recovered by using BLUPs of missing genotypes? Approximate power at level 10−3, as calculated using a non-central chi-square approximation, vs. scaled genetic effect, given by σ−2(1 − μ)γ2, for sib pairs or sib quartets and for different values of the correlation between sampled sib phenotypes, where this correlation would depend on ascertainment. In each plot, the solid line represents the situation in which the phenotypes and genotypes of all sibs are observed, representing the gold standard of perfect recovery of missing genotypes. The dashed line represents the situation in which the genotypes of half of the sibs (1 sib in a sib pair or 2 sibs in a sib quartet) are not observed, and they are incorporated into the MQLS statistic, which is equivalent to BLUP imputation. The dotted line represents the situation in which the genotypes of half the sibs are not observed, and the ungenotyped sibs are simply removed from the analysis.

The sib quartet study design represents a larger sample, so it is to be expected that power should be higher than for the sib pair study design. In addition, we can see that the power is higher overall when the study design is such that the correlation between sampled sibs' phenotypes is higher. One explanation for this is the enrichment effect: for complex traits, an affected individual with an affected sibling is more likely to carry a particular variant associated with the phenotype than is an affected individual with an unaffected sibling. By the same token, an unaffected individual with an unaffected sibling is more likely to carry a particular protective variant than is an unaffected individual with an affected sibling. Therefore, we might expect to have higher power to detect a genetic effect when we sample individuals who have relatives with similar trait values. By comparing the results from complete data on the the sibship (solid lines) to the results that ignore half of the sibs (dotted lines) for different values of the correlation, we see that the higher the phenotype correlation between the sibs, the less important it is to have the missing sib(s) in the analysis (solid and dotted lines get closer). This also seems reasonable, because if the missing sibs' phenotype(s) are well-predicted by the observed sibs' phenotype(s), then there should be less new information by including the additional sib(s).

Finally, we can see that in all 6 cases, the use of the BLUP imputation for missing genotypes (i.e., MQLS test) can provide a moderate power increase over ignoring individuals with missing genotypes. In particular, the MQLS method seems more effective with the sib quartet design, which is to be expected, because in that case there is more information available, on related individuals, to predict the missing genotypes.

4. Discussion

We describe an interesting connection between the MQLS method, for case-control association testing in samples of related individuals, and the imputation of missing genotypes by the best linear unbiased predictor based on relatives' genotypes. In examples, we show that the use of BLUP to predict missing genotypes can add a reasonable amount of power to detect association. The amount of power added is higher when there are more typed relatives available to improve the prediction of missing genotypes.

The BLUP imputation described here is single-point. In contrast, most current genotype imputation methods (Scheet and Stephens, 2006; Browning, and Browning 2009) use information across many markers. However, with related individuals, imputed genotypes are dependent among relatives, where the dependence among imputed genotypes differs from the ordinary dependence among genotypes and is affected by the type and amount of information available for each individual. For association mapping, this complex dependence among imputed genotypes would need to be taken into account in the analysis in order to construct a valid test for association. A key feature of the single-point BLUP imputation we describe here is that the dependence among imputed genotypes is exactly taken into account in the construction of the MQLS test, in a way that is fast and computationally feasible even for large, inbred pedigrees.

The BLUP that we use is constructed assuming that there is no population structure beyond that captured by Φ. Because the genotype prediction for an individual is based on genotypes of close relatives, one would expect it to be robust to mild population structure. The main difficulty for the BLUP would seem to be the possibility that the BLUE of allele frequency, Inline graphic which is the centering value for the BLUP, could be inappropriate in cases of highly differentiated markers, when the allele frequencies are very different in different subpopulations represented in the sample. If information on population structure were available (e.g. in the form of structure-capturing vectors), then this information could be used to replace the vectors Inline graphic and Inline graphic of equation (6) by vectors in which the entry for the ith individual is an ancestry-specific estimated allele frequency. Alternatively, with mild population structure, in the context of case-control association testing, the ROADTRIPS method (Thornton and McPeek, 2010) could be used. ROADTRIPS is a more robust form of MQLS in which an estimated structure matrix Inline graphic is used to correct the variance of the test statistic for misspecified relationships in Φ as well as for mild population structure.

5. Appendix

5.1. Proof that the BLUP is given by equations (5) and (6) with variance in equation (7)

As mentioned in subsection 2.1, we need to find Inline graphic minimizing Inline graphic for every m-vector c, subject to condition (4), which is Inline graphic. Note that condition (4) implies Inline graphic, so we have Inline graphic. Define R by equation (5). We can trivially write Inline graphic.

Claim

CovInline graphic.

Proof

We have CovInline graphic. Applying equation (5), we get Inline graphic. Condition (4) and equation (5) imply Inline graphic, so we have Inline graphic, which proves the claim.

Thus, Inline graphic. This is minimized for every m-vector c by Inline graphic. When ΦN is invertible, R is the unique minimizer, and the unique BLUP is given by RGN. (When the set of individuals with observed genotypes contains both members of one or more MZ twin pairs, ΦN is not invertible, and a minimizer R can be obtained by replacing Inline graphic with Inline graphic in equation (5). In that case, R is not the unique minimizer, but any other minimizer R* satisfies R*GN = RGN, provided that the two members of any MZ twin pair have identical genotypes, so RGN is still the unique BLUP.) Expression (6) follows immediately from expression (5), using Inline graphic, where both Inline graphic and Inline graphic are scalars. Expression (7) follows from the fact that Var Inline graphic.

5.2. Proof of equation (12)

We have Inline graphic, which is the first line of equation (12). The second line of equation (12) follows by noting that for Inline graphic and similarly for Inline graphic.

Acknowledgments

This study was supported in part by the National Institutes of Health (grant R01 HG001645).

Disclosure Statement

No competing financial interests exist.

References

  1. Bourgain C. Hoffjian S. Nicolae R., et al. Novel case-control test in a founder population identifies P-selectin as an atopy susceptibility locus. Am. J. Hum. Genet. 2003;73:612–626. doi: 10.1086/378208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Browning S.R. Browning B.L. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Devlin B. Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
  4. McPeek M.S. Wu X. Ober C. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004;60:359–367. doi: 10.1111/j.0006-341X.2004.00180.x. [DOI] [PubMed] [Google Scholar]
  5. Rabinowitz D. Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum. Hered. 2000;50:211–223. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]
  6. Risch N. Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. 1998;8:1273–1288. doi: 10.1101/gr.8.12.1273. [DOI] [PubMed] [Google Scholar]
  7. Scheet P. Stephens M. A fast and flexible statistical method for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Slager S. Schaid D.J. Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am. J. Hum. Genet. 2001;68:1457–1462. doi: 10.1086/320608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Thornton T. McPeek M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 2007;81:321–337. doi: 10.1086/519497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Thornton T. McPeek M.S. ROADTRIPS: Case-control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 2010;86:172–184. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Wang Z. McPeek M.S. An incomplete-data quasi-likelihood approach to haplotype-based genetic association studies on related individuals. JASA. 2009;104:1251–1260. doi: 10.1198/jasa.2009.tm08507. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES