Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Jun 1.
Published in final edited form as: Biometrics. 2010 Jun 16;67(2):331–343. doi: 10.1111/j.1541-0420.2010.01454.x

Unbiased and Locally Efficient Estimation of Genetic Effect on Quantitative Trait in the Presence of Population Admixture

Yuanjia Wang 1,, Qiong Yang 2, Daniel Rabinowitz 3
PMCID: PMC2948587  NIHMSID: NIHMS205250  PMID: 20560930

Abstract

Population admixture can be a confounding factor in genetic association studies. Family-based methods (Rabinowitz and Laird 2000) have been proposed in both testing and estimation settings to adjust for this confounding, especially in case-only association studies. The family-based methods rely on conditioning on the observed parental genotypes or on the minimal sufficient statistic for the genetic model under the null hypothesis. In some cases these methods do not capture all the available information due to the conditioning strategy being too stringent. General efficient methods to adjust for population admixture that use all the available information have been proposed (Rabinowitz 2002). However these approaches may not be easy to implement in some situations. A previously developed easy-to-compute approach adjusts for admixture by adding supplemental covariates to linear models (Yang et al. 2000). Here is shown that this augmenting linear model with appropriate covariates strategy can be combined with the general efficient methods in Rabinowitz (2000) to provide computationally tractable and locally efficient adjustment. After deriving the optimal covariates, the adjusted analysis can be carried out using standard statistical software packages such as SAS or R. The proposed methods enjoy a local efficiency in a neighborhood of the true model. The simulation studies show that non-trivial efficiency gains can be obtained by using information not accessible to the methods that rely on conditioning on the minimal sufficient statistics. The approaches are illustrated through an analysis of the influence of apolipoprotein E (APOE) genotype on plasma low density lipoprotein (LDL) concentration in children.

Keywords: Genetic association study, Family-based study, Population stratification

1 Introduction

Association studies are used to locate genetic locus influencing a trait of interest by evaluating association between allelic variability at a candidate locus and a trait. These association-based methods are especially useful for evaluating genetic factors with small to moderate effects (Risch and Merikangas 1996). However, it is well known that population admixture can be a confounding factor for association-based methods (see for example, Elston 1998). When the study sample consists subjects drawn from subpopulations with different allele frequencies and trait distributions, spurious association may be detected even when the trait and the gene are not biologically linked.

Family-based association studies have been proposed to adjust for population admixture. The transmission disequilibrium test (TDT) and its extensions (Falk and Rubinstein 1987; Terwilliger and Ott 1992; Spielman, McGinnis, and Ewens 1993; Spielman and Ewens 1998; Lazzeroni and Lange 1998) examine the transmission of parental alleles to the offsprings in a family given the parental genotypes. By conditioning on the parental genotypes, the bias due to population admixture is avoided. However when the parental genotypes are not all observed, these tests cannot be applied. Applying TDT tests to restricted data sets of families with complete parental genotypes can result in bias (Curtis and Sham 1995). Rabinowitz and Laird (2000) proposed a general approach to adjust for population admixture by comparing the test statistics to their conditional distributions given the minimal sufficient statistics for the genetic model under the null hypothesis which does not require complete parental genotype information.

When the association between a trait and a locus has been established by testing, it may be desirable to estimate the form and strength of the trait-genotype association and to evaluate the interaction between genotypes and other environmental factors. Yang et al. (2000) proposed an approach to estimate candidate gene effect in a linear model that is not affected by spurious association. In this approach, the paradigm of conditioning on minimal sufficient statistics is achieved through augmenting the standard regression models with appropriate additional covariates. The approach is computationally convenient: after obtaining additional covariates, the analysis can be carried out by standard statistical packages such as SAS or R.

The approaches proposed in Rabinowitz and Laird (2000) for testing and in Yang et al. (2000) for estimation do not capture all of the available information (the minimal sufficient statistic is not always complete). Rabinowitz (2002) proposed a general framework to develop efficient test statistic that exploits all of the available information that is not potentially confounded by population stratification. Whittemore (2004) proposed a general framework for efficient estimation functions which protects against population stratification. Allen et al. (2005) developed locally efficient estimation of haplotype-disease association in case-parent trio designs that is robust to confounding.

Here is shown that the approach of eliminating bias by adding supplemental covariates to a linear model in Yang et al. (2000) can be combined with the method of deriving efficient estimating equations in Whittemore (2004) or Rabinowitz (2002) to obtain a computationally tractable estimation approach that is efficient but not confounded by population admixture. The optimal supplemental covariates are obtained through matrix algebra calculations. In the cases where the family-specific effects are absent, the additional covariates have closed form expressions. The proposed methods enjoy a local efficiency in a neighborhood of the true model. We use simulation studies to investigate unbiasedness and efficiency of the methods under conditions including violation of assumptions and departure from the true model. The simulation results show that non-trivial efficiency gains can be obtained by using information not accessible to methods that rely on conditioning on the minimal sufficient statistics. The approaches are illustrated through an analysis of the influence of apolipoprotein E (APOE) genotype on plasma low density lipoprotein (LDL) concentration in children.

2 Methods

In this section linear models for quantitative traits are introduced and optimal additional covariates to be included in the models are derived. Two models are considered: one does not involve family-specific terms and the other includes random family-specific effects.

2.1 Model without family-specific effects

Let Yij denote a quantitative trait of the jth individual in the ith family, and let Gij denote the genotype of the same individual at a candidate locus. A simple linear model relating Yij to Gij is

Yij=X(Gij)β+εij, (1)

where X(Gij) is a coding for the genotype, and β is the effect of the genotype on the trait. For example, for a recessive trait, X(Gij) can take value one for subjects carrying two copies of the disease allele and value zero for subjects carrying zero or one copy of the disease allele. The εij are residual effects other than the genotypes under examination which may be environmental factors that are independent of the genotypes or genetic factors at unlinked loci. When there is no population admixture, εij are independent of Gij and the usual least square estimate of β is unbiased. However, when there presents population admixture, the membership of subpopulation is part of the residual effects εij. Since the subpopulation membership influences the genotype distribution in the subpopulation, εij and Gij are correlated so that the ordinary least square estimate of β is biased.

To motivate the derivation of efficient estimator of β when the genotypes and the residual effects are not independent, it is useful to review the method proposed in Yang et al. (2000). The validity of Yang et al. (2000) relies on the fact that even though Gij and εij in model (1) may be marginally correlated, they are conditionally independent, given the minimal sufficient statistic for the genetic model under the null hypothesis. In Yang et al. (2000), after adding the conditional expectation of X(Gij) given the minimal sufficient statistics as additional covariates to the model, the least square estimate for β is unbiased even when population admixture is present. In the current work, the form of the optimal additional covariates that leads to efficient estimator of β is unknown, and is derived from the a constrained optimization problem.

Similar to Yang et al. (2000), the key assumption underlies the proposed approaches is that although the genotype-related covariates and the residuals are correlated due to population admixture, they are conditionally independent given the founder genotypes. As noted in Yang et al. (2000), this assumption corresponds to the transmission of parental alleles to the offspring generation being independent to any other factors that influence the trait, given the parental genotypes. While this assumption is surely an approximation of the biological truth, it is the basis for other methods adjusting for population admixture in family-based association studies (for example TDT and FBAT).

Let Uij denote the unknown additional covariate for the jth individual from the ith family, let n denote the total number of families, let ni denote the number of subjects in the ith family, and let N denote the total number of subjects. A linear model relating a trait to a genotype with additional covariates is

Yij=X(Gij)β+Uijγ+εij.

Without loss of generality, the above model can be written as

Yij=(X(Gij)Uij)β+Uijγ+εij, (2)

where γ̃ = −β + γ. In standard multiple regression analysis, the least square estimates of a subset of regression coefficients can be acquired by first regressing the covariates of these coefficients on the covariates of the remaining coefficients, taking the residuals, and then regressing the response variable on the residuals. We apply this observation to model (2), where we are interested in estimating β. When the covariates X(Gij) − Uij are uncorrelated with Uij (in terms of expectation), the residuals of regressing X(Gij) − Uij on Uij are X(Gij) − Uij themselves. Therefore the least square estimate of β can then be written as

β^=[(XU)T(XU)]1(XU)TY, (3)

where the unknown covariates U satisfy

E(XU)TU=0. (4)

Here Xi = (X(Gi1), ···, X(Gini))T, Ui = (Ui1, ···, Uini)T, X = (X1, ···, Xn)T, U = (U1, ···, Un)T, and Y = (Y1, ···, Yn)T.

Let Gi denote the genotypes of founders in the ith family. Note that Gi may not be completely observed. An important assumption ensuring the validity of the proposed methods analogous to that in Yang et al. (2000) is that even though Gij and εij may be marginally dependent, they are conditionally independent given Gi. It is shown in the appendix that under this assumption, the expectation of β̂ in (3) is

β+E{[(XU)T(XU)]1iE[(XiUi)T1niGi]E[εiGi]}, (5)

and the variance of β̂ is

σ2E[(XU)T(XU)1], (6)

where 1ni is the ni × 1 vector of one. From (5), the condition for obtaining unbiased estimate of β is then

E[(XiUi)T1niGi]=0,i=1,,n. (7)

It follows that the optimal unbiased estimator for β can be obtained by minimizing (6) under the constraints (4) and (7).

It is convenient to introduce some notations to describe the solution to the above constrained maximization problem. Let ϑi denote all possible founder genotypes compatible with that observed in the ith family, and let ϑi denote all possible combinations of offspring genotypes in the ith family. Let ci denote the dimension of ϑi and let di denote the dimension of ϑi. Let Wi denote the ci × ci diagonal matrix with the diagonal entries given by the (possibly misspecified) probabilities of founder genotypes, P(g). In reality, Wi are usually computed from observed founder genotypes. Let Zi denote the ci × di matrix with the (g,g) entry given by the conditional probability of an offspring genotype given the founder genotypes, P(g|g). Let Xi denote the ni × ci matrix with rows index individuals in a family and columns index components in ϑi; that is, the kth row in the matrix is (Xik(g1), ···, Xik(gci)). Let Vi denote the ni × ci matrix with the (m,g)th component being Uim(g). The corresponding matrices Wi, Zi, Xi and Vi for an example pedigree are given in the appendix. With these notations, it is shown in the appendix that the solution to the constrained optimization problem of minimizing (6) subject to (4) and (7) is

Vi=XiZi(ZiTWi1Zi)1ZiTWi1. (8)

The additional covariates Ui can be picked from the rows of Vi that correspond to the observed genotypes in members of the ith family. A simple example illustrating the computations is presented in the appendix.

We use family-specific residual sums to estimate the variance of β̂. By the conditions (4) and (7), the estimating equation for β̂,

i(XiUi)T[Yi(XiUi)β],

has expectation zero. Since the family-specific residual terms are independent, the variance of the solution to this estimating equation is (Cox and Hinkley 1979)

i[(XiUi)T(Yi(XiUi)β^)]2[i(XiUi)T(XiUi)]2. (9)

One advantage of the proposed estimator is that it is unbiased even when the founder genotype distributions Wi are misspecified. To see this, note that the condition for obtaining unbiasedness (7) holds regardless of whether the marginal probabilities Wi are correctly specified. To be specific, denote the misspecified Wi as Wi. As derived in the appendix, the solutions for the corresponding Ui are the components of

XiZi(ZiTWi1Zi)1ZiTWi1.

By the introduced notations, taking the conditional expectation of a random variable amounts to multiplying it by components of Zi. Therefore, from

E[(XiUi)T1niGi]=1niTZiTXiT1ni1niTZiTWi1Zi(ZiTWi1Zi)1ZiTXiT1ni=1niTZiTXiT1ni1niTZiTXiT1ni=0,

it follows that the unbiasedness condition (7) is satisfied with misspecified Wi (denoted as Wi) and misspecified Ui (denoted as Ui). Although in this case the adjusted estimator remains unbiased, it is not efficient. We study the magnitude of efficiency loss due to misspecification of W by simulations in section 3.

2.2 Including family-specific effects

In some situations there may exist family-specific effects influencing a trait. The linear model in this case can be expressed as:

Yij=X(Gij)β+αi+εij, (10)

where αi is a random family-specific factor. Due to population stratification, αi may not be independent of Gij, but may be independent of the subject-specific residual effects εij. An assumption ensuring validity of the proposed methods in this case is that even though Gij and αi may be marginally correlated, they are conditionally independent given Gi.

When there are family-specific effects, to obtain the optimal estimator, weighted least square

β^=[(XU)T1(XU)]1(XU)T1Y (11)

should be used, where Σ is the conditional covariance matrix of Y given G. From the appendix, the expectation of β̂′ is

β+E{[(XU)T1(XU)]1iE[(XiUi)Ti11niGi]E[αiGi]}+E{[(XU)T1(XU)]1iE[(XiUi)Ti11niGi]E[εijGi]}, (12)

where Σi is the conditional covariance of the observations from the ith family given the founder genotypes, cov(YiYiTGi). The unbiasedness condition analogous to (7) should be modified as

E[(XiUi))Ti11niGi]=0,i=1,,n. (13)

The constraint analogous to (4) should be modified as

E(XU)T1U=0. (14)

In addition, we show in the appendix that the variance to be minimized is

E(XU)T1(XU). (15)

Now define

Xi=i12Xi,Ui=i12Ui,Yi=i12Yi,and1ni=i121ni.

With these notations, the constraint (13) becomes (7), the constraint (14) becomes (4) and the minimization term (6) becomes (15) with Xi, Ui and 1ni replaced as Xi,Ui, and 1ni. Consequently, carrying out the same constrained optimization procedure for model (1) using the newly defined variables leads to the solution

Vi=XiZi(ZiTWi1Zi)1ZiTWi1. (16)

In practice, we can fit a linear mixed effects model that has a random family-specific effect and include the original supplemental covariates without considering the family-specific effects (the β̂ obtained from fitting a linear mixed effects model is estimated by weighted least square).

It is worthy to mention that population admixture behaves as a source of family-specific effects (Fulker et al. 1999; Abecasis et al. 2000). The weighted least square approach therefore provides efficiency gain over the ordinary least square even when there are no additional family-specific effects, αi. We compare the two approaches by simulations. For the weighted least square, the variance can be estimated by

i[(XiUi)T^i1(Yi(XiUi)β^)]2[i(XiUi)T^i1(XiUi)]2. (17)

3 Simulation Studies

In this section, we use extensive simulation studies to evaluate properties of the proposed methods. We examine the effect of misspecifying marginal probabilities in W and the influence of the family-specific effects. We also compare the ordinary least square with the weighted least square, and the proposed methods with Yang et al. (2000).

We generated 100 nuclear families each with two children. To simulate population stratification, we drew parental genotypes from a 50:50 admixture of two populations. Parents in the same family were drawn from the same population. The disease allele frequency in each subpopulation was 0.1 and 0.3, so that the marginal disease allele frequency in the whole population was 0.2. The parental genotypes within each subpopulation were simulated based on the Hardy-Weinberg equilibrium. The offspring genotypes were simulated based on Mendelian transmission probabilities. We assumed a dominant effect of the disease allele. We simulated a linear model with different intercept for each population. To investigate the impact of varying severity of population admixture, we simulated several combinations of the intercepts. The intercept in the first population was 5, while in the second population was 10 or 20. The genetic effect was chosen to be 10, 20, 50 or 100. In the models with random family-specific effects, the variances of these effects were 5, 15 or 25. There were 1000 replications in each set of the simulations. We simulated residuals from a normal distribution with mean zero and standard deviation five.

We first examine performance of the proposed methods under the null hypothesis (β = 0). Table 1 shows the type I error rates of test statistics computed using various methods. When the population admixture is moderate, the type I error rates of the unadjusted ordinary least square and unadjusted weighted least square are clearly much higher than the nominal level. For example, for α = 0.05, the type I error is 0.43 for the former and 0.34 for the latter. In contrast, the the proposed adjusted ordinary and weighted least square methods have maintained the desirable error rates. We also investigate the method in Yang et al. (2000) and find its type I error rate to be close to the nominal level. When the population admixture is more severe, the type I error rates of the two unadjusted analyses are substantially higher than the nominal level while the proposed approach and Yang et al. (2000) have maintained the correct α-level.

Table 1.

Type I error rates of various methods

α level Unadjusted OLS Adjusted OLS Unadjusted WLS Adjusted WLS Adjusted Yang
μ1 = 5, μ2 = 10
0.01 0.208 0.007 0.138 0.008 0.007
0.05 0.432 0.041 0.343 0.05 0.043
0.1 0.57 0.091 0.477 0.099 0.09

μ1 = 5, μ2 = 20
0.01 0.83 0.007 0.201 0.012 0.007
0.05 0.952 0.043 0.425 0.053 0.046
0.1 0.975 0.103 0.571 0.105 0.106

Adjusted by method in Yang et al. (2000)

Next we examine performance of various methods under the alternative hypothesis (β ≠ 0). The first set of simulations corresponds to model (1) where there are no family-specific effects. Table 2 summarizes results for the ordinary least square and the weighted least square method. We first examined the properties of the proposed estimator when there was no genetic effect, i.e., β = 0. Under this null model, the unadjusted estimator reported a large spurious genetic effect, that is, the mean β̂ = 1.54 for the ordinary least square and the mean β̂ = 1.39 for the weighted least square analysis. In contrast, the adjusted estimators were very close to zero: the mean β̂ was 0.02 (average empirical SE=1.38) for the ordinary least square and the mean β̂ = 0.02 (average empirical SE=1.37) for the weighted least square. Next we examined the estimator where the true genetic effect was greater than zero (β = 10, 20, 50 or 100). It can be seen that the unadjusted estimates had large bias while the adjusted estimates had negligible bias: the bias of the former ranged from 1.48 to 4.53, while for the latter it ranged from zero to 0.07. The magnitude of the bias for the unadjusted methods increased with the severity of population stratification. When the population admixture was moderate (μ1=5, μ2=10), the bias of the unadjusted least square estimator was around 1.5. When the population admixture was more severe (μ1=5, μ2=20), the bias increased to around 4.5. The bias for the unadjusted weighted least square estimator in each scenario of the population admixture was 1.3 and 2.3, respectively. The magnitude of the bias was similar across all values of the genetic effect for both estimators.

Table 2.

Estimates with correctly specified W: no family-specific effects

Ordinary least square estimates (OLS)
μ1=5, μ2=10 μ1 =5, μ2 =20

True β Unadjusted Mean β̂ Adjusted Mean β̂ Empirical S.E. Estimated S.E. Unadjusted Mean Adjusted Mean β̂ Empirical S.E. Estimated S.E.
0 1.54 0.02 1.38 1.31 4.48 0.02 1.89 1.88
10 11.48 9.98 1.32 1.31 14.53 9.98 1.90 1.87
20 21.53 20.01 1.35 1.31 24.48 19.93 1.88 1.87
50 51.51 50.00 1.33 1.31 54.50 49.99 1.89 1.89
100 101.51 100.05 1.34 1.31 104.43 100.00 1.89 1.87

Weighted least square estimates (WLS)
0 1.39 0.02 1.37 1.32 2.27 0.02 1.45 1.47
10 11.32 9.98 1.32 1.32 12.27 10.02 1.49 1.47
20 21.31 20.01 1.33 1.32 22.26 19.95 1.48 1.47
50 51.35 49.99 1.33 1.32 52.30 50.02 1.48 1.46
100 101.38 100.05 1.32 1.32 102.22 100.05 1.47 1.46

Comparing to Yang et al. (2000)
True β Yang Mean β̂ Empirical S.E. Eff. gain OLS Eff. gain WLS Yang Mean β̂ Empirical S.E. Eff. gain OLS Eff. gain WLS

10 10.32 2.20 40% 40% 10.88 2.91 35% 49%
20 20.33 2.25 40% 41% 20.85 2.88 35% 49%
50 50.3 2.25 41% 41% 50.78 2.93 35% 49%
100 100.2 2.20 39% 40% 100.84 2.83 33% 48%

[SE(Yang)-SE(OLS)]/SE(Yang);

[SE(Yang)-SE(WLS)]/SE(Yang)

Note that when the population admixture was moderate, the standard errors of the ordinary and weighted least square estimators were similar. However, when the population admixture was more substantial (μ1=5, μ2=20), the weighted least square method was more efficient even when there were no family-specific effects. This is because population admixture acts as a source of family-specific effects (Fulker et al. 1999; Abecasis et al. 2000) in which case the weighted least square is more efficient. The efficiency gains of the weighted least square increased with the severity of admixture. When the difference between the intercepts of the two populations was 15, the reduction of the empirical standard error of the estimator was up to 24%. The estimated standard errors were close to the empirical ones.

We compare the efficiency of the proposed methods with Yang et al. (2000), i.e., adding conditional expectations of the genotypes given the minimal sufficient statistics of the null model as additional covariates in the linear model. We see from the bottom panel of Table 2 that the efficiency gains of the proposed methods ranged from 33% to 49%, which were non-trivial. Note that the efficiency gains of the ordinary least square versus the weighted least square were similar when the admixture was moderate (the left panel in Table 2). When the admixture was more severe (the right panel in Table 2), the efficiency gains increased from about 35% in the ordinary least square to about 49% in the weighted least square.

The second set of simulations corresponds to model (10) where there are random family-specific effects. The variance of these effects was fifteen. Again we examined the estimator both under a null model (β = 0) and under several alternative models (β > 0). The same phenomenon of the unadjusted estimators reporting spurious genetic effect when the true β was zero while the adjusted estimators were very close to zero was also observed for this set of simulations. From Table 3, we also see that both the ordinary least square and the weighted least square methods provided unbiased estimates. As expected, the weighted least square estimates were more efficient. When the admixture was moderate, using weighted least square instead of ordinary least square reduced the empirical standard error by up to 10%. When the population admixture was more severe, the corresponding reduction was up to 29%. We noticed larger efficiency gains of the proposed methods over Yang et al. (2000) in this set of simulations (the bottom panel of Table 3). The efficiency gains ranged from 40% to 60%, which are again substantial.

Table 3.

Estimates with correctly specified W: with family-specific effects, var(αi) = 15

Ordinary least square estimates (OLS)
μ1=5, μ2=10 μ1 =5, μ2 =20

True β Unadjusted Mean β̂ Adjusted Mean β̂ Empirical S.E. Estimated S.E. Unadjusted Mean Adjusted Mean β̂ Empirical S.E. Estimated S.E.
0 1.53 0.05 1.57 1.51 4.51 0.02 2.02 2.01
10 11.54 10.03 1.53 1.53 14.53 9.92 2.04 2.01
20 21.44 19.95 1.51 1.52 24.45 20.02 2.04 2.02
50 51.50 49.99 1.51 1.52 54.50 49.92 2.05 2.02
100 101.49 99.91 1.54 1.52 104.49 99.98 2.09 2.02

Weighted least square estimates (WLS)
0 1.07 0.04 1.42 1.40 1.96 0.03 1.42 1.48
10 11.05 10.04 1.41 1.41 11.96 9.96 1.53 1.49
20 21.07 20.0 1.36 1.40 22.04 20.02 1.45 1.48
50 51.07 50.0 1.36 1.40 51.96 49.95 1.50 1.48
100 101.1 99.9 1.40 1.41 102.02 100.08 1.50 1.48

Comparing to Yang et al. (2000)
True β Yang Mean β̂ Empirical S.E. Eff. gain OLS Eff. gain WLS Yang Mean β̂ Empirical S.E. Eff. gain OLS Eff. gain WLS

10 10.50 3.01 49% 53% 10.81 3.59 43% 57%
20 20.23 2.99 49% 55% 20.78 3.65 44% 60%
50 50.24 3.02 50% 55% 50.80 3.61 43% 58%
100 100.23 3.09 50% 55% 100.80 3.49 40% 57%

[SE(Yang)-SE(OLS)]/SE(Yang);

[SE(Yang)-SE(WLS)]/SE(Yang)

In the third set of simulations, we investigate the unbiasedness of β̂ when Wi is misspecified. We analyzed simulated data with correctly specified allele frequency (0.2), moderately misspecified frequency (0.4) and substantially misspecified frequency (0.9). Tables 3 and 4 summarize results under different severity of admixture for models (1) and (10). We see that the unbiasedness holds with misspecified W. The mean bias of both methods ranged from 0.01 to 0.3. There appears to be a small sample bias for the proposed methods when the allele frequency was severely misspecified as 0.4 and 0.9. However, the bias went away when we increased the sample size to 200 families. Specifically, for the ordinary least square, the mean bias decreased from approximately 0.2 (when p=0.4) to 0.04 and from approximately 0.3 (when p=0.9) to 0.05. For the weighted least square, the mean bias decreased from approximately 0.1 (when p=0.4) to 0.02 and from approximately 0.1 (when p=0.9) to 0.03.

Table 4.

Unbiasedness under misspecification of W: no family-specific effects, μ1 = 5, μ2 = 20

Ordinary least square estimates
p=0.2 p=0.4 p=0.9

True β Adjusted Mean β̂ Empirical S.E. Estimated S.E. Adjusted Mean β̂ Empirical S.E. Estimated S.E. Adjusted Mean β̂ Empirical S.E. Estimated S.E.
10 9.94 1.89 1.88 9.73 1.95 1.86 9.80 1.93 1.83
20 19.97 1.94 1.87 19.68 1.90 1.87 19.81 1.94 1.85
50 49.90 1.92 1.87 49.75 1.94 1.88 49.74 1.91 1.84
100 99.99 1.91 1.87 99.70 1.91 1.87 99.70 1.9 1.82

Weighted least square estimates
10 9.94 1.51 1.47 9.93 1.52 1.48 9.94 1.55 1.51
20 20.00 1.49 1.47 49.90 1.51 1.48 49.88 1.53 1.51
50 49.98 1.48 1.47 49.90 1.51 1.48 49.9 1.56 1.5
100 99.93 1.47 1.47 99.90 1.51 1.48 99.90 1.47 1.51

In the fourth set of simulations, we investigate efficiency loss due to misspecification of W. In Tables 5 and 6, we present the empirical standard errors of the point estimates with different scenarios of misspecification. For model (1) where there are no family-specific effects, when the allele frequency was moderately misspecified (p=0.4), the efficiency loss ranged from 1% to 5%, which was moderate. The efficiency loss for the ordinary least square and weighted least square was similar. When the allele frequency was severely misspecified (p=0.9), the efficiency loss ranged from 7% to 13 %. For model (10) where there are family-specific effects, the efficiency loss ranged from 0% to 6%. The magnitudes of the loss were comparable for the moderately and severely misspecified allele frequency.

Table 5.

Unbiasedness under misspecification of W: with family-specific effects, μ1 = 5, μ2 = 20, var(αi) = 25

Ordinary least square estimates
p=0.2 p=0.4 p=0.9

True β Adjusted Mean β̂ Empirical S.E. Estimated S.E. Adjusted Mean β̂ Empirical S.E. Estimated S.E. Adjusted Mean β̂ Empirical S.E. Estimated S.E.
10 9.95 2.20 2.10 9.80 2.10 2.10 9.70 2.10 2.00
20 19.98 2.15 2.11 19.80 2.20 2.10 19.70 2.05 2.00
50 50.00 2.18 2.11 49.80 2.15 2.10 49.80 2.14 2.02
100 99.95 2.20 2.10 99.70 2.18 2.10 99.70 2.10 2.00

Weighted least square estimates
10 10.05 1.56 1.50 9.97 1.52 1.51 9.99 1.52 1.51
20 19.99 1.50 1.48 20.01 1.49 1.50 19.90 1.49 1.52
50 50.02 1.53 1.49 50.02 1.50 1.49 49.90 1.50 1.52
100 99.95 1.52 1.49 99.90 1.53 1..50 99.90 1.48 1.51

Table 6.

Efficiency loss due to misspecification of W: no family-specific effects, μ1 = μ2 = 5

Ordinary least square estimates
p=0.2 p=0.4 p=0.9

True β Adjusted Mean β̂ Empirical S.E. Adjusted Mean β̂ Empirical S.E. Efficiency Loss Adjusted Mean β̂ Empirical S.E. Efficiency loss
10 10.01 1.25 10.04 1.30 4% 10.05 1.34 7%
20 20.01 1.26 19.97 1.27 1% 19.97 1.35 8%
50 50.01 1.27 49.99 1.30 2% 49.99 1.36 7%
100 99.99 1.27 100.03 1.28 1% 100.02 1.40 10%

Weighted least square estimates
10 9.90 1.25 10.04 1.31 5% 10.05 1.35 8%
20 20.01 1.24 19.95 1.27 2% 19.99 1.35 8%
50 50.02 1.24 49.99 1.30 5% 49.99 1.37 10%
100 99.99 1.25 100.02 1.29 3% 100.00 1.41 13%

In the fifth set of simulations, we examine the efficiency loss due to adjusting for population admixture when it is in fact absent. We considered both the cases when the family-specific effects were present and absent. Tables 7 and 8 summarize results under these two scenarios. When there was no admixture and the family-specific effects were absent, the efficiency loss of the adjusted analysis compared to the unadjusted ranged from 34% to 38%. The magnitude of loss was similar for the ordinary and weighted least squares. When there were family-specific effects, the efficiency loss was around 28%. The efficiency loss of the ordinary and the weighted least square was also similar.

Table 7.

Efficiency loss due to misspecification of W: with family-specific effects, μ1 = μ2 = 5, var(αi) = 25

Ordinary least square estimates
p=0.2 p=0.4 p=0.9

True β Adjusted Mean β̂ Empirical S.E. Adjusted Mean β̂ Empirical S.E. Efficiency Loss Adjusted Mean β̂ Empirical S.E. Efficiency loss
10 10.02 1.61 9.99 1.67 4% 10.00 1.64 2%
20 19.96 1.59 19.99 1.66 4% 20.05 1.69 6%
50 50.03 1.63 49.99 1.66 2% 49.94 1.67 2%
100 99.99 1.60 99.95 1.60 0% 100.00 1.64 2%

Weighted least square estimates
10 10.03 1.42 9.99 1.46 3% 10.00 1.49 5%
20 19.98 1.43 20.02 1.49 4% 20.05 1.45 1%
50 50.00 1.43 50.01 1.50 5% 49.96 1.51 6%
100 99.98 1.43 99.95 1.44 1% 100.00 1.50 5%

Table 8.

Efficiency loss due to adjusting for admixture when no admixture is present: no family-specific effects, μ1 = μ2 = 5

Ordinary least square estimates
True β Unadjusted Adjusted Efficiency Loss
Mean β̂ Emp. S.E. Mean β̂ Emp. S.E.
10 10.00 0.73 9.99 1.11 34%
20 19.95 0.79 19.95 1.19 34%
50 49.99 0.77 50.02 1.23 37%
100 100.01 0.71 100.03 1.15 38%

Weighted least square estimates
10 10.00 0.73 9.99 1.11 34%
20 19.95 0.79 19.95 1.19 34%
50 49.99 0.78 50.02 1.23 37%
100 100.01 0.71 100.03 1.15 38%

[SE (Adjusted)−SE (Unadjusted)]/SE (Adjusted)

In the sixth set of simulations, we investigate the influence of departure from constant family-specific effects model. In Table 10, we compare the empirical standard errors of β̂ under varying values of the variance of the family-specific effects. The variance of αi in each model was 0, 15 or 25. For the ordinary least square method, the standard error of β̂ increased with increasing variance of the family-specific effects, and the loss of efficiency ranged from 8% to 17%. For the weighted least square method, the standard errors of the estimators were similar regardless of the value of the family-specific variance (efficiency loss up to 3%). In other words, the family-specific effects have little influence on the efficiency of the estimates obtained by the weighted least square.

Table 10.

Effect of departure from constant family-specific effects model (μ1 = 5, μ2 = 20)

Ordinary least square estimates
Var(αi)=0 Var(αi)=15 Var(αi)=25

True β Adjusted Mean β̂ Empirical S.E. Adjusted Mean β̂ Empirical S.E. Increase of S.E. Adjusted Mean β̂ Empirical S.E. Increase of S.E.
10 9.98 1.90 9.87 2.13 12% 9.95 2.22 17%
20 19.93 1.88 19.83 2.03 8% 19.98 2.15 14%
50 49.99 1.89 49.90 2.10 11% 50.00 2.18 15%
100 100.00 1.89 99.97 2.11 12% 99.95 2.20 16%

Weighted least square estimates
10 10.02 1.49 9.93 1.51 1% 10.05 1.49 0%
20 19.95 1.48 19.87 1.51 2% 19.99 1.50 1%
50 50.02 1.48 49.93 1.50 1% 50.02 1.49 1%
100 100.05 1.47 99.99 1.48 1% 99.95 1.52 3%

4 Data Analysis

In this section, the proposed approaches are applied to an analysis of the influence of APOE genotype on plasma LDL concentrations in young children. There are three common alleles at the APOE locus (ε2, ε3, ε4). The apo ε3 is the most prevalent allele in the general population, with a frequency of 75% to 80%. The frequency of apo ε4 allele varies with ethnicity (Howard et al. 1998). Previous studies of adults have shown that apo ε4 allele is associated with higher LDL cholesterol levels compared to ε3 (Davignon et al. 1988), while the role of ε2 is more complicated.

The effect of APOE gene was found to be larger in younger people (Hixson et al. 1991). Children included in this data analysis were recruited through the Columbia University BioMarkers Study, a cross-sectional study of children and their parents conducted from 1994 to 1998 (Isasi et al. 2000, Shea et al. 1999). Families were recruited from lists of cardiac patients generated through the Presbyterian Hospital Clinical Information System, private cardiology practices, lipid clinics, pediatric practices at Columbia-Presbyterian Medical Center, and fliers posted within the medical center. Families with at least one healthy child, 4 to 25 years of age, were eligible for participation. Healthy was defined as not having any chronic medical condition under treatment by a pediatrician, other than high blood pressure or high lipids (referral criteria to the Childrens Cardiovascular Health Center). Some subjects were recruited through family members other than the children. Around 75% of the children were Hispanic and the remaining 25% were non-Hispanic White.

There were 621 children recruited for the study, among whom 55 did not have data on the APOE genotype. Among the children with genotype data, 10 were excluded because of Mendelian genotyping errors. There were 13 children without LDL concentration data. These children contributed to the computation of the additional covariates, but were not included in the subsequent regression analysis relating LDL to genotype because of missing LDL levels. The mean LDL of the 534 children was 103.8 and the standard deviation was 43.5. The frequencies of children with APOE genotypes ε2ε2, ε3ε2, ε3ε3, ε4ε2, ε4ε3, and ε4ε4 were 9.1%, 7.9%, 61.8%, 1.8%, 25.6%, and 2.0%. The mean LDL concentrations for children in each genotype group were 49.8, 84.9, 106.1, 107.2, 105.4, and 105.9.

Among the 547 children from 322 families with genotype data, 153 in 78 families had complete parental genotypes, 389 in 241 families had parental genotype available in one of the parents, and 5 of them in 3 families had no genotype information on any of the parents.

Three sets of analyses were presented. First, the unadjusted analysis was carried out. Then the analysis was repeated with two approaches to adjust for admixture: the first was the methods proposed in Yang et al. (2000), which computed the additional covariates as conditional expectation of the founder genotypes given the minimal sufficient statistics of the genetic model under the null; the second was the methods proposed here, which computed the additional covariate by (8) or (16). Results from the three analyses were compared.

In each set of the analyses, models with and without a family-specific random effect were fit to the data. The ordinary least square estimates for model (1) were obtained by fitting a simple linear model, while weighted least square estimates for model (10) were obtained by fitting a mixed effects model with random family-specific effects. Standard errors were computed as in (9) or (17). The APOE genotype was coded as the number of each of the three APOE alleles carried by a subject. In this data analysis example, Yij is the LDL concentration for the jth child in the ith family, X(Gij) = (Xε2(Gij), Xε3(Gij), Xε4(Gij))T are the numbers of ε2, ε3, and ε4 allele carried by the child, β = (β2, β3, β4)T are the effect of each of the three alleles, and αi is the family-specific random effect.

Results of the unadjusted analysis were summarized in Table 11. The significant contrasts were β3β2 and β4β2. The estimated differences were 19.9 (SE: 5.3) and 20.7 (SE: 6.1). The interpretation for this analysis was that children carrying the apo ε2 allele had a significantly lower LDL concentration than children with the ε3 or ε4 allele.

Table 11.

Real data example: the unadjusted analysis

Ordinary least square estimates
Parameter Estimate Standard error p value
ε2 33.0 5.2 < 0.001
ε3 52.9 1.2 < 0.001
ε4 53.7 3.2 < 0.001
ε3 − ε2 19.9 5.4 < 0.001
ε4 − ε3 0.8 3.7 0.83
ε4 − ε2 20.7 6.2 < 0.001

Weighted least square estimates
ε2 33.9 6.3 < 0.001
ε3 52.7 1.3 < 0.001
ε4 56.6 2.8 < 0.001
var(αi2)
721.5 145.4 < 0.001
var(σi2)
1159.7 115.1 < 0.001
ε3 − ε2 18.8 6.5 < 0.001
ε4 − ε3 3.9 3.4 0.24
ε4 − ε2 22.7 6.7 < 0.001

For the first adjusted analysis, the FBAT (Rabinowitz and Laird 2000, Horvath et al. 2001) was used to compute the conditional expectation of X(Gi) given the minimal sufficient statistics of parental genotypes. Then a linear model was fit using these conditional expectations as additional covariates as in Yang et al. (2000). The results were summarized in Table 12. In this analysis the parameter estimates for β were not identifiable, but the contrasts remained identifiable. The significant contrasts were still β3β2 and β4β2 as in the unadjusted analysis. The values of the contrasts were 24.5 (SE: 11.1) and 31.7 (SE: 11.0) for the ordinary least square, and 22.0 (SE: 9.8) and 30.1 (SE: 10.1) for the weighted least square. Note that when using the weighted least square method, the effect for β3β2 changed from 18.8 (unadjusted) to 22.0 (adjusted), and the effect for β4β2 changed from 22.7 (unadjusted) to 30.1 (adjusted). Similar magnitude of increase was observed for the ordinary least square estimates.

Table 12.

Real data example: adjusting by Yang et al. (2000)

Ordinary least square estimates
Parameter Estimate Standard error p value
ε3 − ε2 24.5 11.1 0.03
ε4 − ε3 7.2 4.7 0.12
ε4 − ε2 31.7 11.0 0.004

Weighted least square estimates
var(αi2)
718.8 146.1 < 0.001
var(σi2)
1165.0 116.0 < 0.001
ε3 − ε2 22.0 9.8 0.03
ε4 − ε3 8.2 4.3 0.06
ε4 − ε2 30.1 10.1 0.003

For the second adjusted analysis using the methods developed here, the matrices Wi and Zi are required. The genotype frequencies in Wi were computed using observed founder genotypes. We have shown that misspecification of Wi does not affect the unbiasedness of β. To illustrate the computation of Zi and the additional covariates, the calculation was carried out for an example pedigree with two children and one observed heterozygous parent in the appendix.

The results of these analyses were summarized in Table 13. The significant estimated contrasts were β3β2 and β4β2, with values 18.6 (SE: 6.9) and 19.9 (SE: 6.9) for the ordinary least square, and 17.6 (SE: 7.3) and 22.0 (7.3) for the weighted least square. The changes of the contrasts in the proposed adjustment were smaller compared to the Yang adjustment, and the standard errors were also smaller. These comparisons suggest that applying Yang et al. (2000) may have over-corrected for population admixture. Furthermore, the larger standard errors for the contrasts in the Yang analysis compared to the proposed reflected the loss of information by conditioning on the minimal sufficient statistics in the Yang analysis. We can also see that the weighted least square estimates had smaller standard error than the ordinary least squares estimates.

Table 13.

Real data example: adjusting by the proposed method

Ordinary least square estimates
Parameter Estimate Standard error p value
ε3 − ε2 18.6 6.9 0.007
ε4 − ε3 1.3 3.6 0.72
ε4 − ε2 19.9 6.9 0.004

Weighted least square estimates
var(αi2)
734.7 146.6 < 0.001
var(σi2)
1157.2 114.6 < 0.001
ε3 − ε2 17.6 7.3 0.02
ε4 − ε3 4.4 3.4 0.20
ε4 − ε2 22.0 7.3 0.003

5 Discussion

Here a locally efficient approach to adjusting for population admixture when estimating genetic effect on a quantitative trait is proposed. The main step is to augment a linear regression model with supplemental covariates which provides unbiased minimal variance estimator for the genetic parameter of interest. The form of the additional covariates is similar in spirit to that proposed in Rabinowitz (2002) and Whittemore (2004). The models in (1) and (10) can be extended to include environmental factors.

In the testing context, it was observed by Whittemore and Halpern (2003) that both Rabniowitz and Laird (2000) and Rabinowitz (2002) can be formulated as solutions to a constrained optimization problem: the coefficient of variation of the test statistic is to be maximized under some constraint. In Rabinowitz and Laird (2000) the constraint is that the conditional expectation of the test statistic given the minimal sufficient statistic of the genetic model under the null hypothesis is zero; while in Rabinowitz (2002) the constraint is that the conditional expectation of the test statistic given the founder genotypes is zero. The column space of the constraints in the former contains the corresponding column space of the constraints of the latter: the vectors in the latter space can be expressed as a linear combination of vectors in the former space. This observation implies that the constraints in Rabinowtiz and Laird (2000) are too restrictive and there is potential loss of information incurred by conditioning on a larger space. For example, families where all children have the same genotype do not contribute to the analysis. In contrast, the methods in Rabinowtiz (2002) capture all the available information.

In the estimation context, the analogous comparison of efficiency is between Yang et al. (2000), which was based on Rabinowitz and Laird (2000), and the proposed approach, which has a similar form to Rabinowitz (2002) and Whittemore (2004). Yang et al. (2000) corresponds to projecting covariates involving genotypes subject to population admixture onto the space of the minimal sufficient statistics, while the proposed approach corresponds to projecting the genotype-related covariates onto an appropriate smaller space. The larger the space of projection, the more information is lost. Our simulation results suggest non-trivial efficiency gains of the proposed methods over Yang et al. (2000). It can also be seen from the real data analysis example that the standard errors of the proposed methods were smaller which suggests over-adjustment in Yang et al. (2000) because conditioning on the minimal sufficient statistic may be too restrictive.

It is shown in the appendix that the proposed approach is optimal when there is no family-specific effect or when the family-specific effect is a constant. When such effect is not a constant, the proposed approach is locally optimal because the family-specific effect is approximately a constant (the variance of such effect is zero) when considered locally. We used simulation studies to investigate the efficiency of the estimators when the family-specific effect is not a constant. The efficiency of the ordinary least square method was reduced to up to 17% (Table 10), but the efficiency of the weighted least square method was not greatly influenced by the departure from the constant family-specific effect model (efficiency loss up to 3%, see Table 10).

The added covariates Ui involve marginal probabilities Wi, which are estimated (possibly incorrectly) from the data. The estimators depending on these estimated values would normally introduce extra variability. However, since the additional covariates XiUi can be viewed as residuals from the projection of Xi onto to the space spanned by Wi, they are orthogonal to Wi. By the orthogonality, there is no additional variability introduced by estimating Wi.

Since population admixture acts as a source of family-specific effects, the weighted least square is more efficient than the ordinary least square even when there are no additional family-specific effects. For the real data analysis, the weighted least square should be used.

The proposed methods are designed for the single-locus model or the multi-locus model without interaction. When there are multiple loci predisposing a disease and there is no interaction between the loci, we compute a set of optimal covariates for each locus using founder genotypes at this locus. All the optimal covariates will then be included in the linear model analyses. When there is interaction between the loci, the current approaches need to be modified to a haplotype based method to account for this effect because haplotype association analysis may be more powerful than genotype association analysis (Morris and Kaplan 2002). However, one complication faced by a haplotype analysis is that the phase of a haplotype is usually not observed. In a haplotype analysis, when the phase is known, one uses functions of haplotypes as predictors. When the phase is unknown, one uses the conditional distribution of the haplotypes given the genotypes to compute the conditional expectation of phase-unknown haplotype scores. The conditional distribution depends on the marginal distribution of haplotypes which may be subject to population stratification or may be estimated using only approximated assumptions (e.g. Hardy-Weinberg equilibrium). To extend the methods developed here in this context, the covariates Xi in the equation (8) should be replaced by the estimated conditional expectation of the functions of haplotypes. The Wi should be replaced by the marginal distribution used in the calculations of Xi, and Zi should be replaced by the matrix of conditional probability of all possible haplotypes given the parental haplotypes. Further research along this direction is underway.

The proposed methods are easy to implement and the computational cost is low. The extra computation involved other than fitting a linear model is to compute the additional optimal covariates. The form of these covariates (see (8)) suggests that they are constructed using marginal distribution of founder genotypes and conditional distribution of the offspring genotypes and involve some matrix algebra. These computations do not entail iterations and can be completed in seconds. In our simulations, it took half a minute to compute the optimal covariates for 1000 repetitions on a Dell Workstation with 2.00GHz CPU. A link to the code to compute the optimal covariates can be found at www.columbia.edu/~yw2016.

Here the methods are developed in the context of random sampling. When subjects with extreme values of a quantitative trait are over-sampled or when certain outcomes are over-sampled, these methods are generally biased. In such settings, the more general estimating equation conditioning on the outcomes proposed in Whittemore (2004) may be applicable.

Table 9.

Efficiency loss due to adjusting for admixture when no admixture is present: with family-specific effects, μ1 = μ2 = 5, var(αi) = 15

Ordinary least square estimates
Trueβ Unadjusted Adjusted Efficiency Loss
Mean β̂ Emp. S.E. Mean β̂ Emp. S.E.
10 9.99 1.01 9.97 1.41 28%
20 20.01 1.02 19.98 1.42 28%
50 50.02 1.01 50.02 1.39 27%
100 99.98 1.02 99.98 1.41 28%

Weighted least square estimates
10 9.99 0.97 9.97 1.34 28%
20 19.99 0.97 19.98 1.32 27%
50 50.01 0.95 50.02 1.31 27%
100 99.99 0.95 99.98 1.32 28%

[SE (Adjusted)−SE (Unadjusted)]/SE (Adjusted)

Acknowledgments

Yuanjia Wang’s research is supported by NIH AG031113-01A2. We thank Dr. Steve Shea for providing the LDL data used to illustrate the proposed methods.

Appendix

In this section the expectation and the variance of β̂ were computed, the solution to the constrained optimization problem was derived, and the efficient additional covariates for an example pedigree was computed.

From the expression β̂ = [(XU)T(XU)]−1(XU)TY and the model (2), the expectation of β̂ is

Eβ^=E{[(XU)T(XU)]1(XU)TY}=β+E{[(XU)T(XU)]1(XU)TUγ}+E{[(XU)T(XU)]1(XU)Tε}=β+E{[(XU)T(XU)]1iE[(XiUi)T1niGi]E[εijGi]},

where 1ni denotes the ni × 1 vector of 1. Here the third equality follows from the constraint (4) and the fact that given the founder genotypes, X(G) is conditionally independent of ε. It follows that the condition to ensure the unbiasedness of β̂ is that

E[(XiUi)T1niGi]=0,i=1,,n.

Now turn to the computation of the variance. We have the expression

var(β^)=var(E(β^G))+E(var(β^G)).

Under the conditions (4) and (7), we have E(β̂|G) = β. Therefore the first term on the right hand side of the expression is zero. Let Ai denote the vector X(Gi) − Ui, and let A denote the matrix (A1T,,AnT)T. The second term can be calculated as

var(β^G)=E{[ATA]1iAiTεiεiTAi[ATA]1G}=iE(εij2Gi)E{[ATA]1AiTAi[ATA]1G}=σ2E{[ATA]1G}.

Here σ2 is the variance of the residuals. The second equality follows from the conditional independence of εij and Gij given Gi. Taking the expectation we obtain

var(β^)=σ2E[ATA]1. (18)

Similarly, when there are family-specific effects, expectation of the weighted least square β̂′ as in (11) is

Eβ^=E{[(XU)T1(XU)]1(XU)T1Y}=E{[(XU)T1(XU)]1(XU)T1[(XU)β+Uγ+α+ε]}=β+E{[(XU)T1(XU)]1iE[(XiUi)Ti11niGi]E[αiGi]}+E{[(XU)T1(XU)]1iE[(XiUi)Ti11niGi]E[εijGi]}.

The conditional variance is

var(β^G)=E{[AT1A]1iAiTi1(αi21ni1niT+εiεiT)i1Ai[AT1A]1G}=E{[AT1A]1iAiTi1E[αi21ni1niT+εiεiTGi)]i1Ai[AT1A]1G}=iE{[AT1A]1AiTi1Ai[AT1A]1G}=E{[AT1A]1G}.

Here the second equality follows from the conditional independence of αi and Gij given Gi, and the third equality follows from i=var(YiYiTGi). Taking expectation, we have

var(β^)=E[AT1A]1. (19)

Therefore minimizing the variance of weighted least square when there are family-specific effects amounts to minimizing (XU)TΣ−1(XU).

Since by a linear transformation, solving U for the weighted least square estimator can be converted to solving U for the ordinary least square, we solve the constrained optimization problem for the latter. To minimize the variance (18) subject to constraints (4) and (7), we introduce Lagrange equations. Recall that ϑi denotes all possible founder genotypes compatible with that observed in the ith family, and ϑi denotes all possible combination of offspring genotypes in the ith family. The object function is

i=1ngϑij=1ni(Xij(g)Uij(g))2P(g)i=1ngϑigϑij=1niλi,g(Xij(g)Uij(g))P(gg)ηi=1ngϑij=1ni(Xij(g)Uij(g))Uij(g)P(g).

Here λi,g and η are Lagrange multipliers. The term Uij(g) is the additional covariate to add when the observed genotypes in the offspring is g.

It is convenient to write the objective function in a matrix form and do the calculation in matrix algebra. Recall the notations for Wi, Xi and Vi defined in section 2, and let λi denote the di × 1 vector of λi,g. The objective function can be written as

i1niT(XiVi)Wi(XiVi)T1niiλiTZiT(XiVi)T1niηi1niT(XiVi)WiViT1ni,

and the Lagrange equations can be written as

2Wi(XiVi)T1niZiλiηWi(Xi2Vi)T1ni=0 (20)
ZiT(XiVi)T1ni=0 (21)
i1niT(XiVi)WiViT1ni=0. (22)

Multiplying the both sides of (20) on the left by ZiTWi1 and using the condition (21) results in

λi=η(ZiTWi1Zi)1ZiTXiT1ni. (23)

Plug (23) into (20) to get

(2η)Wi(XiVi)T1niηZi(ZiTWi1Zi)1ZiTXiT1ni+ηWiViT1ni=0.

Solve this equation to arrive at

ViT1ni=Wi1Zi(ZiTWi1Zi)1ZiTXiT1ni, (24)

and η = 2. Here Vi is identifiable up to the sum of its components. Adding and subtracting a constant from the elements of Vi does not change the sum of all the elements. Nevertheless, the expectation of the estimator is the same. We simply pick

Vi=XiZi(ZiTWi1Zi)1ZiTWi1.

The desirable additional covariates Ui can be picked from the row of Vi which corresponds to the observed genotypes in family members of the ith family. The marginal probabilities Wi can be computed as P(g)=gϑi(g)P(gg)P(gg). Here g** index the observed founder genotypes in ϑi. Note that the solution under the constraints (20) and (21) satisfies (22) automatically.

Finally the computation for an illustrative example pedigree with two children is presented. Suppose that the example pedigree has two children with genotypes (DD, Dd), one parent with observed genotype Dd, and the other parent with no genotype information. The parental genotypes compatible with the observed genotypes are ϑi={(Dd,DD),(Dd,Dd),(DD,dd)}. The 9 possible genotype configurations for the children are listed as the rows in Table A1. Here ci = 9, and di = 3. The entries of matrix Zi are presented in Table A1.

Table A1.

The entries of Zi for the example pedigree

Offspring genotypes Parental genotypes
(Dd, DD) (Dd, Dd) (Dd, dd)
(DD, DD) 1/4 1/16 0
(DD, Dd) 1/4 1/8 0
(DD, dd) 0 1/16 0
(Dd, DD) 1/4 1/8 0
(Dd, Dd) 1/4 1/4 1/4
(Dd, dd) 0 1/8 1/4
(dd, DD) 0 1/16 0
(dd, Dd) 0 1/8 1/4
(dd, dd) 0 1/16 1/4

Code Xi as the number of D alleles, then from the equation (24), the matrix Vi can be calculated. The results under different assumptions of the allele frequency are recorded in Table A2. It can be seen that there is no big difference in Vi when we change the allele frequency.

Table A2.

The matrix Xi and Vi for the example pedigree

Offspring genotypes
XiT
ViT
p = 0.1 p = 0.2 p = 0.4
(DD, DD) (2, 2) (2.04, 2.04) (1.98, 1.98) (1.86, 1.86)
(DD, Dd) (2, 1) (1.68, 1.68) (1.66, 1.66) (1.62, 1.62)
(DD, dd) (2, 0) (1.24, 1.24) (1.18, 1.18) (1.06, 1.06)
(Dd, DD) (1, 2) (1.68, 1.68) (1.66, 1.66) (1.62, 1.62)
(Dd, Dd) (1, 1) (0.6, 0.6) (0.7, 0.7) (0.9, 0.9)
(Dd, dd) (1, 0) (0.48, 0.48) (0.46, 0.46) (0.42, 0.42)
(dd, DD) (0, 2) (1.24, 1.24) (1.18, 1.18) (1.06, 1.06)
(dd, Dd) (0, 1) (0.48, 0.48) (0.46, 0.46) (0.42, 0.42)
(dd, dd) (0, 0) (0.44, 0.44) (0.38, 0.38) (0.26, 0.26)

The observed genotypes for children in the example pedigree is (DD, Dd), which correspond to the second entry in Table A2. Therefore the additional covariate Ui for this family when the allele frequency is 0.1 is (1.68, 1.68). When the allele frequency is 0.2 or 0.4, the additional covariates for this family are (1.66, 1.66) and (1.62, 1.62), respectively. These covariates are not substantially affected by the specification of allele frequency.

Contributor Information

Yuanjia Wang, Email: yw2016@columbia.edu, Department of Biostatistics, Mailman School of Public Health, Columbia University 722 W168th St., New York, NY 10032, U.S.A.

Qiong Yang, Department of Biostatistics, Boston University.

Daniel Rabinowitz, Department of Statistics, Columbia University.

References

  1. Abecasis GR, Cardon LR, Cookson WOC. A general test of association for quantitative traits in nuclear families. American Journal of Human Genetics. 2000;66:279–292. doi: 10.1086/302698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Allen AS, Satten GA, Tsiatis AA. Locally-efficient robust estimation of haplotype-disease association in family-based studies. Biometrika. 2005;92:559–571. [Google Scholar]
  3. Cox DR, Hinkley CV. Theoretical statistics. London: Chapman & Hall; 1979. [Google Scholar]
  4. Curtis D, Sham PC. A note on the application of the transmission disequilibrium test when a parent is missing. American Journal of Human Genetics. 1995;56:811–812. [PMC free article] [PubMed] [Google Scholar]
  5. Davignon J, Gregg RE, Sing CF. Apolipoprotein E polymorphism and atherosclerosis. Arteriosclerosis. 1988;8:121. doi: 10.1161/01.atv.8.1.1. [DOI] [PubMed] [Google Scholar]
  6. Elston RC. Linkage and association. Genet Epidemiol. 1998:565–576. doi: 10.1002/(SICI)1098-2272(1998)15:6<565::AID-GEPI2>3.0.CO;2-J. [DOI] [PubMed] [Google Scholar]
  7. Falk CT, Rubinstein P. Haplotype relative risks: An easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet. 1987;51:227233. doi: 10.1111/j.1469-1809.1987.tb00875.x. [DOI] [PubMed] [Google Scholar]
  8. Fulker DW, Cherny SS, Sham PC, Hewitt JK. Combined linkage and association sib-air analysis for quantitative traits. American Journal ofHuman Genetics. 1999;64:259–267. doi: 10.1086/302193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Hardin JW, Hilbe JM. Generalized Estimating Equations. Chapman and Hall; London, UK: 2003. [Google Scholar]
  10. Hixson JE. Apolipoprotein E Polymorphisms Affect Atherosclerosis in Young Males: Pathobiological Determinants of Atherosclerosis in Youth (PDAY) Research Group. Arterioscler Thromb. 1991;11:237–244. doi: 10.1161/01.atv.11.5.1237. [DOI] [PubMed] [Google Scholar]
  11. Horvath S, Xu X, Laird N. The family based association test method: strategies for studying general genotype-phenotype associations. Euro J Hum Gen. 2001;9:301–306. doi: 10.1038/sj.ejhg.5200625. [DOI] [PubMed] [Google Scholar]
  12. Howard BV, Gidding SS, Liu K. Association of apolipoprotein E phenotype with plasma lipoproteins in African-American and white young adults. Am J Epidemiol. 1998;148:859868. doi: 10.1093/oxfordjournals.aje.a009711. [DOI] [PubMed] [Google Scholar]
  13. Isasi CR, Shea S, Deckelbaum RJ, Couch SC, Starc TJ, Otvos JD, Berglund L. Apolipoprotein ε2 allele is associated with an anti-atherogenic Lipoprotein profile in children: the Columbia University Biomarker Study. Pediatrics. 2000;106:568–575. doi: 10.1542/peds.106.3.568. [DOI] [PubMed] [Google Scholar]
  14. Lazzeroni LC, Lange K. A conditional interference framework for extending the transmission/disequilibrium test. Hum Hered. 1998;48:6781. doi: 10.1159/000022784. [DOI] [PubMed] [Google Scholar]
  15. Morris RW, Kaplan NL. On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genetic Epidemiology. 2002;23:221–233. doi: 10.1002/gepi.10200. [DOI] [PubMed] [Google Scholar]
  16. Rabinowitz D, Larid N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Human Hered. 2000;50:211–223. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]
  17. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
  18. Rabinowtiz D. Adjusting for population heterogeneity and misspecified haplotype frequencies when testing nonparametric null hypotheses in statistical genetics. Journal of the American Statistical Association. 2002;92:742–758. [Google Scholar]
  19. Shea S, Isasi CR, Couch S, Starc TJ, Tracy RP, Deckelbaum R, Talmud P, Berglund L, Humphries SE. Relations of plasma fibrinogen level in children to measures of obesity, the (G-455->A) mutation in the beta-fibrinogen promoter gene, and family history of ischemic heart disease: the Columbia University BioMarkers Study. Am J Epidemiol. 1999;150(7):737–46. doi: 10.1093/oxfordjournals.aje.a010076. [DOI] [PubMed] [Google Scholar]
  20. Spielman RS, Ewens WJ. A sib-ship test for linkage in the presence of association: The sib transmission/disequilibrium test. Am J Hum Genet. 1998;62:450458. doi: 10.1086/301714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52:506516. [PMC free article] [PubMed] [Google Scholar]
  22. Terwilliger JD, Ott J. A haplotype-based haplotype relative risk approach to detecting allelic associations. Hum Hered. 1992;42:337346. doi: 10.1159/000154096. [DOI] [PubMed] [Google Scholar]
  23. Yang Q, Rabinowitz D, Isasi C, Shea S. Adjusting for confounding due to population admixture when estimating the effect of candidate genes on quantitative traits. Human Heredity. 2000;50:227–233. doi: 10.1159/000022920. [DOI] [PubMed] [Google Scholar]
  24. Whittemore A. Estimating genetic association parameter from family data. Biometrika. 2004;91:219–225. [Google Scholar]
  25. Whittemore A, Halpern J. Genetic association tests for family data with missing parental genotypes: a comparison. Genetic Epidemiology. 2003;25:80–91. doi: 10.1002/gepi.10247. [DOI] [PubMed] [Google Scholar]

RESOURCES