Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Jul 1.
Published in final edited form as: Ann Hum Genet. 2010 May 31;74(4):351–360. doi: 10.1111/j.1469-1809.2010.00588.x

Influence of population stratification on population-based marker-disease association analysis

TENGFEI LI 1, ZHAOHAI LI 2, ZHILIANG YING 3, HONG ZHANG 4,*
PMCID: PMC2897957  NIHMSID: NIHMS203793  PMID: 20529080

Summary

Population-based genetic association analysis may suffer from the failure to control for confounders such as population stratification (PS). There has been extensive study on the influence of PS on candidate gene-disease association analysis, but much less attention has been paid to its influence on marker-disease association analysis. In this paper, we focus on the Pearson chi-square test and the trend test for marker-disease association analysis. The mean and variance of the test statistics are derived under presence of PS, so that the power and inflated type I error rate can be evaluated. It is shown that the bias and the variance distortion are not zero in the presence of both PS and penetrance heterogeneity (PH). Unlike the candidate gene-disease association analysis, when PS is present, the bias is not zero no matter whether PH is present or not. This work generalizes the results of Ewens and Spielman (1995), where only the fully recessive penetrance model is considered and only the bias is calculated. It is shown that candidate gene-disease association analysis can be treated as a special case of marker-disease association analysis. Consequently, our results extend previous study on the candidate gene-disease association analysis. A simulation study confirms the theoretical findings.

Keywords: bias, marker-disease association, penetrance heterogeneity, population stratification, variance distortion

INTRODUCTION

Population-based gene-disease association analysis is the most commonly used statistical method for detecting genetical variants underlying human diseases (Risch and Merikangas, 1996; Risch, 2000). Such an approach makes use of the case-control design, which is easy to carry out and cost-effective. However, the case-control studies often suffer from a failure to account for confounders such as population stratification (PS), resulting in spurious associations (Knowler et al., 1988; Lander and Schork, 1994; Cardon and Palmer, 2003; Campbell et al., 2005). When parental genotypes of affected individuals are available, the transmission disequilibrium test (TDT) can be used to control for false positives due to PS. However, for diseases with a late age of onset, the parental genotypes are generally unavailable and, therefore, TDT is not applicable. There have been studies in recent years on the impacts of PS on gene-disease association analysis, particularly with respect to the bias and/or variance distortion of the test statistic (Ewens and Spielman, 1995; Gorroochurn et al., 2004; Heiman et al. 2004; Qin et al., 2006; Whittemore, 2006; Li et al., 2009; Zheng et al. 2009). Most of the existing studies focus on a candidate locus, where the null hypothesis states that the penetrance does not depend on genotype in any subpopulation. Markers are widely used in preliminary association analyses for detecting disease genes, especially in genome-wide association analyses. However, the impact of PS on marker-disease association has not been studied, with the exception of the work of Ewens and Spielman (1995), where the bias of the test statistic for marker-disease association was obtained by assuming a very special disease model, namely a fully recessive penetrance model, but the variance distortion and power function were not given.

In this paper, we extend the results of Ewens and Spielman (1995) to a more general class of models, without assuming any mode of inheritance. Besides the bias, we also derive the variance distortion and power function under both the null hypothesis and the alternative hypothesis. The null hypothesis in the marker locus case states that the linkage disequilibrium (LD) measures are zero in any subpopulation. It is shown that the bias and variance distortion under the null hypothesis are not zero in the presence of both PS and penetrance heterogeneity (PH). In addition, the bias is not zero when PS is present, even if PH is not, in contrast to the result for candidate gene-disease association analysis, where the bias is equal to zero when PH is absent. We demonstrate that candidate gene-disease association analysis can be treated as a special case of marker-disease association analysis, so that our results are extensions of the previous work on candidate gene-disease association analysis. Because the null hypothesis in the marker locus case is different from that in the candidate locus case, the existing results under the null hypothesis for a candidate locus cannot be transformed through simple reparameterization to yield our results.

Our contributions consist of the following: 1) we extend the existing results to the general case of marker-disease association analysis; 2) we find that the presence of PS can lead to bias of the marker-disease association test statistic even when the PH is absent, while in the candidate locus case the bias is always zero when PH is absent; 3) we derive the power functions for the Pearson chi-square test and the trend test so that one can study the impact of PS and PH on both the type I and type II errors of the two tests.

The rest of the paper is organized as follows. Some notation and definitions are given in the next section. The subsequent sections give the mean and variance of the Pearson chi-square test statistic and the trend test statistic and their power functions. A small-scale simulation study is conducted to verify the theoretical results. This is followed by some concluding remarks.

NOTATION

Suppose that in a case-control study n1 cases and n2 controls are sampled from their respective populations, where n = n1+ n2. A marker with alleles M and m is then genotyped, with the counts of genotypes and alleles given in Tables 1 and 2, respectively.

Table 1.

Genotype counts

MM Mm mm Sum
Cases D2 D1 D0 n1
Controls C2 C1 C0 n2
Sum r2 r1 r0 n

Table 2.

Allele counts

M m Sum
Cases 2D2+D1 2D0+D1 2n1
Controls 2C2+C1 2C0+C1 2n2
Sum 2r2+r1 2r0+r1 2n

Let the proportions of cases and controls with allele M be denoted by D = (2D2+ D1)/(2n1) and C = (2C2+ C1)/(2n2), respectively. In addition, let = (2D2+ D1+ 2C2+C1)/(2n) be the proportion of the pooled sample with allele M. The commonly used Pearson chi-square test statistic based on allele counts is the square of the following test statistic:

T=q^Dq^CV1/2, (1)

where

V=q^(1q^)(12n1+12n2) (2)

is an estimate of the variance of DC. The estimate V is used when Hardy-Weinberg equilibrium (HWE) holds in the overall population. In the “variance adjustment” section, we will present a variance estimate that is valid even when HWE does not hold.

We assume that the total population consists of K subpopulations, with HWE holding within each subpopulation. Throughout this paper, we shall use Si to denote the event that a randomly selected individual is from subpopulation i and wi to denote the proportion of the total population that belongs to subpopulation i. Assume that only one locus, with alleles A and a, is responsible for the disease. For subpopulation i, let pi and qi denote the frequencies of alleles A and M, respectively. Thus, the frequencies of alleles A and M in the overall population are p=i=1Kwipi and q=i=1Kwiqi, respectively. Furthermore, let xi1, xi2, xi3 and xi4 denote the frequencies of gametes MA, Ma, mA and ma, respectively, and δi = xi1xi4xi2xi3 the LD measure between the marker locus and the disease locus. Finally, denote by f2i, f1i and f0i the penetrances of genotypes AA, Aa and aa, respectively. Under the HWE, the frequencies of genotypes AA, Aa and aa at the disease locus for subpopulation i are p2i=pi2, p1i = 2pi(1−pi) and p0i = (1−pi)2, respectively. The null hypothesis of linkage equilibrium becomes

H0:δ1==δK. (3)

Under the null hypothesis H0, all LD measures δi, i = 1,…,K, are equal to 0, while under the alternative hypothesis, at least one LD measure is not equal to 0. It is clear that the null hypothesis implies that the marker is not associated with the disease.

Definition 1

PS is said to be present if the allele frequencies at the marker locus are heterogenous, i.e., the qi vary with i.

EXPECTATION OF FREQUENCY DIFFERENCE

In this section, we calculate the expectations of D and C under both the null and the alternative hypotheses. We study the null expectation of DC, which is termed bias. Hereafter, let Y = 1 denote the event that a randomly chosen individual is a case, and Y = 2 the event that a randomly chosen individual is a control.

By definition, the expectations of D and C are equal to

E(q^D)=P(MMY=1)+12P(MmY=1) (4)

and

E(q^C)=P(MMY=2)+12P(MmY=2), (5)

respectively. The disease prevalence, which we denote by B, satisfies B=i=1Kwij=02fjipji by the Law of Total Probability. In Appendix I, we show that,

P(MMY=1)=1B{i=1Kwiqi2j=02fjipji+i=1Kwiδi2(f2i+f0i2f1i)+2i=1Kwiqiδi[pi(f2if1i)+(1pi)(f1if0i)]} (6)

and

P(MmY=1)=2B{i=1Kwiqi(1qi)j=02fjipjii=1Kwiδi2(f2i+f0i2f1i)+i=1Kwi(12qi)δi[pi(f2if1i)+(1pi)(f1if0i)]}. (7)

Similarly, we have

P(MMY=2)=11B{i=1Kwiqi2j=02(1fji)pjii=1Kwiδi2(f2i+f0i2f1i)2i=1Kwiqiδi[pi(f2if1i)+(1pi)(f1if0i)]} (8)

and

P(MmY=2)=21B{i=1Kwiqi(1qi)j=02(1fji)pji+i=1Kwiδi2(f2i+f0i2f1i)i=1Kwi(12qi)δi[pi(f2if1i)+(1pi)(f1if0i)]}. (9)

Substituting the above four probabilities for those in (4) and (5) gives

Δ=E(q^Dq^C)=A1+A2B(1B), (10)

where

A1=j=02[i=1Kwifjipjiqiqi=1Kwifjipji] (11)

and

A2=i=1Kwiδi[(f2if1i)pi+(f1if0i)(1pi)]. (12)

Since A2 = 0 under the null hypothesis, the bias is A1/[B(1−B)]. Define a random variable Z with probability function P(Z = i) = wi, i = 1, ···, K. Then A1=Cov(qZ,j=02fjZpjZ), where qz, fjZ and pjZ are conditional probabilities that are equal to qi, fji and pji, respectively, conditional on Z = i.

The following are some scenarios that occur in practice.

Scenario 1

If PS is absent, then random variable qZ degenerates to a constant. In this scenario, A1 = 0 and the bias is zero.

Scenario 2

If PS is present but PH is absent (i.e., fji = fj1, j = 0,1,2, i = 1,···,K), then the random variables fjZ, j = 0,1,2, degenerate to constants, but A1 is not zero. Hence the bias is not zero in general since qZ and pjZ, j = 0,1,2, are not necessarily constant.

Scenario 3

If both PS and PH are present, then the bias is not zero in general.

Remark 1

When f2i = 1, f0i = f1i = 0, i = 1, ···,K, the model degenerates to the so-called fully recessive penetrance model and the expectation becomes

i=1Kwiqipi2qi=1Kwipi2+i=1KwiδipiB(1B). (13)

The above expression is almost identical to expression (5) in Ewens and Spielman (1995).

Remark 2

If the marker locus and the disease locus coincide, so that pi = qi and the LD measures are δi=pipi2, then the marker locus becomes a candidate locus. In this case, (6)–(9) become

P(MMY=1)=(i=1Kwif2ip2i)/(i=1Kwij=02fjipji),P(MmY=1)=(i=1Kwif1ip1i)/(i=1Kwij=02fjipji),P(MMY=2)=(i=1Kwi(1f2i)p2i)/(i=1Kwij=02(1fji)pji)

and

P(MmY=2)=(i=1Kwi(1f1i)p1i)/(i=1Kwij=02(1fji)pji).

The resulting expectation corresponds to the candidate locus case studied by Li et al. (2009). In the candidate locus case, the null hypothesis is f0i = f1i = f2i for i = 1,…, K, and the bias is equal to zero if either PS or PH is absent. In the marker locus case that we study in the current paper, however, the bias is generally not zero if PH is absent but PS is present.

VARIANCE OF THE FREQUENCY DIFFERENCE

In Appendix II, we derive the following variance formula for D:

Var(q^D)=14n1[4P(MMY=1)(1P(MMY=1))+P(MmY=1)(1P(MmY=1))4P(MMY=1)P(MmY=1)]. (14)

Under the null hypothesis H0, the conditional probabilities P(MM|Y = 1) and P(Mm|Y = 1) given by (6) and (7) are equal to

PH0(MMY=1)=1Bi=1Kwiqi2j=02fjipji=i=1Kqi2αi=q¯D2+σD2 (15)

and

PH0(MmY=1)=2Bi=1Kqi(1qi)j=02fjipji=i=1K2qi(1qi)αi=2q¯D(1q¯D)2σD2,

respectively, where

q¯D=i=1Kαiqi,σD2=i=1Kαi(qiq¯D)2andαi=wiBj=02fjipji.

It follows from (14), (15) and (16) that the null variance of D is

VarH0(q^D)=q¯D(1q¯D)+σD22n1.

Similarly,

VarH0(q^C)=q¯C(1q¯C)+σC22n2,

where

q¯C=i=1Kγiqi,σC2=i=1Kγi(qiq¯C)2andγi=wi1Bj=02(1fji)pji.

Under the alternative hypothesis, the variance of D can be expressed as

Var(q^D)=q¯D(1q¯D)+σD22n1+12n1Bi=1Kwiδi2(f2i+f0i2f1i)1n1B2{i=1Kwiδi[pi(f2if1i)+(1pi)(f1if0i)]}2+12n1Bi=1Kδiwi[pi(f2if1i)+(1pi)(f1if0i)](1+2qi4q¯D). (16)

We refer to Appendix III for its detailed derivation. Similarly, the variance of C is equal to

Var(q^C)=q¯C(1q¯C)+σC22n212n2(1B)iwiδi2(f2i+f0i2f1i)1n2(1B)2{iwiδi[pi(f2if1i)+(1pi)(f1if0i)]}212n2(1B)iwiδi[pi(f2if1i)+(1pi)(f1if0i)](1+2qi4q¯C). (17)

By virtue of the independence between the cases and controls, the variance of DC is

σ2Var(q^D)+Var(q^C). (18)

In particular, the null variance of DC is

σ02q¯D(1q¯D)+σD22n1+q¯C(1q¯C)+σC22n2. (19)

Definition 2

We say variance distortion exists if under the null hypothesis, the variance estimator V for DC as given by (2), is not asymptotically equivalent to the true variance, that is, if V/σ02 does not converge to 1 with probability 1 under the null hypothesis.

By the Law of Large Numbers, DD and CC, which imply that converges to c1D + c2C with probability 1, where cj = limn→∞nj/n, j = 1,2. It follows that under the null hypothesis V is asymptotically equivalent to

σ¯2[(c1q¯D+c2q¯C)(1c1q¯Dc2q¯C)](12n1+12n2). (20)

Remark 3

If PS is absent, then under the null hypothesis D = C = q and σD2=σC2=0. Hence, σ02=σ¯2 and the variance distortion vanishes under the null hypothesis and HWE. Otherwise, the variance distortion is present in general.

VARIANCE ADJUSTMENT

In this section, we derive the power function of the test statistic T, which is given by (1). By the Central Limit Theorem, TA = (DC−Δ)/σ converges in distribution to the standard normal distribution, since the mean and variance of DC are Δ and σ2. The two-sided T test at level of significance α is determined by rejection region {|T|> uα/2}, where uα/2 is the upper α/2 -quantile of the standard normal distribution. The corresponding power function is therefore approximated by

1Φ(uα/2σ¯Δσ)+Φ(uα/2σ¯Δσ), (21)

where Φ is the standard normal distribution function and σ̄2 is defined by (20).

As we mentioned earlier, variance distortion exists in presence of PS. Therefore, it is necessary to use a consistent estimate of the variance σ2. Notice that under the null hypothesis, σ2 becomes σ02=(q¯D(1q¯D)+σD2)/(2n1)+(q¯C(1q¯C)+σC2)/(2n2). We can estimate it with

V=q^D(1q^D)+σ^D22n1+q^C(1q^C)+σ^C22n2, (22)

where σ^D2=D2/n1q^D2 and σ^C2=C2/n2q^C2 are consistent estimates of σD2 and σC2, respectively. The estimator V* was used by Li et al. (2009) for the candidate locus. In the marker locus case, we can show that V* is asymptotically equivalent to σ2 under both the null hypothesis and the alternative hypothesis. Actually, V* is a special estimate of the trend test statistic that will be studied in the next section, and it will be shown that V* is asymptotically equivalent to σ2 even when HWE does not hold.

Now, a modification of T takes the form

T=q^Dq^C(V)1/2. (23)

The T* test with rejection region {|T*|>uα/2} has an approximate power function

1Φ(uα/2Δ/σ)+Φ(uα/2Δ/σ). (24)

EXTENSION TO TREND TEST

The trend test statistic is defined as

Tx=(D2/n1C2/n2)+x(D1/n1C1/n2)Vx1/2,

where x is a given real number between 0 and 1 and Vx is an estimator of the variance of the numerator. From (6)–(9), it follows that the expectation of (D2/n1C2/n2)+ x(D1/n1C1/n2) is

Δx=1B(1B){i=1Kwi[qi2+2xqi(1qi)]j=02fjipjii=1Kwij=02fjipjii=1K[qi2+2xqi(1qi)]+(12x)i=1Kwiδi2(f2i+f0i2f1i)+2i=1Kwiδi[qi+x(12qi)][pi(f2if1i)+(1pi)(f1if0i)].

Under the null hypothesis, this expectation is equal to Cov(j=02pjZfjZ,qZ2+2xqZ(1qZ)), where Z is the random variable defined below (12). In the absence of PS, the random variable Z becomes non-random, making the null expectation 0. Otherwise, the expectation is nonzero in general. Furthermore, under the assumptions in Remark 2, the expression Δx reduces to that given by Zheng et al. (2009).

For notational simplicity, we use g21 = P(MM|Y = 1), g11 = P(Mm|Y = 1), g22 = P(MM|Y = 2) and g12 = P(Mm|Y = 2) for the expressions given by (6)–(9). Using the facts that (D2, D1, D0) and (C2,C1,C0) follow trinomial distributions, we obtain the following formula

σx2=Var[(D2n1C2n2)+x(D1n1C1n2)]=1n1[g21(1g21)+x2g11(1g11)2xg11g21]+1n2[g22(1g22)+x2g12(1g12)2xg12g22].

Replacing the gij by their consistent estimators, we get the following estimate of the variance of (D2/n1C2/n2)+ x(D1/n1C1/n2):

Vx=1n1[D2n1(1D2n1)+x2D1n1(1D1n1)2xD1n1D2n1]+1n2[C2n2(1C2n2)+x2C1n2(1C1n2)2xC1n2C2n2]. (25)

By the Law of Large Numbers, Vx is a consistent estimate under both the null hypothesis and the alternative hypothesis, even if HWE does not hold in any subpopulation. When x = 0.5, we have that V0.5 = V* and T0.5 = T*. This shows that V* is a consistent estimate of the variance of DC.

The asymptotic power function of the Tx test with rejection region {|Tx|> uα/2} is

1Φ(uα/2Δx/σx)+Φ(uα/2Δx/σx). (26)

A SIMULATION STUDY

To study the finite sample performance of the mentioned tests, we conducted some simulations. We studied the impact of PS on the powers and type I error rates of the T test (defined in (1)) and the T* test (defined in (23)). In the simulations, we assumed the study population consisted of 2 subpopulations of equal sizes.

First we considered an additive mode of inheritance in each subpopulation. The underlying models were specified as follows. The allele frequencies of M at a disease locus were 0.2 and 0.2 (PS is absent) or 0.1 and 0.3 (PS is present) for the two subpopulations. The allele frequencies of M at a marker locus were 0.3 and 0.3 (PS is absent) or 0.4 and 0.2 (PS is present) for the two subpopulations. HWE was assume to hold in the 2 subpopulations at both the marker and the disease loci. The penetrances of genotypes aa, Aa and AA in subpopulation 1 were 0.1, 0.2 and 0.3, respectively, and they were either 0.2, 0.3, 0.4 for subpopulation 2 (PH is present) or the same as those in subpopulation 1 (PH is absent). The LD measures in the two subpopulations were the same, that is, either 0 (null hypothesis) or 0.05 (alternative hypothesis).

We randomly generated the genotypes of 1000 cases and 1000 controls. The empirical type I error rates/powers at a 0.05 level of significance were estimated based on 5,000,000 replications. The asymptotic type I error rates/powers of the T* test were calculated using formula (24). The resulting powers are presented in Table 3.

Table 3.

Type I error rates/powers for marker locus under additive mode of inheritance

T test T* test
Hypothesis 1 PS 2 PH 3 Empirical Asymptotic Empirical
Null Absent Absent 0.050 0.050 0.050
Null Absent Present 0.050 0.050 0.050
Null Present Absent 0.217 0.203 0.204
Null Present Present 0.885 0.876 0.876
Alternative Absent Absent 0.808 0.809 0.809
Alternative Absent Present 0.604 0.604 0.605
Alternative Present Absent 0.402 0.385 0.386
Alternative Present Present 0.153 0.142 0.143
1

“Null”: δ1 =δ2 = 0; “Alternative”: δ1 =δ2 = 0.05.

2

“Absent”: p1 = p2 = 0.2 and m1 = m2 = 0.3; “Present”: p1 = 0.1, p2 = 0.3 and m1 = 0.4, m2 = 0.2.

3

“Absent”: the penetrances of genotypes aa, Aa, AA are 0.1,0.2 and 0.3, respectively, in both of the subpopulations; “Present”: the penetrances are 0.1, 0.2 and 0.3 in subpopulation 1 and 0.2, 0.3 and 0.4 in subpopulation 2.

For the T* test, it is seen that the asymptotic type I error rates/powers and the empirical type I error rates/powers are very close to each other, with differences of no more than 0.001, showing an accurate approximation of the power function. As expected, when PS is absent, the type I errors are virtually equal to the nominal level 0.05; when PS is present, the type I error could be inflated a great deal, especially when PH is also present (0.876). The power is also influenced by the presence of PS and PH. For example, when both PS and PH are present, the power is only 0.142, compared with 0.809 for the case where neither PS nor PH is present.

The T test has type I error rates/powers close to those of the T* test in the absence of PS, with differences of no more than 0.001. In the presence of PS, there are minor differences that vary from 0.009 to 0.017.

The above simulations assumed that HWE held in any subpopulation. Our further simulations without an assumption of HWE (results not shown) showed that the T* test had type I error rates close to the nominal levels in the presence of PS, but the T test could distort the type I error rate, with its magnitude depending upon the strength of Hardy-Weinberg disequilibrium.

Second we considered a fully penetrance recessive model in each subpopulation, with the penetrance being 1 for AA and 0 for Aa and aa. The other parameters are the same as those in Table 3, except that the LD measures under the null hypothesis are 0.01. The simulation results are reported in Table 4. For this mode of inheritance, the impact of PS on the type I error rates and the powers has a trend the same as that for the additive mode of inheritance.

Table 4.

Type I error rates/powers for marker locus under fully recessive mode of inheritance1

T test T* test
Hypothesis2 PS3 Empirical Asymptotic Empirical
Null Absent 0.050 0.050 0.050
Null Present 1.000 1.000 1.000
Alternative Absent 0.941 0.941 0.941
Alternative Present 0.838 0.828 0.828
1

In both of the subpopulation, the penetrances of genotypes aa, Aa and AA are 0, 0 and 1, respectively.

2

“Null”: δ1 =δ2 = 0; “Alternative”: δ1 =δ2 = 0.01.

3

“Absent”: p1 = p2 = 0.2 and m1 = m2 = 0.3; “Present”: p1 = 0.1, p2 = 0.3 and m1 = 0.4, m2 = 0.2.

Third we considered a special case, where the the marker locus and the disease locus coincide, with a common allele frequency pi for the i th subpopulation, i = 1,2, and where the LD measures are δi=pipi2. The other parameters are the same as those in Table 3 except that the null hypothesis (in each subpopulation the penetrances are independent of the genotypes) is different and the mode of inheritance under the alternative hypothesis is recessive. The detailed parameter settings are described and the simulation results are presented in Table 5. As expected, the T* test has type I error rates controlled at the nominal level when either PS or PH is absent (this is different from the marker locus case), but the type I error rate is inflated when both PS and PH are present. Furthermore, the presence of PS and/or PH also has an impact on the powers of the T* test, with the trend similar to that for the marker locus case. There are only minor differences between the T test and the T* test, except under the null hypothesis with the absence of PS.

Table 5.

Type I error rates/powers for candidate locus 1

T test T* test
Hypothesis 2 PS 3 PH 4 Empirical Asymptotic Empirical
Null Absent Absent 0.050 0.050 0.050
Null Absent Present 0.050 0.050 0.050
Null Present Absent 0.057 0.050 0.050
Null Present Present 0.850 0.838 0.838
Alternative Absent Absent 0.749 0.730 0.731
Alternative Absent Present 0.482 0.467 0.467
Alternative Present Absent 0.891 0.867 0.868
Alternative Present Present 0.108 0.089 0.090
1

The marker locus and the disease locus coincide, and the LD measure δi in the i th subpopulation is pipi2 with pi being the allele frequency of both marker and disease loci in the i th subpopulation.

2

“Null”: the penetrances of the genotypes aa, Aa and AA are the same in each subpopulation; “Alternative”: the penetrances are different for the genotypes aa, Aa and AA in each subpopulation.

3

“Absent”: p1 = p2 = 0.2; “Present”: p1 = 0.3, p2 = 0.1.

4

“Absent”: under the null hypothesis, the penetrances of genotypes aa, Aa and AA are 0.2 in each subpopulation, and under the alternative hypothesis, the penetrances of genotypes aa, Aa and AA are 0.1, 0.1 and 0.2, respectively, in both of the two subpopulations; “Present”: under the null hypothesis, the penetrances of genotypes aa, Aa and AA are 0.2 in subpopulation 1 and 0.1 in subpopulation 2, and under the alternative hypothesis, the penetrances of genotypes aa, Aa and AA are 0.1, 0.1 and 0.2 in subpopulation 1 and 0.2, 0.2 and 0.3 in subpopulation 2.

DISCUSSION

Population-based marker-disease association analysis is a powerful tool but may suffer from PS. Our work provides closed forms for the expectation and variance of two commonly used test statistics, which enable us to study the type I error rate and power under various scenarios. We extend the work of Ewens and Spielman (1995) by relaxing the assumption of fully recessive penetrance and studying bias and variance distortion simultaneously. Our simulation results are in agreement with those from asymptotic approximations, confirming that the theoretical findings are correct. Both analysis and simulation results show that the presence of PS can inflate the type I error rate and decrease the power dramatically in the marker-disease association analysis. Therefore, it is necessary to modify the test statistics to accommodate PS. Methods have been proposed in the literature for correcting bias and/or variance distortion in candidate gene-disease association analysis, including genomic control (Devlin and Roeder, 1999; Devlin et al., 2001), structured association (Pritchard et al., 2000; Satten et al., 2001; Pritchard and Donnelly, 2001), the delta centralization (Gorroochurn et al., 2006). Whittermore (2006) suggested sensitivity analysis. However, the performance qualify for these methods for marker-disease association analysis is unclear and needs further investigations.

Acknowledgments

We would like to thank the Managing Editor, the Handling Editor and two reviewers for their helpful comments and suggestions leading to an improvement of the paper. We are grateful to Dr. B. J. Stone for editorial help. This research was supported in part by the National Natural Science Foundation of China 10701067 (HZ), the Outstanding Overseas Chinese Scholars Fund of Chinese Academy of Sciences (ZL), and NIH grant 5R37GM047845 (ZY).

APPENDIX I. Proof of (6) and (7)

By the definition of the linkage equilibrium measures δi, the probabilities of gametes MA, Ma, mA and ma are

P(MASi)=piqi+δi (27)
P(MaSi)=(1pi)qiδi, (28)
P(mASi)=(1qi)piδi (29)

and

P(maSi)=(1qi)(1pi)+δi, (30)

respectively. Random mating gives

P((MM,AA)Si)=[P(MASi)]2=[piqi+δi]2, (31)
P((MM,Aa)Si)=2[piqi+δi][(1pi)qiδi], (32)
P((MM,aa)Si)=[(1pi)qiδi]2, (33)
P((Mm,AA)Si)=2[piqi+δi][pi(1qi)δi], (34)
P((Mm,Aa)Si)=2[piqi+δi][(1pi)(1qi)+δi]+2[(1pi)qiδi][pi(1qi)δi], (35)
P((Mm,aa)Si)=2[(1pi)qiδi][(1pi)(1qi)+δi]. (36)

Here (MM, AA) is the joint genotype at the marker locus (MM) and the disease locus (AA), so are the other 5 pairs. It follows from (31)–(36) that

P(MMY=1)=i=1Kwi[P((MM,AA)Si)f2i+P((MM,Aa)Si)f1i+P((MM,aa)Si)f0i]i=1Kwij=02fjipji=1B{i=1Kwiqi2j=02fjipji+i=1Kwiδi2(f2i+f0i2f1i)+2i=1Kwiqiδi[pi(f2if1i)+(1pi)(f1if0i)]}

and

P(MmY=1)=i=1Kwi[P((Mm,AA)Si)f2i+P((Mm,Aa)Si)f1i+P((Mm,aa)Si)f0i]i=1Kwij=02fjipji=2B{i=1Kqi(1qi)j=02fjipjii=1Kwiδi2(f2i+f0i2f1i)+i=1Kwi(12qi)δi[pi(f2if1i)+(1pi)(f1if0i)]}.

APPENDIX II. Proof of (14)

Define two indicator functions

IMM={1,ifrandomlyselectedmarkergenotypeofacaseisMM;0,otherwise.IMm={1,ifrandomlyselectedmarkergenotypeofacaseisMm;0,otherwise.

Then,

q^D=(2IMM+IMm)2n1,

where the summation is taken over all cases. Since

Var(IMM)=P(MMY=1)(1P(MMY=1)),Var(IMm)=P(MmY=1)(1P(MmY=1)),Cov(IMM,IMm)=P(MMY=1)P(MmY=1),

the variance of D is

Var(q^D)=Var((2IMM+IMm)2n1)=1(2n1)2[4n1Var(IMM)+n1Var(IMm)+4n1Cov(IMM,IMm)]=14n1[4P(MMY=1)(1P(MMY=1))+P(MmY=1)(1P(MmY=1))4P(MMY=1)P(MmY=1)].

APPENDIX III. Proof of (16)

Define c1iqi2αi, c2 ≡ Σi2qi(1−qi)αi, x1P(MM|Y = 1)−c1, x2P(MM|Y = 1)−c2. Substituting c1, c2, x1 and x2 into (14), we have

Var(q^D)=q^D(1q^D)+σD22n1(2x1+x2)24n1+4(12c1c2)x1+(12c24c1)x24n1 (37)

By the definitions,

2x1+x2=2Biδiwi[pi(f2if1i)+(1pi)(f1if0i)], (38)
2c1+c2=i[2qi2αi+2qi(1qi)αi]=2q¯D, (39)
4x1+x2=2Biδiwi[pi(f2if1i)+(1pi)(f1if0i)](1+2qi)+2Biwiδi2(f2i+f0i2f1i). (40)

It follows from (39) and (40) that

4(12c1c2)x1+(12c24c1)x24n1=4(12q¯D)x1+(14q¯D)x24n1=4x1+x24(2x1+x2)q¯D4n1=12n1Biδiwi[pi(f2if1i)+(1pi)(f1if0i)](1+2qi4q¯D)+12n1Biwiδi2(f2i+f0i2f1i). (41)

Equation (16) follows from (37), (38) and (41).

References

  1. Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386. [Google Scholar]
  2. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN. Demonstrating stratification in a European American population. Nat Genet. 2005;37:868–872. doi: 10.1038/ng1607. [DOI] [PubMed] [Google Scholar]
  3. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361:598–604. doi: 10.1016/S0140-6736(03)12520-2. [DOI] [PubMed] [Google Scholar]
  4. Ewens WJ, Spielman RS. The transmission/disequilibrium test: history, subdivision, and admixture. Am J Hum Genet. 1995;57:455–464. [PMC free article] [PubMed] [Google Scholar]
  5. Gorroochurn P, Heiman GA, Hodge SE, Greenberg DA. Centralizing the non-central chi-square: A new method to correct for population stratification in genetic case-control association studies. Genet Epi. 2006;30:277–289. doi: 10.1002/gepi.20143. [DOI] [PubMed] [Google Scholar]
  6. Gorroochurn P, Hodge SE, Heiman G, Greenberg DA. Effect of population stratification on case-control association studies. II. False-positive rates and their limiting behavior as number of subpopulations increases. Hum Hered. 2004;58:40–48. doi: 10.1159/000081455. [DOI] [PubMed] [Google Scholar]
  7. Heiman GA, Hodge SE, Gorroochurn P, Zhang J, Greenberg DA. Effect of population stratification on case-control association studies. Hum Hered. 2004;58:30–39. doi: 10.1159/000081454. [DOI] [PubMed] [Google Scholar]
  8. Knowler WC, Williams RC, Pettitt DJ, Steinberg AG. Gm 3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet. 1988;43:520–526. [PMC free article] [PubMed] [Google Scholar]
  9. Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994;265:2037–2048. doi: 10.1126/science.8091226. [DOI] [PubMed] [Google Scholar]
  10. Li CC. Population subdivision with respect to multiple alleles. Ann Hum Genet. 1969;33:23–29. doi: 10.1111/j.1469-1809.1969.tb01625.x. [DOI] [PubMed] [Google Scholar]
  11. Li CC. Genetics of subdivided populations and its relationships with certain measures of association. Genet Epi. 1991;8:1–11. doi: 10.1002/gepi.1370080102. [DOI] [PubMed] [Google Scholar]
  12. Li Z, Zhang H, Zheng G, Gastwirth JL, Gail MH. Excess false positive rate caused by population stratification and disease rate heterogeneity in case-control association studies. Comput Statist Data Anal. 2009;53:1767–1781. [Google Scholar]
  13. Qin H, Zhang H, Li Z. The impact of population stratification on commonly used statistical procedures in population-based QTL association studies. In: Hsiung A, Zhang C, Ying Z, editors. Random Walk, Sequential Analysis and Related Topics-A Festschrift in Honor of Yuan-Shih Chow. Singapore: World Scientific Publisher; 2006. pp. 311–333. [Google Scholar]
  14. Risch N. Searching for genetic determinants in the new millennium. Nature. 2000;405:847–856. doi: 10.1038/35015718. [DOI] [PubMed] [Google Scholar]
  15. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
  16. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52(3):506–516. [PMC free article] [PubMed] [Google Scholar]
  17. Whittemore AS. Population structure in genetic association studies. Proceedings of the American Statistical Association, Statistics in Epidemiology Section [CD-ROM]; Alexandria, VA: ASA; 2006. [Google Scholar]
  18. Zheng G, Li Z, Gail MH, Gastwirth JL. Impact of population substructure on trend tests for genetic case-control association studies. Biometrics. 2009 doi: 10.1111/j.1541–0420.2009.01264.x. [DOI] [PubMed] [Google Scholar]

RESOURCES