Abstract
Genomewide association studies have become the primary tool for discovering the genetic basis of complex human diseases. Such studies are susceptible to the confounding effects of population stratification, in that the combination of allele-frequency heterogeneity with disease-risk heterogeneity among different ancestral subpopulations can induce spurious associations between genetic variants and disease. This article provides a statistically rigorous and computationally feasible solution to this challenging problem of unmeasured confounders. We show that the odds ratio of disease with a genetic variant is identifiable if and only if the genotype is independent of the unknown population substructure conditional on a set of observed ancestry-informative markers in the disease-free population. Under this condition, the odds ratio of interest can be estimated by fitting a semiparametric logistic regression model with an arbitrary function of a propensity score relating the genotype probability to ancestry-informative markers. Approximating the unknown function of the propensity score by B-splines, we derive a consistent and asymptotically normal estimator for the odds ratio of interest with a consistent variance estimator. Simulation studies demonstrate that the proposed inference procedures perform well in realistic settings. An application to the well-known Wellcome Trust Case-Control Study is presented. Supplemental materials are available online.
Keywords: B-spline, Case-control study, Principal components, Propensity score, Semiparametric logistic regression, Single nucleotide polymorphism
1. INTRODUCTION
One and a half decades ago, Risch and Merikangas (1996) showed that genomewide association (GWA) studies interrogating hundreds of thousands or more genetic variants have far greater power than traditional linkage studies in detecting the genetic factors for complex human diseases (e.g., cancer, diabetes, cardiovascular diseases, mental disorders). The catalog of single nucleotide polymorphisms (SNPs) (International HapMap Consortium 2005) and the advances in genotyping technology have made GWA studies a reality. Indeed, the number of these studies has grown exponentially in recent years, and their success in identifying genetic basis of complex human diseases was declared “breakthrough of the year” by the journal Science in 2007. Most GWA studies use a case-control design, which compares the allele frequencies between disease subjects (i.e., cases) and disease-free subjects (i.e., controls). Because of the need to adjust for multiple testing, large samples are required to attain sufficient power in detecting the modest genetic effects expected for complex human diseases.
It is widely recognized that unobserved population substructure or stratification can generate spurious associations, that is, associations between disease and genetic variants that are not linked to any causative loci (Lander and Schork 1994). This problem of unmeasured confounders arises when the population under study is composed of several latent ancestral subpopulations that have different allele frequencies and also different risks of disease. The effects of stratification increase with sample size, so that even subtle population substructure can yield grossly inflated type I error for large GWA studies. Case-control studies are particularly prone to population stratification, especially when cases and controls are obtained from different sources. Indeed, the concern over population stratification led many to advocate family-based studies, which are more costly and less powerful than case-control studies.
Delvin and Roeder (1999) proposed a simple approach, known as “genomic control,” to correcting for population stratification in case-control studies. In this approach, the chi-squared statistics are calculated at a set of null markers across the genome, and their empirical median is divided by the median of the chi-squared distribution to produce an inflation factor, which is then used to adjust the test statistics at all candidate SNPs. This adjustment is crude, especially when the degrees of stratification vary substantially among the SNPs.
Pritchard et al. (2000) developed the so-called “structured association,” which assigns subjects to discrete subpopulations (with possibly fractional memberships) based on a set of unlinked genetic markers and tests for associations within sub-populations. This approach is computationally intractable for large numbers of markers. It is difficult to correctly estimate the population substructure or to correctly assign individuals to subpopulations, especially when the population under study is a continuous mixture of ancestral subpopulations.
The prevailing approach to adjusting for population stratification is principal components analysis, which infers continuous axes of genetic variation from genomic markers and includes those axes as covariates in the association analysis (Chen et al. 2003; Zhang, Zhu, and Zhao 2003; Price et al. 2006). It is unclear how many axes should be used or how their effects on disease should be modeled. Price et al. (2006) suggested including the 10 axes with the highest eigenvalues as covariates with linear effects, and their method, EIGENSTRAT, has been used in virtually all recent GWA studies. The assumption of linear effects is likely violated in practice. Zhang, Zhu, and Zhao (2003) and Chen et al. (2003) modeled the effects semiparametrically thorough a kernel function, which can be computationally unstable when the dimension is greater than two.
Epstein, Allen, and Satten (2007) calculated a stratification score, which is the individual’s estimated odds of disease as a function of his or her ancestry-informative markers, and tested for association within strata defined by the stratification score. The use of the stratification score tends to attenuate the estimates of the genetic effects at candidate SNPs and thus reduce statistical power, especially when the set of ancestry-informative markers is large (Lee et al. 2008). Recently, Zhao, Rebbeck, and Mitra (2009) calculated a propensity score (Rosenbaum and Rubin 1983) by regressing the genotype of each candidate SNP on ancestry-informative markers, and used the resulting propensity score as a linear term in the logistic regression model for the disease association; they did not theoretically justify their method.
This article provides a rigorous statistical treatment of the problem of population stratification. We formulate the confounding due to population stratification through appropriate statistical models, identify necessary and sufficient conditions for the use of genomic markers to correct for this confounding, and develop inference procedures that properly correct for the confounding. Our methods avoid the aforementioned limitations of the existing methods. Indeed, our theory yields valuable insights into the behavior of the existing methods. We demonstrate the advantages of the new methods through simulated and real data.
2. THEORY AND METHODS
2.1 Statistical Framework
We consider case-control GWA studies. Let D indicate, by the values 1 versus 0, whether or not the subject has the disease of interest, and let G denote the genotype at a candidate locus, which is defined as simply the number of minor alleles. We are interested in making inference about the effect of G on D in the presence of population stratification. We represent the unknown population substructure by an unobserved, possibly vector-valued variable U. We collect data on a set of potentially correlated genomic markers, denoted by Z, which is informative about the latent population substructure U.
It is natural to assume the following odds model:
(1) |
where β is a parametric function, and γ is a completely unknown function. (The usual intercept term is absorbed in the arbitrary function γ.) Under the additive mode of inheritance, β(g) = βg; under the dominant or recessive mode of inheritance, β (g) = β I(g ≥ 1) or β I(g = 2); under the co-dominant mode of inheritance, β (g) = β1I(g = 1) + β2I(g = 2). The additive mode of inheritance is used almost exclusively in practice.
Remark I
Because of the arbitrariness of the function γ, model (1) is semiparametric. Indeed, model (1) is completely nonparametric under the null hypothesis that β (g) = 0. This feature is important because proper control of type I error is a prominent task in GWA studies.
Remark 2
Model (1) allows certain ancestry-informative markers to be associated with disease (even after conditioning on the true ancestry) and includes the special case where ancestry-informative markers are independent of disease conditional on the true ancestry such that
We normally set γ (u, z) in (1) to γ (u) so as to estimate the marginal effect of G, although the general formulation allows estimation of the effect of G conditional on some other SNPs.
2.2 Identifiability
It is well known that odds ratios are identifiable from case-control data (Prentice and Pyke 1979). In our setting, however the odds ratio parameter β might not be identifiable, because U is never observed. To be able to estimate β, we require that Z be sufficiently informative about U such that β can be correctly identified by stratifying on Z instead of on U. In other words, the following model must hold:
(2) |
where ξ is some unknown function different from γ. Theorem 1 and Proposition 1 provide necessary and sufficient conditions under which model (1) indeed implies model (2).
Theorem 1
A necessary and sufficient condition for model (2) to hold is that U is independent of G conditional on Z in the control population, that is,
(3) |
Proposition 1
Remark 3
The foregoing results imply that it is possible to correct for the confounding effects of U if Z dissolves the link between U and G such that U is independent of G for each stratum of Z in the control population (or, equivalently, in the case population). Note that the conditional independence does not hold in the whole population or in the combined sample of cases and controls if there is population stratification and β is nonzero.
Remark 4
Conditions (3) and (4) are reasonable in that they require the aspect of the unknown population substructure U, which needs to be controlled for when estimating the effect of G to be fully contained in the observed marker data Z. These conditions hold trivially if U is completely determined by Z. They also hold even if U is not completely determined by Z but the part of U not captured by Z is unrelated to G.
The proofs of Theorem 1 and Proposition 1 are given in the Appendix. Hereinafter, we assume that condition (3) or, equivalently, (4) holds. Our goal is to make inference on β.
2.3 Inference Procedures
The observed data consist of (Di, Gi, Zi), i = 1, …, n, where n is the total number of study subjects. Under model (2), the likelihood takes the form
In general, it is difficult to maximize this likelihood, because the function ξ (·) is unknown and can be highly nonlinear. If Z is discrete with a small number of values, then ξ (Z) can be easily parameterized. If Z is continuous but very low-dimensional, we may apply a nonparametric smoothing method. If Z is discrete with many values or continuous with dimension higher than three, then it is desirable to find a low-dimensional representation of Z, say ψ(Z), such that ψ (Z) can replace Z in model (2). Such a ψ(Z) exists and is given in the following theorem.
Theorem 2
If
then
for some function ξ.
The proof is given in the Appendix. This result shows that we can simply identify a low-dimensional function ψ (Z) such that G and Z are independent given ψ (Z) in the control population. Clearly, ψ (Z) can be chosen to be Pr(G = g|Z = z, D = 0), which is at most two-dimensional. We refer to this function as the genotype-associated propensity score (GAPS) because it pertains to the propensity of having genotype g for a subject with markers Z.
There is an important difference between GAPS and the traditional propensity score of Rosenbaum and Rubin (1983). The latter is used to mitigate the nonrandomness of treatment assignment such that within each score stratum, the treatment assignment is independent of the potential outcome under the assumption of no unmeasured confounders. In contrast, our propensity score is used to remove the confounding of latent population substructure such that within each score stratum, the genotype of the candidate SNP is independent of the unmeasured confounders U. Under the odds ratio model, our propensity scores are based on the relationship between G and Z in the control population rather than in the entire population.
We may determine Pr(G = g|Z = z, D = 0) by supervised learning methods, such as discriminant analysis, classification trees, and support vector machines. It is computationally simpler to adopt a parametric model. For example, we may assume the proportional odds model acknowledging the ordinal nature of G. It is biologically more sensible and numerically more stable to formulate the effects of Z on G at the nucleotide level. Let G(1) (or G(2)) indicate, by the values 1 versus 0, whether the nucleotide inherited from the father (or the mother) is the minor allele, such that G = G(1) + G(2). It is natural to specify the following logistic model:
(5) |
(We have expanded Z to include the unit component and absorbed the intercept term in η.) We set ψ (Z) = ηTZ, which is only one-dimensional. Without loss of generality, we focus on the additive mode of inheritance such that β (G) = β G. Then model (2) becomes
(6) |
The likelihood for η based on the observed data (Di, Gi, Zi), i = 1, …, n, is
The resulting maximum likelihood estimator is denoted by η̂. Given η̂, we estimate β by maximizing the following pseudo-likelihood function:
Because ξ (·) is an unknown function, adopting a nonparametric approach is desirable. The simplest approach is to stratify the subjects into K strata based on the (estimated) propensity scores {η̂TZi; i = 1, …, n}. Then β can be estimated by the standard logistic regression of D on G and (K – 1) dummy variables. The discontinuity of the stratification makes it difficult to account for the variation in estimating η. (Unlike the traditional propensity score, our propensity score is conditional on the disease status, so its influence function is not orthogonal to that of β; thus the variation in estimating the propensity score cannot be ignored.)
We propose to use B-splines to approximate ξ. Specifically, let B1(·), …, BKn (·) be the B-spline basis functions of order m (m ≥ 3) with the interior knots being the k/(Kn – 2m) (k = 0, 1, …, Kn – 2m) quantiles of {η̂TZi; i = 1, …, n}. Then ξ (x) can be approximated by , where θ1, …, θKn are unknown coefficients. We estimate β and θ ≡ (θ1, …, θKn)T by maximizing the following approximate log-likelihood function:
This maximization can be done using the standard Newton–Raphson algorithm. The resulting estimators are denoted by β̂ and θ̂.
The asymptotic properties of β̂ are stated in the following theorem. In this theorem and in its proof, given in the Appendix, ζ0 denotes the true value of any parameter or function ζ.
Theorem 3
Assume that the following regularity conditions hold:
-
(C.1)
If cTZ = 0 with probability 1 for some constant vector c, then c = 0. In addition, with positive probability.
-
(C.2)
The support of Z is bounded, and has a bounded continuous support.
-
(C.3)
The function γ0(u, z) and the joint density of (U, Z) have bounded fourth derivatives and are bounded away from 0 in the support of (U, Z).
-
(C.4)
The number of the B-spline basis functions Kn → ∞, , and .
Then n1/2(β̂ − β0) converges in distribution to a mean-0 normal random variable.
Remark 5
The first part of condition (C.1) is equivalent to the linear independence of Z, and the second part ensures that G is not completely determined by . This condition is required for the consistency of the parameter estimators. Conditions (C.2) and (C.3) impose some smoothness on the distribution of (U, Z). Condition (C.4) is satisfied by any Kn = nν with ν ∈ (1/14, 1/12).
Remark 6
The number of basis functions Kn plays the role of the smoothing parameter in B-splines. The value of Kn should be large enough so that the B-spline basis is flexible, but not too large, to avoid overfitting. We typically set Kn to be between 2 and 10. If the effects of populations stratification are fully accounted for, then the genomic control λ (Delvin and Roeder 1999), which is the ratio of the median of the observed chi-squared test statistics to the theoretical median of the chi-squared distribution, will be equal to 1. Thus we may choose a value of Kn such that the resulting λ is close to 1.
The proof of Theorem 3 is given in the Appendix. The main challenge is that the dimension of the parameter vector θ increases with increasing sample size. In the proof, we show that the asymptotic variance of β̂ can be consistently estimated by treating model (6) as a parametric model with parameters β, θ, and η. A sandwich variance estimator is used to account for the variation due to the plug-in estimation of η.
It follows from Proposition 1(a) that the conclusion of Theorem 2 also holds if Pr(G = g|Z = z, D = 1) = Pr(G = g|ψ(Z) = ψ(z), D = 1). Thus Theorem 3 continues to hold if we replace D = 0 in model (5) by D = 1. We refer to our methods as GAPS(C) if η is estimated from the control sample and GAPS(D) if η is estimated from the disease (i.e., case) sample. Estimating η from the combined sample of cases and controls is generally invalid.
Although both are theoretically valid, GAPS(C) and GAPS(D) may yield different results in actual data analysis. In our approach, the propensity scores are first calculated from the control or case sample and then matched into the other sample. The scores calculated from the sample with a finer latent substructure might not be well matched in the sample with a coarser latent substructure, and the resulting GAPS method might not be fully effective. For this reason, we suggest using both GAPS(C) and GAPS(D) in practice. If the latent substructures are balanced between the case and control samples, then the two methods will yield similar results; otherwise, the mismatch of the two samples may warrant further investigation.
2.4 Dimension Reduction
GWA studies survey the entire genome with high-density genotyping platforms containing approximately 1 million SNPs. It is customary to infer ancestry from the dense SNP data themselves rather than collect additional data on ancestry informative markers. For such high-dimensional markers, however, estimating ψ(Z) of Section 2.3 directly is a challenge. We suggest reducing the dimension of Z before applying the methods described in Section 2.3. Particularly, we find a low-dimensional representation of Z by performing principal components analysis on a distance matrix among study subjects.
Define the n × p matrix X, whose ith row is the 1 × p vector of genotypes for the ith subject. We standardize each column of X to have mean 0 and variance 1. The resulting matrix is still denoted by X. We define an n × n distance matrix whose (i, j)th element is a distance measure between the ith and jth subjects (i.e., the ith and jth rows of X). The choice of the correlation of genotypes as the distance yields EIGENSTRAT, whereas setting the distance to the square root of the maximum of the correlation of genotypes and 0 yields Laplacian eigenmaps (Zhang, Niyogi, and McPeek 2009; Lee et al. 2010). The latter distance is less sensitive to outliers (i.e., subjects with markedly different ancestry) than the former. We perform an eigenvalue decomposition of the distance matrix. The first few principal components (PCs) are the low-dimension representation of X and will replace Z in the inference procedures of Section 2.3. According to the multidimensional scaling theory, the space spanned by the first few PCs is the optimal low-dimensional embedding, which tends to preserve the original pairwise distances.
Remark 7
By replacing the original set of markers Z with the PCs, we now require that condition (3) or equivalent condition (4) hold, with Z being the PCs. Note that the original Z might contain certain information that is not part of the PCs as long as that information is not associated with U or G. Although a larger number of PCs yields a more accurate low-dimensional representation, including a very large number of PCs in association analysis may cause numerical instability. For most studies, population substructures are well captured by the first few (e.g., 10) PCs.
2.5 Comparisons With Other Approaches
Our approach is in the same vein as EIGENSTRAT, the stratification score of Epstein et al. (EAS), and the propensity score of Zhao et al. (ZRM), in that all four approaches adjust for population stratification by including ancestry-related covariates in the logistic regression model for the SNP–disease association. For EIGENSTRAT, the covariates are the top 10 PCs. For EAS, the covariate pertains to the probability of disease given ancestry-informative markers. ZRM uses the propensity score based on the entire sample, whereas our approach uses the propensity score based on the control sample or case sample, in this section we examine the differences between our approach and EIGENSTRAT, EAS, and ZRM.
Both our approach and EIGENSTRAT use principal components analysis for dimension reduction. In our approach, the ancestry-related covariate used in the SNP-disease association model is a scalar function of the PCs, and its effect on the disease status is modeled nonparametrically. In EIGENSTRAT, the PCs enter into the SNP–disease association model as linear terms. As shown in the proof of Theorem 2,
so the effects of PCs on the odds ratio of disease can be highly nonlinear. For example, if U is dichotomous with Pr(U = u|Z = z, D = 0) = euαz/(1 + eαz) and γ(u, z) = γu, then ξ(z) = log{(1 + eγ+αz)/(1 + eαz)}. Thus, EIGENSTRAT can yield biased estimation of the SNP–disease association, although it is optimal when the assumption of linear effects holds. If there is only a single PC, then the kernel method of Chen et al. (2003) is essentially equivalent to our spline method. If there are more than two PCs, then this kernel method is unstable, whereas our method is unaffected, because the propensity score is one-dimensional.
The stratification score of EAS is an estimate of ξ in model (2) with β = 0, and thus removes the bias due to population stratification under β = 0. To effectively remove the bias, the stratification score must be calculated correctly. EAS calculates the stratification score as a linear combination of ancestry-informative markers, even though ξ may be a highly nonlinear function. Because the stratification score of EAS involves the disease status, the variation due to estimation of the score may not be ignored in the association testing. When β ≠ 0, the stratification score is not an appropriate estimate of ξ, so the resulting estimator of β may be biased. Originally, EAS selected ancestry-informative markers via partial least squares regression, which can result in substantial overfitting and loss of power when there are a large number of markers (Lee et al. 2008). Recently, Sarasua et al. (2009) suggested calculating the stratification score based on the top 10 PCs instead of on individual markers.
Both our approach and ZRM rely on propensity scores. However, our propensity scores are first calculated from the control or case sample and then matched to the other sample, whereas those of ZRM are calculated from the combined sample of cases and controls. As shown in Proposition 1(b), the propensity scores based on the entire sample are inappropriate when β ≠ 0, and thus ZRM does not yield consistent estimators of nonzero β. Even under the null hypothesis that β = 0, ZRM still may produce biased estimation of β because it enters the propensity score into the SNP–disease association model as a linear term, which is likely a faulty functional form. ZRM does not account for variation due to estimation of the propensity scores, and thus may underestimate the variance of the estimated genetic effect. ZRM has some other limitations, including requiring an arbitrary dichotomization of G and accommodating only low-dimensional markers.
3. SIMULATION STUDIES
We conducted a series of simulation studies to assess the Performance of the proposed methods in realistic settings. We used a case-control design with 1000 cases and 1000 controls. We generated data for controls as follows. We first simulated 1 million markers in the same manner as done by Price et al. (2006). We then calculated the PCs for these markers and set Z = (1, Z1, …, Z10)T, where Z1, …, Z10 are the top PCs. Given Z, we generated G from model (5) with . We generated data for cases under model (1) with β(g) = βg and γ(u, z) = γu using acceptance–rejection sampling. Specifically, we generated Z and G as before and generated U from the logistic probability function
where ε is an independent normal variable with mean 0 and variance σ2. (Note that U ranges between 0 and 1 and can be understood as an ancestry proportion, such as the fraction of European ancestry. If σ2 = 0, then U is determined by Z without error; if σ2 ≠ 0, then U is not fully captured by Z.) We accepted the observation (G, Z) if a uniform [0, 1] random variable was less than M−1eβG+γU, where M is an upper bound for eβG+γU, γ = −1.2, and β = −0.4, 0 or 0.4.
We used B-splines of order 3 on the 10 quantiles of the propensity scores. For comparisons, we included the logistic regression of D on G and Z, as well as the logistic regression of D on G and U. The former pertains to EIGENSTRAT with Z being the PCs, and the latter pertains to the ideal situation of known population substructure. We also included EAS and ZRM. Table 1 summarizes the results of the simulation studies for making inference on β.
Table 1.
Method |
σ2=0
|
σ2 =0.25
|
|||||
---|---|---|---|---|---|---|---|
β = −0.4 | β = 0 | β = 0.4 | β = −0.4 | β = 0 | β =0.4 | ||
Known structure | Bias | −0.0034 | −0.0008 | 0.0023 | −0.0029 | −0.0002 | 0.0023 |
SE | 0.0807 | 0.0800 | 0.0846 | 0.0791 | 0.0788 | 0.0828 | |
SEE | 0.0804 | 0.0802 | 0.0832 | 0.0794 | 0.0791 | 0.0820 | |
Power | 0.9609 | 0.00098 | 0.9356 | 0.9667 | 0.00093 | 0.9443 | |
GAPS | Bias | 0.0008 | 0.0041 | 0.0069 | 0.0005 | 0.0039 | 0.0067 |
SE | 0.0814 | 0.0803 | 0.0844 | 0.0812 | 0.0802 | 0.0841 | |
SEE | 0.0813 | 0.0805 | 0.0835 | 0.0813 | 0.0804 | 0.0834 | |
Power | 0.9471 | 0.00098 | 0.9399 | 0.9481 | 0.00093 | 0.9387 | |
EIGENSTRAT | Bias | 0.0389 | 0.0403 | 0.0431 | 0.0345 | 0.0374 | 0.0404 |
SE | 0.0767 | 0.0762 | 0.0808 | 0.0766 | 0.0761 | 0.0805 | |
SEE | 0.0776 | 0.0777 | 0.0810 | 0.0777 | 0.0777 | 0.0809 | |
Power | 0.9220 | 0.0023 | 0.9863 | 0.9267 | 0.0020 | 0.9857 | |
EAS | Bias | 0.1085 | 0.0302 | 0.0236 | 0.1109 | 0.0297 | 0.0230 |
SE | 0.0688 | 0.0747 | 0.0799 | 0.0682 | 0.0745 | 0.0797 | |
SEE | 0.0705 | 0.0771 | 0.0805 | 0.0700 | 0.0770 | 0.0804 | |
Power | 0.8258 | 0.0014 | 0.9773 | 0.8270 | 0.0014 | 0.9766 | |
ZRM | Bias | 0.0479 | 0.0381 | 0.0352 | 0.0446 | 0.0352 | 0.0326 |
SE | 0.0759 | 0.0733 | 0.0807 | 0.0758 | 0.0731 | 0.0804 | |
SEE | 0.0774 | 0.0775 | 0.0807 | 0.0774 | 0.0775 | 0.0806 | |
Power | 0.9036 | 0.0018 | 0.9822 | 0.9106 | 0.0016 | 0.9816 |
NOTE: Bias is the mean of the estimator of β minus the true value. SE is the standard error of the parameter estimator. SEE is the mean of the standard error estimator. Power is the empirical type I error (under β = 0) or power (under β = −0.4 or 0.4) for testing the null hypothesis H0 : β = 0 at the 10−3 nominal significance level. Each entry is based on 1 million replicates. The results for GAPS pertain to GAPS(C); those of GAPS(D) are nearly identical.
For GAPS, the parameter estimator is virtually unbiased, the standard error estimator is accurate, and the type I error is close to the nominal significance level. Indeed, GAPS does nearly as well as the logistic regression with known substructure in terms of bias, standard error, type I error, and power. EIGENSTRAT, EAS and ZRM all yield biased parameter estimation and inflated type I error. Because the biases are positive whether the true β is positive or negative, these three methods are more powerful than GAPS under β = 0.4 and less powerful than GAPS under β = −0.4.
We also assessed the performance of GAPS under misspecified models. For example, we generated G from model (5) with , but ignored the interaction in calculating the propensity scores. Under β = 0, the bias of β̂ was approximately 0.004 with standard error 0.08, and the type I error was approximately 0.0008 at the 10−3 nominal significance level, whether σ2 = 0 or 0.25. In addition, when the propensity scores were generated from the proportional odds model but estimated under model (5), the simulation results remained virtually the same.
Finally, we considered the simulation setting of Price et al. (2006), for which EIGENSTRAT is optimal. We found that GAPS had accurate control of the type I error and was nearly as powerful as EIGENSTRAT. For discrete subpopulations with moderate ancestry differences between cases and controls, the power to detect the causal SNPs was approximately 0.460 for GAPS and 0.483 for EIGENSTRAT; for admixed population with ancestry differences between cases and controls based on ancestry risk r = 2, the power to detect the causal SNPs was approximately 0.468 for GAPS and 0.485 for EIGENSTRAT.
4. REAL EXAMPLE
The Wellcome Trust Case Control Consortium (WTCCC) (2007) conducted a landmark GWA study in the British population using the Affymetrix GeneChip 500K Mapping Array Set. The study examined approximately 2000 individuals for each of seven major diseases—bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D), and type 2 diabetes (T2D)—and a shared set of approximately 3000 controls. The seven conditions are all common familial diseases of major public health significance. The control individuals came from two sources: 1500 individuals from the 1958 British Birth Cohort and 1500 individuals from blood donors. Even after exclusion of individuals with evidence of recent non-European ancestry, the British population is heterogeneous, having been shaped by several waves of immigration from southern and northern Europe. The case and control samples might differ in the distribution of ancestry, due either to bias in ascertaining controls or to overrepresentation of sub-populations with higher disease risk in cases.
We wished to compare the effectiveness of the proposed and existing methods in correcting for population stratification in the WTCCC data. After downloading the data from the WTCCC Web site, we excluded the SNPs and subjects on the WTCCC’s exclusion lists. In addition, we removed SNPs on the sex chromosomes, as well as any SNP with missing rate > 5%, minor allele frequency < 1%, or Hardy–Weinberg p-value < 0.001. The resulting data contain approximately 395,000 SNPs with slightly less than 5000 subjects for each case-control comparison; see Table 2 for details. For principal components analysis, we removed two small regions of known extensive local LD, namely the MHC region on chromosome 6 (25–35 Mb) and the chromosome 8 inversion region (7–13 Mb), excluded SNPs with minor allele frequencies < 5%, and generated a pruned subset of SNPs with pairwise correlations r2 < 0.25 through two iterations of the PLINK (Purcell et al. 2007) command “indep-pairwise 200 100 0.25.” For each disease, the pruned subset contains approximately 81,000 SNPs. We performed principal components analysis on the pruned data sets by iteratively deleting any SNPs that are strongly associated with the top 10 PCs (p-values < 10−8). We used the top 10 PCs for all methods. Nonparametric estimation showed evidence of nonlinear effects of the PCs on the odds ratios of disease. For GAPS, we used B-splines of order 3 on the 10 quantiles of the propensity scores.
Table 2.
BD | CAD | CD | HT | RA | T1D | T2D | |
---|---|---|---|---|---|---|---|
# subjects | 4806 | 4864 | 4686 | 4890 | 4798 | 4901 | 4862 |
# analyzed SNPs | 394,661 | 394,668 | 394,849 | 394,749 | 394,703 | 394,677 | 394,641 |
# SNPs used in PCs | 81,348 | 81,246 | 80,971 | 81,256 | 81,281 | 81,289 | 81,309 |
Genomic control λ | |||||||
Unadjusted | 1.118 | 1.072 | 1.111 | 1.068 | 1.039 | 1.058 | 1.083 |
EIGENSTRAT | 1.080 | 1.067 | 1.078 | 1.070 | 1.038 | 1.057 | 1.066 |
EAS | 1.080 | 1.062 | 1.088 | 1.069 | 1.036 | 1.053 | 1.061 |
ZRM | 1.048 | 1.064 | 1.072 | 1.074 | 1.040 | 1.061 | 1.059 |
GAPS(C) | 1.057 | 1.012 | 1.013 | 1.005 | 0.980 | 0.998 | 1.014 |
GAPS(D) | 1.036 | 0.999 | 1.005 | 0.997 | 0.970 | 0.989 | 1.004 |
Figure 1 provides quantile–quantile plots for Crohn’s disease. Without adjustment of population stratification, the observed test statistics are noticeably higher than the expected values, even in the left tail, and λ is approximately 1.11. EIGENSTRAT, EAS, and ZRM change the results only slightly, with values of λ of 1.08, 1.09, and 1.07, respectively. For both GAPS(C) and GAPS(D), the observed statistics agree well with the expected values except in the extreme right tail, which corresponds to significant association signals, and the values of λ are approximately 1.01.
Figure 2 compares the unadjusted analysis, EIGENSTRAT, and GAPS(C) for Crohn’s disease. The first two methods yield highly similar results, whereas GAPS(C) reduces the number of significant SNPs considerably and yields quite different rankings of the top SNPs. The numbers of SNPs reaching the Bonferroni threshold of genomewide significance are 70, 72, 68, 70, 58, and 63 for the unadjusted analysis, EIGENSTRAT, EAS, ZRM, GAPS(C), and GAPS(D), respectively. For the top 100SNPs picked by the unadjusted analysis, the Pearson’s correlation coefficients of the unadjusted analysis with EIGENSTRAT, EAS, ZRM, GAPS(C), and GAPS(D) are 0.967, 0.981, 0.985, 0.935, and 0.918, respectively, and the corresponding Spearman’s rank correlation coefficients are 0.966, 0.986, 0.968, 0.915, and 0.938.
The results for the other six diseases are similar to those for Crohn’s disease. As shown in Table 2, the λ values of GAPS(C) and GAPS(D) are very close to 1, except for BD and RA. For all seven diseases, the λ values produced by GAPS(C) and GAPS(D) are closer to 1 than those of EIGENSTRAT, EAS, and ZRM, the only exception being GAPS(C) for BD. When we increased the number of knots from 10 to 12 for BD, the λ values for GAPS(C) and GAPS(D) became 1.015 and 0.993. Incidentally, replacing the standard PCs with those of Laplacian eigenmaps did not produce markedly different results; the details are omitted.
5. DISCUSSION
We have developed a theoretically justifiable and computationally feasible approach to correcting for population stratification in GWA studies. Our methods maintain proper control of type I error and provide consistent estimators of odds ratios with proper variance estimators. The computing time for GAPS is longer than that for EIGENSTRAT, but by less than one-fold. The relevant software will be posted on our Web site soon.
In the WTCCC example, we used the top 10 PCs for all methods. This approach, originally suggested by Price et al. (2006), seems to work well empirically in terms of controlling λ. It is possible to pick the PCs based on the Tracy–Widom (1994) test; however, that strategy tends to identify a very large number of PCs, causing numerical instability in downstream association analysis. We can determine which PCs to use by a variable selection procedure, at the cost of additional computation time.
The estimation of the odds ratio of disease with a candidate SNP requires that condition (3) or, equivalently, condition (4) holds. The validity of this condition, which is required not only by our methods but also by all others, is not directly testable. Violation of this assumption would be a potential source of the deviation of the observed test statistics from the theoretical values in the quantile-quantile plot. In the Supplemental Materials, we describe a method for conducting sensitivity analysis.
The only other possible reason for our methods to lose control of type I error would be miscalculation of the propensity scores. To reduce the computational burden, we adopt a parametric approach to calculating the propensity scores. Our experience shows that GAPS is insensitive to this modeling assumption. In our simulation studies, GAPS continued to have correct type I error under various misspecified models for propensity scores. In the analysis of the WTCCC data, the results were virtually unchanged when we replaced model (5) by the proportional odds model. The robustness of GAPS to the model misspecification for the propensity score is attributable to the fact that the effect of the propensity score on the risk of disease is modeled nonparametrically. It is possible to avoid parametric modeling as well as dimension reduction by relating the genotype of a candidate SNP directly to genomic markers through a supervised learning technique (e.g., random forest, support vector machine). Then the selected markers will be specific and predictive for that particular SNP. It would be computationally intensive to do this for every candidate SNP in a GWA study. One compromise might be to use parametric modeling for the initial screening and then apply supervised learning to a small set of top SNPs.
In some studies, population stratification is due to cryptic relatedness or family structure. The identifiability conditions presented in Section 2.2 still apply to such situations, because eqs. (1)–(4) pertain to the marginal distribution of (D, G, U, Z). The proposed estimator of β continues to be valid, but the variance estimation should account for the within-cluster dependence through the generalized estimating equation approach.
We have focused on case-control studies. However, our theoretical development does not rely on the case-control design, so our methods can be readily applied to cross-sectional studies with binary traits. Similar methods can be developed for continuous traits in cross-sectional studies and potentially censored age-at-onset data from cohort studies. Furthermore, it is possible to expand our methods to make likelihood-based inference on haplotype–disease associations along the lines of Lin and Zeng (2006).
Acknowledgments
This work was supported by National Institutes of Health Grant R01 CA082659 and a Gillings Innovation Laboratory award.
APPENDIX: PROOFS OF THEORETICAL RESULTS
Proof of Theorem 1
For notational simplicity, we assume that all variables are discrete. Clearly,
Under model (1),
(A.1) |
where π0= Pr(D = 0) and π1= 1 − π0. Thus,
As a result,
Because γ is arbitrary, we conclude that
if and only if pr(U = u|G = g, Z = z, D = 0) does not depend on g. The theorem is proved.
Proof of Proposition 1
-
It follows from (A.1) that
(A.2) If U and G are independent conditional on Z in the control population, then Pr(U = u, G = g|Z = z, D = 0) can be factorized into a product of two functions, one involving (u, z) and one involving (g, z). Thus, the right side of (A.2) is also a product of a function of (u, z) and a function of (g, z), which implies that U and G are independent conditional on Z in the case population. The other direction can be argued in a similar manner.
-
Clearly,where p0 is the sample proportion of controls, and p1 = 1 − p0. It follows from (A.1) that
If β(g) = 0, then Pr(U = u, G = g|Z = z) = Pr(U = u, G = g, Z = z|D = 0)ζ (u, z) for some function ζ (u, z). Using the arguments of the previous paragraph, we obtain the equivalence between the two conditional independence conditions. Note that the equivalence does not hold if β(g) ≠ 0 and γ (u, z) depends on u.
Proof of Theorem 2
Note that
Under model (1) and condition (3),
By the assumption of the theorem, Pr(G = g|Z = z, D = 0) = Pr(G = g|ψ (Z) = x, D = 0) for z satisfying ψ (z) = x. Thus,
where
The conclusion holds.
Proof of Theorem 3
The proof of this theorem pertains to three major results: consistency, asymptotic normality, and variance estimation.
Consistency
Let and denote the empirical and true probability measures, respectively. By the classical likelihood theory for parametric models (e.g., Lehmann and Casella 1998, chap. 6), η̂ converges to η0 in probability, and
(A.3) |
where
and a⊗2 = aaT for any column vector a. It follows from the proof of Theorem 2 and condition (C.3) that ξ0(·) has a bounded fourth derivative. By theorem 6.25 of Schumaker (2007), we can find a B-spline approximation to ξ0(x), denoted by , such that and , where denotes the real line and ||f||W1,∞( ) = ||f||L∞( ) + ||f′||L∞( ) for any differentiable function f with derivative f′.
Define l(β, ξ; η) = D{βG + ξ(ηTZ)} − log{1 + eβG+ ξ(ηTz)}. Also, write . Because l(β, ξ; η̂) is maximized at (β̂, ξ̂), we have l(β̂, ξ̂; η̂) ≥ l(β0, ξ̃; η̂). Let . The concavity of l(β, ξ; η) for fixed η yields
As a result,
(A.4) |
Note that
uniformly in G and Z, where . Thus, (A.4) yields
(A.5) |
By lemma 2.6.15 of van der Vaart and Wellner (1996), the function
belongs to a bounded Vapnik–Červonenkis class with index (Kn + 3) and has a bounded covering function. Thus, the left side of (A.4) can be further bounded by Op(Knn−1/2). The second-order Taylor expansion of the right side of (A.5) at (β0, ξ0) yields
By similar arguments, there exists some positive constant c0 such that
It then follows from (A.5) and condition (C.1) that
where denotes the support of . Because for some constant c1 (de Boor 1978, p. 155), we have
Thus,
(A.6) |
Condition (C.4) implies that for sufficiently large n. It then follows from (A.6) that
Thus,
which implies that
By the choice of Kn in condition (C.4), we obtain |β̂ − β0| → 0 and ||ξ̂ − ξ0||W1,∞( ) → 0 in probability. This proves the consistency.
To determine the convergence rate, we reexamine the inequality given in (A.4) by setting εn to 1. Because ξ̂ converges to ξ0 in W1,∞-norm, the left side of (A.4) is op(n−1/2), whereas the right side is bounded from below by for some positive constant c2. Thus,
Asymptotic normality
We first derive the least favorable direction for β0; that is, we obtain a function h0(·) such that
where lβ is the score for β0, lξ[h0] is the score function for ξ0 along the submodel ξ0 + εh0, and is the dual operator of lξ. By the semi-parametric efficiency theory (Bickel et al. 1993),
where . Clearly, h0(·) has a bounded fourth derivative. We approximate h0 by the B-spline sequence h̃ and show that and . Because β̂ solves the equation
we obtain lβ (β̂, ξ̂; η̂) + lξ(β̂, ξ̂; η̂) [h̃] = 0. Thus,
(A.7) |
Because ξ̂ and h̃ have bounded first derivatives and lβ and lξ are differentiable, the Donsker theorem (van der Vaart and Wellner 1996, thm. 2.5.2) implies that the left side of (A.7) is equal to
Given the consistency of (β̂, ξ̂, η̂), the Taylor expansion of the right side of (A.7) yields
(A.8) |
where lββ (β, ξ; η) denotes the second derivative of l(β, ξ; η) with respect to β, lβξ (β, ξ; η)[h] denotes the further derivative of lξ(β, ξ; η) [h] with respect to β, and lξξ (β, ξ, η)[h, h̃] denotes the further derivative of lξ [h] along the submodel ξ + εh̃.
The choice of h0 implies that the second term of (A.8) is 0. The convergence rate of (β̂, ξ̂), together with condition (C.4), implies that the last term of (A.8) is op(n−1/2). Thus,
(A.9) |
We claim that that {lββ (β0, ξ0; η0) + lξβ(β0, ξ0, η0)[h0]} is nonzero. If this claim does not hold, then lβ (β0, ξ0; η0) + lξ(β0, ξ0; η0)[h0] has zero variance and thus must be equal to 0 with probability 1. That is, or with probability 1, which contradicts condition (C.1). The asymptotic normality of β̂ then follows from (A.9) and (A.3).
Variance estimation
From the proof of asymptotic normality, the influence function for β̂ is
If we can find a consistent estimator of h0 in the L2(P) space, denoted by ĥ, then a consistent estimator of the limiting variance of n1/2(β̂ − β0) is
(A.10) |
where
To find a consistent estimator for h0, we solve the following equation:
(A.11) |
where the parameters (β, ξ, η) are evaluated at (β̂, ξ̂, η̂), and h is on the tangent space of the sieve space, that is, . As a linear operator, the left side of (A.11) converges to with the parameters taking the true values. Because the information operator is invertible, so is . In addition, the inverse of the latter converges to the inverse of the former. If then follows from the continuity of linear operators that the solution to (A.11), denoted by ĥ, converges to h0.
Note that eq. (A.11) is essentially a system of linear equations. Algebraically, the calculation of (A.10) is equivalent to the variance estimation by treating model (6) as a parametric model with parameters β, θ, and η. Thus, the limiting covariance matrix of n1/2(β̂ − β0, (θ̂ − θ0)T)T can be consistently estimated by the sandwich estimator H−1VH−1, where
Footnotes
SUPPLEMENTARY MATERIALS
Method for sensitivity analysis: A method for conducting sensitivity analysis on condition (3) is described in supplementary materials. (supp.pdf)
Contributor Information
D. Y. Lin, Email: lin@bios.unc.edu.
D. Zeng, Email: dzeng@bios.unc.edu.
References
- Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press; 1993. [Google Scholar]
- Chen H-S, Zhu X, Zhao H, Zhang S. Qualitative Semi-Parametric Test for Genetic Associations in Case-Control Designs Under Structured Populations. Annals of Human Genetics. 2003;67:250–264. doi: 10.1046/j.1469-1809.2003.00036.x. [DOI] [PubMed] [Google Scholar]
- de Boor C. A Practical Guide to Splines. New York: Springer; 1978. [Google Scholar]
- Devlin B, Roeder K. Genomic Control for Association Studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
- Epstein MP, Allen AS, Satten GA. A Simple and Improved Correction for Population Stratification in Case-Control Studies. American Journal of Human Genetics. 2007;80:921–930. doi: 10.1086/516842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lander E, Schork N. Genetic Dissection of Complex Traits. Science. 1994;265:2037–2048. doi: 10.1126/science.8091226. [DOI] [PubMed] [Google Scholar]
- Lee AB, Luca D, Klei L, Devlin B, Roeder K. Discovering Genetic Ancestry Using Spectral Graph Theory. Genetic Epidemiology. 2010;34:51–59. doi: 10.1002/gepi.20434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Sullivan PF, Zou F, Wright FA. Comment on a Simple and Improved Correction for Population Stratification. American Journal of Human Genetics. 2008;80:524–531. doi: 10.1016/j.ajhg.2007.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehmann EL, Casella G. Theory of Point Estimation. New York: Springer; 1998. [Google Scholar]
- Lin DY, Zeng D. “Likelihood-Based Inference on Haplotype Effects in Genetic Association Studies” (with discussion) Journal of the American Statistical Association. 2006;101:89–118. [Google Scholar]
- Prentice RL, Pyke R. Logistic Disease Incidence Models and Case-Control Studies. Biometrika. 1979;66:403–411. [Google Scholar]
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies. Nature Genetics. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association Mapping in Structured Populations. American Journal of Human Genetics. 2000;67:170–181. doi: 10.1086/302959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Risch N, Merikangas K. The Future of Genetic Studies of Complex Human Diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
- Rosenbaum PR, Rubin DB. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika. 1983;70:41–55. [Google Scholar]
- Sarasua SM, Collins JS, Williamson DM, Satten GA, Allen AS. Effect of Population Stratification on the Identification of Significant Single-Nucleotide Polymorphisms in Genome-Wide Association Studies. BMC Proceedings. 2009;3(Suppl 7):S13. doi: 10.1186/1753-6561-3-s7-s13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schumaker LL. Spline Functions: Basic Theory. Cambridge: Cambridge University Press; 2007. [Google Scholar]
- The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The Wellcome Trust Case Control Consortium. Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tracy C, Widom H. Level-Spacing Distributions and the Airy Kernel. Communications in Mathematical Physics. 1994;159:151–174. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]
- Zhang J, Niyogi P, McPeek MS. Laplacian Eigenfunctions Learn Population Structure. PLoS ONE. 2009;4(12):e7928. doi: 10.1371/journal.pone.0007928. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang S, Zhu X, Zhao H. On a Semiparametric Test to Detect Associations Between Quantitative Traits and Candidate Genes Using Unrelated Individuals. Genetic Epidemiology. 2003;24:44–56. doi: 10.1002/gepi.10196. [DOI] [PubMed] [Google Scholar]
- Zhao H, Rebbeck TR, Mitra N. A Propensity Score Approach to Correction for Bias Due to Population Stratification Using Genetic and Non-Genetic Factors. Genetic Epidemiology. 2009;33:679–690. doi: 10.1002/gepi.20419. [DOI] [PMC free article] [PubMed] [Google Scholar]