Abstract
Confounding due to population stratification (PS) arises when differences in both allele and disease frequencies exist in a population of mixed racial/ethnic subpopulations. Genomic control, structured association, principal components analysis (PCA), and multidimensional scaling (MDS) approaches have been proposed to address this bias using genetic markers. However, confounding due to PS can also be due to non-genetic factors. Propensity scores are widely used to address confounding in observational studies but have not been adapted to deal with PS in genetic association studies. We propose a genomic propensity score (GPS) approach to correct for bias due to PS that considers both genetic and non-genetic factors. We compare the GPS method with PCA and MDS using simulation studies. Our results show that GPS can adequately adjust and consistently correct for bias due to PS. Under no/mild, moderate, and severe PS, GPS yielded estimated with bias close to 0 (mean = −.0044, standard error = .0087). Under moderate or severe PS, the GPS method consistently outperforms the PCA method in terms of bias, coverage probability, and type I error. Under moderate PS, the GPS method consistently outperforms the MDS method in terms of coverage probability. PCA maintains relatively high power compared to both MDS and GPS methods under the simulated situations. GPS and MDS are comparable in terms of statistical properties such as bias, type I error, and power. The GPS method provides a novel and robust tool for obtaining less-biased estimates of genetic associations that can consider both genetic and non-genetic factors.
Keywords: population-based genetic association, propensity scores, population stratification, genetic and non-genetic covariates
INTRODUCTION
It has become increasingly common in genetic association studies to recognize and adjust for characteristics of the study population that may lead to confounded associations, including the effects of ancestry and admixture that lead to population stratification (PS). PS is a type of confounding that can arise when differences in allele frequencies and differences in baseline disease rates in a population of mixed racial/ethnic groups confound association of genes with disease. Consequences of PS include deviations from Hardy-Weinberg equilibrium and bias in the estimate of genetic associations, which can lead to incorrect inferences as well as inconsistency across reports [Deng, 2001; Marchini et al., 2004]. In order for bias due to PS to exist, both of the following must be true: (a) the frequency of the marker genotype of interest varies significantly by race/ethnicity, and (b) the background disease prevalence varies significantly by race/ethnicity [Wacholder et al., 2000, 2002].
Traditional epidemiological methods for dealing with confounding include matching, stratification, and model-based covariate adjustment [Hennekens and Buring, 1987]. Despite their value and wide applicability, these methods all have shortcomings in genetic association studies. Matching fails when relevant matching covariates (such as ancestry) cannot be adequately captured. Stratification may fail when the number of strata is so large that not all strata contain subjects with both the genotype of interest and those with no copy of the genotype of interest. Because the number of strata increases exponentially with the number of covariates, there is a risk of having such noninformative strata even when the number of covariates is quite small. Traditional model-based adjustment such as logistic regression may avoid these problems but depends on correct specification of the model relating the genotype and covariates to the outcome. This method can involve a perilous extrapolation if the distributions of the covariates in the comparison groups do not overlap substantially.
The propensity score, defined as the conditional probability of assignment to a particular group given a vector of observed covariates, provides an alternative approach to address confounding [Rosenbaum and Rubin, 1983]. If subjects are assigned to strata based on their propensity scores, then the comparison groups within the strata are balanced with respect to the observed potential confounders. Because exact adjustment using the propensity score will on average remove all of the bias in group effect estimates [Rosenbaum and Rubin, 1985], propensity scores can be used as a basis for matching, stratification, regression adjustment, and data reduction. The method offers distinct advantages because it can control for numerous covariates simultaneously by matching or stratifying on a single scalar variable; this greatly simplifies model building and estimation. The use of propensity scores to adjust for confounding and hence to increase the comparability of exposure groups has become widely accepted in observational studies [Rosenbaum, 1995; Joffe and Rosenbaum, 1999; Lunceford and Davidian, 2004]. However, this approach has not been adopted in genetic or molecular epidemiology to control for confounding or to reduce bias due to PS in genetic association studies.
It has long been argued that association analysis is more powerful than family-based linkage analysis [Risch and Merikangas, 1996]. However, population based association studies may be subject to confounding due to PS. Family-based association studies, on the other hand, avoid confounding due to PS but are often not feasible when studying late-onset diseases and may be cost prohibitive. Hence, several population-based methods have been developed to test for and/or adjust for PS in case-control studies. However, no consensus has been reached as to which method is best. These methods all utilize genotype information either from a set of random (null) markers or from a set of selected ancestry informative markers (AIMs). These methods for testing and/or adjusting for PS can be broadly classified into three classes [Balding 2006; Barnholtz-Sloan et al., 2008]: (1) genomic control (GC) [Devlin and Roeder, 1999; Bacanu et al., 2000; Devlin et al., 2001a], (2) structured association (SA) [Pritchard and Rosenberg, 1999; Pritchard et al., 2000; Hoggart et al., 2003; Pritchard and Donnely, 2007] and (3) other, such as principal component methods [Price et al., 2006] and multidimensional scaling (MDS) [Li and Yu, 2008]. The principal components analysis (PCA) approach uses genotype data to estimate “axes of variation,” which can be used to adjust for association attributable to ancestry along each axis. MDS is an extension of PCA. It uses techniques from multidimensional scaling [Mardia et al., 2003] and clustering [Kaufman and Rousseeuw, 1990; Tisbshirani et al., 2001] analysis to identify both clustered and continuous patterns of genetic variation; it then corrects for their potential confounding effects by adjusting each subject’s positions along identified axes of genomic variation and his/her memberships in detected clusters simultaneously.
Despite the existence of these methods, the optimal approach for correction of bias due to population stratification remains unresolved, and additional evaluations of the existing methods are needed. Recently, Zhang et al. [2008] evaluated the relative merits of structured association, genomic control and PCA under various PS conditions. They found that the performance of PCA was very stable under various stratification levels in terms of power, type I error rates, accuracy and positive prediction value. Most recently, Tiwari et al. [2008] provided a comprehensive review of methods made in recent years correcting for PS and evaluated and synthesized these methods based on statistical principles. However, many of the existing approaches may be inadequate because they only consider genetic (and not environmental or patient characteristics) factors that may lead to biases. Some of the methods are also computationally intensive (e.g. SA) and do not provide parameter estimate of the gene effect such as odds ratios (e.g. GC). Therefore, additional research in this area is needed.
The primary goal of this research was to develop a propensity score framework to optimally estimate and correct for the effect of genetic and non-genetic covariates in population-based association studies, and thereby reduce the bias introduced by PS. We compare data analysis approaches using a naïve regression method, PCA, MDS, and our propensity score approach using simulated data.
METHODS
NOTATION
All notations for formulas and simulations used in this work appear in Table I. We use G to denote the candidate genotype of interest, D to denote disease status, Z to denote a vector of genetic covariates such as null markers and AIMs, and X to denote a vector of non-genetic covariates which may include patient demographics and disease specific variables such as tumor characteristics. For all terms, the indices i, k, and g represent individual, subpopulation, and test-locus genotype, respectively. For example, pgik represents the probability of having a high-risk genotype g for individual i in subpopulation k.
TABLE I.
Notation for Formulas and Simulations
| G = test-locus genotype (G=1 is for genotypes AA and Aa; G=0 is for genotype aa) |
| D = disease status for the ith subject (1=case, 0=control) |
| X = vector of non-genetic covariates for ith subject, such as age, education, gender, and race |
| Z = vector of additional genotypes for the ith subject, such as null markers or AIMs |
| pgi = probability of having a high-risk genotype attribute to X, where logit (pgi) = αx+βx X |
| αx = baseline odds-ratio between X and G |
| βx = log odds-ratio between X and G |
| k = subpopulation (k=1 or 2) |
| p = minor-allele frequency |
| Fst = Wright’s coefficient of inbreeding |
| pk = allele frequency for the kth subpopulation, where pk ~ Beta [(1−Fst)p/Fst, (1−Fst)(1−p)/Fst] |
| pgk = pk2+2pk(1−pk), probability of having high risk genotype attribute to Z |
| pgik = (1−λ) pgi + λ pgk, probability of having high-risk genotype for individual i in |
| subpopulation k, where λ= pgk/(pgk + pgi) |
| pdi = probability of having disease according to logit (pdi) = αd+βd X + β G |
| αd = baseline odds-ratio between (X, G) and D |
| βd = log odds-ratio between X|G and D |
| β = log odds-ratio between G|X and D |
| GPS(Z, X) = P(G=1|Z, X), genomic propensity scores based on (Z, X) |
| Ndk = number of cases in subpopulation k, k=1, 2 |
| Nck = number of controls in subpopulation k, k=1, 2 |
PROPENSITY SCORES METHOD
We define the genomic propensity score (GPS) to be the likelihood of an individual having a particular test-locus genotype based on that individual’s covariate makeup. This can be stated explicitly as
| (1) |
where GPSi(zi, xi) is the genomic propensity score for subject i calculated from that subject’s zi and xi which represent that individual’s vector of genetic and non-genetic covariates and where Gi is that subject’s test-locus genotype. For example, for a dominant disease risk allele A and reference allele a, G = 1 may represent the high-risk genotypes AA and Aa, and G = 0 may represent the homozygote wild-type genotype aa. It is assumed that, given the observed covariates, the Gi (i = 1, …,N) are independent and identically distributed where N is the total size of the study cohort:
| (2) |
In the development below, we suppress an index i corresponding to the ith individual. We assume that given GPS(z, x), (Z, X) and G are conditionally independent, i.e., Pr ((Z, X), G|GPS(z,x)) = Pr ((Z,X)|GPS(z,x)) Pr(G|GPS (z,x)), which balances measured covariates across test-locus genotype groups.
We define a general class of models that specify the potential relation among disease, test-locus genotype, genetic and non-genetic covariates.
| (3) |
where f (.) is a link function, such as the logit function, that determines the relationship between the outcome variable D and predictor variables Z and X. E (D | Z, X) denotes the conditional mean of D given Z and/or X; and η is a function of covariates, usually a linear function. For example, we specify
| (4) |
where αd and βd are specified as shown in Table II.
Table II.
Parameter Settings for the Simulations
| Parameter | Description | Low | Medium | High |
|---|---|---|---|---|
| k | Subpopulation | 1 | 2 | |
| Nd1 | Number of cases in subpopulation 1 | 125 | 200 | 250 |
| Nd2 | Number of cases in subpopulation 2 | 250 | 300 | 375 |
| Nc1 | Number of controls in subpopulation 1 | 250 | 300 | 375 |
| Nc2 | Number of controls in subpopulation 2 | 125 | 200 | 250 |
| p | Minor-allele frequency | 0.1 | 0.3 | 0.5 |
| Fst | Wright’s coefficient of inbreeding | 0.01 | 0.05 | 0.1 |
| αx | Baseline odds-ratio of X and G | −log(4) | 0 | log(3) |
| βx | Log odds-ratio of X and G | log(0.75) | log(0.85), log(1.22) | log(1.35) |
| αd | Baseline odds-ratio between (X, G) and D | −log(99) | −log(99) | |
| βd | Log odds-ratio between X|G and D | log(0.67) | log(0.75), log(1.35) | log(1.5) |
| β | Log odds-ratio between G|X and D | log(1) | log(1.5) | log(2) |
We assumed that the disease prevalence follows logit link for our simulations:
| (5) |
where logit(t)=log(t/(1−t)) is the logit function. The probability of having the high-risk genotype, attributed to X, follows the logistic model logit (pgi) = αx+βx X for our simulations. Within each subpopulation, we assume HWE conditions are met for null markers Z. Then the probability of having the high-risk genotype, attributed to Z, is defined as pgk = pk2+2pk(1−pk), where pk is the allele frequency for the kth subpopulation and pk ~ Beta [(1−Fst)p/Fst, (1−Fst)(1−p)/Fst]. Fst is Wright’s coefficient of inbreeding [Wright 1951]. Therefore the probability of having the high-risk genotype for individual i in subpopulation k, attributed to Z and X, is defined as pgik = (1−λ) pgi + λ pgk, where λ= pgk/(pgk + pgi). The retrospective interpretation of λ is that when λ is close to 0, the probability of having the high-risk genotype is mainly determined by non-genetic factors. On the other hand, when λ is close to 1, the probability of having the high-risk genotype is mainly determined by genetic factors. When λ is close to 0.5, the probability of having the high-risk genotype is determined equally by non-genetic and genetic factors. Finally, we specify our proposed GPS model:
| (6) |
where αd and β are specified as shown in Table II and γ is a nuisance parameter.
We fit the disease prevalence model assuming a logit link to estimate the effect of risk genotype(s) β:
| (7) |
SIMULATION STUDIES
DESCRIPTION OF SIMULATION
We conducted simulation studies to evaluate the performance of the proposed GPS method and to compare it with other existing methods, including a logistic regression model with or without adjustment for non-genetic and/or genetic covariates, the PCA [EIGENSTRAT, Price et al., 2006], and MDS [MDS.R, Li and Yu 2008]. To fully evaluate the proposed GPS method, we calculated propensity scores based on non-genetic and genetic covariates to examine the relationship between D and G. We fitted ‘reduced’ models (Table III, models 1–6) that included 1) no adjustment for PS, 2) direct adjustment by non-genetic covariates only, 3) adjustment by propensity scores based on non-genetic covariates, 4) direct adjustment by genetic covariates only, 5) adjustment by propensity scores based on genetic covariates, and 6) direct adjustment by both non-genetic and genetic covariates. These ‘reduced’ models were compared to a ‘full’ model (Table III, model 7) that adjusted propensity scores based on non-genetic and genetic covariates. Because PCA and MDS methods can only incorporate genetic covariates, when comparing these two methods to GPS, the non-genetic covariates were directly adjusted in the final disease association models.
Table III.
Estimated Odds-ratio (OR), 95% Confidence Intervals of OR (CI), Bias and Standard Error (SE) of the log OR and the Coverage Probability (CP) for Comparing Models with and without GPS by Population Stratification
| Model | True OR=1
|
True OR=1.5
|
True OR=2
|
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OR (95% CI) | Bias | SE | CP | OR (95% CI) | Bias | SE | CP | OR (95% CI) | Bias | SE | CP | |
| No or Mild Population Stratification | ||||||||||||
| 1. G | 1.102 (0.862–1.366) | 0.090 | 0.129 | 0.916 | 1.517 (1.218–1.882) | 0.005 | 0.128 | 0.972 | 1.886 (1.527–2.301) | −0.065 | 0.129 | 0.960 |
| 2. X+G | 1.041 (0.794–1.335) | 0.032 | 0.140 | 0.964 | 1.345 (1.049–1.711) | −0.117 | 0.141 | 0.903 | 1.681 (1.314–2.119) | −0.182 | 0.141 | 0.774 |
| 3. GPS(X)+G | 1.036 (0.817–1.289) | 0.028 | 0.132 | 0.975 | 1.312 (1.046–1.627) | −0.141 | 0.134 | 0.851 | 1.619 (1.282–1.997) | −0.218 | 0.136 | 0.660 |
| 4. Z+G | 1.122 (0.816–1.459) | 0.104 | 0.147 | 0.881 | 1.847 (1.332–2.532) | 0.194 | 0.159 | 0.765 | 2.490 (1.799–3.421) | 0.205 | 0.161 | 0.743 |
| 5. GPS(Z)+G | 1.116 (0.828–1.436) | 0.099 | 0.144 | 0.885 | 1.792 (1.319–2.419) | 0.166 | 0.155 | 0.821 | 2.384 (1.742–3.234) | 0.163 | 0.157 | 0.826 |
| 6. Z+X+G | 1.047 (0.738–1.413) | 0.032 | 0.161 | 0.936 | 1.540 (1.070–2.178) | 0.009 | 0.176 | 0.942 | 2.095 (1.429–2.950) | 0.030 | 0.176 | 0.938 |
| 7. GPS(Z, X)+G | 1.035 (0.777–1.323) | 0.025 | 0.145 | 0.972 | 1.403 (1.059–1.840) | −0.078 | 0.156 | 0.940 | 1.789 (1.342–2.329) | −0.121 | 0.156 | 0.890 |
| Moderate Population Stratification | ||||||||||||
| 1. G | 1.525 (1.202–1.913) | 0.415 | 0.130 | 0.095 | 2.336 (1.884–2.918) | 0.436 | 0.130 | 0.043 | 2.886 (2.287–3.595) | 0.360 | 0.132 | 0.178 |
| 2. X+G | 1.444 (1.100–1.865) | 0.359 | 0.141 | 0.271 | 2.085 (1.616–2.672) | 0.322 | 0.142 | 0.383 | 2.597 (1.990–3.295) | 0.253 | 0.144 | 0.590 |
| 3. GPS(X)+G | 1.395 (1.089–1.753) | 0.326 | 0.134 | 0.295 | 1.996 (1.563–2.501) | 0.279 | 0.137 | 0.464 | 2.470 (1.932–3.076) | 0.204 | 0.139 | 0.714 |
| 4. Z+G | 1.169 (0.855–1.570) | 0.144 | 0.150 | 0.831 | 1.898 (1.356–2.643) | 0.220 | 0.161 | 0.708 | 2.555 (1.813–3.502) | 0.231 | 0.161 | 0.691 |
| 5. GPS(Z)+G | 1.154 (0.861–1.519) | 0.132 | 0.146 | 0.853 | 1.821 (1.325–2.502) | 0.181 | 0.157 | 0.769 | 2.426 (1.757–3.276) | 0.180 | 0.158 | 0.787 |
| 6. Z+X+G | 1.088 (0.783–1.497) | 0.071 | 0.163 | 0.929 | 1.576 (1.078–2.235) | 0.032 | 0.178 | 0.934 | 2.144 (1.437–3.028) | 0.052 | 0.177 | 0.932 |
| 7. GPS(Z, X)+G | 1.066 (0.814–1.388) | 0.054 | 0.147 | 0.948 | 1.442 (1.066–1.919) | −0.051 | 0.160 | 0.957 | 1.863 (1.347–2.445) | −0.082 | 0.160 | 0.932 |
| Severe Population Stratification | ||||||||||||
| 1. G | 2.528 (1.968–3.241) | 0.919 | 0.133 | 0.000 | 4.679 (3.640–5.807) | 1.130 | 0.137 | 0.000 | 5.756 (4.495–7.204) | 1.050 | 0.140 | 0.000 |
| 2. X+G | 2.397 (1.815–3.107) | 0.865 | 0.144 | 0.000 | 4.218 (3.192–5.412) | 1.025 | 0.149 | 0.000 | 5.235 (4.002–6.813) | 0.953 | 0.151 | 0.000 |
| 3. GPS(X)+G | 2.267 (1.738–2.906) | 0.810 | 0.138 | 0.000 | 3.987 (3.077–5.031) | 0.969 | 0.145 | 0.000 | 4.939 (3.816–6.319) | 0.896 | 0.147 | 0.000 |
| 4. Z+G | 1.257 (0.903–1.717) | 0.214 | 0.167 | 0.751 | 1.990 (1.345–2.777) | 0.265 | 0.178 | 0.666 | 2.670 (1.833–3.717) | 0.273 | 0.177 | 0.654 |
| 5. GPS(Z)+G | 1.219 (0.910–1.604) | 0.187 | 0.156 | 0.793 | 1.830 (1.275–2.482) | 0.185 | 0.170 | 0.797 | 2.409 (1.719–3.264) | 0.173 | 0.169 | 0.834 |
| 6. Z+X+G | 1.169 (0.800–1.645) | 0.139 | 0.180 | 0.871 | 1.659 (1.044–2.403) | 0.079 | 0.196 | 0.907 | 2.247 (1.447–3.272) | 0.097 | 0.194 | 0.914 |
| 7. GPS(Z, X)+G | 1.129 (0.851–1.482) | 0.111 | 0.159 | 0.914 | 1.502 (1.020–2.033) | −0.013 | 0.176 | 0.954 | 1.941 (1.386–2.636) | −0.043 | 0.176 | 0.959 |
OR=odds ratio between disease (D) and gene (G); estimated OR, bias and SE are the average of over 1,000 replications of the estimated OR, bias and SE of the estimated log OR; CI=empirical 95% confidence intervals; bias=log(estimated OR) −log(true OR); the empirical CP is the proportion of the confidence interval of the estimated OR including the true OR among 1,000 replications. In all models, Fst=0.05, p1=0.1 and p2=0.5.
All parameter settings for the simulations used in this work are presented in Table II. First, data for a population consisting of K subpopulations, denoted Nk=100,000 with k=1, …, K were generated. Then 500 cases and 500 controls were randomly selected from two subpopulations. From the kth subpopulation, Ndk cases and Nck controls were independently simulated. The number of cases and controls drawn from two subpopulations was chosen as follows. Samples of 500 cases and 500 controls were drawn from a population consisting of two subpopulations. The prior probability of sampling individuals from each subpopulation was 0.5. The prevalence of the disease varied from 0.25 to 0.75 in subpopulations 1 and 2. Under no PS, the prevalence of the disease was the same in each subpopulation. Thus, the expected numbers of cases and controls from each subpopulation were (250, 250) and (250, 250), respectively. Under moderate PS, the prevalence of the disease was 1.5-fold higher in subpopulation 1; thus, the expected numbers of cases and controls for each subpopulation were (300, 200) and (200, 300), respectively. Finally, under severe PS, the prevalence of the disease was 3-fold higher in subpopulation 1; thus, the expected number of cases and controls from each subpopulation were (375, 125) and (125, 375), respectively. When comparing the GPS to PCA and MDS methods, only moderate and severe PS situations were investigated.
SIMULATION ALGORITHM
The steps used for our simulation studies are as follows:
Step 1. Generate X non-genetic covariates X1–X6~N(0, 1) and X7–X10~Bernoulli(0.5) for subpopulation k with Nk=100,000, where k=1, …, K; obtain pgi for each ith subject based on model logit (pgi) = αx+βx X, where αx and βx are specified as in Table II.
Step 2. Simulate allele frequency pk for the kth subpopulation from pk ~ Beta [(1−Fst)p/Fst, (1−Fst)(1−p)/Fst], where p=0.1, 0.3 or 0.5 and Fst=0.01, 0.05 or 0.1 (Table II). Obtain pgk = pk2+2pk(1−pk), where pgk is the probability of having high risk genotype attribute to Z.
Step 3. Assume λ= pgk/(pgk + pgi), calculate pgik = (1−λ) pgi + λ pgk, where pgik is the probability of having high risk genotype for individual i in subpopulation k. Generate test-locus genotype G~Binomial(pgik, 2) and recode to G=1 if G=1 or 2 (Aa or AA) and keep G=0 if G=0 (aa).
Step 4. Generate Z null markers ZG1, …, ZG50~Binomial(pk*, 2), where MAF pk* in the ancestral population at each locus is generated from the uniform distribution U(pk−0.05, pk+0.05). If pk−0.05 is less than 0, then 0 is used; if pk+0.05 is greater than 1, then 1 is used.
Step 5. Assuming odds ratio (OR)=exp(β)=1, 1.5 or 2, calculate pdi = probability of having the disease according to logit (pdi) = αd+βd X + β G for subject i=1, …,Nk, where αd and βd are specified as shown in Table II. Simulate D~Bernolli(pdi) to assign case to ith subject if D=1 and control if D=0.
Step 6. Randomly select (Nd1, Nc1)=(250, 250), (200, 300), or (125, 375) from subpopulation 1 and (Nd2, Nc2)=(250, 250), (300, 200), or (375, 125) from subpopulation 2. Combine subpopulations 1 and 2 to create a dataset with 500 cases and 500 controls.
-
Step 7. Calculate GPS(Z, X) = P(G=1| Z, X) genomic propensity scores based on (Z, X) using dataset from step 6. Adjust the GPS(Z, X) in the model Logit[D=1| G, GPS(Z, X)]=αd+ β G + γGPS(Z, X),), where γ is nuisance parameter.
Calculate PCA scores, PCA(Z), using dataset from step 6 by EIGENSTRAT. Adjust the non-genetic covariates and PCA(Z) in the model Logit[D=1|X, G, PCA(Z)]=αd+βd X + β G + γPCA(Z).
Finally, calculate MDS scores and clusters MDS(Z) and MDS(k) respectively, using dataset from step 6 by MDS R package. Adjust the non-genetic covariates and MDS(Z) and MDS(k) in the model Logit[D=1|X, G, MDS(Z), MDS(k)]=αd+βd X + β G + γ1MDS(Z) + γ2MDS(k), where γ1 and γ2 are nuisance parameters.
Step 8. In fitting the models specified above, including reduced models (Table III models 1 to 6), obtain estimates of OR, p-value of OR, bias, and SE of log OR.
Step 9. Repeat steps 6 to 8 for 1000 times, calculate average OR, bias, SE, coverage probability, empirical 95% confidence interval of OR, type I error, and power.
Step 10. Compare reduced models (Table III models 1 to 6) with full GPS model (Table III model 7) to evaluate the performance of the GPS method in terms of OR, bias, SE, and coverage probability (Table III).
Step 11. Compare the GPS method with no adjustment, PCA and MDS methods in terms of OR, bias, SE, coverage probability, type I error, and power (Tables IV to VI).
Table IV.
Estimated Odds-ratio (OR), Bias and Standard Error (SE) of the Estimated log OR and the Coverage Probability (CP) for Comparing no Adjustment, PCA, MDS and GPS Methods with True OR=1
| Estimate OR
|
Bias
|
SE
|
CP
|
|||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fst | p1 | p2 | No Adjustment | PCA | MDS | GPS | No Adjustment | PCA | MDS | GPS | No Adjustment | PCA | MDS | GPS | No Adjustment | PCA | MDS | GPS |
| Moderate Population Stratification | ||||||||||||||||||
| 0.01 | 0.1 | 0.3 | 1.474 | 1.009 | 0.976 | 1.009 | 0.380 | −0.003 | −0.035 | −0.000 | 0.130 | 0.152 | 0.152 | 0.144 | 0.165 | 0.945 | 0.937 | 0.956 |
| 0.01 | 0.1 | 0.5 | 1.860 | 1.092 | 1.074 | 1.076 | 0.615 | 0.073 | 0.057 | 0.061 | 0.128 | 0.167 | 0.167 | 0.159 | 0.000 | 0.923 | 0.930 | 0.941 |
| 0.05 | 0.1 | 0.3 | 1.381 | 1.150 | 1.147 | 1.123 | 0.314 | 0.128 | 0.126 | 0.107 | 0.128 | 0.143 | 0.141 | 0.135 | 0.303 | 0.838 | 0.844 | 0.883 |
| 0.05 | 0.1 | 0.5 | 1.525 | 1.081 | 1.043 | 1.066 | 0.415 | 0.066 | 0.030 | 0.054 | 0.130 | 0.155 | 0.156 | 0.147 | 0.095 | 0.937 | 0.945 | 0.948 |
| 0.1 | 0.1 | 0.3 | 1.763 | 1.022 | 1.011 | 1.014 | 0.560 | 0.008 | −0.003 | 0.003 | 0.129 | 0.162 | 0.162 | 0.155 | 0.002 | 0.938 | 0.944 | 0.958 |
| 0.1 | 0.1 | 0.5 | 2.081 | 1.059 | 1.049 | 1.037 | 0.728 | 0.039 | 0.030 | 0.021 | 0.129 | 0.192 | 0.191 | 0.185 | 0.000 | 0.947 | 0.954 | 0.958 |
| Severe Population Stratification | ||||||||||||||||||
| 0.01 | 0.1 | 0.3 | 2.088 | 1.117 | 1.029 | 1.100 | 0.728 | 0.097 | 0.015 | 0.086 | 0.132 | 0.165 | 0.167 | 0.153 | 0.001 | 0.917 | 0.953 | 0.938 |
| 0.01 | 0.1 | 0.5 | 3.652 | 1.167 | 1.091 | 1.136 | 1.288 | 0.137 | 0.070 | 0.113 | 0.133 | 0.184 | 0.186 | 0.175 | 0.000 | 0.880 | 0.927 | 0.906 |
| 0.05 | 0.1 | 0.3 | 1.528 | 1.195 | 1.184 | 1.147 | 0.415 | 0.166 | 0.157 | 0.129 | 0.128 | 0.153 | 0.153 | 0.137 | 0.100 | 0.810 | 0.800 | 0.855 |
| 0.05 | 0.1 | 0.5 | 2.528 | 1.156 | 1.047 | 1.129 | 0.919 | 0.131 | 0.031 | 0.111 | 0.133 | 0.171 | 0.174 | 0.159 | 0.000 | 0.874 | 0.953 | 0.914 |
| 0.1 | 0.1 | 0.3 | 3.188 | 1.054 | 1.018 | 1.037 | 1.152 | 0.034 | −0.000 | 0.022 | 0.133 | 0.179 | 0.180 | 0.172 | 0.000 | 0.934 | 0.936 | 0.950 |
| 0.1 | 0.1 | 0.5 | 4.981 | 1.079 | 1.055 | 1.052 | 1.600 | 0.053 | 0.030 | 0.031 | 0.138 | 0.212 | 0.211 | 0.207 | 0.000 | 0.939 | 0.945 | 0.956 |
OR=odds ratio between disease (D) and gene (G); estimated OR, bias and SE are the average over 1,000 replications of the estimated OR, bias and SE of the estimated log OR; bias=log(estimated OR) −log(true OR); the empirical coverage probability is the proportion of the confidence interval of the estimated OR including the true OR among 1,000 replications. In all models, true OR=1.
Table VI.
Empirical Type I Error and Power of Simulations for Comparing no Adjustment, PCA, MDS and GPS Methods
| Type I error
|
Power
|
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Fst | p1 | p2 | No Adjustment | PCA | MDS | GPS | No Adjustment | PCA | MDS | GPS |
| Moderate Population Stratification | ||||||||||
| 0.01 | 0.1 | 0.3 | 0.835 | 0.055 | 0.063 | 0.044 | 1.000 | 0.892 | 0.874 | 0.862 |
| 0.01 | 0.1 | 0.5 | 1.000 | 0.077 | 0.070 | 0.059 | 1.000 | 0.877 | 0.847 | 0.825 |
| 0.05 | 0.1 | 0.3 | 0.697 | 0.162 | 0.157 | 0.117 | 1.000 | 0.859 | 0.815 | 0.802 |
| 0.05 | 0.1 | 0.5 | 0.905 | 0.063 | 0.055 | 0.052 | 1.000 | 0.691 | 0.617 | 0.606 |
| 0.1 | 0.1 | 0.3 | 0.998 | 0.062 | 0.056 | 0.042 | 1.000 | 0.810 | 0.779 | 0.752 |
| 0.1 | 0.1 | 0.5 | 1.000 | 0.053 | 0.046 | 0.042 | 1.000 | 0.756 | 0.695 | 0.674 |
| Severe Population Stratification | ||||||||||
| 0.01 | 0.1 | 0.3 | 0.999 | 0.083 | 0.047 | 0.062 | 1.000 | 0.951 | 0.900 | 0.921 |
| 0.01 | 0.1 | 0.5 | 1.000 | 0.120 | 0.073 | 0.094 | 1.000 | 0.852 | 0.752 | 0.789 |
| 0.05 | 0.1 | 0.3 | 0.900 | 0.190 | 0.213 | 0.145 | 1.000 | 0.851 | 0.744 | 0.819 |
| 0.05 | 0.1 | 0.5 | 1.000 | 0.126 | 0.047 | 0.086 | 1.000 | 0.691 | 0.552 | 0.617 |
| 0.1 | 0.1 | 0.3 | 1.000 | 0.066 | 0.064 | 0.050 | 1.000 | 0.762 | 0.670 | 0.694 |
| 0.1 | 0.1 | 0.5 | 1.000 | 0.061 | 0.055 | 0.044 | 1.000 | 0.804 | 0.625 | 0.738 |
RESULTS
SIMULATION RESULTS OF THE GPS METHOD
Table III presents results under various models with and without GPS used to correct for PS with Fst=0.05, p1=0.1 and p2=0.5. Under no or mild PS, as expected, the model without any adjustment performed well when compared to models with adjustment. For example, when the true OR=1.5, the model without adjustment gave the best estimated OR=1.517 with the smallest bias of 0.005, the smallest SE of 0.128, and the highest coverage probability of 0.972 among all models under no or mild PS. As expected, the models with adjustment of (Z, X) performed consistently well under no or mild PS. Surprisingly, the model with GPS(Z, X) gave the smallest bias of 0.025 and the highest coverage probability of 0.972 when OR=1 and performed reasonably well when OR=1.5 or 2 under no or mild PS. Under the no or mild PS, the confidence intervals around the ORs overlap across models 1–7 when OR=1, 1.5 or 2. However, the GPS(Z, X)-adjusted model (model 7) yields the narrowest confidence interval when compared to reduced models 1 to 6.
Under moderate PS, all ORs were overestimated for all reduced models. This is particularly true when no adjustment was made in G-only models. For example, when the true OR was 1, the estimated OR was 1.525 with coverage probability less than 10% in model 1 under moderate PS. On the other hand, the GPS(Z, X)-adjusted model gave the best estimate of OR (1.066), the smallest bias of 0.054, and the highest coverage probability of 0.948 under moderate PS. When the true OR=1.5 or 2, the GPS(Z, X)-adjusted model also showed a good estimate and the best coverage probability. Under the moderate PS, the confidence intervals around the ORs overlap across models 1–7 when OR=1, 1.5 or 2. However, the GPS(Z, X)-adjusted model (model 7) yields the narrowest confidence interval when compared to reduced models 1 to 6.
Finally, under severe PS, the GPS(Z, X)-adjusted models performed consistently well when the true OR=1, 1.5 or 2. For example, when the true OR=1.5, the GPS(Z, X)-adjusted model gave the unbiased OR of 1.502, the smallest bias of −0.013 and the best coverage probability of 0.954 under severe PS. Comparing the GPS(Z, X)-adjusted models to (Z+X)-adjusted models, adjusting for the estimated propensity score was even better for removing bias than directly adjusting for all non-genetic and genetic covariates. This can be explained by the concept that if adjusting for individual non-genetic and genetic covariates is sufficient, it is also sufficient to adjust for the propensity scores. On top of that adjusting for the estimated propensity score removed both systematic and chance imbalances. The standard errors of the estimated log odds-ratio tended to be inflated in (Z+X)-adjusted models compared to the GPS(Z, X)-adjusted models. Under the severe PS, the confidence intervals around the ORs overlap across models 1–3, and across models 4–7 when OR=1, 1.5 or 2. However, the confidence intervals do not overlap between the model subgroups 1–3 and 4–7. In addition, the GPS(Z, X)-adjusted model (model 7) yields the narrowest confidence interval when compared to reduced models 1 to 6.
Figures 1–2 and Supplementary Figures S1–S3 illustrate results of estimated OR, empirical power, bias, SE, and coverage probability by MAF, true OR, and PS for Fst = 0.01, 0.05, or 0.1 using GPS to correct for PS. Figure 1 and Supplementary Figure S1 show the estimated OR and bias with GPS. They indicate that under each scenario considered, GPS performs consistently well in terms of removing bias due to PS and hence yielding unbiased estimator. Under no/mild, moderate, and severe PS, GPS yielded estimated with bias close to 0 (mean = −.0044, standard error = .0087). This is also true for other statistics such as empirical power (Figure 2), SE (Supplementary Figure S2), and coverage probability (Supplementary Figure S3). Because of the robust results observed in our GPS method and the demanding computation time from MDS, we only compare our GPS method to PCA and MDS methods using parameters Fst=0.01, 0.05 or 0.1, p1=0.1 p2=0.3 or p1=0.1 p2=0.5.
Fig. 1.

Estimated OR with GPS by MAF and OR. Dot line is no/mildPS. Short dash line is moderate PS. Long dash line is severe PS.
Fig. 2.

Bias with GPS by MAF and OR. Dot line is no/mild PS. Short dash line is moderate PS. Long dash line is severe PS. Positive values indicate an overestimation of the true effect of G and D. Negative values indicate an underestimation of the true effect of G and D.
SIMULATION RESULTS OF THE GPS METHOD COMPARED TO PCA AND MDS
Table IV shows the results comparing no adjustment, PCA, MDS and GPS methods with OR=1 in terms of estimation, bias, SE, and coverage probability. Under moderate PS, all three adjusted methods performed well in terms of estimated OR and bias, with evidence of overestimation when Fst=0.05, p1=0.1 and p2=0.3. It is clear that our proposed GPS method consistently outperforms the PCA and MDS methods in terms of SE and coverage probability under moderate PS. Under severe PS, all three adjusted methods performed reasonably well, but the GPS method consistently outperformed the PCA method in terms of estimation, bias, SE, and coverage probability. Both GPS and PCA methods always outperformed the MDS in terms of SE under severe PS. MDS sometimes outperformed GPS and PCA in terms of bias and coverage probability under severe PS, though inconsistently. As expected, when OR=1 the basic no adjustment method yielded an overestimation of the OR, high bias and very low coverage probability under both moderate and severe PS.
Table V shows the results of comparing no adjustment, PCA, MDS and GPS methods with OR=1.5 in terms of estimation, bias, SE and coverage probability. As expected, when OR=1.5 the basic no adjustment method yielded an overestimation of the OR, high bias and very low coverage probability under both moderate and severe PS. Under moderate PS, all three adjusted methods performed reasonably well in terms of estimation and bias, with evidence of overestimation by the PCA and MDS methods when Fst=0.01 and underestimation by the GPS method when Fst=0.1. The proposed GPS method consistently outperforms the PCA and MDS methods in terms of SE regardless of Fst and MAF. Under the moderate PS, the GPS method outperformed both PCA and MDS methods in terms of coverage probability. Under severe PS, GPS and MDS methods consistently outperformed PCA in terms of estimation, bias, and coverage probability. PCA and MDS methods provided similar SE under severe PS. Finally, the proposed GPS method consistently outperformed both PCA and MDS methods in terms of SE and coverage probability under severe PS.
Table V.
Estimated Odds-ratio (OR), Bias and Standard Error (SE) of the Estimated log OR and the Coverage Probability for Comparing no Adjustment, PCA, MDS and GPS Methods with True OR=1.5
| Estimate OR
|
Bias
|
SE
|
CP
|
|||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fst | p1 | p2 | No Adjustment | PCA | MDS | GPS | No Adjustment | PCA | MDS | GPS | No Adjustment | PCA | MDS | GPS | No Adjustment | PCA | MDS | GPS |
| Moderate Population Stratification | ||||||||||||||||||
| 0.01 | 0.1 | 0.3 | 2.319 | 1.643 | 1.607 | 1.550 | 0.427 | 0.079 | 0.057 | 0.023 | 0.131 | 0.149 | 0.149 | 0.143 | 0.101 | 0.914 | 0.932 | 0.957 |
| 0.01 | 0.1 | 0.5 | 2.514 | 1.699 | 1.661 | 1.572 | 0.510 | 0.111 | 0.088 | 0.036 | 0.131 | 0.166 | 0.166 | 0.158 | 0.014 | 0.902 | 0.926 | 0.965 |
| 0.05 | 0.1 | 0.3 | 2.234 | 1.598 | 1.560 | 1.495 | 0.391 | 0.052 | 0.028 | −0.012 | 0.130 | 0.153 | 0.153 | 0.145 | 0.119 | 0.942 | 0.955 | 0.962 |
| 0.05 | 0.1 | 0.5 | 2.336 | 1.535 | 1.575 | 1.442 | 0.436 | 0.009 | 0.034 | −0.051 | 0.130 | 0.170 | 0.168 | 0.160 | 0.043 | 0.953 | 0.945 | 0.957 |
| 0.1 | 0.1 | 0.3 | 2.266 | 1.589 | 1.556 | 1.488 | 0.406 | 0.045 | 0.025 | −0.017 | 0.131 | 0.162 | 0.161 | 0.153 | 0.092 | 0.958 | 0.955 | 0.976 |
| 0.1 | 0.1 | 0.5 | 2.227 | 1.548 | 1.499 | 1.455 | 0.126 | 0.019 | −0.014 | −0.040 | 0.131 | 0.159 | 0.159 | 0.151 | 0.126 | 0.948 | 0.948 | 0.953 |
| Severe Population Stratification | ||||||||||||||||||
| 0.01 | 0.1 | 0.3 | 3.206 | 1.826 | 1.730 | 1.686 | 0.750 | 0.183 | 0.129 | 0.106 | 0.133 | 0.161 | 0.161 | 0.152 | 0.000 | 0.795 | 0.865 | 0.897 |
| 0.01 | 0.1 | 0.5 | 4.870 | 1.783 | 1.665 | 1.639 | 1.169 | 0.155 | 0.086 | 0.074 | 0.138 | 0.182 | 0.183 | 0.173 | 0.000 | 0.854 | 0.907 | 0.929 |
| 0.05 | 0.1 | 0.3 | 3.462 | 1.701 | 1.575 | 1.573 | 0.828 | 0.112 | 0.034 | 0.037 | 0.133 | 0.167 | 0.168 | 0.157 | 0.000 | 0.900 | 0.945 | 0.958 |
| 0.05 | 0.1 | 0.5 | 4.679 | 1.601 | 1.498 | 1.502 | 1.130 | 0.047 | −0.020 | −0.013 | 0.137 | 0.185 | 0.186 | 0.176 | 0.000 | 0.927 | 0.936 | 0.954 |
| 0.1 | 0.1 | 0.3 | 4.072 | 1.634 | 1.560 | 1.522 | 0.991 | 0.070 | 0.024 | 0.003 | 0.137 | 0.178 | 0.178 | 0.168 | 0.000 | 0.940 | 0.959 | 0.973 |
| 0.1 | 0.1 | 0.5 | 3.917 | 1.658 | 1.509 | 1.538 | 0.952 | 0.085 | −0.010 | 0.013 | 0.136 | 0.174 | 0.176 | 0.165 | 0.000 | 0.926 | 0.946 | 0.968 |
OR=odds ratio between disease (D) and gene (G); estimated OR, bias and SE are the average over 1,000 replications of the estimated OR, bias and SE of the estimated log OR; bias=log(estimated OR) −log(true OR); the empirical coverage probability is the proportion of the confidence interval of the estimated OR including the true OR among 1,000 replications. In all models, true OR=1.5.
Results for the simulated type I errors and power with significance level of 0.05 are presented in Table VI. Under moderate PS, the basic no adjustment method yielded an inflated type I error rate ≥ 0.7. Similarly, under severe PS, the basic no adjustment method yielded > 0.9 type I error. However, under both moderate and severe PS, the no adjustment method provided 100% empirical power. All three adjusted methods performed reasonably well under moderate PS, with evidence of inflated type I error when Fst=0.05, p1=0.1 and p2=0.3. The PCA method had an inflated type I error rate under severe PS. For example, the PCA method yielded an empirical type I error rate more than two times that of the nominal level when Fst=0.01, p1=0.1 and p2=0.5 or Fst=0.05 under severe PS. The MDS and GPS methods also yielded an inflated empirical type I error rate when Fst=0.05, p1=0.1 and p2=0.3. The MDS method performed well when Fst=0.01 or Fst=0.05, p1=0.1 and p2=0.5, and GPS performed extremely well when Fst=0.1 under severe PS. Table VI also shows the simulated empirical power results. Under moderate PS, PCA consistently outperformed the MDS and GPS methods in terms of power. The MDS and GPS have similar and reasonable power under moderate PS, where MDS yielded slightly higher power than that of GPS. Under severe PS, PCA yielded higher power than that of MDS and GPS methods. Again, the MDS and GPS yielded reasonable power under severe PS, where GPS yielded consistently higher power than that of MDS.
DISCUSSION
PS has been raised as a concern in population based genetic association studies. Currently, popular methods such as PCA and MDS exist to correct for PS. However, both of these commonly applied methods control bias due to PS using genetic markers alone. We have proposed a new method to correct for potential confounding due to PS using both genetic and non-genetic factors. Our GPS method, which utilizes the estimated propensity score, allows us to represent each subject’s genetic and non-genetic background with a single variable and captures much of the variation due to genetic as well as the non-genetic factors. As illustrated in our simulation studies, the GPS method can adjust adequately and consistently for PS when there are moderate and severe ancestral differences between cases and controls. Our simulations demonstrate that GPS consistently outperforms PCA in terms of bias, SE, coverage probability and type I error under moderate or severe PS. The PCA method maintains relatively high power compared to both MDS and GPS methods under the simulated situations. Under moderate PS, GPS method consistently outperforms the MDS method in terms of SE and coverage probability. The GPS and MDS methods are comparable in terms of statistical properties such as bias, type I error, and power. However, the proposed GPS method is computationally less burdensome and easier to implement than both PCA and MDS methods. In addition, in the absence of null markers or AIMs (which is often the case in candidate gene studies), the propensity score approach can still help to reduce confounding due to non-genetic factors whereas PCA and MDS cannot. As expected, the basic no adjustment method yielded an overestimated OR, high bias, very low coverage probability and inflated type I error under both moderate and severe PS.
To estimate the genomic propensity score GPS(Z, X), we use a logistic regression to obtain the predicted probability of having the high-risk genotype based on (Z, X). The dependent variable is the test-locus genotype rather than the disease status, and the independent variables are the genetic and/or non-genetic covariates. Note that the disease variable is not used in the calculation of the genomic propensity score. To estimate the effect of the genotype on disease, we use a second logistic regression model. In this case, the dependent variable is the disease status, and the independent variables are the test-locus genotype variable with the estimated genomic propensity score.
There are several advantages to using propensity scores rather than adjusting covariates directly in genetic association studies. If it is sufficient to adjust for individual covariates X, then it is sufficient to adjust for the propensity score [Rosenbaum and Rubin, 1983]. The direct consequence of using a propensity score is a dramatically reduced data dimensionality. Furthermore, it has been shown that adjusting for the estimated propensity score is better at removing bias than adjusting for the true propensity score. This is because adjusting for the estimated propensity score can remove both systematic and chance imbalances, while adjusting for the true propensity score removes only systematic imbalances [Cepeda et al., 2003]. Finally, because PCA and MDS methods are primarily focused on controlling PS by use of genetic markers alone, any non-genetic factors have to be controlled separately in an additional regression step. This is not the case in the GPS method because genetic and non-genetic covariates are adjusted simultaneously.
In our simulations studies, non-genetic and genetic factors were generated independently. Therefore, there were no directly specified relationships between genetic and non-genetic factors. Both genetic and non-genetic factors were considered to be associated with the gene of interest (G). However, no causal relation was specified between non-genetic factors (E) and G; therefore, E was not a determinant of G. In fact, E was associated with G, but E was also independently related to D; thus E was considered to be a potential confounder. Therefore, it is appropriate to adjust for E when estimating the effect of G on disease. As suggested by the first reviewer, we have added the basic unadjusted model and compared it with the fully adjusted models. Our results demonstrate that the adjustment for confounders affected the association between G and D substantially. The vector of non-genetic covariates is randomly generated, but we have specified distributions for these random variables to mimic actual non-genetic factors. For example, six non-genetic covariates are normally distributed to represent continuous risk factors such as age. The additional four non-genetic factors were specified to have Bernoulli distributions in order to represent categorical risk factors such as family history of disease. We originally proposed Poisson distributions for ordered non-genetic factors. However, with our extensive simulations settings (1000 datasets with 1000 subjects), we believe that the normally distributed non-genetic covariates are a good approximation for Poisson distributed non-genetic risk factors.
We have chosen to limit our simulation studies to two subpopulations based on simulations conducted in previous papers that found that the magnitude of bias due to PS is ultimately bounded by two subpopulations [Wacholder, 2000; Wang, 2004; Wang, 2006]. As the number of subpopulations increase, confounding from each group will tend to cancel as some groups contribute positive confounding and others contribute negative confounding; thus, the effect of the additional subpopulations is likely to dampen the bias [Wacholder, 2000]. Since the main goal of our research was to develop a method to reduce the bias introduced by PS, we adopted the most conservative scenario by using K=2 subpopulations.
The categories of no/mile, moderate and severe PS were determined by varying the prevalence of disease in subpopulations 1 and 2. We introduced different PS scenarios in our simulations by sampling cases in different proportions from the two subpopulations. To induce no/mild PS, we considered a situation of no confounding by sampling cases in the same proportions (0.5, 0.5). To induce moderate stratification, we sampled cases in the proportions 0.4 and 0.6 from the two subpopulations, respectively. Finally, to induce severe PS, we sampled cases in the proportions 0.25 and 0.75. Similar prevalence ratios were used by other authors who designed the same three level of PS [Zheng, 2006; Epstein, 2007]. The relative wide ranges of genotype frequencies (0.1, 0.3, 0.5) and Fst (0.01, 0.05, 0.1) values for our simulation studies are based on two reasons. First, several recently published papers on methods for population stratification have used similar frequencies and Fst in their simulation studies [Zheng, 2006; Epstein, 2007]; we feel that using similar parameter ranges allows easier comparison of results. Second, our use of the larger genotype frequencies and Fst values were intentionally chosen to introduce greater bias due to PS. We have demonstrated that our new approach is able to reduce biases due to PS that are relatively large in magnitude insuring that this method is capable of handling situations with lower allele frequencies and Fst values.
In our simulations studies, out choice of 10 non-genetic covariates was based on the decision to include a large enough number of covariates to demonstrate the utility of propensity score adjustment over other methods (if we chose only 2 or 3 covariates, then other methods could be applied within covariate strata) but not so large a number as to negate the reality of current genetic association studies. In our experience conducting genetic association studies, we generally have access to or have collected around this number of patient and/or disease characteristics. It should be emphasized, that one of the distinct advantages of the propensity score approach is its data reduction capabilities by adjusting for numerous covariates simultaneously. If a study collected and adjusted for more than 10 non-genetic factors, then our approach should be able to further reduce bias by adding additional non-genetic factors to obtain a more precisely estimated propensity score. Our choice of 50 null markers was based on the current literature. A number of authors have shown, via simulation studies, that 50 to 100 AIMs are needed to accurately assign one’s individual ancestry [Risch, 2002; Tsai, 2005; Choudhry, 2006; Barnholtz-Sloan, 2008]. Moreover, for reliable estimation using genomic information to address the problem of bias due to PS, Devlin [2001b] states that one requires 50 or more null markers. We used the lower bound of the recommended number (50 rather than 100) because we wanted to establish a method that has good properties using minimal genomic information. With larger numbers of null markers, our approach should be able to further reduce bias due to PS.
When evaluating the performance of the GPS method, we found that GPS always slightly underestimates the OR when the true ORs are 1.5 or 2 regardless of the level of PS. GPS also tends to slightly underestimate OR when the true OR is 1.5 in comparison to PCA and MDS methods under moderate or severe PS. This phenomenon has both advantages and disadvantages when applying the GPS method. An advantage of this finding is that GPS may tend to reduce the number of false positive findings even though this underestimation may reduce the power to detect a true association under certain conditions. In many situations, it is preferable to avoid false positive associations, which may lead to unnecessary follow-up or validation studies. In our simulation studies, the power was comparable between GPS and MDS, and only slightly reduced compared to PCA. We realize that underestimation is the unwanted features as the over-estimation. Bias with bi-directions can be introduced by over- or under-estimations. Nonetheless, false-positive and false-negative results that result from over- or under-estimation is a problem for these studies, and other approaches to avoid these errors should be considered.
We did not compare our GPS approach to other methods such as GC and SA. GC has the advantage that an underlying model of the population structure does not have to be specified, but it does not provide parameter estimates [Satten et al., 2001]. Also, GC approach only uses genotype information. SA methods are more attractive to study with more than two potential subpopulations which are not the case under our simulation studies (K=2 only). Furthermore, SA methods are computationally demanding, and because the notation of subpopulations is a theoretical construct that only imperfectly reflects reality, the question of the correct number of subpopulations can never be fully resolved [Balding, 2006].
We present our method in the context of a binary trait. It is straightforward to extend the proposed approach to study other traits, such as quantitative traits or multi-categorical traits, by using a generalized linear model approach [McCullagh and Nelder, 1989]. In addition, we are currently extending our methods to allow for three category genotypes (rather than dichotomizing the risk genotype into high and low risk categories) and haplotypes using an extended propensity score [Imai and Van Dyk, 2004].
The main focus of this research was to best estimate the effect of a candidate gene by correcting for confounding effects of genetic and non-genetic factors. Our method can be used to handle multiple SNPS, but is not easily applied to GWAS studies or other large-scale studies using hundreds or thousands of SNPs because each test requires the estimation of a propensity score first, and then requires the fitting of a logistic regression model. Further research is needed to extend this GPS method for applications to genome-wide association studies. Because the main goal of our method is to obtain best estimate of the measure of association (OR and 95% confidence interval), it is better suited to candidate gene studies rather than GWAS where signal detection is of greatest importance.
Our proposed method can be implemented by using standard software such as R and SAS. We have developed SAS code for implementing our GPS approach which is available upon request from the author. In addition, our proposed method requires much less computing time than both PCA and MDS. For example, to run 1000 simulations under a single scenario takes less than 1 hour for the GPS method, about 3 to 4 hours for the PCA method, and about 100 hours for the MDS method on an Intel® Core™ Duo CPU at 1.6GHz.
Supplementary Material
Fig.3.

SE with GPS by MAF and OR. Dot line is no/mild PS. Short dash line is moderate PS. Long dash line is severe PS.
Fig. 4.

Coverage Probability with GPS by MAF and OR. Dot line isno/mild PS. Short dash line is moderate PS. Long dash line is severe PS.
Fig. 5.

Power with GPS by MAF and OR. Dot line is no/mild PS. Short dash line is moderate PS. Long dash line is severe PS.
Acknowledgments
We thank Drs. Mingyao Li, Jinbo Chen and Wensheng Guo for their helpful comments on the simulation studies. This work was supported by grants P50-CA105641 and R01-CA08574 from NIH (to T.R.R.); and grants from the Pennsylvania Department of Health (to N.M.) and an Abramson Cancer Center Pilot Grant (NM). The Department specifically disclaims responsibility for any analysis, interpretations or conclusions.
References
- Bacanu SA, Devlin B, Roeder K. The power of genomic control. Am J Hum Genet. 2000;66:1933–1944. doi: 10.1086/302929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balding DJ. A tutorial of statistical methods for population association studies. Nat Rev Genet. 2006;7(10):781–791. doi: 10.1038/nrg1916. [DOI] [PubMed] [Google Scholar]
- Barnholtz-Sloan JS, McEvoy B, Shriver MD, Rebbeck TR. Ancestry estimation and correction for population stratification in molecular epidemiologic association studies. Cancer Epidemiol Biomarkers Prev. 2008;17(3):471–477. doi: 10.1158/1055-9965.EPI-07-0491. [DOI] [PubMed] [Google Scholar]
- Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol. 2003;158:280–287. doi: 10.1093/aje/kwg115. [DOI] [PubMed] [Google Scholar]
- Choudhry S, Coyle NE, Tang H, et al. Population stratification confounds genetic association studies among Latinos. Hum Genet. 2006;118:652–664. doi: 10.1007/s00439-005-0071-3. [DOI] [PubMed] [Google Scholar]
- Deng HW. Population admixture may appear to mask, change or reverse genetic effects of genes underlying complex traits. Genetics. 2001;159:1319–1323. doi: 10.1093/genetics/159.3.1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
- Devlin B, Roeder K, Wasserman L. Genomic control, a new approach to genetic-based association studies. Theor Popul Biol. 2001a;60:155–166. doi: 10.1006/tpbi.2001.1542. [DOI] [PubMed] [Google Scholar]
- Devlin B, Roeder K, Bacanu SA. Unbiased methods for population-based association studies. Genet Epidemiol. 2001b;21:273–284. doi: 10.1002/gepi.1034. [DOI] [PubMed] [Google Scholar]
- Epstein MP, Allen AS, Satten GA. A Simple and Improved Correction for Population Stratification in Case-Control Studies. Am J Hum Genet. 2007;80:921–930. doi: 10.1086/516842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hennekens CH, Buring JE. Epidemiology in Medicine. Philadelphia: Lippioncott Williams & Wilkins; 1987. [Google Scholar]
- Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003;72:1492–1504. doi: 10.1086/375613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imai K, Van Dyk DA. Cause inference with general treatment regimes: generalizing the propensity score. JASA. 2004;99:854–866. [Google Scholar]
- Joffe MM, Rosenbaum PR. Invited commentary: propensity scores. Am J Epidemiol. 1999;150:327–333. doi: 10.1093/oxfordjournals.aje.a010011. [DOI] [PubMed] [Google Scholar]
- Kaufman L, Rousseeuw PJ. Finding Groups in Data. New York: Wiley; 1990. [Google Scholar]
- Li QZ, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008;32:215–226. doi: 10.1002/gepi.20296. [DOI] [PubMed] [Google Scholar]
- Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23:2937–2960. doi: 10.1002/sim.1903. [DOI] [PubMed] [Google Scholar]
- McCullagh P, Nelder JA. Generalized Linear Model. Second. New York: Chapman & Hall/CRC Press; 1989. [Google Scholar]
- Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36:512–517. doi: 10.1038/ng1337. [DOI] [PubMed] [Google Scholar]
- Mardia KV, Kent JT, Bibby JM. Multivariate Analysis. New York: Academic Press; 2003. [Google Scholar]
- Nievergelt CM, Libiger O, Schork NJ. Generalized analysis of molecular variance. PLoS Genet. 2007;3(4):e51. doi: 10.1371/journal.pgen.0030051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- Pritchard JK, Donnelly Case-control studies of association in structured or admixed populations. Theor Popul Biol. 2001;60:227–37. doi: 10.1006/tpbi.2001.1543. [DOI] [PubMed] [Google Scholar]
- Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999;65:220–228. doi: 10.1086/302449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000;67:170–181. doi: 10.1086/302959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
- Risch N, Burchard E, Ziv E, Tang H. Categorization of humans in biomedical research: genes, race and disease. Genome Biol. 2002;3:1–12. doi: 10.1186/gb-2002-3-7-comment2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenbaum PR. In: Observational Studies. Rosenbaum PR, editor. New York: Springer-Verlag; 1995. pp. 1–12. [Google Scholar]
- Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]
- Rosenbaum OR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39:33–38. [Google Scholar]
- Satten GA, Flanders WD, Yang Q. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet. 2001;68:466–477. doi: 10.1086/318195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B. 2001;2:411–423. [Google Scholar]
- Tiwari HK, Barnholtz-Sloan J, Wineinger N, Padilla MA, Vaughan LK, Allison DB. Review and evaluation of methods correcting for population with a focus on underlying statistical principles. Human Heredity. 2008;66:67–86. doi: 10.1159/000119107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsai H, Peng J, Wang P, Risch NJ. Comparison of three methods to estimate genetic ancestry and control for stratification in genetic association studies among admixed populations. Hum Genet. 2005;118:424–433. doi: 10.1007/s00439-005-0067-z. [DOI] [PubMed] [Google Scholar]
- Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst. 2000;92(14):1151–1158. doi: 10.1093/jnci/92.14.1151. [DOI] [PubMed] [Google Scholar]
- Wacholder S, Rothman N, Caporaso N. Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol Biomarkers Prev. 2002;11:513–520. [PubMed] [Google Scholar]
- Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in case-control association studies of admixed populations. Genet Epidemiol. 2004;27:14–20. doi: 10.1002/gepi.20003. [DOI] [PubMed] [Google Scholar]
- Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in epidemiologic studies of gene-gene or gene-environment interactions. Cancer Epidemiol Biomarkers Prev. 2006;15(1):124–132. doi: 10.1158/1055-9965.EPI-05-0304. [DOI] [PubMed] [Google Scholar]
- Wright S. The genetical structure of populations. Ann Eugen. 1951;15:323–354. doi: 10.1111/j.1469-1809.1949.tb02451.x. [DOI] [PubMed] [Google Scholar]
- Zhang F, Wang YP, Deng HW. Comparison of population-based association study methods correcting for population stratification. PloS ONE. 2008;3(10):e3391. doi: 10.1371/journal.pone.0003392. 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng G, Freidlin B, Gastwirth JL. Robust genomic control for association studies. Am J Hum Genet. 2006;78:350–356. doi: 10.1086/500054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
