To the Editor: Recently, Lee et al.1 presented a method to estimate the proportion of phenotypic variation explained by common SNPs for case-control phenotypes. This extends the work of Yang et al.2 for estimating the proportion of phenotypic variation that can be explained by common SNPs for quantitative traits. Yang et al.2 found that 45% of variation in height in Australian individuals of European descent can be explained by common SNPs. Lee et al. showed that a high proportion (22%–38%) of the variation in liability for Crohn disease, bipolar disorder, and type 1 diabetes in the Wellcome Trust Case-Control Consortium (WTCCC) data can be explained by common SNPs.1
Under the Lee et al.1 and Yang et al.2 framework, common SNPs explain a proportion of the variation of the trait if individuals who are more genetically similar at the common SNPs are also more phenotypically similar. Genetic similarity is estimated with genome-wide allele sharing, and the proportion of variation explained is estimated with a variance components model.1,2 These analyses are susceptible to confounding by population structure. Confounding can occur because individuals who are more genetically similar also tend to be geographically proximal or in the same social class and thus have a shared environment,3 so that it is unclear whether phenotypic similarity is caused by shared genetic factors or by shared environment. Additionally, in case-control studies, ascertainment of a larger proportion of cases than of controls from a geographic region can inflate estimates of the proportion of variance explained by SNPs because relatedness within regions tends to be higher than relatedness across regions. For example, the WTCCC bipolar disorder study ascertained a disproportionate number of cases from Wales, and the participants from Wales and Scotland had much higher rates of recent identity by descent than participants from all other regions within the UK.4 This excess relatedness in cases due to ascertainment would be expected to inflate estimates of phenotypic variance explained by the genetic data.
In this letter, we illustrate the problem of confounding due to population structure in estimating the proportion of phenotypic variation explained by common SNPs, and we show that inclusion of principal components (PCs) as fixed effects does not fully correct for population structure in this setting. We use real genetic data from the WTCCC5 controls with simulated phenotypes as well as simulated genotype and phenotype data.
The WTCCC control data consist of 2938 population controls, each of which has a reported geographic region, which is one of 12 regions within the UK.5 In addition to the WTCCC's quality control filters, we required a 0.99 posterior probability to call a genotype and removed SNPs with more than 0.5% missing data. Following Lee et al.,1 we also removed SNPs with minor allele frequency <0.05 and SNPs showing differential missingness between the two control cohorts (NBS and 58C) or deviation from Hardy-Weinberg disequilibrium at a 0.05 significance level; 206,103 SNPs remained after application of these filters. We removed individuals with more than 1% missing data and both individuals in pairs with GCTA a relatedness value > 0.05. This left 2861 individuals.
We used the reported geographic region to simulate the phenotype for each individual without consideration of the genetic data. The relationship between geographic region and simulated trait distribution could potentially be explained by environmental factors such as diet or occupation that vary with geographic region, or for case-control data by ascertainment scheme. By construction, each phenotype is independent of genotype within each geographical region, and consequently the phenotypes have no heritability.
We used the GCTA software6 version 0.91.0 to analyze the data, and simulated phenotypes are described below. For each simulated phenotype, we used the software to estimate the proportion of phenotypic variability explained by the autosomal SNPs without adjustment for PCs and with adjustment for 20 PCs that were calculated by GCTA.
We simulated case-control phenotype data for the WTCCC control individuals. For simplicity, we combined all regions except Scotland and Wales, and we refer to the combined regions as “England.” A randomly selected 90% of individuals from Scotland and Wales were assigned to be “cases,” and the remaining 10% “controls,” whereas in England 10% of individuals were assigned to be “cases” and the remaining 90% “controls.” We assumed a prevalence of 1% for the disease and used this to convert case-control variation onto the liability scale, which accounts for case-control ascertainment.1 The genotype data explain 51% (standard error [SE] = 7%) of the case-control variation on the liability scale without adjustment for PCs and 31% (SE = 7%) of the case-control variation on the liability scale when adjusting for 20 PCs. With adjustment for imperfect linkage disequilibrium (LD),2,6 common SNPs are estimated to explain 60% (SE = 9%) of the case-control variation on the liability scale without PCs and 37% (SE = 9%) with 20 PCs. We used 20 PCs to follow Lee et al.,1 but Table S1, available online demonstrates that 10 PCs provide the same level of correction for the WTCCC data. Results for other case-control ascertainment schemes and for other relationship cut-off values are shown in Table 1 and Table S2.
Table 1.
Average Estimated Inflation in Estimates of Variation Explained for Several Case-Control Ascertainment Schemes in WTCCC Control Data
Ascertainment scheme | A (80% and 20%)a | B (75% and 30%)b | Null Model (50% and 50%)c |
---|---|---|---|
Average estimated variance explained | 8.1% | 3.9% | 2.1% |
Standard error | 0.6% | 0.4% | 0.3% |
We computed the mean estimated variance of phenotype explained for 100 realizations of each ascertainment scheme. The estimated percentage of variance explained is constrained to be nonnegative, so even the null model gives a small average estimated variance explained, however the difference between the null results and the results from schemes A and B is statistically significant. The ascertainment in scheme B is similar to ascertainment in the WTCCC Bipolar study in which 72% of Welsh were cases, whereas 34% of English were cases. All results were generated with 20 PCs, a 0.025 cut-off on relatedness and no adjustment for incomplete LD, and are reported on the liability scale with 1% prevalence. We note that although scheme B is designed to give similar regional proportions to those in the WTCCC Bipolar data, the regional information for the WTCCC data is crude and does not fully capture the population structure in the data. Thus, we are not able to fully model the geographical (and possibly socio-economic) ascertainment differences between the cases and controls, which limits the effect sizes that we can observe in our simulations.
Ascertainment scheme A has 80% of Welsh and Scots designated as cases and 20% of English designated as cases.
Ascertainment scheme B has 75% of Welsh and Scots designated as cases and 30% of English as designated as cases.
The null scheme has equal ascertainment of cases and controls across all subpopulations (50% cases). A scheme with 30% as cases in all subpopulations gave identical results (data not shown).
We also simulated quantitative trait phenotype data for these individuals from a normal distribution in which the mean depends on the geographic region. The mean simulated trait values in Scotland, England, and Wales were 1, 2, and 3, respectively, with a standard deviation of 0.4 within each region. For this phenotype, the genotype data explain 50% (SE = 9%) of the phenotypic variation without adjustment for PCs and 42% (SE = 9%) with adjustment for 20 PCs. With adjustment for imperfect LD, common SNPs are estimated to explain 60% (SE = 11%) of the phenotypic variation without adjustment for PCs and 50% (SE = 11%) with 20 PCs.
The results of analyzing the WTCCC control genotypes with simulated phenotypes show that estimates of phenotypic variance explained by genotypes can be biased upward in the presence of confounding with population structure. Comparison of the analysis with and without PCs shows that PCs are absorbing some of the confounding due to population structure but are unable to fully adjust for population structure in these data. The geographic region information for the WTCCC data is a crude surrogate for population structure, and our simulations might understate the potential inflation in the estimates of phenotypic variance explained by common SNPs in these data.
We also simulated genotype data for individuals located on a 5 × 5 grid. Each grid point can be thought of as a geographic region such as a county. We simulated a population of 2 million individuals, initially unrelated, and simulated forward for 200 generations with a constant population size in each generation. We generated crossovers according to Haldane's model7 on 20 chromosomes each 150 cM long. The initial geographic region of each founder was chosen at random. At each generation, an individual from the previous generation was chosen at random to be parent 1, and then another individual from the same region was chosen to be parent 2. The probability that the child would be placed in the same region as the parents was 0.5; children not placed in the same region as the parents were placed in a randomly chosen region located one horizontal or vertical grid unit away. Genotype data were generated for 300,000 SNPs with minor allele frequency uniformly distributed between 0.05 and 0.5. The SNPs were simulated to be in linkage equilibrium in the founder generation.
We sampled 10,000 individuals at the final generation of which 9340 individuals remained after removing closely related individuals (with a relatedness cut-off of 0.05). We generated simulated phenotypic traits that were based on the geographic origin of the individuals without consideration of the genotype so that the traits have zero heritability. We simulated a case-control trait with ascertained case proportions 0, 0.25, 0.50, 0.75, or 1 according to the value of the first dimension of the individual's location. When assuming a population prevalence of 1%, with or without the use of 20 PCs, 27.2% (SE = 4.5%) of the variance was explained on the liability scale (without adjustment for imperfect LD). We also simulated a quantitative trait with a mean equal to 0, 1, 2, 3, or 4 according to the value of the first dimension of the individual's location and with a standard deviation of 1 in each region. Without the use of PCs, 53.6% (SE = 8.1%) of variance of the trait was explained by the genotype data. With 20 PCs, 52.0% (SE = 8.1%) of the variance was explained. Thus, in these simulated data not only do the PCs not correct for population structure, but a comparison of estimates with and without PC adjustment does not give a clear indication of the presence of the population structure as it did in the WTCCC control data. Additionally, a test for interchromosomal correlation in relatedness described in Yang et al.2 did not yield evidence of population structure in these data.
Other methods for adjusting for population structure used in association analysis are not necessarily applicable in this context. For example, in association testing, one tests a single locus at a time and assumes that the rest of the genome has essentially no effect (on average). However, in estimating the proportion of variance explained one uses either the whole genome or a very large portion of the genome. Thus, one does not have a large number of other similar units that mostly have no effect to use as a control. Clearly, genomic control is not applicable here. As another example, mixed linear models that are very similar to the model used by Yang et al.2 and Lee et al.1 have been applied to association testing, with the relationship between estimated kinship and phenotypic similarity being used to correct for population structure and cryptic relatedness8 rather than to estimate the proportion of variance explained. Unfortunately, one cannot simultaneously use this relationship for both purposes.
In this study, we considered the effect of population structure in the absence of a genetic effect. For heritable traits, estimated phenotypic variance explained by common SNPs could include both true genetic effect and inflation due to confounding factors. However, if the data have been ascertained to avoid biases and ensure homogeneity, this inflation should be a very small part of the estimate.
The role of common variants in contributing to phenotype variability is an important question with crucial implications for study design. The results presented here suggest that confounding with fine-scale population structure is a serious concern when estimating phenotypic variability explained by common SNPs and that there is a need for more sensitive methods that can detect and quantify population structure in this context.
Acknowledgments
This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of investigators who contributed to the generation of Wellcome Trust Case-Control Consortium data is available from www.wtccc.org.uk. Funding for the Wellcome Trust Case-Control Consortium project was provided by the Wellcome Trust under award 076113. This work was supported by National Institutes of Health (NIH) awards R01HG004960 and R01HG005701. The content of this study is the sole responsibility of the authors and does not necessarily reflect the views of the NIH or the Wellcome Trust.
Supplemental Data
Web Resources
The URLs for data presented herein are as follows:
GCTA, a Tool for Genome-wide Complex Trait Analysis, http://gump.qimr.edu.au/gcta/
Wellcome Trust Case-Control Consortium, www.wtccc.org.uk
References
- 1.Lee S.H., Wray N.R., Goddard M.E., Visscher P.M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 2011;88:294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Risch N., Burchard E., Ziv E., Tang H. Categorization of humans in biomedical research: Genes, race and disease. Genome Biol. 2002;3 doi: 10.1186/gb-2002-3-7-comment2007. comment2007.1–comment2007.12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Browning B.L., Browning S.R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 2011;88:173–182. doi: 10.1016/j.ajhg.2011.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Haldane J.B.S. The combination of linkage values and the calculation of distances between the loci of linked factors. J. Genet. 1919;8:299–309. [Google Scholar]
- 8.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.