Figure - PMC

Skip to main content

View full-text article in PMC

. Author manuscript; available in PMC: 2014 Jul 15.

Published in final edited form as: Nat Rev Genet. 2013 Jul;14(7):507–515. doi: 10.1038/nrg3457

a) Human: High R² can be achieved by chance particularly when sample size is small.

We simulated GWAS data based upon real human genotype data under the null hypothesis of no association. We used data of 11,586 unrelated European Americans genotyped on 563,212 SNPs ^71–73. We randomly sampled N individuals and selected top SNPs for height at p < 10⁻⁵ (red bar) and p < 10⁻⁴ (blue bar) to predict the phenotype in the same data. We also performed association analysis for real height phenotype in 10,000 individuals and selected top SNPs at p < 10⁻⁵ (green bar) and p < 10⁻⁴ (purple bar) to predict height phenotype in the same sample. The graph shows the mean prediction R² over 100 simulation replicates. Error bar: standard error of the mean. The number on top of each column is the mean number of selected SNPs over 100 simulation replicates.

b) Drosophila: An example, illustrating bias when selecting the top SNPs. We downloaded genotype data of the Drosophila Genetics Reference Panel and simulated phenotypes under the null hypothesis, i.e., random association between each of the > 1 million SNPs and phenotype. We repeated the GWAS analysis reported in⁵⁴, selecting the top 10 independently associated SNPs and predicted the phenotypes of the lines using these 10 SNPs. Since in the simulated data there are only random associations between SNP and phenotype any prediction power is false and result of over-fitting. By chance, the top SNPs (in terms of test statistic) explain 57% (R²=57%) of the phenotypic variance between the inbred lines, from a linear regression of phenotype on predictor. Both phenotype and predictor have been standardized to normal distribution z-scores (mean of zero and standard deviation of one).

c) Dairy Cattle: The impact of leaving the validation cohort in the discovery set, either at both SNP selection (GWAS) and SNP effect estimation stages, or at the effect size estimation stage only. Data shown are from 2,732 bulls with ~500K SNPs phenotyped for average milk yield of their daughters’ milk production. The bulls were split into a discovery sample (bulls born during or before 2003), N_d = 2,458, and a validation sample (bulls born after 2003) of N_v= 274.

a) Human: High R² can be achieved by chance particularly when sample size is small.

We simulated GWAS data based upon real human genotype data under the null hypothesis of no association. We used data of 11,586 unrelated European Americans genotyped on 563,212 SNPs ^71–73. We randomly sampled N individuals and selected top SNPs for height at p < 10⁻⁵ (red bar) and p < 10⁻⁴ (blue bar) to predict the phenotype in the same data. We also performed association analysis for real height phenotype in 10,000 individuals and selected top SNPs at p < 10⁻⁵ (green bar) and p < 10⁻⁴ (purple bar) to predict height phenotype in the same sample. The graph shows the mean prediction R² over 100 simulation replicates. Error bar: standard error of the mean. The number on top of each column is the mean number of selected SNPs over 100 simulation replicates.

b) Drosophila: An example, illustrating bias when selecting the top SNPs. We downloaded genotype data of the Drosophila Genetics Reference Panel and simulated phenotypes under the null hypothesis, i.e., random association between each of the > 1 million SNPs and phenotype. We repeated the GWAS analysis reported in⁵⁴, selecting the top 10 independently associated SNPs and predicted the phenotypes of the lines using these 10 SNPs. Since in the simulated data there are only random associations between SNP and phenotype any prediction power is false and result of over-fitting. By chance, the top SNPs (in terms of test statistic) explain 57% (R²=57%) of the phenotypic variance between the inbred lines, from a linear regression of phenotype on predictor. Both phenotype and predictor have been standardized to normal distribution z-scores (mean of zero and standard deviation of one).

c) Dairy Cattle: The impact of leaving the validation cohort in the discovery set, either at both SNP selection (GWAS) and SNP effect estimation stages, or at the effect size estimation stage only. Data shown are from 2,732 bulls with ~500K SNPs phenotyped for average milk yield of their daughters’ milk production. The bulls were split into a discovery sample (bulls born during or before 2003), N_d = 2,458, and a validation sample (bulls born after 2003) of N_v= 274.