Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jan 15.
Published in final edited form as: Nat Genet. 2011 May 8;43(6):519–525. doi: 10.1038/ng.823

Genome-partitioning of genetic variation for complex traits using common SNPs

Jian Yang 1, Teri A Manolio 2, Louis R Pasquale 3, Eric Boerwinkle 4, Neil Caporaso 5, Julie M Cunningham 6, Mariza de Andrade 7, Bjarke Feenstra 8, Eleanor Feingold 9, M Geoffrey Hayes 10, William G Hill 11, Maria Teresa Landi 12, Alvaro Alonso 13, Guillaume Lettre 14, Peng Lin 15, Hua Ling 16, William Lowe 17, Rasika A Mathias 18, Mads Melbye 8, Elizabeth Pugh 16, Marilyn C Cornelis 19, Bruce S Weir 20, Michael E Goddard 21,22, Peter M Visscher 1
PMCID: PMC4295936  NIHMSID: NIHMS650238  PMID: 21552263

Abstract

Recently, we reported a method to estimate the proportion of phenotypic variance explained by all SNPs from genome-wide association studies, and estimated that half of the heritability for human height was captured by common SNPs. Here we partition genetic variation for height, body mass index (BMI), von Willebrand factor (vWF) and QT interval (QTi) onto chromosomes and chromosome segments, using 586,898 SNPs genotyped on 11,586 unrelated individuals. We estimate that ~45%, ~17%, ~25% and ~21% of variance in height, BMI, vWF and QTi, respectively, can be explained by considering all autosomal SNPs simultaneously, and a further ~0.5–1% by X-chromosome SNPs for height, BMI and vWF. We show that variance explained by each chromosome for height and QTi is proportional to the total gene length on that chromosome. In genome-wide analyses, common SNPs in or near genes explain more variation than SNPs between genes. We propose a novel approach to estimate variation due to cryptic relatedness and population stratification. Our results provide further evidence that a substantial proportion of heritability is accounted for by causal variants in linkage disequilibrium with common SNPs; that height, BMI and QTi are highly polygenic traits; and that the additive variation explained by a part of the genome is approximately proportional to the total length of DNA contained within genes therein.


Genome-wide association studies (GWAS) have led to the discovery of hundreds of marker loci that are associated with complex traits, including disease and quantitative phenotypes1, yet for most traits the associated variants cumulatively explain only a small fraction of total heritability2. GWAS have provided insight into biology via the discovery of pathways that were previously not known to be involved in the trait and the discovery of genes and pathways that are common to two or more complex traits3. As an experimental design, GWAS are hypothesis generating, and typically very stringent statistical thresholds are set to control false positive rates. This approach is at the expense of the false negative rate, i.e. failure to detect loci that are associated with the trait but whose effect sizes are too small to reach genome-wide statistical significance. In addition, GWAS typically employ common SNP markers. If ungenotyped causal variants have a lower allele frequency than the SNPs in GWAS, then they will be in low linkage disequilibrium (LD) with common SNPs and the effect estimated at the SNPs will be proportionally attenuated. That is, the proportion of heritability that can be captured with common SNPs depends on how well causal variants are tagged by these SNPs. For these reasons, the cumulative genetic variation accounted for by SNPs that reach genome-wide statistical significance is certain to be smaller than the total genetic variance.

An alternative to hypothesis testing is to focus on the estimation of the variance explained by all SNPs together. Recently we demonstrated how this may be done and estimated that ~45% of phenotypic variation for human height is accounted for by common SNPs from a sample of ~4000 Australians with ancestry in the British Isles4. In a separate study we partitioned additive variance for height onto chromosomes using within-family segregation, which captures the effects of all causal variants, and concluded that variance was explained in proportion to chromosome length5. Here we take these studies further, using a much larger sample of 11,586 unrelated European Americans and by considering a range of traits. We partition additive genetic variation for height, body mass index (BMI), von Willebrand factor (vWF) and QT interval (QTi) onto the autosomes, the X-chromosome and genomic segments. vWF is a large adhesive glycoprotein that circulates in plasma and is essential in hemostasis, whereas QTi as an important electrocardiographic parameter related to ventricular arrhythmias and sudden death. We find that genetic variation explained by a genomic segment is proportional to the length of DNA contained within genes in that segment. We estimate the proportion of variation due to population structure and report empirical results for the X-chromosome that are consistent with full dosage compensation (X-inactivation) in females of genes that affect these traits.

RESULTS

Variance explained by all autosomal SNPs for height, BMI, vWF and QTi

We selected 14,347 individuals from three population-based GWAS, i.e. the Health Professionals Follow-up Study (HPFS), the Nurses’ Health Study (NHS) and the Atherosclerosis Risk in Communities (ARIC) study68, and estimated the genetic relationship matrix (GRM) of all the individuals using 565,040 autosomal SNPs which passed quality control (Online Methods). We excluded one of each pair of individuals with an estimated genetic relationship > 0.025 (i.e. more related than third- to fourth-cousins) and retained a subset of 11,586 unrelated individuals. The reason for excluding related pairs is to avoid the possibility that the phenotypic resemblance between close relatives could be due to non-genetic effects (e.g. shared environment) and causal variants not tagged by SNPs but captured by pedigree10,11. We then fitted the GRM in a mixed linear model (MLM) to estimate the proportion of variance explained by all the autosomal SNPs ( hG2) for height, BMI, vWF and QTi in each cohort and the combined data where applicable (Online Methods, Table 1 and Supplementary Table 1). The traits vWF and QTi were available from the ARIC sample only. We show that 44.8% (s.e = 2.9%) of phenotypic variance for height can be explained by all the autosomal SNPs, in line with an estimate of 44.5% (s.e. = 8.3%) in a similar analysis of an Australian cohort (3,925 unrelated individuals genotyped by 294,831 SNPs on Illumina arrays, in contrast to the Affymetrix arrays in the present study)4. We show for the first time that 16.5% (s.e. = 2.9%), 25.2% (s.e. = 5.1%) and 20.9% (s.e. = 5.0%) of variances for BMI, vWF and QTi can be explained by all the autosomal SNPs, approximately 10-fold, 2-fold and 3-fold larger than the variance explained by all known validated loci found by GWAS for BMI1215, vWF16 and QTi17, respectively. We note that the ABO blood group locus on chromosome 9 is known to explain approximately 10% of phenotypic variation for vWF16, through modification of the amount of H antigen expression on the circulating vWF glycoprotein18,19. The estimate of hG2 for weight is 18.6% (s.e. = 2.8%). Because of the high phenotypic correlation between BMI and weight (r = 0.92), results for these two traits are very similar. We therefore report results for BMI in the following sections and for completion give all results for weight in the supplementary online material.

Table 1.

Estimates of the variance explained by all autosomal SNPs for height, BMI, vWF and QTi. The traits vWF and QTi were available in the ARIC cohort only.

Trait n aNo PC
b10 PCs
dHeritability eGWAS
chG2 (s.e.) P value hG2 (s.e.) P value
Height 11,576 0.448 (0.029) 4.5×10−69 0.419 (0.030) 7.9×10−48 80%–90%41 ~10%26
BMI 11,558 0.165 (0.029) 3.0×10−10 0.159 (0.029) 5.3×10−9 42%–80%30,31 ~1.5%15
vWF 6,641 0.252 (0.051) 1.6×10−7 0.254 (0.051) 2.0×10−7 66%–75%42,43 ~13%16
QTi 6,567 0.209 (0.050) 3.1×10−6 0.168 (0.052) 5.0×10−4 37%–60%44,45 ~7%17
a

without PC adjustment.

b

adjustment with the first 10 PCs from PCA.

c

estimate of variance explained by all autosomal SNPs.

d

narrow sense heritability estimate from family or twin studies from the literature.

e

variance explained by GWAS associated loci from the literature.

Genome-partitioning of genetic variation

Next, we estimated the GRM from the SNPs on each autosome and partitioned the total genetic variance onto individual chromosomes by fitting the GRMs of all the chromosomes simultaneously in a joint analysis (Online Methods). We observed a strong linear relationship between the estimate of variance explained by each chromosome ( hC2) and chromosome length (LC, in Mb units) for height (P = 1.4×10−6 and R2 = 0.695) and QTi (P = 1.1×10−3 and R2 = 0.422) (Fig. 1 and Supplementary Tables 2 and 3). We mapped SNPs to 17,787 genes according to positions on the UCSC Genome Browser hg18 assembly (http://genome.ucsc.edu)20, 17,652 of which had at least one SNP within ± 50 Kb of the 5′ and 3′ untranslated regions (UTRs). There was also a significant correlation between the estimate of hC2 and the number of genes on each chromosome (Ng(C)) for height and QTi (Supplementary Table 3). Since LC and Ng(C) are correlated (r = 0.628), we performed a multiple regression analysis of the estimate of hC2 on LC and Ng(C), and fitted models in which chromosome length was fitted after the number of genes and vice versa. When including both LC and Ng(C) in the regression model, Ng(C) was not significant and LC was still significant for height and QTi (Supplementary Table 3). The regression of the estimate of hC2 on either LC or Ng(C) was not significant for BMI and vWF. These results are consistent with variance explained by each chromosome for height and QTi (but less so for BMI and vWF) being proportional to the proportion of the genome being considered. Although longer chromosomes harbour more genes that are implicated in abnormal growth or skeletal development, the relationship between variance explained for height and chromosome length remains significant after fitting the number of such genes (Supplementary Fig. 1). We provide evidence that the linear relationship between the estimate of hC2 and LC cannot be attributed to the fact that longer chromosomes have more SNPs and thereby smaller sampling errors when estimating genetic relationships between individuals (Supplementary Note and Supplementary Figs. 2 and 3).

Figure 1.

Figure 1

Estimate of the variance explained by each chromosome for height, BMI, vWF and QTi by the joint analysis using unrelated individuals against chromosome length. The numbers in the circles/squares are the chromosome numbers.

However, genes vary greatly in size and when we considered the length of genes we observed that the estimate of hC2 for height and QTi was also proportional to the total length of genes on each chromosome (Lg(C)) where gene length is defined as physical distance between the beginning and end of UTRs (Supplementary Fig. 4). Since the correlation between LC and Lg(C) is extremely high (r = 0.97), we were unable to discriminate whether LC or Lg(C) is causative by multiple regression, i.e. the regression of hC2 on LC is not significant fitted after Lg(C) and vice versa (Supplementary Table 3). Therefore, a different analysis was required. We asked whether we could still observe a significant regression of hC2 on Lg(C) when chromosome length was held constant. We implemented this by dividing the genome into segments with the same length of 50 or 30 Mb and then estimated the variance explained by each segment ( hS2) in a joint analysis (Online Methods). We found that the regression of hS2 on the total gene length per segment (Lg(S)) remained significant for height, P = 1.7×10−3 for 50 Mb segments and P = 1.2×10−4 for 30 Mb segments (Supplementary Fig. 5). The regressions of hS2 on the number of genes, the total length of exons and the number of exons on each segment were also significant in some cases, but none of them was significant when fitted after Lg(S) while Lg(S) was always significant fitted after any of them (Supplementary Table 4). These results suggest that, at least for height, genomic regions explain variation in proportion to their genic content.

To quantify these effects genome-wide, we partitioned the variance explained by all the SNPs onto genic ( hGg2) and intergenic ( hGi2) regions of the whole genome (Online Methods). We defined the gene boundaries as ± 0 Kb, 20 Kb and 50 Kb of the 3′ and 5′ UTRs. A total of 213,509, 282,058, and 336,127 SNPs were located within the boundaries of 13,406, 17,277 and 17,652 protein-coding genes for the three definitions (± 0 Kb, 20 Kb and 50 Kb) respectively, which covered 35.8%, 49.4% and 58.7% of the genome. Some genes did not have any SNPs within them, especially if the most stringent definition of gene boundary (± 0 Kb) was used. We tested the estimates of hGg2 and hGi2 against the expected values from the genic and intergenic coverages by a goodness-of-fit test. We found strong evidence for height and vWF, and less so for BMI and QTi, that genic regions proportionally explain more variation than intergenic regions (legends of Fig. 2 and Supplementary Fig. 6). As an example we consider the case of genes ± 20 Kb where genic and intergenic coverages are roughly equal (49.4% vs. 50.6%). The estimates of hGg2 vs. hGi2 are 32.8% vs. 12.6% (P = 2.1×10−10) for height, 22.7% vs. 4.0% (P = 5.1×10−4) for vWF, 11.7% vs. 4.7% (P = 0.022) for BMI and 13.5% vs. 7.5% (P = 0.251) for QTi. We further partitioned the genetic variance onto the genic and intergenic regions of each chromosome (Online Methods). In general, the results agree with that of the whole-genome partitioning analysis, in that the genic regions proportionally explained more variation (Fig. 2 and Supplementary Fig. 6). The variance attributable to chromosome 9 for vWF is dominated by the genic regions, which is expected because the ABO gene on this chromosome explains ~10% of its variance16. However, there appear to be exceptions, for example, the intergenic regions of chromosome 2 and chromosome 5 seemed to be more important for BMI and QTi, respectively. These results are not conclusive because the standard errors of the estimates are large. Despite these special cases, overall the results are consistent with causal variants being more likely to be located in the vicinity of functional genes.

Figure 2.

Figure 2

Estimates of the variance explained by genic and intergenic regions on each chromosome for height by the joint analysis using 11,586 unrelated individuals. The genic region is defined as ± 0 Kb (a), ± 20 Kb (b) and ± 50 Kb (c) of the 3′ and 5′ UTRs. Error bars are the standard errors of the estimates. hGg2 and hGi2 are the variances explained by all the genic and intergenic SNPs across the whole genome. P(observed vs. expected): goodness-of-fit test of the estimated hGg2/hG2 against that expected from the coverage of genic regions.

Variance attributed to population structure

Mixed linear model methods are useful to control for population structure in GWAS21,22. Population structure in the data causes correlations of SNPs on different chromosomes. Consequently, fitting only one chromosome in the model (separate analysis) also captures some of the variance due to other chromosomes, so that the estimate of variance explained by each chromosome from the separate analysis ( hC2(sep)) is biased upwards (Online Methods). The joint analysis has the advantage of protecting against such inter-chromosomal correlations because the estimate of each hC2 is conditional on the other chromosomes in the model so that the estimates of variance explained by different chromosomes are independent of each other. We therefore can calculate the variance attributable to population structure by comparing the estimates between hC2(sep) and hC2. The inter-chromosomal SNP correlations occur for two reasons: 1) cryptic relatedness (e.g. unexpected cousins) because closely related individuals will share SNPs identical by descent on more than one chromosome; 2) systematic difference in allele frequencies between subpopulations (population stratification). We modelled the variance attributed to these two forms of population structure as hC2(sep)hC2=b0+b1LC+ε, where the slope b1 allows the possibility that longer chromosomes track population structure better than smaller chromosomes. To illustrate the effect of population structure, we also estimated hC2(sep) and hC2 in the entire sample of 14,347 individuals (i.e. without removing close relatives). The intercept b0 appears to be due to cryptic relatedness because when we eliminate relatives with a relationship > 0.025, b0 declines to zero (Fig. 3). We therefore predicted that cryptic relatedness accounted for 1.5%, 0.084%, 0.22% and 0.065% (not significant) of the phenotypic variance for height, BMI, vWF and QTi, respectively, in the entire sample. The variance attributed to cryptic relatedness is irrespective of chromosome length because it does not require many SNPs per chromosome to detect close relatives. Conversely, the regression slope b1 appears to be due to population stratification because longer chromosomes are likely to have more ancestry informative markers (AIMs), assuming that the AIMs are randomly distributed across the genome. We then predicted that population stratification accounted for 6.9×10−5LC, and 7.2×10−6LC, −1.92×10−6LC (not significantly different from zero) and 2.3×10−5LC of variance for height, BMI, vWF and QTi, respectively, in the entire sample and a similar amount in the data set of unrelated individuals (Fig. 3). The difference between hC2(sep) and hC2 represents the overall effect of all the other 21 chromosomes on one chromosome. Therefore, the proportion of variance attributed to population structure (cryptic relatedness and population stratification) across the whole genome is approximately equal to b022/21+b1C=122LC/21, which is (1.6% + 0.91%), (0.088% + 0.095%), (0.23% + 0.0%) and (0.068% + 0.30%) for height, BMI, vWF and QTi, respectively, in the entire sample. Hence, we provide a simple approach to estimate and partition the variance attributed to population structure for complex traits. The variances due to cryptic relatedness and population stratification depend on the data structure in the sample. Therefore, the estimates we present above are specific for the data in this study.

Figure 3.

Figure 3

Difference between the estimates of variance explained by each chromosome by the separate ( hC2(sep)) and joint ( hC2) analyses for height, BMI, vWF and QTi against chromosome length. All: using all the individuals in the entire sample. Unrelated: using unrelated individuals after excluding one of each pair of individuals with an estimate of genetic relationship > 0.025.

It is common to fit eigenvectors (PCs) from principal component analysis (PCA) in single SNP association studies to correct for possible population structure23,24. We show that fitting the first 10 PCs and one chromosome at a time or fitting all chromosomes simultaneously without fitting PCs, led to similar estimates of the variance explained by each chromosome (Supplementary Fig. 7), which suggests that the majority of variance attributed to population structure is well captured by the first 10 PCs in these data.

Estimation of variance explained by the X-chromosome

We estimated the GRM for the X-chromosome and parameterized it under three assumptions of dosage compensation10: 1) equal X-linked variance for males and females; 2) no dosage compensation i.e. both X-chromosomes are active for females; 3) full dosage compensation i.e. one of the X-chromosomes is completely inactive for females. We fitted the parameterized GRMs for the X-chromosome in an MLM whilst simultaneously estimating hG2 in the model to capture the genetic variation on the autosomes and variation due to possible population structure. For all the traits, the full-dosage compensation model fits the data best and the no-dosage compensation model is the worst with the equal-variance model in between (Supplementary Table 5). However, the differences in estimates are relatively small and none of them is statistically significant. Larger datasets will be required to distinguish such small differences. Under the assumption of full dosage compensation, the variance attributable to the X-chromosome for females was 0.61% (s.e. = 0.32%), 0.82% (s.e. = 0.35%), 0.57% (s.e. = 0.52%) and 0.0% (s.e. = 0.48%) for height, BMI, vWF and QTi, respectively. To verify that we detect heterogeneous variances on the X-chromosome rather than autosomal variance differences between males and females, we fitted the same dosage compensation models for the autosomes. As expected, the equal variance model fitted the data best and the full dosage compensation model was the worst fit for all the traits (Supplementary Table 6). Therefore, the data are consistent with twice as much additive genetic variation for height, BMI, and vWF on the X-chromosome in males as in females, which is predicted from theory under the assumption of random X-inactivation25. While there are syndromic examples illustrating the phenotypic effect of the Lyon hypothesis (e.g., Turner’s syndrome and Kleinfelter syndrome), to our knowledge, this is the first empirical evidence from genotype-phenotype associations on complex traits that the amount of genetic variation on the X-chromosome appears consistent with X-chromosome inactivation. However, the evidence is indirect and not overwhelming. Larger samples sizes and the detection of multiple associated loci on the X-chromosome will be necessary to investigate the expression of genes on the X-chromosome that affect the traits studied.

Comparison with variants known to be associated with BMI, vWF and height

To quantify the effect of known associated variants on the results, we included the FTO SNP rs9939609 on chromosome 16 for BMI and the ABO SNP rs612169 on chromosome 9 for vWF as a covariate when estimating hC2 by the joint analysis of all autosomes. FTO was the first locus to be detected through GWAS that is associated with BMI14 and ABO is a major determinant of vWF19. When compared to the result without adjustment, the estimate of variance due to chromosome 16 ( h162) for BMI decreased from 1.19% to 0.61%, in line with an estimate of ~0.34% to ~1% of variance explained by the FTO locus for BMI in previous GWAS12,14,15 and an estimate of ~0.46% from association analysis in the present study; the estimate of h92 for vWF decreased by 11.8%, consistent with an estimate of ~10% of variance for vWF explained by the ABO locus in GWAS16; and the estimates for the other chromosomes remained the same (Supplementary Fig. 8).

The meta-analysis of ~133,000 individuals by the GIANT consortium has identified 180 independent loci associated with genetic variation of height26. The estimate of hC2 by joint analysis in our study shows a high correlation (r = 0.715 and P = 1.8×10−4) with the sum of the variance explained at the associated loci on each chromosome from the GIANT meta-analysis (Fig. 4).

Figure 4.

Figure 4

The sum of variance explained by the GWAS associated SNPs on each chromosome in the GIANT meta-analysis (MA) of height26 against the estimate of variance explained by each chromosome for height by the joint analysis using the combined data of 11,586 unrelated individuals in the present study. The variance explained by GWAS loci in the GIANT MA was calculated based on the result of its replication study.

Additional models

We fitted a number of other models to quantify the effect of having multiple phenotypic observations per individual, to test for genotype-sex interaction effects and for the effect of sample ascertainment. We also estimated the genetic correlation between height and weight.

As both height and BMI have repeated measures in the ARIC cohort and BMI also has repeated measures in the HPFS and NHS cohorts, we fitted a repeatability model for these repeated records, assuming that the genetic correlation between repeated observations is unity. We estimated the repeatability of height as ~0.99 in the ARIC study and the repeatabilities of weight and BMI of > 0.93 in all cohorts (Supplementary Table 7). The estimates of hG2 by repeatability model analyses are similar to those using the mean of the repeated measures as in all other analyses (Table 1 and Supplementary Table 7). The ratio of an estimate of hG2 based upon m repeated records to that of a single observation is 1/[(1 − ρ)/m + ρ] with ρ the repeatability27. In this study, ρ > 0.93 and m is small, so that our inference based upon the mean of 3 or 4 observations is very similar to that from a single observation.

We fitted a genotype-sex interaction effect in the model, and did not find any significant interaction effect for all the traits (Supplementary Table 8). We also estimated the variance explained by all the autosomal SNPs in each gender group in each cohort separately, and did not observe clear evidence of genetic heterogeneity between males and females (Supplementary Table 9). However, there are exceptions, for example, the p-value for a test of heterogeneity between males and females for vWF was 0.007 (0.024 after PC-adjustment), which may indicate real genetic difference between males and females for this trait. When taking multiple testing into account, the test was not significant after a Bonferroni correction. For weight and BMI, the p-values only verge on significance even when hG2 are estimated to be zero in males, suggesting a lack of power to detect an interaction.

The HPFS and NHS cohorts were ascertained by type-II diabetes (T2D), which might affect the estimate of genetic variance because BMI is a risk factor for T2D28. We performed analyses of the variance explained by all the autosomal SNPs separately in T2D cases and controls of the HPFS and NHS cohorts, and did not observe significant differences in estimated genetic variance between cases and controls for all the three traits (Supplementary Table 10). Moreover, we estimated hG2 in the combined dataset excluding the T2D cases, and the estimate was not different from that using all samples (Supplementary Table 11). These two additional analyses show little, if any, impact of the ascertainment of T2D cases on the estimate of variance explained by the SNPs.

All analyses we have performed are univariate, i.e. a single phenotype at a time. However, multivariate models fit easily in the same analysis framework and the only limitation is computational. As an example, we approximated a full bivariate analysis of height and weight by using logarithms of the phenotypes, exploiting the relationship log(BMI) = log(Weight) − 2log(Height), so that from three univariate analyses we can estimate the genetic correlation between log(Weight) and log(Height). We estimated a genetic correlation of 0.45 (s.e. = 0.17) between log(Weight) and log(Height) (Supplementary Table 12). Although this is on the logarithm scale, the estimate on the observed scale is unlikely to be very different. This estimate indicates that for genetic variation tagged by common SNPs there is a substantial overlap in genome-wide additive factors for height and weight.

DISCUSSION

In this study we estimate that ~45%, ~17%, ~25% and ~21% of phenotypic variation for height, BMI, vWF and QTi, respectively, is tagged by common SNPs and we partition this variation onto autosomes, chromosome segments and the X-chromosome. We find that chromosome segments explain variation in approximate proportion to the total length of genes contained therein. While this suggests that there are very many polymorphisms affecting these traits, the linear relationship between the estimate of variance explained and genomic length is not perfect, especially for BMI and vWF. Chromosomes with similar (genic) lengths can explain different amounts of variation (Fig. 1 and Supplementary Fig. 4) and the estimates of variance explained by genomic segments with equal length also show large variability (Supplementary Fig. 5), suggesting some granularity in distribution of causal variants. The genetic architecture of vWF is distinct from the other traits we analysed, because a large proportion of variance is explained by a common SNP in a single gene (ABO). We show that the variance attributed to a single major gene can be captured by all the SNPs on that chromosome or the whole genome, demonstrating that our whole-genome and chromosome estimation approach is independent of the distribution of effect sizes. Our results provide further evidence for the highly polygenic nature of complex trait variation and that a substantial proportion of genetic variation is tagged by common SNPs4,29. These results have implications for the experimental design to detect additional variation and are informative with respect to the nature of complex trait variation.

Of the four traits studied, the largest proportion of phenotypic variance explained by the SNPs is for height and the smallest is for BMI. Why are the results for height and BMI so different? Heritability of height is approximately 80% and we estimate that more than half of this variation (45/80 = 0.56) is tagged by common SNPs. Estimates of the narrow sense heritability of BMI appear to be more variable, ranging from 42% to 58% when estimated from the correlation of full brothers and fathers and sons30 to 60–80% from twin studies31. Nevertheless, even if we assume that the narrow sense heritability for BMI is 50% then only 17/50 = 0.34 of additive genetic variation is explained by common SNPs. Given these assumptions and the standard errors in Table 1, the standard error of the difference in the proportion of genetic variance explained for height and BMI is approximately 0.07, so that the observed difference of 0.22 appears statistically significant. These results are consistent with the proportion of phenotypic variation for height and BMI explained by genome-wide significant SNPs, in that for height about 10% of phenotypic variance is explained yet for BMI less than 2%15,26, despite similar and large experimental sample sizes. These results imply that causal variants for BMI are in less LD with common SNPs than causal variants for height, possibly because, on average, causal variants for BMI have lower minor allele frequency than causal variants for height. Both observations from GWAS and our analyses are consistent with the allelic architecture for BMI being different from that for height. Different evolutionary pressures on obesity (or leanness) and height could account for such differences because natural selection will result in low frequencies of alleles that are correlated with fitness32. However, we do not provide direct evidence to support this hypothesis.

If genetic variation is a function of the length of a chromosome segment occupied by genes then this implies that causal variants are more likely to occur in the vicinity of the genes than in intergenic regions. These causal variants could either change the protein structure or regulate the expression of the gene in cis. However, regulatory elements sometimes occur a long distance away from the gene they regulate and our results show that SNPs situated > 50 Kb from any gene still explain some of the variance although less than SNPs nearer to a gene. These results are consistent with analyses of published genome-wide significant SNPs for complex traits, in that a substantial proportion is found in intergenic regions1.

GWAS for height, BMI, vWF and QTi to date have identified individual genetic variants that cumulatively explain about 10%, 1.5%, 13% and 7% of phenotypic variation, respectively1517,26. In contrast, we show that 45%, 17%, 25% and 21% of the variance is explained by common SNPs (Table 1). The difference between these two sets of figures is due to SNPs that are associated with the traits but do not reach genome wide significance. The proportion of variance explained by all the SNPs is less than the heritability because of incomplete LD between the causal polymorphisms and the SNPs. Therefore, experiments to find SNPs that pass the genome-wide significance threshold can focus on the proportion of variation that is tagged by common SNPs by increasing sample size, or focus on the proportion of variation that is not tagged, for example, by considering less common variants. The former approach has been successfully done by the GIANT consortium, which reported that 10% and 1.5% of variation for height and BMI, respectively, can be accounted for by common SNPs using sample sizes of more than 100,00015,26. The latter will be facilitated by the 1000 Genomes Project33 and independently by efforts to sequence exomes and whole genomes. Experimental designs to discover causal variants that are in LD with common SNPs and those that interrogate less common or rare variants are complementary and recent publications that suggest that all or most variation for disease is to be found in less common or rare (coding) variants34,35 are not consistent with empirical data, at least for a range of complex traits, including height, BMI, lipids and schizophrenia15,26,29,36. For those causal variants that are rare in the population (say, with a frequency of less than 1%), an important but unanswered question is whether their effect sizes are large enough to be detected through conventional association analysis. The power of detection for a rare variant is proportional to the product of its frequency (which is small) and the square of its effect size. Hence rare variants will be detected only if their effect sizes are large enough given their low frequency. Our results imply that there are many chromosomal regions that contain causal variants and so most must explain a small proportion of total variance. Such small contributions can be due to loci with very low MAF and large effect sizes but our ability to detect them by association is limited by the amount of variance explained.

Genome-partitioning methods such as applied here help us further understand the genetic architecture of complex traits. With ever larger samples sizes, the methods we have used and those that are based upon traditional GWAS analyses will converge in inference, in that we will be able to partition variation to individual loci.

ONLINE METHODS

GWAS samples and quality control

Details of the HPFS, NHS and ARIC cohorts have been described previously68. The GWAS data in terms of study design, sample selection and genotyping have been detailed by Qi et al.37 for the HPFS and NHS cohorts and by Psaty et al.8 for the ARIC cohort. All three cohorts have been studied as part of the GENEVA (The Gene, Environment Association Studies) project38 and this study has benefitted by using data from this consortium that have been generated and cleaned using a common protocol. We selected 6,293 individuals (2,745 T2D cases and 3,148 controls) from the NHS and HPFS cohorts and 15,792 individuals from the ARIC cohort. All of these selected individuals were genotyped using the Affymetrix Genome-Wide Human 6.0 array.

Of the 909,622 SNP probes, 874,517 (HPFS), 879,071 (NHS) and 841,820 (ARIC) passed quality control analysis by the Broad Institute and the GENEVA Coordinating Center (excluding SNPs with missing call rate ≥ 5% or plate association P < 1×10−10)9. We further excluded SNPs with missing rate ≥ 2%, > 1 discordance in the duplicated samples, Hardy-Weinberg equilibrium P < 1×10−3 or minor allele frequency < 0.01. A total of 687,398 (27,578), 665,163 (24,108) and 593,521 (23,664) autosomal (X-chromosome) SNPs were retained for the HPFS, NHS and ARIC cohorts, respectively, 565,040 (21,858) of which were in common across the three cohorts.

We included only one of each set of duplicated samples and one of each pair of samples which were identified as full siblings by an initial scan of relatedness in PLINK39. We investigated population structure by PCA of all the autosomal SNPs that passed QC, and included only samples of European ancestry (Supplementary Fig. 9). We excluded samples with gender misidentification by examining the mean of the intensities of SNP probes on the X and Y chromosomes. We also excluded samples with missing call rate ≥ 2% and samples on two plates which showed extremely high level of mean inbreeding coefficients. A total of 2,400 (HPFS), 3,265 (NHS) and 8,682 (ARIC) samples were retained for analysis respectively, with a combined set of 14,347 samples.

Phenotypes

Summary statistics of the phenotypes of height, weight, BMI, vWF and QTi are shown in Supplementary Table 13. There are three measures of weight and a single measure of height in both HPFS and NHS cohorts, four measures of weight and three measures of height in the ARIC cohort, and single measures of vWF and QTi in the ARIC cohort. For height, weight and BMI, we used the mean of repeated measures in all the analyses except for the analysis of the repeatability model. We adjusted the phenotypes (or the mean phenotype) for age and standardized it to a z-score in each gender group in each of the three cohorts separately.

Statistical analysis

We estimated the GRM of all individuals in the combined data from all the autosomal SNPs using the method we recently developed4,10, and excluded one of each pair of individuals with an estimated genetic relationship > 0.025. We then estimated the variance explained by all autosomal SNPs by restricted maximum likelihood analysis of an MLM y = + gG + ε, where y is a vector of phenotypes, β is a vector of fixed effects (e.g. the first 10 PCs) with its incidence matrix of X, gG is a vector of aggregate effects of all autosomal SNPs with var(gG)=AGσG2 and AG is the GRM estimated from all autosomal SNPs. The proportion of variance explained by all autosomal SNPs is defined as hG2=σG2/σP2 with σP2 being the phenotypic variance.

Furthermore, we estimated the GRM from the SNPs on each chromosome (AC) and estimated the variance attributable to each chromosome by fitting the GRMs of all the chromosomes simultaneously in the model y=Xβ+C=122gC+ε where gC is a vector of genetic effects attributable to each chromosome and var(gC)=ACσC2 (joint analysis). The proportion of variance explained by each chromosome is defined as hC2=σC2/σP2. We also fitted one chromosome at a time in the model y = + gC + ε (separate analysis). If there is an effect of population structure, SNPs on one chromosome will be correlated with the SNPs on the other chromosomes such that hC2(sep) will be overestimated in the separate analysis.

We extended the joint analysis of chromosomes to that of genomic segments. We divided the genome evenly into NS segments with each of dS Mb length, and then estimated the GRM using the SNPs on each segment. We estimated the variance explained by each segment ( hC2) by fitting the GRMs of all the segments in an MLM y=Xβ+S=1NSgS+ε where gs is a vector of genetic effects attributable to each segment.

We further partitioned the variance explained by all the SNPs onto genic and intergenic regions of the whole genome ( hGg2 and hGi2) as well as that of each chromosome ( hCg2 and hCi2). The gene boundaries were defined as ± dg Kb away from the 3′ and 5′ UTRs. We estimated hGg2 and hGi2 by fitting all the genic and intergenic SNPs in an MLM y = + gGg + gGi + ε, and estimated hCg2 and hCi2 by fitting the genic and nongenic SNPs on individual chromosomes in the model y=Xβ+C=122gCg+C=122gCi+ε.

We estimated the variance attributable to the X chromosome using the method we recently developed10. In brief, we estimated the GRM for the X-chromosome (AX) using the following equations

  • A^jkM=iN(xijMpi)(xikMpi)pi(1pi) for a male-male pair

  • A^jkF=iN(xijF2pi)(xikF2pi)2pi(1pi) for a female-female pair, and

  • A^jkMF=iN(xijMpi)(xikF2pi)2pi(1pi) for a male-female pair,

where xijM and xijF are the number of copies of the reference allele for an X-chromosome SNP for a male and a female, respectively, pi is the frequency of the reference allele and N is the number of SNPs. Assuming the male-female genetic correlation to be 1, the X-linked phenotypic covariance is covX(yjM,ykM)=AjkMσX(M)2 for a male-male pair, covX(yjF,ykF)=AjkFσX(F)2 for a female-female pair, or covX(yjM,ykF)=AjkMFσX(M)σX(F) for a male-female pair25,40, where σX(M)2 and σX(F)2 are X-linked genetic variances for males and females, respectively. Assumptions about inactivity of the X-chromosome (dosage compensation) imposed a relationship between σX(M)2 and σX(F)2, which allow a single variance component σX(F)2 to account for the X-linked genetic variance for both sexes. Therefore, we can express the X-linked phenotypic covariances as covX(yjM,ykM)=d2AjkMσX(F)2, covX(yjF,ykF)=AjkFσX(F)2, and covX(yjM,ykF)=dAjkMFσX(F)2, where d is the lyonization coefficient, σX(M)=dσX(F), which takes 1 under the hypothesis of equal X-linked genetic variance for both sexes, takes 12 under the hypothesis of no dosage compensation (both X-chromosomes are active for females) and takes 2 under the hypothesis of full dosage compensation (complete inactivity of one X-chromosome for females) (Supplementary Note). In the analysis of MLM, we took the lyonization coefficient into the account by parameterizing the raw Ax matrix, i.e. AXP=d2AX for male pairs, AXP=AX for female pairs and AXP=dAX for male-female pairs. We estimated σX(F)2 under the three hypotheses by fitting the parameterized GRM for the X-chromosome ( AXP) conditional on the GRM estimated from all autosomal SNPs in an MLM y=Xβ+gX+gG+ε, where gX is a vector of X-linked genetic effects with var(gX)=AXPσX(F)2.

Supplementary Material

Genome-partitioning_SD information

Acknowledgments

Funding support for the GENEVA project has been provided through the NIH Genes, Environment and Health Initiative. For the ARIC project support is from U01 HG 004402 (PI Eric A. Boerwinkle). For the NHS and HPFS support is from U01 HG 004399 and U01 HG 004728 (PIs: Frank B. Hu and Louis R. Pasquale). Genotyping for the ARIC, NHS and HPFS studies was performed at the Broad Institute of MIT and Harvard, with funding support from U01 HG04424 (PI Stacey Gabriel). The GENEVA Coordinating Center receives support from U01 HG 004446 (PI Bruce S Weir). Assistance with GENEVA data cleaning was provided by the National Center for Biotechnology Information. David Crosslin and Cathy Laurie of the GENEVA project assisted in making the data available for analysis. A Physician Scientist Award from Research to Prevent Blindness in New York City also supports L.R.P. M.C.C is a recipient of a Canadian Institutes of Health Research Fellowship. We acknowledge funding from the Australian National Health and Medical Research Council (NHMRC grants 389892, 613672) and the Australian Research Council (ARC grants DP0770096 and DP1093900). We thank Danielle Posthuma for discussions and the referees for constructive comments.

Footnotes

AUTHOR CONTRIBUTIONS

P.M.V., M.E.G, B.S.W. and T.A.M. designed the study. J.Y. performed all statistical analyses. J.Y. and P.M.V. wrote the first draft of the paper. All authors contributed, by providing genotype and phenotype data, by giving advice on analyses and interpretation of results and/or by giving advice on the contents of the paper.

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

References

  • 1.Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106:9362–7. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010;363:166–76. doi: 10.1056/NEJMra0905980. [DOI] [PubMed] [Google Scholar]
  • 4.Yang J, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42:565–9. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Visscher PM, et al. Genome partitioning of genetic variation for height from 11,214 sibling pairs. Am J Hum Genet. 2007;81:1104–10. doi: 10.1086/522934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Rimm EB, et al. Prospective study of alcohol consumption and risk of coronary disease in men. Lancet. 1991;338:464–8. doi: 10.1016/0140-6736(91)90542-w. [DOI] [PubMed] [Google Scholar]
  • 7.Colditz GA, Hankinson SE. The Nurses’ Health Study: lifestyle and health among women. Nat Rev Cancer. 2005;5:388–96. doi: 10.1038/nrc1608. [DOI] [PubMed] [Google Scholar]
  • 8.Psaty BM, et al. Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium: Design of prospective meta-analyses of genome-wide association studies from 5 cohorts. Circ Cardiovasc Genet. 2009;2:73–80. doi: 10.1161/CIRCGENETICS.108.829747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Laurie CC, et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010;34:591–602. doi: 10.1002/gepi.20516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Visscher PM, Yang J, Goddard ME. A commentary on ‘common SNPs explain a large proportion of the heritability for human height’ by Yang et al. (2010) Twin Res Hum Genet. 2010;13:517–24. doi: 10.1375/twin.13.6.517. [DOI] [PubMed] [Google Scholar]
  • 12.Willer CJ, et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet. 2009;41:25–34. doi: 10.1038/ng.287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Thorleifsson G, et al. Genome-wide association yields new sequence variants at seven loci that associate with measures of obesity. Nat Genet. 2009;41:18–24. doi: 10.1038/ng.274. [DOI] [PubMed] [Google Scholar]
  • 14.Frayling TM, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–894. doi: 10.1126/science.1141634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Speliotes EK, et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet. 2010;42:937–948. doi: 10.1038/ng.686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Smith NL, et al. Novel associations of multiple genetic loci with plasma levels of factor VII, factor VIII, and von Willebrand factor: The CHARGE (Cohorts for Heart and Aging Research in Genome Epidemiology) Consortium. Circulation. 2010;121:1382–92. doi: 10.1161/CIRCULATIONAHA.109.869156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Shah SH, Pitt GS. Genetics of cardiac repolarization. Nat Genet. 2009;41:388–389. doi: 10.1038/ng0409-388. [DOI] [PubMed] [Google Scholar]
  • 18.Preston AE, Barr A. The Plasma Concentration of Factor Viii in the Normal Population. II. The Effects of Age, Sex and Blood Group. Br J Haematol. 1964;10:238–45. doi: 10.1111/j.1365-2141.1964.tb00698.x. [DOI] [PubMed] [Google Scholar]
  • 19.O’Donnell J, Boulton FE, Manning RA, Laffan MA. Amount of H antigen expressed on circulating von Willebrand factor is modified by ABO blood group genotype and is a major determinant of plasma von Willebrand factor antigen levels. Arterioscler Thromb Vasc Biol. 2002;22:335–41. doi: 10.1161/hq0202.103997. [DOI] [PubMed] [Google Scholar]
  • 20.Liu JZ, et al. A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010;87:139–45. doi: 10.1016/j.ajhg.2010.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Zhang Z, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010;42:355–360. doi: 10.1038/ng.546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kang HM, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–9. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  • 24.Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010;11:459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Bulmer MG. The mathematical theory of quantitative genetics. Oxford University Press; New York: 1985. [Google Scholar]
  • 26.Lango Allen H, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467:832–838. doi: 10.1038/nature09410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Falconer DS, Mackay TFC. Introduction to quantitative genetics. England: Longman; 1996. [Google Scholar]
  • 28.Colditz GA, et al. Weight as a risk factor for clinical diabetes in women. Am J Epidemiol. 1990;132:501–13. doi: 10.1093/oxfordjournals.aje.a115686. [DOI] [PubMed] [Google Scholar]
  • 29.Purcell SM, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–52. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Magnusson PK, Rasmussen F. Familial resemblance of body mass index and familial risk of high and low body mass index. A study of young men in Sweden. Int J Obes Relat Metab Disord. 2002;26:1225–31. doi: 10.1038/sj.ijo.0802041. [DOI] [PubMed] [Google Scholar]
  • 31.Schousboe K, et al. Sex differences in heritability of BMI: a comparative study of results from twin studies in eight countries. Twin Res. 2003;6:409–21. doi: 10.1375/136905203770326411. [DOI] [PubMed] [Google Scholar]
  • 32.Eyre-Walker A. Genetic architecture of a complex trait and its implications for fitness and genome-wide association studies. Proceedings of the National Academy of Sciences. 2010;107:1752–1756. doi: 10.1073/pnas.0906182107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB. Rare variants create synthetic genome-wide associations. PLoS Biol. 2010;8:e1000294. doi: 10.1371/journal.pbio.1000294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. doi: 10.1016/j.cell.2010.03.032. [DOI] [PubMed] [Google Scholar]
  • 36.Teslovich TM, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466:707–713. doi: 10.1038/nature09270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Qi L, et al. Genetic variants at 2q24 are associated with susceptibility to type 2 diabetes. Hum Mol Genet. 2010;19:2706–15. doi: 10.1093/hmg/ddq156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Cornelis MC, et al. The gene, environment association studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions. Genet Epidemiol. 2010;34:364–372. doi: 10.1002/gepi.20492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kent JW, Jr, Dyer TD, Blangero J. Estimating the additive genetic effect of the X chromosome. Genet Epidemiol. 2005;29:377–88. doi: 10.1002/gepi.20093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Visscher PM, Hill WG, Wray NR. Heritability in the genomics era–concepts and misconceptions. Nat Rev Genet. 2008;9:255–66. doi: 10.1038/nrg2322. [DOI] [PubMed] [Google Scholar]
  • 42.Orstavik KH, et al. Factor VIII and factor IX in a twin population. Evidence for a major effect of ABO locus on factor VIII level. Am J Hum Genet. 1985;37:89–101. [PMC free article] [PubMed] [Google Scholar]
  • 43.de Lange M, Snieder H, Ariens RA, Spector TD, Grant PJ. The genetics of haemostasis: a twin study. Lancet. 2001;357:101–5. doi: 10.1016/S0140-6736(00)03541-8. [DOI] [PubMed] [Google Scholar]
  • 44.Dalageorgou C, et al. Heritability of QT interval: how much is explained by genes for resting heart rate? J Cardiovasc Electrophysiol. 2008;19:386–91. doi: 10.1111/j.1540-8167.2007.01030.x. [DOI] [PubMed] [Google Scholar]
  • 45.Russell MW, Law I, Sholinsky P, Fabsitz RR. Heritability of ECG measurements in adult male twins. J Electrocardiol. 1998;30(Suppl):64–8. doi: 10.1016/s0022-0736(98)80034-4. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Genome-partitioning_SD information

RESOURCES