Significance
The genetic contribution to a phenotype is frequently measured by heritability, the fraction of trait variation explained by genetic differences. Hundreds of publications have found DNA polymorphisms that are statistically associated with diseases or quantitative traits [genome-wide association studies (GWASs)]. Genome-wide complex trait analysis (GCTA), a recent method of analyzing such data, finds high heritabilities for such phenotypes. We analyze GCTA and show that the heritability estimates it produces are highly sensitive to the structure of the genetic relatedness matrix, to the sampling of phenotypes and subjects, and to the accuracy of phenotype measurements. Plausible modifications of the method aimed at increasing stability yield much smaller heritabilities. It is essential to reevaluate the many published heritability estimates based on GCTA.
Keywords: GCTA, GWAS, heritability, SNP, singular value decomposition
Abstract
Genome-wide association studies (GWASs) seek to understand the relationship between complex phenotype(s) (e.g., height) and up to millions of single-nucleotide polymorphisms (SNPs). Early analyses of GWASs are commonly believed to have “missed” much of the additive genetic variance estimated from correlations between relatives. A more recent method, genome-wide complex trait analysis (GCTA), obtains much higher estimates of heritability using a model of random SNP effects correlated between genotypically similar individuals. GCTA has now been applied to many phenotypes from schizophrenia to scholastic achievement. However, recent studies question GCTA’s estimates of heritability. Here, we show that GCTA applied to current SNP data cannot produce reliable or stable estimates of heritability. We show first that GCTA depends sensitively on all singular values of a high-dimensional genetic relatedness matrix (GRM). When the assumptions in GCTA are satisfied exactly, we show that the heritability estimates produced by GCTA will be biased and the standard errors will likely be inaccurate. When the population is stratified, we find that GRMs typically have highly skewed singular values, and we prove that the many small singular values cannot be estimated reliably. Hence, GWAS data are necessarily overfit by GCTA which, as a result, produces high estimates of heritability. We also show that GCTA’s heritability estimates are sensitive to the chosen sample and to measurement errors in the phenotype. We illustrate our results using the Framingham dataset. Our analysis suggests that results obtained using GCTA, and the results’ qualitative interpretations, should be interpreted with great caution.
In recent years, genome-wide association studies (GWASs) have become an important tool for investigating the genetic contribution to complex phenotypes. These studies use statistical techniques to find associations between single nucleotide polymorphisms (SNPs) and phenotype(s) (e.g., continuous traits such as height or discrete traits such as presence/absence of a disease). A widely used measure of genetic influence on a phenotype is the (narrow-sense) heritability, defined as the ratio of the additive genetic variance to the total phenotypic variance. A major conundrum revealed by many analyses of GWAS data has been that the small number of significant associations explain much less of the heritability than is estimated from correlations between relatives [i.e., much heritability is “missing” (1–3)]. To address this problem, Yang et al. (4) posited that heritability is not missing but is “hidden.” The authors developed a statistical framework [genome-wide complex trait analysis (GCTA)] in which each SNP makes a random contribution to the phenotype, and these contributions are correlated between individuals who have similar genotypes. Applied to many GWASs, GCTA yields estimates of heritability far larger than those obtained using earlier analyses. GCTA has been used to estimate the heritability of many phenotypes from schizophrenia (5) to scholastic achievement (6). Despite its current wide use, recent studies (7, 8) have questioned the reliability of GCTA estimates.
We show here that the results produced using GCTA hinge on accurate estimation of a high-dimensional genetic relatedness matrix (GRM). We show that even when the assumptions in GCTA are satisfied exactly, heritability estimates produced by GCTA will be biased, and it is unlikely that the confidence intervals will be accurate. When there is genetic stratification in the population, we show that GCTA’s heritability estimates are guaranteed to be unstable and unreliable, which is especially relevant because stratification is common in human GWASs.
Our analysis has two other important consequences: (i) the heritability estimate produced by GCTA is sensitive to the choice of the sample used; and (ii) the estimate is sensitive to measurement errors in the phenotype. We argue that this instability and sensitivity are attributable to the fact that GCTA necessarily overfits typical GWASs. We show that a direct approach to eliminating this overfitting leads back to the small SNP heritability estimates derived previously from association studies. We illustrate our results using the Framingham dataset (9, 10) comprising information on 49,214 SNPs in 2,698 unrelated individuals.
We conclude that application of GCTA to GWAS data may not reliably improve our understanding of the genomic basis of phenotypic variability. Even when the assumptions for GCTA all hold, we recommend the use of diagnostic tests, and we describe one such test. We also discuss several ways of moving toward better methods.
The Data and the GCTA Model
The Data.
A typical GWAS takes phenotypic values for N individuals and assays the individuals’ genotypes at P single nucleotide sites (SNPs). Typically, [e.g., our illustrations use the Framingham data on 2,698() unrelated individuals at 49,214() SNPs]. The genetic data can be represented as an matrix whose entry takes the value 0, 1, or 2 corresponding to the number of copies of the reference allele of the SNP in the individual.
Mixed-Effect Models.
To quantify how genes influence phenotypes, mixed linear models (11, 12) of the form
[1] |
are used in the animal-breeding literature. Here, is a vector of phenotype values for the N individuals, is a vector of “fixed effects,” is a vector of random effects of an individual’s genotype, is a vector of residuals, and is a matrix describing how genetic effects are correlated between individuals. In animal-breeding studies and some human studies, the entries of are estimated from pedigrees.
What GCTA Assumes and Does.
GCTA applies Eq. 1 to GWASs, where genetic and phenotypic information are usually measured on unrelated individuals. Each individual’s genotype is given by P numbers, so the vector of random effects is and the matrix is . GCTA estimates by centering and scaling the data matrix using Hardy–Weinberg assumptions and assumes that and .
The fixed effects term is commonly dropped from the analysis and the GCTA model written as
[2] |
which we use henceforth. This model can be rewritten more clearly, with the same , as
[3] |
where is a random vector of genetic contributions to the phenotypes, is the variance of total additive genetic effects, and is the GRM between pairs of individuals.
GCTA obtains maximum-likelihood estimates (MLEs) of the parameters and and then estimates the heritability as the ratio , where is the observed variance in the phenotype (4).
The relatively simple structure of Eq. 3 (linear, additive, and no environmental or epigenetic effects) relies on assumptions that GCTA has in common with the animal-breeding literature. GCTA assumes that the SNPs used are in linkage equilibrium; in practice, SNPs in linkage disequilibrium are avoided. In our analysis and example, we assume that this selection has been made. However, GCTA makes important additional assumptions: assumption 1, each SNP makes a random contribution to the phenotype independent of the others; assumption 2, the distribution of these random contributions is identical for all SNPs; and assumption 3, there is no genetic stratification in the population.
In this paper, we use a rigorous mathematical analysis of GCTA to answer two key questions. How reliable are the GCTA estimates when the assumptions in the model are satisfied exactly? How robust are the GCTA estimates to violations of assumption 3 (with or without a correction)?
We begin with the key point that any observed realization of the matrix in Eq. 2 is a sample from some underlying distribution of possible data. In fact, is a random matrix. Hence, the data (the entries of ) and the resulting GRM () will have sampling errors. As a consequence, the MLEs produced by GCTA are statistical estimates of the parameters in Eq. 3. To analyze the precision and stability of these MLEs, we now develop the connection between the MLEs and the geometry of (in terms of the matrix’s singular values and singular vectors).
Singular Values and GCTA.
The MLEs produced by GCTA depend on the properties of the GRM matrix (). We prove here that these MLEs can be expressed in terms of the singular values and associated singular vectors—the spectral properties—of the data matrix . Readers will be familiar with spectral properties in the context of principal component analysis (PCA). In PCA, we rank the N eigenvalues (and associated eigenvectors) of the symmetric matrix . PCA is equivalent to a singular value decomposition (SVD) of the matrix , which produces a set of real singular values ( for ), a set of left singular vectors ( for ) of dimension , and a corresponding set of right singular vectors ( for ) of dimension . The eigenvalues of in a PCA are the squares of the singular values of (the PCA eigenvectors are the left singular vectors of ).
We find (Appendix A) that the MLEs computed by GCTA are explicit functions of the singular values and singular vectors of . We write the MLEs for and as the sum of three terms, the second of which (Eqs. A3 and A8) is a function of
[4] |
note the terms in . Using to represent a realization of , the third term (Eqs. A3 and A9) is a function of and of the left singular vectors ( for ) of . We use these results to determine how sampling errors influence the estimates produced by GCTA in two cases.
Case 1: GCTA’s Assumptions Are Satisfied Exactly.
When assumptions 1 through 3 hold, the matrix (asymptotically) has an N variate Wishart distribution with P degrees of freedom. Marčenko and Pastur (13) (also see refs. 14 and 15 for more readable expositions) show that for samples from a Wishart distribution, the empirical distribution of the eigenvalues of (Fig. 1A) converges to a limiting form when , , and is finite. This limit distribution depends only on the value of and, although asymptotic, accurately describes the eigenvalue distribution of even for sample sizes as small as (Fig. 1B). Note from Fig. 1A that for all values of , almost all eigenvalues of lie within the interval 0–4, and that as approaches 1, the spectrum of becomes ill-conditioned (i.e., the ratio of its largest to its smallest eigenvalue becomes large).
In all available GWASs, there are more SNPs than people, so . For such cases, the known results above imply that the eigenvalues of the GRM are well-conditioned, so that the errors in the second term of the MLE [i.e., in ] will be small.
In addition to the eigenvalues, the third term in the MLE expression depends on the eigenvectors of the GRM. Because the eigenvectors of the true GRM are unknown, GCTA proceeds by approximating the eigenvectors of the true GRM by the eigenvectors of the sample GRM. This approximation will be valid if the eigenvectors of the true GRM are “similar” to the eigenvectors of the sample GRM. Standard results from perturbation theory show that the eigenvectors will be similar only if the eigenvalues of the sample GRM are not packed close to one another. However, because eigenvalues of the GRM are packed in the interval 0–4, the eigenvalues of the sample GRM are guaranteed to be packed close to one another (for our simulations in Fig. 1B, where and , the two closest eigenvalues have magnitudes 0.2412 and 0.2417, respectively). As a result of this close packing, the eigenvectors of the sample GRM can be drastically different from the eigenvectors of the true GRM and, these differences bias the MLEs of and by amplifying the sampling errors associated with the GRM (see Appendix B for details). If these biases in the MLEs are large, the heritability estimates produced by GCTA will not be representative of the true underlying heritability of the phenotype. Furthermore, because GCTA estimates the SE of the heritability as a function of the MLEs, large biases would make this SE meaningless.
To demonstrate this bias, we simulated a dataset comprising 50,000 SNPs in linkage equilibrium for 2,000 people [using PLINK software (16)] and a phenotype with a heritability of 0.75 (using GCTA). The simulation assumes that the entire additive genetic contribution to the phenotype comes from 45,000 out of the total 50,000 SNPs (the causal SNPs) whose effect sizes are normally distributed with mean 0 and variance 1.
Using GCTA on this dataset, we estimate the genotypic variance as with a SE of 0.151. This result, along with standard results from large sample theory (17), state that the MLE of is approximately normally distributed with mean 0.685/50,000 = 1.37 and SD 1.51/50,000 = 3.1 which forms GCTA’s null hypothesis.
To test this hypothesis, we construct 500 genotype matrices, each comprising all 2,000 people but only 5,000 SNPs randomly chosen from the initial 50,000. We ran GCTA using each of these genotype matrices, and in each case, estimated (Fig. 2). More than half of these estimates lie outside the 95% confidence interval (marked by red arrows), with the largest of these estimates () being more than 10 SDs away from the mean; these results suggest that GCTA’s null is almost certainly being violated.
Although our results guarantee that each estimate of in Fig. 2 will be biased, our results do not provide any general information about the magnitude of the bias. It is possible that the bias in some of these estimates is small, and some resampling procedure might resolve problems raised in this section; we do not pursue this here.
Case 2: There Is Genetic Stratification.
Assumption 3 is typically violated: stratification is widely observed in humans (18, 19) and animals (20, 21) and is a major reason for the high number of false discoveries in GWASs (22). GCTA claims to address this by incorporating eigenvectors of the GRM as fixed effects [as columns of in Eq. 1, as suggested by Eigenstrat software (23)]. Surprisingly, GCTA finds that in most cases, the fixed-effect term has little influence (i.e., the heritability estimate from GCTA is nearly independent of the stratification). We will show that the analysis provided by GCTA is flawed and that stratification will induce large errors in the MLEs. We illustrate our points using the Framingham dataset (9, 10), which is known to be stratified.
We begin by analyzing how genetic stratification influences the spectral properties of . A plot of the singular values of for the Framingham dataset (Fig. 3) reveals that these values are extremely skewed. The first four singular values are greater than 1,000, and the next 20 or so are between 100 and 1,000, whereas thousands of singular values are close to 0: the largest singular value is times the smallest singular value.
It is well known (see theorem 3 in ref. 24 and section 4.3 in ref. 25) that such skews must occur in stratified populations. In essence, these studies show that if N is large, and the markers are sampled from K different populations, the first singular values of will be much larger than the remaining . Because in most cases, , we expect most of the singular values of to be close to 0, as in Fig. 3.
This skew, and the long tail of near-zero singular values, have serious implications for the MLEs from GCTA. Recall that the second term in the MLE (Eq. A8) is sensitive to the near-zero singular values of . The third term (Eq. A9) is a function of and so is also sensitive to the near-zero singular values when is ill-conditioned (i.e., the ratio of the largest to the smallest singular value of is large). Therefore, the accuracy of both terms in the MLE expression hinge on the precise estimation of the near-zero singular values of . Stewart (26) shows that in the presence of noise, the estimation errors of a singular value whose “true” magnitude is 0 will be larger than the noise by a factor (i.e., the estimates of the near-zero singular values are extremely imprecise). We now illustrate several problems with the GCTA estimates in genetically stratified populations, all of which stem from the imprecision of the near-zero singular values.
Sensitivity to the SNPs used in the study.
The heritability estimates from GCTA will be sensitive to the SNPs used in the study because the errors associated with the near-zero singular values (and in turn the MLEs) for datasets constructed using different sets of SNPs will be different. To demonstrate this, we construct 2,500 genotype matrices, for , each comprising 5,000 (of the total 49,214 SNPs) randomly sampled SNPs, and use GCTA to estimate the heritability of systolic blood pressure (BP) (27) using each of these . Contrary to the claim in ref. 4, these heritability estimates show high variability (Fig. 4A), because of sampling errors associated with one or more of the near-zero singular values (Fig. 4B).
Sensitivity to the measurement errors in the phenotype.
Because is ill-conditioned, small changes in the phenotype vector can cause large changes in the heritability estimates from GCTA. Hence, GCTA violates the “unspoken assumption that imprecision of measurement of phenotype will not have large systematic effects on the location of significant associations in GWAS” (28). To demonstrate this violation, we generate 2,500 noisy samples of the BP phenotype vector [for each sample, the entry of the vector is drawn uniformly over the minimum and maximum of the four BP readings available to us for person i; in general, BP readings are much noisier (29)] and use GCTA to estimate the heritability using each of these vectors. Even for the modest errors in this case, GCTA shows high variability in its heritability estimates (Fig. 5).
Saturation of heritability estimates.
According to GCTA, each SNP makes a random contribution to the phenotype; therefore, the heritability estimate, , is necessarily directly proportional to P (because is fixed). Several studies (see figure 5 in ref. 30) using GCTA have found results contradicting this; these studies report a threshold value above which, introducing more SNPs in the analysis produces only marginal increases in the heritability estimates from GCTA. This saturating behavior implies that above a threshold of P, is nearly independent of P (i.e., is inversely proportional to P) which violates GCTA’s assumption that the contribution of each SNP to the heritability estimate is independent of the others.
Bias in the heritability estimates.
We have shown that for a stratified population, the MLEs produced by GCTA are guaranteed to be biased. The bias arises because thousands of eigenvalues of the GRM are closely packed (near 0) and have large sampling errors associated with their values (Appendix B). As a result of the bias, the heritability estimates produced by GCTA are not reflective of the “true” underlying heritability. Furthermore, the SEs reported by GCTA are functions of the MLEs, and so these SEs will also be unreliable.
To demonstrate this unreliability, we first ran GCTA on the Framingham dataset with BP as the phenotype. GCTA reports that has an estimate of 0.263 and a SE of 0.048. This result plus large sample theory imply that the MLE for will be approximately normally distributed with mean 0.263/49,214 = and SD 0.048/49,214 = which forms GCTA’s null hypothesis.
To test this hypothesis, we computed the estimate and SEs of for each of the samples used in Fig. 4A. The for almost all of these samples lies outside the 95% confidence intervals predicted by GCTA’s null (marked by the red arrows), with the largest of these estimates (5.46 ) being more than 50 SDs away from the mean; these results suggests that GCTA’s null is almost certainly being violated.
Resampling techniques like the bootstrap cannot be used to correct for the bias in heritability estimates because every run of the bootstrap will produce a biased estimate of heritability, and there is no way of estimating the magnitude of the bias in any of these samples. Are there other approaches to fixing GCTA when there is genetic stratification in the population? Two common approaches to fixing this problem are (i) constraining the random effects associated with only some of the SNPs to be relevant (sparsity) (31) or (ii) denoising the matrix (i.e., setting its lower noisy singular values to 0) before constructing it (32). We do not pursue the first approach because it violates the premise of GCTA that each SNP makes a random contribution to the phenotype. Using the latter approach, we show (Appendix C) that the contribution of the random effects term will not be significantly different from 0 (Fig. 6); these results are consistent with our findings on denoising the Framingham dataset. Because the random effects term is the sole driver of the “improved” heritability estimates produced by GCTA, in the term’s absence, the heritability estimates will be no better than those obtained using the significant SNPs in association studies.
It is not surprising that the estimates produced by GCTA are sensitive and biased because the method estimates parameters ( parameters for the N nonzero singular values and their corresponding left and right singular vectors, plus and ) from a dataset containing entries and therefore overfits the data. In contrast, association studies insist on stringent P values for significance and so greatly reduce the effective number of parameters being estimated (see ref. 33 for details); the resulting effective number of parameters is much smaller than the size of the dataset, so overfitting is not a problem.
The MLEs produced by GCTA will be unreliable irrespective of the number of principal components that are included in the model (see Fig. 7, where the MLEs produced by GCTA are unreliable even when five principal components are used as fixed effects; similar unreliabilities were observed when one and three principal components were used as fixed effects, respectively). Price and coworkers (23, 24) have shown that principal components are useful for identifying population stratification and when used with reliable methods like association studies, are effective in correcting for population stratification. When principal components are used in conjunction with GCTA, the analysis step is unreliable as a result of the overfitting. Therefore, although the principal components are still able to accurately identify population stratification, they only serve to compound the bias of GCTA.
Discussion
GCTA analyses have been widely accepted in large part because they produce heritability estimates that are many times larger than earlier estimates from GWAS data and are closer to those obtained from data with reliable pedigrees. The statistical model in GCTA assumes that each SNP makes a small random contribution to the variability in the phenotype. Our analytical and numerical results illustrate the problems with GCTA when (i) the assumptions of the model are satisfied exactly or (ii) the assumptions are violated as a result of genetic stratification. In both cases, the problems associated with GCTA stem from the fact that a high-dimensional correlation matrix is being estimated from a limited amount of data without dimensionality reduction.
When there is genetic stratification in the population, the GRM has a long tail of near-zero eigenvalues; here, GCTA will produce unreliable heritability estimates. GCTA claims that including the first few principal components as fixed effects (following ref. 23) will resolve the problem of stratification, but this is not the case; even after including principal components as fixed effects, the problems associated with the near-zero singular values of remain. Principal components are useful for dealing with stratification in the context of association studies because principal components reduce the dimensionality of the problem via stringent P value criteria. We believe that stratification is responsible for many of the counterintuitive results reported by studies using GCTA (a more detailed discussion of these studies can be found in ref. 8). Furthermore, numerous studies on sensitive subjects like childhood intelligence (34), Tourette syndrome (35), and schizophrenia (36) need to be critically reviewed.
Even when there is no genetic stratification, our analysis strongly suggests that the heritability estimates and their SEs produced by GCTA will be unreliable. These unreliabilities are illustrated by our simulations (Fig. 2). We do not prove that they will apply for all datasets where there is no genetic stratification. Our illustration does suggest a simple test of the reliability of GCTA’s estimates by a resampling procedure: first, construct many, say 500, genotype matrices by randomly sampling SNPs from the total P SNPs in the dataset and use GCTA to compute the estimate corresponding to each of these genotype matrices. Next, compute the estimate () and SE () for using all P SNPs. Under GCTA’s null, ∼99.5% of the distribution should lie in the interval [, ]; if any of the 500 simulations estimates a far outside this range, GCTA’s estimates should not be trusted.
Heritability estimates using methods other than GCTA are generally low . These methods use, for example, either single SNP associations or polygenic scores constructed from a few significant SNPs. We have shown that GCTA grossly underestimates the uncertainties associated with the individual SNP contributions and as a result, the only reliable heritability estimates are the 3–4% produced by these other methods.
The problems in GCTA stem from the overfitting of a high-dimensional GRM. To make progress with GCTA-like mixed models, it is critical that the estimates of this matrix be refined. Some progress has been made in this direction using, for example, methods for covariance smoothing (37). There are several alternative methods of describing the relatedness between individuals (for a survey of these methods, see ref. 38), some of which could prove useful in improving the estimate of the GRM.
We have shown that a brute force approach to estimating the covariance structure of the random effects of SNPs (as in GCTA) does not resolve the problem stemming from the number of SNPs (P) being much larger than the number of subjects genotyped (N). We believe that future studies of GWAS data can make progress by incorporating prior information. Two possible ways of so doing are (i) insisting that the basis used for constructing the GRM is sparse (i.e., only some SNPs make random contributions and the rest have fixed contributions) and (ii) incorporating biological information about the relationships between elements in the random covariance matrix.
Appendices
Appendix A: The Likelihood Function and Sensitivity.
Expressing Eq. 2 in probabilistic form, we have
[A1] |
Therefore, the marginal distribution of will be given by
[A2] |
We have only one sample of the vector , namely our observed vector of phenotypes (which we call ). Because has a multivariate normal distribution, the log likelihood of observing is
[A3] |
The Woodbury matrix identity (39) states that the inverse of a matrix is given by . Hence in Eq. A3 with , , , and , we have
[A4] |
We now use this formulation to show that for a stratified population, the heritability estimates will be sensitive to the data used in the GWAS.
Instability of the second term in Eq. A3.
Using the SVD of () in Eq. A2, we have
[A5] |
Sylvester’s theorem for determinants (40) states that for invertible matrices , the determinant . Hence in Eq. A5, setting , , and , we obtain
[A6] |
Therefore, because ,
[A7] |
To find the MLEs of , using Eq. A3, we must differentiate with respect to and and set the derivatives to 0. The derivative of the first term is independent of , and the derivative of the second term is 0 because is independent of and . The last term in Eq. A7 is
[A8] |
For a stratified population, thousands of singular values, of will be close to 0, and Eq. A8 will be extremely sensitive to small changes in the values of wi
Instability of the third term in Eq. A3.
From Eq. A4, we have
[A9] |
Consider just the factor with an underlying curly bracket. Suppose, we perturb to for some small γ. Then, that factor becomes
[A10] |
where we chose such that (the matrix can be trivially constructed to have elements only on its primary diagonal). Because a small perturbation of causes a large change in its spectral properties (26), the vector can be vastly different from , and hence is extremely sensitive to measurement errors in the phenotype.
Appendix B: The Likelihood Function and Bias.
Here, we reformulate the likelihood function in terms of the eigenvalues [ for ] and the eigenvectors ( for ) of the GRM, . Because , we have
[A11] |
Using the SVD of described in Singular Values and GCTA, we have . Now,
[A12] |
Premultiplying Eq. 3 by and using Eq. A12,
[A13] |
Setting the diagonal matrix, , we note that , where is defined in Eq. A2. Using Eq. A13, we can write the likelihood function as
[A14] |
Because is an orthogonal matrix, = 1. Therefore, we have
[A15] |
Using Eq. A15 in Eq. A14 and comparing with Eq. A3, we note that the first two terms in the likelihood function are identical. The only term whose derivative with respect to and depends on the eigenvectors is the third term, and therefore we analyze this term in more detail. The eigenvalues of will be for . The last term in Eq. A14 can be written as
[A16] |
Now, suppose that the GRM, , estimated by GCTA differs from the “true” underlying GRM () by a small sampling error . Using standard results from perturbation theory, we can express the eigenvectors of in terms of the spectral properties of and the error matrix, , as
[A17] |
where is the eigenvalue of , is the singular value of , and and are the eigenvectors corresponding to and , respectively. Using Eq. A17 in Eq. A16, we get
[A18] |
Note that J is not symmetric in (because the second term is not symmetric). Differentiating J with respect to , we have
[A19] |
Note in Eq. A19 that unless is exactly 0, the MLE of will always have a factor that does not average out to 0 (because the second term is not symmetric in ), and therefore is guaranteed to be biased. Similar derivations show that the MLE of are also guaranteed to be biased.
Case 1: There is no stratification in the population.
When the assumptions of GCTA are met exactly, asymptotically has an N variate Wishart distribution with P degrees of freedom (15). Marčenko–Pastur theory (13) provides an empirical distribution (which we henceforth refer to as the M-P distribution) for the eigenvalues of the variance-covariance matrix (which in our case will be the GRM, ) as a function of , which although asymptotic, works well even for sample sizes as small as (Fig. 1B). The distribution shows that most of the eigenvalues of the GRM will lie in 0–4. Note from Fig. 1A that when is close to 1, the eigenvalues will be skewed and as the value of becomes smaller, the eigenvalues become concentrated on smaller intervals. For example, when , the eigenvalues lie within 0.25–2.25 (Fig. 1A).
Because the M-P distribution is continuous, we assume without loss of generality that the eigenvalues are unique (i.e., there are no repeated eigenvalues). Because N is large, the eigenvalues of are necessarily packed extremely close to one another. To be specific, for and , the maximum value of the minimum separation between the eigenvalues will be . This upper bound is not tight; our ballpark estimate assumes a (best-case) uniform spread of eigenvalues on the interval. In reality, the distribution is peaked near 0.5 (Fig. 1A), and therefore we expect the eigenvalues near 0.5 to be a lot closer than 0.001 (for our simulation in Fig. 1B, the minimum spacing between the eigenvalues is ).
This small separation causes the eigenvectors of the estimated GRM to be drastically different from those of the true GRM. To see why, consider an eigenvalue (say the one) of , which is closely packed to another eigenvalue (say the one). From Eq. A17, the angle between and will have terms of the form , which will be large because is close to 0. This is just one of the terms; all eigenvalues that are “sticking” close to the eigenvalue will induce large differences between the estimated and true eigenvectors.
These differences are amplified in the expressions for the MLE and its derivatives (see the second and third summations in Eqs. A18 and A19). Therefore, the bias in the heritability estimates produced by GCTA can be large.
Case 2: The population is stratified.
In a stratified population, there are two sources of bias in the GCTA estimates. The first source of bias is identical to that described in the case when there is no stratification for a stratified population, thousands of eigenvalues of the GRM are tightly packed near 0, and therefore from Eqs. A18 and A19, there can be large errors associated with the MLEs produced by GCTA.
The second source of bias comes from the large errors associated with the eigenvalues of the GRM (26). Specifically, suppose there are sampling errors and associated with two closely packed eigenvalues whose true magnitudes are and , respectively. We have
[A20] |
where . Because the errors in the near-zero eigenvalues can be large, κ need not be close to 0. As a result, all of the error terms in Eqs. A18 and A19 will have additional amplification factors of the form for every pair (p and q) of near-zero eigenvalues of the GRM.
The bootstrap and GCTA.
Here, we show that resampling techniques (e.g., the bootstrap, jackknife, etc.) cannot be used to improve the estimates produced by GCTA. Loosely put, the bootstrap estimates the parameter in question (in this case, the heritability) by resampling from the original sample and relying on the fact that the sampling errors “average out” to 0. We have shown in Appendix B that the GCTA estimates are overly (erroneously) biased by local information.
For run i of a bootstrap, let the heritability estimate obtained by GCTA be , the biasing error be , the sampling error (with mean 0) be , and the “true” heritability estimate be . The bootstrap estimate can be expressed as
[A21] |
Taking the mean in Eq. A21 over sufficient bootstrap samples, we have
[A22] |
The bootstrap will estimate , which is not helpful because we have no handle on . More importantly, provides no useful information about .
Appendix C: Dynamics of GCTA and More Problems with Stratification.
Consider the most general case, where N random people and P random SNPs are used to construct the random genotype matrix (of which one realization is the observed data matrix ). From , we construct the matrix by centering and scaling so that the entries of matrix will be i.i.d., with mean 0 and variance 1. As a result, and will be random Wishart matrices (15), whose eigenvectors are known to be uniformly distributed over the unit sphere in N dimensions (41) which implies that with high probability, the eigenvectors of ( for ) and ( for ) satisfy and , respectively (42) (the norm of a vector = is defined as = max) [i.e., the maximum entries in the eigenvectors are very small (for n = 2,698 and P = 49,214, and )].
Suppose the first singular values of are much larger than the others. We analyze , the term primarily responsible for the high heritability estimates obtained using Eq. 3. First, we express as a function of the singular values and singular vectors of , namely
[A23] |
In such cases, it is best to discard the near-zero singular values (26) while constructing the GRM. Accordingly, Eq. A23 becomes
[A24] |
In this expression, note that will be very close to 0 (because the norm of is close to 0) for all , and therefore the random effect term will never be significant. Because the norm of for is also close to 0, we expect to be close to the 0 matrix, which is observed to be true in most cases.
GCTA tries to explain the variance of the phenotype vector by k random projections onto a plane defined by the columns of . In Eq. A23, we expressed these random projections using the left singular vectors of as a basis. When N is fixed, (i.e., as we collect more genotypic information on a fixed set of individuals) and one (or a few) of the singular values is much larger than the rest, the singular vector corresponding to these singular values will be consistent and the singular vectors corresponding to most of the remaining nonzero singular values will be strongly inconsistent (32) (informally put, consistency describes whether the singular vector estimates from the data matrix approach the “true” singular vector as more data are collected). This implies that the subspace defined by the higher singular vectors of the data matrix contain little biological information about the genotype matrix. [Similar inconsistency results hold for the case where , and for some constant c (43)].
In Eq. A23, we showed that the random projection of onto the first singular vectors (given by ) is almost surely 0. Because the subspace defined by the higher singular vector estimates () is inconsistent, there is little biological connection between in Eq. A23 and the heritability estimate. It appears that it is this term that is responsible for the high values of the GCTA estimate (because the first term makes nearly 0 contribution to the projection). For every set of people and SNPs that are chosen, the data matrix generates a new set of arbitrary singular vectors, which in turn generate arbitrary and estimates.
Acknowledgments
We thank Chiara Sabatti, Chris Gignoux, David Golan, David Steinsaltz, Edgar Dobriban, Jonathan Pritchard, and Kenneth Wachter for useful comments on earlier drafts of this paper. This project is funded by National Institutes of Health Grant AG22500 (to S.T.) and the Morrison Institute for Population and Resource Studies. S.K.K. is funded by the Stanford Center for Computational, Evolutionary and Human Genomics. D.H.R. is supported by National Institute on Aging Grant K01AG047280. Our use of the Framingham data has been approved by the Stanford Institutional Review Board.
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
References
- 1.Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456(7218):18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
- 3.Weedon MN, et al. Diabetes Genetics Initiative; Wellcome Trust Case Control Consortium; Cambridge GEM Consortium Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008;40(5):575–583. doi: 10.1038/ng.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yang J, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lee SH, et al. Schizophrenia Psychiatric Genome-Wide Association Study Consortium (PGC-SCZ); International Schizophrenia Consortium (ISC); Molecular Genetics of Schizophrenia Collaboration (MGS) Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat Genet. 2012;44(3):247–250. doi: 10.1038/ng.1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Davies G, et al. Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Mol Psychiatry. 2011;16(10):996–1005. doi: 10.1038/mp.2011.85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Boardman JD, et al. Is the gene-environment interaction paradigm relevant to genome-wide studies? The case of education and body mass index. Demography. 2014;51(1):119–139. doi: 10.1007/s13524-013-0259-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Charney E. September 19, 2013. Still chasing ghosts: A new genetic methodology will not find the “missing heritability”. Independent Science News.
- 9.Govindaraju DR, et al. Genetics of the Framingham heart study population. Adv Genet. 2008;62:33–65. doi: 10.1016/S0065-2660(08)00602-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Splansky GL, et al. The third generation cohort of the national heart, lung, and blood institute’s Framingham heart study: Design, recruitment, and initial examination. Am J Epidemiol. 2007;165(11):1328–1335. doi: 10.1093/aje/kwm021. [DOI] [PubMed] [Google Scholar]
- 11.VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91(11):4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- 12.Robinson GK. That blup is a good thing: The estimation of random effects. Stat Sci. 1991;6(1):15–32. [Google Scholar]
- 13.Marčenko VA, Pastur LA. Distribution of eigenvalues for some sets of random matrices. Sbornik Mathematics. 1967;1(4):457–483. [Google Scholar]
- 14.Wachter KW. The strong limits of random matrix spectra for sample matrices of independent elements. Ann Probab. 1978;6(1):1–18. [Google Scholar]
- 15.Johnstone IM. 2006. High dimensional statistical inference and random matrices. arXiv:math/0611589.
- 16.Purcell S, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Casella G, Berger RL. Statistical Inference. Vol 2 Duxbury; Pacific Grove, CA: 2002. [Google Scholar]
- 18.Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361(9357):598–604. doi: 10.1016/S0140-6736(03)12520-2. [DOI] [PubMed] [Google Scholar]
- 19.Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: Quantification of bias. J Natl Cancer Inst. 2000;92(14):1151–1158. doi: 10.1093/jnci/92.14.1151. [DOI] [PubMed] [Google Scholar]
- 20.Heaton MP, et al. Selection and use of SNP markers for animal identification and paternity analysis in U.S. beef cattle. Mamm Genome. 2002;13(5):272–281. doi: 10.1007/s00335-001-2146-3. [DOI] [PubMed] [Google Scholar]
- 21.Beraldi D, et al. Quantitative trait loci (QTL) mapping of resistance to strongyles and coccidia in the free-living Soay sheep (Ovis aries) Int J Parasitol. 2007;37(1):121–129. doi: 10.1016/j.ijpara.2006.09.007. [DOI] [PubMed] [Google Scholar]
- 22.Tian C, Gregersen PK, Seldin MF. Accounting for ancestry: Population substructure and genome-wide association studies. Hum Mol Genet. 2008;17(R2):R143–R150. doi: 10.1093/hmg/ddn268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 24.Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bryc K, Bryc W, Silverstein JW. Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations. Theor Popul Biol. 2013;89:34–43. doi: 10.1016/j.tpb.2013.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Stewart GW. Perturbation theory for the singular value decomposition. Institute for Advanced Computer Studies, University of Maryland; College Park, MD: 1998. [Google Scholar]
- 27.Kannel WB. Risk stratification in hypertension: New insights from the Framingham Study. Am J Hypertens. 2000;13(1 Pt 2):3S–10S. doi: 10.1016/s0895-7061(99)00252-6. [DOI] [PubMed] [Google Scholar]
- 28.Barendse W. The effect of measurement error of phenotypes on genome wide association studies. BMC Genomics. 2011;12(1):232. doi: 10.1186/1471-2164-12-232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pickering TG, et al. Subcommittee of Professional and Public Education of the American Heart Association Council on High Blood Pressure Research Recommendations for blood pressure measurement in humans and experimental animals: Part 1: Blood pressure measurement in humans: A statement for professionals from the Subcommittee of Professional and Public Education of the American Heart Association Council on High Blood Pressure Research. Hypertension. 2005;45(1):142–161. doi: 10.1161/01.HYP.0000150859.47929.8e. [DOI] [PubMed] [Google Scholar]
- 30.Bérénos C, Ellis PA, Pilkington JG, Pemberton JM. Estimating quantitative genetic parameters in wild populations: A comparison of pedigree and genomic approaches. Mol Ecol. 2014;23(14):3434–3451. doi: 10.1111/mec.12827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J Am Stat Assoc. 2009;104(486):682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jung S, Marron J. PCA consistency in high dimension, low sample size context. Ann Stat. 2009;37(6B):4104–4130. [Google Scholar]
- 33.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Vol 2 Springer; New York: 2009. [Google Scholar]
- 34.Trzaskowski M, Yang J, Visscher PM, Plomin R. DNA evidence for strong genetic stability and increasing heritability of intelligence from age 7 to 12. Mol Psychiatry. 2014;19(3):380–384. doi: 10.1038/mp.2012.191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Davis LK, et al. Partitioning the heritability of Tourette syndrome and obsessive compulsive disorder reveals differences in genetic architecture. PLoS Genet. 2013;9(10):e1003864. doi: 10.1371/journal.pgen.1003864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.de Candia TR, et al. International Schizophrenia Consortium; Molecular Genetics of Schizophrenia Collaboration Additive genetic variation in schizophrenia risk is shared by populations of African and European descent. Am J Hum Genet. 2013;93(3):463–470. doi: 10.1016/j.ajhg.2013.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Crossett A, Lee AB, Klei L, Devlin B, Roeder K. Refining genetically inferred relationships using treelet covariance smoothing. Ann Appl Stat. 2013;7(2):669–690. doi: 10.1214/12-AOAS598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Speed D, Balding DJ. Relatedness in the post-genomic era: Is it still useful? Nat Rev Genet. 2015;16(1):33–44. doi: 10.1038/nrg3821. [DOI] [PubMed] [Google Scholar]
- 39.Horn RA, Johnson CR. Matrix Analysis. Cambridge Univ Press; Cambridge, UK: 2012. [Google Scholar]
- 40.Harville DA. Matrix Algebra from a Statistician’s Perspective. Vol 157 Springer; New York: 1997. [Google Scholar]
- 41.Bai Z, et al. On asymptotics of eigenvectors of large sample covariance matrix. Ann Probab. 2007;35(4):1532–1572. [Google Scholar]
- 42.Wang K. 2013. Optimal upper bound for the infinity norm of eigenvectors of random matrices. PhD thesis (Rutgers, The State University of New Jersey, New Brunswick, NJ)
- 43.Johnstone IM, Lu AY. 2004. Sparse principal components analysis. arXiv:0901.4392.