Abstract
For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the “missing heritability” problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.
Main Text
Despite the great success of genome-wide association studies (GWAS), which have identified hundreds of SNPs conferring the genetic variation of human complex diseases and traits,1 the genetic architecture of human complex traits still remains largely unexplained. For most traits, the associated SNPs from GWAS only explain a small fraction of the heritability.2,3 There has not been any consensus on the explanation of the “missing heritability.” Possible explanations include a large number of common variants with small effects, rare variants with large effects, and DNA structural variation.2,4 We recently proposed a method of estimating the total amount of phenotypic variance captured by all SNPs on the current generation of commercial genotyping arrays and estimated that ∼45% of the phenotypic variance for human height can be explained by all common SNPs.5 Thus, most of the heritability for height is hiding rather than missing because of many SNPs with small effects.5,6 In contrast to single-SNP association analysis, the basic concept behind our method is to fit the effects of all the SNPs as random effects by a mixed linear model (MLM),
(Equation 1) |
where y is an n × 1 vector of phenotypes with n being the sample size, β is a vector of fixed effects such as sex, age, and/or one or more eigenvectors from principal component analysis (PCA), u is a vector of SNP effects with , I is an n × n identity matrix, and ɛ is a vector of residual effects with . W is a standardized genotype matrix with the ijth element , where xij is the number of copies of the reference allele for the ith SNP of the jth individual and pi is the frequency of the reference allele. If we define and define as the variance explained by all the SNPs, i.e., , with N being the number of SNPs, then Equation 1 will be equivalent to:7–9
(Equation 2) |
where g is an n × 1 vector of the total genetic effects of the individuals with , and A is interpreted as the genetic relationship matrix (GRM) between individuals. We can therefore estimate by the restricted maximum likelihood (REML) approach,10 relying on the GRM estimated from all the SNPs. Here we report a versatile tool called genome-wide complex trait analysis (GCTA), which implements the method of estimating variance explained by all SNPs, and extend the method to partition the genetic variance onto each of the chromosomes and also to estimate the variance explained by the X chromosome and test for dosage compensation in females. We developed GCTA in five function domains: data management, estimation of the GRM from a set of SNPs, estimation of the variance explained by all the SNPs on a single chromosome or the whole genome, estimation of linkage disequilibrium (LD) structure, and simulation.
Estimation of the Genetic Relationship from Genome-wide SNPs
One of the core functions of GCTA is to estimate the genetic relationships between individuals from the SNPs. From the definition above, the genetic relationship between individuals j and k can be estimated by the following equation:
(Equation 3) |
We provide a function to iteratively exclude one individual of a pair whose relationship is greater than a specified cutoff value, e.g., 0.025, while retaining the maximum number of individuals in the data. For data collected from family or twin studies, we recommend that users estimate the genetic relationships with all of the autosomal SNPs and then use this option to exclude close relatives. The reason for exclusion is that the objective of the analysis is to estimate genetic variation captured by all the SNPs, just as GWAS does for single SNPs. Including close relatives, such as parent-offspring pairs and siblings, would result in the estimate of genetic variance being driven by the phenotypic correlations for these pairs (just as in pedigree analysis), and this estimate could be a biased estimate of total genetic variance, for example because of common environmental effects. Even if the estimate is not biased, its interpretation is different from the estimate from “unrelated” individuals: a pedigree-based estimator captures the contribution from all causal variants (across the entire allele frequency spectrum), whereas our method captures the contribution from causal variants that are in LD with the genotyped SNPs.
As a by-product, we provide a function in GCTA to calculate the eigenvectors of the GRM, which is asymptotically equivalent to those from the PCA implemented in EIGENSTRAT11 because the GRM (Ajk) defined in GCTA is approximately half of the covariance matrix (Ψjk) used in EIGENSTRAT. The only purpose of developing this function is to calculate eigenvectors and then include them in the model as covariates to capture variance due to population structure. More sophisticated analyses of the population structure can be found in programs such as EIGENSTRAT11 and STRUCTURE.12
Estimation of the Variance Explained by Genome-wide SNPs by REML
The GRM estimated from the SNPs can be fitted subsequently in an MLM to estimate the variance explained by these SNPs via the REML method.10 Previously, we included only one genetic factor in the model. Here we extend the model in a general form as
where is a vector of random genetic effects, which could be the total genetic effects for the whole genome or for a single chromosome. In this model, the phenotypic variance () is partitioned into the variance explained by each of the genetic factors and the residual variance,
where is the variance of the ith genetic factor with its corresponding GRM, Ai.
In GCTA, we provide flexible options to specify different genetic models. For example:
(1) To estimate the variance explained by all autosomal SNPs, we can specify the model as y = Xβ + g + ɛ with , where g is an n × 1 vector of the aggregate effects of all the autosomal SNPs for all of the individuals and Ag is the GRM estimated from these SNPs. This model is the same as Equation 2.
(2) To estimate the variance of genotype-environment interaction effects (), we can specify the model as y = Xβ + g + ge + ɛ with , where ge is a vector of genotype-environment interaction effects for all of the individuals with Age = Ag for the pairs of individuals in the same environment and with Age = 0 for the pairs of individuals in different environments.
(3) To partition genetic variance onto each of the 22 autosomes, we can specify the model as with , where is a vector of genetic effects attributed to the ith chromosome and Ai is the GRM estimated from the SNPs on the ith chromosome.
GCTA implements the REML method via the average information (AI) algorithm.13 In the REML iteration process, the estimates of variance components from the tth iteration are updated by , where θ is a vector of variance components (, …, and ); L is the log likelihood function of the MLM (ignoring the constant), with ; AI is the average of the observed and expected information matrices, ; and is a vector of first derivatives of the log likelihood function with respect to each variance component, .13 At the beginning of the iteration process, all of the components are initialized by an arbitrary value, i.e., , which is subsequently updated by the expectation maximization (EM) algorithm, . The EM algorithm is used as an initial step to determine the direction of the iteration updates because it is robust to poor starting values. After one EM iteration, GCTA switches to the AI algorithm for the remaining iterations until the iteration converges with the criteria of L(t + 1) – L(t) < 10−4, where L(t) is the log likelihood of the tth iteration. In the iteration process, any component that escapes from the parameter space (i.e., its estimate is negative) will be set to 10−6 × . If a component keeps escaping from the parameter space, it will be constrained at 10−6 × .
From the REML analysis, GCTA has an option to provide the best linear unbiased prediction (BLUP) of the total genetic effect for all individuals. BLUP is widely used by plant and animal breeders to quantify the breeding value of individuals in artificial selection programs14 and also by evolutionary geneticists.15 Consider Equations 1 and 2, i.e., y = Xβ + Wu + ɛ and y = Xβ + g + ɛ. Because these two models are mathematically equivalent,7–9 the BLUP of g can be transformed to the BLUP of u by . Here the estimate of ui corresponds to the coefficient wij, which is then rescaled for the original xij by . We could obtain the BLUP of SNP effects in a discovery set by GCTA and predict genetic values of the individuals in a validation set (). For example, GCTA could be used to predict SNP effects in a discovery set, and the SNP effects could be used in PLINK to predict whole-genome profiles via the scoring approach in a validation set. If the predictions are unbiased, then the regression slope of the observed phenotypes on the predicted genetic values is 1.14 In that case, the genetic value calculated based on the BLUP of SNP effects is an unbiased predictor of the true genetic value in the validation set (gnew), in the sense that .16,17 Prediction analyses of human complex traits have demonstrated that many SNPs that do not pass the genome-wide significance level have substantial contribution to the prediction.18,19 This option is therefore useful for the whole-genome prediction analysis with all of the SNPs, irrespective of their association p values.
Estimation of the Variance Explained by the SNPs on the X Chromosome
The method of estimating the genetic relationship from the X chromosome is different to that for the autosomal SNPs, because males have only one X chromosome. We modified Equation 3 for the X chromosome as:
where and are the number of copies of the reference allele for an X chromosome SNP for a male and a female, respectively.
Assuming the male-female genetic correlation to be 1, the X-linked phenotypic covariance between a pair of individuals is:20
where and are the genetic variance attributed to the X chromosome for males and females, respectively.
The relative values of and depend on the assumption made regarding dosage compensation for X chromosome genes. There are two alleles per locus in females, but only one in males. If we assume that each allele has a similar effect on the trait (i.e., no dosage compensation), the genetic variance on the X chromosome for females is twice that for males: i.e., . Thus,
This can be implemented by redefining GRM for the X chromosome as for male-male pairs, for female-female pairs, and for male-female pairs. If we assume that each allele in females has only half the effect of an allele in males (i.e., full dosage compensation), the X-linked genetic variance for females is half that for males: i.e., . Thus,
Therefore, the raw AX matrix should be parameterized as for male-male pairs, for female-female pairs, and for male-female pairs. The third possibility is to assume equal genetic variance on the X chromosome for males and females, i.e., , in which case the AX matrix is not redefined at all.
We can estimate by fitting the model , where is a vector of genetic effects attributable to the X chromosome, with assuming no dosage compensation, assuming full dosage compensation, and assuming equal X-linked genetic variance for males and females. Test of dosage compensation can be achieved by comparing the likelihoods of model fitting under the three assumptions.
Estimation of the Variance Explained by Genome-wide SNPs for a Case-Control Study
The methodology described above is also applicable for case-control data, for which the estimate of variance explained by the SNPs corresponds to variation on the observed 0–1 scale. Under the assumption of a threshold-liability model for a disease, i.e., disease liability on the underlying scale follows standard normal distribution,21 the estimate of variance explained by the SNPs on the observed 0–1 scale can be transformed to that on the unobserved continuous liability scale by a linear transformation.22 The relationship between additive genetic variance on the observed 0–1 and unobserved liability scales was proposed more than a half century ago,23,24 and we recently extended this transformation to account for ascertainment bias in a case-control study, i.e., a much higher proportion of cases in the sample than in the general population (unpublished data). We provide options in GCTA to analyze a binary trait and to transform the estimate on the 0–1 scale to that on the liability scale with an adjustment for ascertainment bias. There is an important caveat in applying the methods described herein to case-control data. Any batch, plate, or other technical artifact that causes allele frequencies between case and control on average to be more different than that under the null hypothesis stating that the samples come from the same population will contribute to the estimation of spurious genetic variation, because cases will appear to be more related to other cases than to controls. Therefore, stringent quality control is essential when applying GCTA to case-control data. Quantitative traits are less likely to suffer from technical genotyping artifacts because they will generally not lead to spurious association between continuous phenotypes and genotypes.
Estimation of the Inbreeding Coefficient from Genome-wide SNPs
Apart from estimating the genetic relatedness between individuals, GCTA also has a function to estimate the inbreeding coefficient (F) from SNP data, i.e., the relationship between haplotypes within an individual. Two estimates have been used: one based on the variance of additive genetic values (diagonal of the SNP-derived GRM) and the other based on SNP homozygosity (implemented in PLINK).25 Let (1 – pi)2 + pi(1 – pi)F, 2pi(1 – pi)(1 – F), and pi2 + pi(1 – pi)F be the frequencies of the three genotypes of a SNP i and let hi = 2pi(1 – pi). The estimate based on the variance of additive genotype values is
where xi is the number of copies of the reference allele for the ith SNP. This is a special case of Equation 3 for a single SNP when j = k. The estimate based upon excess homozygosity is
where O(# hom) and E(# hom) are the observed and expected number of homozygous genotypes in the sample, respectively. Both estimators are unbiased estimates of F in the sense that , but their sampling variances are dependent on allele frequency, i.e., (1 – hi) / hi if F = 0. In addition, the covariance between the two estimators is (3hi – 1) / hi + (1 – 2hi)F / hi – F2, so that the sampling covariance between the estimators is (3hi – 1) / hi and the sampling correlation is (3hi – 1) / (1 – hi) when F = 0. We proposed an estimator based upon the correlation between uniting gametes:5
is also an unbiased estimator of F in the sense that . If F = 0, regardless of allele frequency, which is smaller than the sampling variance of and , i.e., 1 ≤ (1 – hi) / hi. When 0 < F < 1/3, also has a smaller variance than and . In GCTA, we use 1 + rather than 1 + to calculate the diagonal of the GRM. For multiple SNPs, we average the estimates over all of the SNPs, i.e., .
Estimating LD Structure
In a standard GWAS, particularly with a large sample size, the mean (λmean) or median (λmedian) of the test statistics for single-SNP associations often deviates from its expected value under the null hypothesis of no association between any SNP and the phenotype, which is usually interpreted as the effect due to population stratification and/or cryptic relatedness.11,26,27 An alternative explanation is that polygenic variation causes the observed inflated test statistic.18 To predict the genomic inflation factors, λmean and λmedian, from polygenic parameters such as the total amount of variance that is explained by all SNPs, we need to quantify the LD structure between SNPs and putative causal variants (unpublished data). GCTA provides a function to search for all the SNPs in LD with the “causal variants” (mimicked by a set of SNPs chosen by the user). Given a causal variant, we use simple regression to test for SNPs in LD with the causal variant within d Mb distance in either direction. PLINK has an option (“show targets”) to select SNPs in LD with a set of target SNPs with LD r2 larger than a user-specified cutoff value. This function is very useful to distinguish independent association signals but less suited to predict λmean and λmedian, because the test statistics of the SNPs in modest LD with causal variants (SNPs at Mb distance with low r2) will also be inflated to a certain extent, and these test statistics will contribute to the genomic inflation factors.
GWAS Simulation
We provided a function to simulate GWAS data based on the observed genotype data. For a quantitative trait, the phenotypes are simulated by the simple additive genetic model y = Wu + ɛ, where the notation is the same as above. Given a set of SNPs assigned as causal variants, the effects of the causal variants are generated from a standard normal distribution, and the residual effects are generated from a normal distribution with mean of 0 and variance of , where is the empirical variance of Wu and h2 is the user specified heritability. For a case-control study, assuming a threshold-liability model, disease liabilities are simulated in the same way as that for the phenotypes of a quantitative trait. Any individual with disease liability exceeding a certain threshold T is assigned to be a case and a control otherwise, where T is the threshold of normal distribution truncating the proportion of K (disease prevalence). The only purpose of this function is to do a simple simulation based on the observed genotype data. More complicated simulation can be performed with programs such as ms,28 GENOME,29 FREGENE,30 and HAPGEN.31
Data Management
We chose the PLINK25 compact binary file format (∗.bed, ∗.bim, and ∗.fam) as the input data format for GCTA because of its popularity in the genetics community and its efficiency of data storage. For the imputed dosage data, we use the output files of the imputation program MACH32 (∗.mldose.gz and ∗.mlinfo.gz) as the inputs for GCTA. For the convenience of analysis, we provide options to extract a subset of individuals and/or SNPs and to filter SNPs based on certain criteria, such as chromosome position, minor allele frequency (MAF), and imputation R2 (for the imputed data). However, we do not provide functions for a thorough quality control (QC) of the data, such as Hardy-Weinberg equilibrium test and missingness, because these functions have been well developed in many other genetic analysis packages, e.g., PLINK, GenABEL,33 and SNPTEST.34 We assume that the data have been cleaned by a standard QC process before entering into GCTA.
Estimating Total Heritability
The method implemented in GCTA is to estimate the variance explained by chromosome- or genome-wide SNPs rather than the trait heritability. Estimating the heritability (i.e., variance explained by all the causal variants), however, relies on the genetic relationship at causal variants that is predicted with error by the genetic relationship derived from the SNPs as a result of imperfect tagging. We have previously established that the prediction error is c + 1 / N, with c depending on the distribution of the MAF of causal variants. We therefore developed a method based on simple regression to correct for the prediction error by
where . The estimate of variance explained by all of the SNPs after such adjustment is an unbiased estimate of heritability only if the assumption about the MAF distribution of causal variants is correct.
Efficiency of GCTA Computing Algorithm
GCTA implements the REML method based on the variance-covariance matrix V and the projection matrix P. In some of the mixed model analysis packages, such as ASREML,35 to avoid the inversion of the n × n V matrix, people usually use Gaussian elimination of the mixed model equations (MME) to obtain the AI matrix based on sparse matrix techniques. The SNP-derived GRM matrix, however, is typically dense, so the sparse matrix technique will bring an extra cost of memory and CPU time. Moreover, the dimension of MME depends on the number of random effects in the model, whereas the V matrix does not. For example, when fitting the 22 chromosomes simultaneously in the model, the dimension of MME is 22n × 22n (ignoring the fixed effects), whereas the dimension of V matrix is still n × n. We compared the computational efficiency of GCTA and ASREML. When the sample size is small, e.g., n < 3000, both GCTA and ASREML take a few minutes to run. When the sample size is large, e.g., n > 10,000, especially when fitting multiple GRMs, it takes days for ASREML to finish the analysis, whereas GCTA needs only a few hours.
System Requirements
We have released executable versions of GCTA for the three major operating systems: MS Windows, Linux/Unix, and Mac OS. We have also released the source codes so that users can compile them for some specific platforms. GCTA requires a large amount of memory when calculating the GRM or performing an REML analysis with multiple genetic components. For example, it requires ∼4.8 GB memory to calculate the GRM for a data set with 3925 individuals genotyped by 294,831 SNPs, and it takes ∼4 CPU hours (AMD Opteron 2.8 GHz) to finish the computation. We therefore recommend using the 64-bit version of GCTA for large memory support.
Nonadditive Genetic Variance
The analysis approach we have adapted is a logical extension of estimation methods based on pedigrees. It allows estimation of additive genetic variation that is captured by SNP arrays and is therefore informative with respect to the genetic architecture of complex traits. The estimate of variance captured by all of the SNPs obtained in GCTA is directly comparable to the heritability estimated from pedigree analysis in family and twin studies, as well as the variance explained by GWAS hits, so that missing and hiding heritability can be quantified.5 Other sources of genetic variations such as dominance, gene-gene interaction, and gene-environment interaction are also important for complex trait variation but are less relevant to the “missing heritability” problem if the total heritability refers to the narrow-sense heritability, i.e., the proportion of phenotypic variance due to additive genetic variance. The current version of GCTA only provides functions to estimate and partition the variances of additive and additive-environment interaction effects. It is technically feasible to extend the analysis to include dominance and/or gene-gene interaction effects in the future. However, the power to detect the high-order genetic variation will be limited, i.e., the sampling variance of estimated variance components will be very large. Future developments will also include options to do multivariate analyses, to read genotype or imputed probability data in different formats, and to implement other applications of whole-genome or chromosome segment approaches.
In summary, we have developed a versatile tool to estimate genetic relationships from genome-wide SNPs that can subsequently be used to estimate variance explained by SNPs via a mixed model approach. We provide flexible options to specify different genetic models to partition genetic variance onto each of the chromosomes. We developed methods to estimate genetic relationships from the SNPs on the X chromosome and to test the hypotheses of dosage compensation. GCTA is not limited to the analysis of data on human complex traits, but in this report we only use examples and specifications (e.g., the number of autosomes) for humans.
Acknowledgments
We thank Bruce Weir for discussions on the sampling variance of estimators of inbreeding coefficients. We thank Allan McRae and David Duffy for discussions and Anna Vinkhuyzen for software testing. We acknowledge funding from the Australian National Health and Medical Research Council (grants 389892 and 613672) and the Australian Research Council (grants DP0770096 and DP1093900).
Web Resources
The URLs for data presented herein are as follows:
Genome-wide Complex Trait Analysis (GCTA), http://gump.qimr.edu.au/gcta
MACH 1.0: A Markov Chain-based haplotyper, http://www.sph.umich.edu/csg/yli/mach
References
- 1.Hindorff L.A., Sethupathy P., Junkins H.A., Ramos E.M., Mehta J.P., Collins F.S., Manolio T.A. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., McCarthy M.I., Ramos E.M., Cardon L.R., Chakravarti A. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
- 4.Eichler E.E., Flint J., Gibson G., Kong A., Leal S.M., Moore J.H., Nadeau J.H. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 2010;11:446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gibson G. Hints of hidden heritability in GWAS. Nat. Genet. 2010;42:558–560. doi: 10.1038/ng0710-558. [DOI] [PubMed] [Google Scholar]
- 7.Hayes B.J., Visscher P.M., Goddard M.E. Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. 2009;91:47–60. doi: 10.1017/S0016672308009981. [DOI] [PubMed] [Google Scholar]
- 8.Strandén I., Garrick D.J. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J. Dairy Sci. 2009;92:2971–2975. doi: 10.3168/jds.2008-1929. [DOI] [PubMed] [Google Scholar]
- 9.VanRaden P.M. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- 10.Patterson H.D., Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–554. [Google Scholar]
- 11.Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 12.Falush D., Stephens M., Pritchard J.K. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gilmour A.R., Thompson R., Cullis B.R. Average information REML: An efficient algorithm for variance parameters estimation in linear mixed models. Biometrics. 1995;51:1440–1450. [Google Scholar]
- 14.Henderson C.R. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975;31:423–447. [PubMed] [Google Scholar]
- 15.Kruuk L.E. Estimating genetic parameters in natural populations using the “animal model”. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2004;359:873–890. doi: 10.1098/rstb.2003.1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Goddard M.E., Wray N.R., Verbyla K., Visscher P.M. Estimating effects and making predictions from genome-wide marker data. Stat. Sci. 2009;24:517–529. [Google Scholar]
- 17.de Los Campos G., Gianola D., Allison D.B. Predicting genetic predisposition in humans: The promise of whole-genome markers. Nat. Rev. Genet. 2010;11:880–886. doi: 10.1038/nrg2898. [DOI] [PubMed] [Google Scholar]
- 18.Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O'Donovan M.C., Sullivan P.F., Sklar P., International Schizophrenia Consortium Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lango Allen H., Estrada K., Lettre G., Berndt S.I., Weedon M.N., Rivadeneira F., Willer C.J., Jackson A.U., Vedantam S., Raychaudhuri S. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467:832–838. doi: 10.1038/nature09410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kent J.W., Jr., Dyer T.D., Blangero J. Estimating the additive genetic effect of the X chromosome. Genet. Epidemiol. 2005;29:377–388. doi: 10.1002/gepi.20093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lynch M., Walsh B. Sinauer Associates; Sunderland, MA: 1998. Genetics and Analysis of Quantitative Traits. [Google Scholar]
- 22.Falconer D.S. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann. Hum. Genet. 1965;29:51–76. [Google Scholar]
- 23.Dempster E.R., Lerner I.M. Heritability of threshold characters. Genetics. 1950;35:212–236. doi: 10.1093/genetics/35.2.212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Robertson A., Lerner I.M. The heritability of all-or-none traits; viability of poultry. Genetics. 1949;34:395–411. doi: 10.1093/genetics/34.4.395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Campbell C.D., Ogburn E.L., Lunetta K.L., Lyon H.N., Freedman M.L., Groop L.C., Altshuler D., Ardlie K.G., Hirschhorn J.N. Demonstrating stratification in a European American population. Nat. Genet. 2005;37:868–872. doi: 10.1038/ng1607. [DOI] [PubMed] [Google Scholar]
- 27.Cardon L.R., Palmer L.J. Population stratification and spurious allelic association. Lancet. 2003;361:598–604. doi: 10.1016/S0140-6736(03)12520-2. [DOI] [PubMed] [Google Scholar]
- 28.Hudson R.R. Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology. 1990;7:1–44. [Google Scholar]
- 29.Liang L., Zöllner S., Abecasis G.R. GENOME: A rapid coalescent-based whole genome simulator. Bioinformatics. 2007;23:1565–1567. doi: 10.1093/bioinformatics/btm138. [DOI] [PubMed] [Google Scholar]
- 30.Hoggart C.J., Chadeau-Hyam M., Clark T.G., Lampariello R., Whittaker J.C., De Iorio M., Balding D.J. Sequence-level population simulations over large genomic regions. Genetics. 2007;177:1725–1731. doi: 10.1534/genetics.106.069088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Spencer C.C., Su Z., Donnelly P., Marchini J. Designing genome-wide association studies: Sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 2009;5:e1000477. doi: 10.1371/journal.pgen.1000477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Li Y., Abecasis G.R. Mach 1.0: Rapid Haplotype Reconstruction and Missing Genotype Inference. Am. J. Hum. Genet. 2006;S79:2290. [Google Scholar]
- 33.Aulchenko Y.S., Ripke S., Isaacs A., van Duijn C.M. GenABEL: An R library for genome-wide association analysis. Bioinformatics. 2007;23:1294–1296. doi: 10.1093/bioinformatics/btm108. [DOI] [PubMed] [Google Scholar]
- 34.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Gilmour A.R., Gogel B.J., Cullis B.R., Thompson R. VSN International; Hemel Hempstead, UK: 2006. ASReml User Guide Release 2.0. [Google Scholar]