Statistical genetic issues for genome-wide association studies

Bruce S Weir

doi:10.1139/G10-062

. Author manuscript; available in PMC: 2015 Dec 21.

Published in final edited form as: Genome. 2010 Nov;53(11):869–875. doi: 10.1139/G10-062

Statistical genetic issues for genome-wide association studies¹

Bruce S Weir ¹

PMCID: PMC4686343 NIHMSID: NIHMS744698 PMID: 21076502

Abstract

Genotyping technology now allows the rapid and affordable generation of million-SNP profiles for humans, leading to considerable activity in association mapping. Similar activity is anticipated for many plant species, including Brassica. These plant association mapping activities will require the same care in quality control and quality assurance as for humans. The subsequent analyses may draw upon the same body of theory that is described here in the language of quantitative genetics.

Keywords: linkage disequilibrium, case-control, quantitative trait, additive variance, dominance variance, trend test

Introduction

In 1969, forty years ago, I worked with plant geneticist R.W. Allard on some plant data generated by the new technique of electrophoresis (Allard et al. 1972). Allard’s laboratory had generated data at four distinct esterase allozyme loci for several generations of an experimental population of barley. The numbers of individual plants genotyped were in the thousands, but our pride in having four molecular markers seems quaint in the current climate of whole-genome association studies (GWAS). My colleagues and I have recently written about our work in preparing data for posting on the database of Genotype and Phenotype (dbGaP) of sets of one million single nucleotide polymorphisms (SNPs) typed on tens of thousands of humans (Laurie et al. 2010). Gore et al. (2009) have reported over two million SNPs among 27 inbred lines of maize. The potential for genetic improvement of crop plants from such whole-genome data set is enormous, but it comes with the cost of new statistical genetic challenges. These challenges range from the mechanical issues of managing large data sets to the genetic issues of having to allow for dependencies among genetic markers. In this discussion I will show, however, that the data allow new insights into the genomic architecture as well as the genetic architecture of complex traits. Although whole-genome sequence data are not yet economically feasible for plant species (but see Varshney et al. 2009), we already have enough information to address some very long-standing problems in population and quantitative genetics.

GWAS data

Generating the data

A comparison of genomic DNA sequences in different individuals reveals some positions at which two, or in some cases more than two, bases can occur. These SNPs are highly abundant and they have been described as the “ultimate genetic marker” by Duran et al. (2008). Among the many excellent reviews of the technology being used to generate whole-genome sets of SNP genotypes is that of Syvänen (2001). Information on over 12 million SNPs for Homo sapiens is contained in the public dbSNP (available at http://ncbi.nlm.nih.gov/projects/SNP/), along with information for nine other species. For rice, barley, and Brassica there is the autoSNPdb (available at http://autosnpdb.qfab.org.au) described by Duran et al. (2008) that draws data from GenBank (available at http://ncbi.nlm.nih.gov/Genbank/). The success of GWAS has been described by The Wellcome Trust Case Control Consortium (2007) and many other authors.

Managing the data

We have found it convenient to store SNP data in the network Common Data Form (netCDF). NetCDF is a set of software libraries and machine-independent data formats, designed specifically for large sets of array-oriented scientific data, and its use is widespread in those sciences that require very large data sets. It is maintained by the Unidata program at the University Corporation for Atmospheric Research (available at http://unidata.ucar.edu/software/netcdf/). This portable binary file format is convenient since it allows for efficient multidimensional arrayed data. The key feature is the speed with which data can be accessed from the file, as this is not a trivial consideration when there are a million SNP genotypes for sample of several thousand individuals. Function documentation can be found at http://cran.r-project.org/web/packages/ncdf/ncdf.pdf and a tutorial that details the steps to create netCDF files for raw SNP intensity files and called SNP genotype files is under preparation by my GENEVA colleagues (http://genevastudy.org). This tutorial describes a set of R programs that provide an alternative GWAS analysis platform to the very useful PLINK package (Purcell et al. 2007).

Cleaning the data

The production of GWAS data involves industrial-scale processes for which systems of quality assurance and quality control are essential (Laurie et al. 2010). Many association mapping methods involve comparisons of SNP allele frequencies among classes of individuals. Even small arti-factual differences in allele frequencies can generate false–positive results and there have been several accounts of appropriate quality control/quality assurance (QA/QC) methods (Broman 1999; Chanock et al. 2007; The Wellcome Trust Case Control Consortium 2007; Manolio et al. 2007; Miyagawa et al. 2008).

Sample cleaning

In any large study there is the possibility of miss-annotation of individuals. For human studies these include wrongly identified gender, race and (or) ethnicity, and family membership. Similar problems can occur with plants, and Yu et al. (2006) referred to possible discrepancies between pedigree records and genetically determined relationships. For human and plant studies the genetic markers themselves allow population or family outliers to be identified.

A search for population outliers and possible miss-annotations can be accomplished with principal components analysis (PCA) where the million or so SNPs can be reduced down to a small number of components and individuals plotted in two-dimensional spaces for pairs of the first few components. Methodology in the GWAS context has been described by Patterson et al. (2006) and Price et al. (2006) and their EIGENSOFT software is available at http://genepath.med.harvard.edu/~reich/Software.htm. An interesting finding from such analyses is that PCA can detect features of the genome: Novembre et al. (2008) reported that a principal component in their study detected polymorphism for an inversion on human chromosome 8. Similar results may be anticipated in plant populations and they raise the issue of how to select SNPs to include in PCA. It may or may not be desirable to have PCA reflect genomic structure.

High-density SNP data are also leading to an appreciation of the prevalence of aneuploidy in humans. Peiffer et al. (2006) and Maciejewski and Mufti (2008) have shown how the SNP intensity signals, and inferred copy number and heterozygosity levels, can identify large portions of a chromosome that do not occur in the diploid state. Similar analyses will be of importance in Brassica studies when there are ploidy issues. It is prudent to remove aberrant chromosomes from association analyses.

Finally, samples with low quantity DNA and (or) high missing SNP genotype rates are generally excluded from further analyses.

SNP cleaning

Even if a sample passes QA/QC checks, it is unlikely that all SNPs will have been genotyped with high quality. A SNP with a high missing rate over all samples is unlikely to be used for association mapping, any more than would be a SNP with a low concordance rate among duplicate samples or with high “Mendelian error” rate (where genotypes are not consistent with known pedigrees). SNPs that have outlying departures from Hardy–Weinberg equilibrium may not have been genotyped correctly and are often found to have substantial excesses or deficiencies of heterozygosity.

In these various SNP tests there must be attention paid to multiple testing issues. A simple Bonferroni correction, such as rejecting Hardy–Weinberg equilibrium for a SNP in a one-million SNP panel if the P value is less than 5 × 10⁻⁸ is unlikely to be as informative as rejecting for those SNPs with a P value that deviates from expectation in a Q-Q probability plot (Weir et al. 2004).

A final QA/QC check on SNPs is conducted after association tests are conducted. It is useful to examine the “cluster plots” that display all individuals, colour-coded by assigned genotype, on a grid with the numbers of the two alleles as axes. There should be tight clusters at (2,0), (0,2), and (1,1) for the two homozygotes and the heterozygote.

Association mapping

A valuable review of association mapping in crop plants was given by Sorkheh et al. (2008). The following treatment allows for continuous or discrete traits and uses the classical language of quantitative genetics.

Allelic association

Whether locating the genes affecting economic traits is regarded as the first step in determining the genetic basis for the traits or as the basis for marker-assisted selection, it is helpful to cast the activity in terms of linkage disequilibrium between marker and trait loci. If SNP marker M has alleles M,m with frequencies p_M,p_m and a trait locus T has alleles T,t with frequencies p_T,p_t then for outbreeding random-mating populations the joint genotype frequencies can be set out as in Table 1. Each genotype probability is the product of gamete probabilities and these, in turn, can be expressed in terms of allelic frequencies and linkage disequilibrium D_MT as shown in Table 2. The most useful measure of association between marker and trait alleles is the correlation coefficient ρ_MT defined as

ρ_{M T} = \frac{D_{M T}}{\sqrt{p_{M} p_{m} p_{T} p_{t}}}

If variable x_M (or y_T) is defined for gametes to take the value 1 if the gamete carries allele M (or T) and 0 if the gamete carries allele m (or t), then D_MT is the covariance of x_M and y_T and ρ_MT is the correlation of these two variables.

Table 1.

Marker–trait genotype probabilities.

Total

P_{M T}^{2}

2P_MTP_Mt

P_{M t}^{2}

P_{M}^{2}

2P_MTP_mT

2P_MTP_mt+2P_MtP_mT

2P_MtP_mt

2p_Mp_m

P_{m T}^{2}

2P_mTP_mt

P_{m t}^{2}

P_{m}^{2}

Total

P_{T}^{2}

2p_Tp_t

P_{T}^{2}

Open in a new tab

Table 2.

Marker–trait gamete probabilities.

	T	t	Total
M	P_MT =p_Mp_T + D_MT	P_Mt =p_Mp_t − D_MT	p_M
m	P_mT =p_mp_T − D_MT	P_mt =p_mp_t + D_MT	p_m
Total	p_T	p_t	1

Open in a new tab

Trait association

The contribution of locus T to the trait of interest can be indicated by variable G and, still assuming random mating, the genetic mean and variance are

\begin{array}{c} μ_{G} = p_{T}^{2} G_{T T} + 2 p_{T} p_{t} G_{T t} + p_{t}^{2} G_{t t} \\ Var (G) = p_{T}^{2} G_{T T}^{2} + 2 p_{T} p_{t} G_{T t}^{2} + p_{t}^{2} G_{t t}^{2} - μ_{G}^{2} \end{array}

and the variance can be partitioned into additive and dominance components, $Var (G) = σ_{A_{T}}^{2} + σ_{D_{T}}^{2}$ where

\begin{array}{l} σ_{A_{T}}^{2} = 2 p_{T} p_{t} {[p_{T} (G_{T T} - G_{T t}) + p_{t} (G_{T t} - G_{t t})]}^{2} \\ σ_{D_{T}}^{2} = p_{T}^{2} p_{t}^{2} {(G_{T T} - 2 G_{T t} + G_{t t})}^{2} \end{array}

For a marker locus M with alleles M,m we can assign values X_MM, X_Mm, and X_mm, and have a completely analogous decomposition of the variance of X over individuals in a random mating population. Unless the marker and trait loci are the same, i.e., M is a casual SNP, the nature and location of T are not known and the values of G are not known. The marker and its variable, by contrast, are known.

Association mapping rests on the result (Weir 2008)

Cov (G, X) = ρ_{M T} σ_{A_{M}} σ_{A_{T}} + ρ_{M T}^{2} σ_{D_{M}} σ_{D_{T}}

Continuous traits

A simple approach for continuous traits is to suppose that trait values Y may be written as the sums of genetic and nongenetic components, Y = G + E. If these components are independent and if E has a mean of zero then Cov(X,Y) = Cov(X,G). If trait values are regressed on the marker genotype values X, the regression coefficient is

β_{Y X} = \frac{Cov (X, Y)}{Var (X)} = \frac{ρ_{M T} σ_{A_{T}} σ_{A_{M}} + ρ_{M N}^{2} σ_{D_{T}} σ_{D_{M}}}{σ_{A_{M}}^{2} + σ_{D_{M}}^{2}}

This reduces to $β_{Y X} = ρ_{M T} σ_{A_{T}} / {(2 p_{M} p_{m})}^{1 / 2}$ for additive marker scoring such as X_MM = 2, X_Mm = 1, and X_mm = 0. In that case, $σ_{D_{T}}^{2} = 0$ . If, instead, the marker is scored as X_MM = _Pm, X_Mm = 0, and X_mm = _PM, making $σ_{A_{T}}^{2} = 0$ , then it is only the nonadditive trait variance that influences the regression: $β_{Y X} = ρ_{M T}^{2} σ_{D_{T}} / p_{M} p_{m}$ . For any scoring of the marker genotypes, a significant regression coefficient for phenotype on marker genotype implies a significant linkage disequilibrium ρ_MT between marker and disease loci and there is an implication of physical proximity of these loci (Myles et al. 2009). The extent to which additive and nonadditive trait genetic effects contribute to the regression depends on the choice of marker scaling.

An alternative approach is to work with the correlation ρ_XY of X and Y. Note that the variance of trait values is Var(Y) = Var(G) + V_E. For an additive marker variable

ρ_{X Y} = ρ_{M T} h_{Y}^{T}

where ${(h_{Y}^{T})}^{2} = σ_{A_{T}}^{2} / (σ_{A_{T}}^{2} + σ_{D_{T}}^{2} + V_{E})$ is the heritability of trait Y owing to locus T. Nonadditive marker scoring provides that $ρ_{X Y} \propto ρ_{M T}^{2}$ . Sample values r_XY for the correlation ρ_XY can be transformed to normal variables with Fisher’s transformation

z = \frac{1}{2} \ln (\frac{1 + r_{X Y}}{1 - r_{X Y}})

and the hypothesis ρ_XY = 0, implying that ρ_MT = 0, against the hypothesis ρ_XY > 0 is rejected with significance level a if z ≥ z_{1 − α}. Here z_{1 − α} is the (1 − α) percentile of the standard normal distribution. Standard theory for correlation coefficients provides that the sample size n needed for this test to have (1 − β) % power is (approximately)

n = {[\frac{2 (z_{α} + z_{β})}{\ln (\frac{1 + ρ_{X Y}}{1 - ρ_{X Y}})}]}^{2} + 3

For 90% power and 1% significance level, z_β = −1.28 and z_α = −2.33. For a SNP with $ρ_{M T}^{2} = 0.8$ to the disease gene and a trait with heritability ${(h_{Y}^{T})}^{2} = 0.2$ so that ρ_XY = 0.4 this sample size is about 73. Multiple-testing considerations for one-million SNP sets suggest that a significance level of about 10⁻⁸ may be appropriate and then the sample size increases to over 250. These calculations assume a single trait locus, although plant quantitative traits are more likely to be affected by multiple loci and the heritability associated with any one locus will be reduced. If heritability per locus is proportional to the number of loci affecting the trait, then so too will be the sample size needed to detect association of a marker to each locus. Yang et al. (2010) make the useful distinction between testing for significant marker–trait associations and estimating the combined effects of all markers on a trait. For human height they showed that most of the heritability for human height is accounted for when 300 000 SNPs are used in an analysis even though individual SNP effects are generally too small to pass stringent significance tests.

Both the linear regression and correlation approaches can provide single df tests for marker–trait association. Another test follows from analysis of variance of trait values among the three marker genotypic classes. The three trait means are

\begin{array}{l} ε (Y | M M) = (P_{M T}^{2} G_{T T} + 2 P_{M T} P_{M t} G_{T t} + P_{M t}^{2} G_{t t}) / p_{M}^{2} \\ = μ_{G} + (p_{M} ρ_{M T} A + ρ_{M T}^{2} D) / p_{M}^{2} \\ ε (Y | M m) = μ_{G} + [(p_{m} - p_{M}) ρ_{M T} A - 2 ρ_{M T}^{2} D] / 2 p_{M} p_{m} \\ ε (Y | m m) = μ_{G} - (p_{m} ρ_{M T} A - ρ_{M T}^{2} D) / p_{m}^{2} \end{array}

where $A = σ_{A_{T}} \sqrt{2 p_{M} p_{m}}$ , $D = σ_{D_{T}} p_{M} p_{m}$ , and μ_G is the trait mean. An analysis of variance F test will therefore also test that ρ_MT = 0 and the test will be affected by both additive and dominance effects at the trait locus.

Binary traits

If, however, the trait is binary (presence or absence) the trait genetic value can be interpreted as the probability of an individual having the trait (i.e., being a “case”), and marker genotype frequencies can be conditioned on trait status. Then the random mating assumption leads to

\begin{array}{l} \Pr (M M | case) = p_{M}^{2} + (p_{M} ρ_{M T} A + ρ_{M T}^{2} D) / μ_{G} \\ \Pr (M m | case) = 2 p_{M} ρ_{m} + [(p_{m} - p_{M}) ρ_{M T} A - 2 ρ_{M T}^{2} D] / μ_{G} \\ \Pr (m m | case) = p_{m}^{2} - (p_{m} ρ_{M T} A - ρ_{M T}^{2} D) / μ_{G} \end{array}

where now μ_G is the trait prevalence or the probability that a random individual has the trait. For individuals not having the trait (i.e., being a “control”)

\begin{array}{l} \Pr (M M | control) = p_{M}^{2} - (p_{M} ρ_{M T} A + ρ_{M T}^{2} D) / (1 - μ_{G}) \\ \Pr (M m | control) = 2 p_{M} ρ_{m} - [(p_{m} - p_{M}) ρ_{M T} A - 2 ρ_{M T}^{2} D] / (1 - μ_{G}) \\ \Pr (m m | control) = p_{m}^{2} + (p_{m} ρ_{M T} A - ρ_{M T}^{2} D) / (1 - μ_{G}) \end{array}

These expressions lead to case-control test statistics.

For example, allelic data may be set out in a 2 × 2 table of case/control versus M/m as in Table 3. A contingency-table test statistic is

X^{2} = \frac{{(n_{M a} n_{m o} - n_{m a} n_{M o})}^{2}}{n_{M} n_{m} n_{a} n_{o}}

Under the null hypothesis of no marker–trait association, X² is distributed as chi-square with one df. If there is association, the noncentrality parameter is

λ = \frac{n ρ_{M T}^{2} σ_{A_{T}}^{2}}{μ_{G} (1 - μ_{G})}

Table 3.

Allelic case-control counts and probabilities.

Case

Control

Total

Counts

n_Ma

n_Mo

n_M

Probabilities

p_{M} μ_{G} + \frac{1}{2} ρ_{M T} A

p_{M} (1 - μ_{G}) - \frac{1}{2} ρ_{M T} A

p_M

Counts

n_ma

nmo

n_m

Probabilities

p_{m} μ_{G} - \frac{1}{2} ρ_{M T} A

p m (1 - μ_{G}) + \frac{1}{2} ρ_{M T} A

p_M

Total

Counts

n_a

n_o

Probabilities

μ_G

1−μ_G

Open in a new tab

This allows sample size calculations: if the marker-causal gene linkage disequilibrium is $ρ_{M}^{2} = 0.8$ , the probability that a random individual has the trait is μ_G = 0.1; the causal allele frequencies are p_T = 0.1 and p_t = 0.9; and the probabilities that individuals with genotypes TT,Tt, and tt have the trait are 0.82, 0.42, and 0.02, respectively, where $σ_{A_{T}}^{2} = 0.0288$ and n ≈ 3λ. For a 1% significance level and a 90% power, n is about 40 and this rises to 140 for a significance level of 10⁻⁸. There are occasions when single markers are associated with genes having such substantial effects on a human disease: the two SNPs reported for macular degeneration (Klein et al. 2005) were found in a case-control study with 96 cases and 50 controls. The more usual situation, however, is that much smaller effects are associated with each SNP: the top SNPs reported for seven common diseases by The Wellcome Trust Case Control Consortium (2007) had odds ratios typically around 1.5 and $σ_{A_{T}}^{2}$ would then drop substantially. It is generally recognized that sample sizes in the thousands are necessary for risk ratios ≤ 1.5 (Risch and Merikangas 1996).

Effects of inbreeding

Plant populations often have some degree of inbreeding and this means that reducing marker genotypes to alleles can be misleading. To illustrate the consequences for association mapping, we extend the discussion of case-control tests to that of trend tests. The usual notation (Sasieni 1997) is shown in Table 4. With that notation, the allelic case-control test statistic is

X_{A}^{2} = \frac{2 N {[N (r_{1} + 2 r_{2}) - R (n_{1} + 2 n_{2})]}^{2}}{S R [2 N (n_{1} + 2 n_{2}) - {(n_{1} + 2 n_{2})}^{2}]}

and the genotypic linear trend test statistic is

X_{T}^{2} = \frac{N {[N (r_{1} + 2 r_{2}) - R (n_{1} + 2 n_{2})]}^{2}}{S R [N (n_{1} + 4 n_{2}) - {(n_{1} + 2 n_{2})}^{2}]}

The trend test statistic can be regarded as the correlation between marker genotype (coded linearly as 2,1, and 0 for MM,Mm, and mm), and case status (coded as 1, 0 for case, control).

Table 4.

Marker genotype counts in cases and controls.

Genotype	MM	Mm	mm	Total
Case counts	r₀	r₁	r₂	R
Control counts	s₀	s₁	s₂	S
Total counts	n₀	n₁	n₂	N

Open in a new tab

Now suppose the population is inbred to an extent f so that $P_{M M} = P_{M}^{2} + f p_{M} p_{m}$ , P_Mm = 2(1 − f)p_Mp_m, and $P_{m m} = p_{m}^{2} + p_{M} p_{m} f$ . Under the hypothesis of no association, ρ_MT = 0, the test statistics should have an expected value of 1 and indeed $E (X_{T}^{2}) \approx 1$ , but $E (X_{A}^{2}) \approx (1 + f)$ . This confirms that the trend test, unlike the allelic case-control test is robust to departures from Hardy–Weinberg equilibrium. Note that f or F_IS is the within-population inbreeding coefficient.

These last results, and those in the next section, are most easily derived by expressing the test statistics as functions of sample allele frequencies. The numerator of the allelic case-control test statistic $X_{A}^{2}$ can be written as $8 N R^{2} S^{2} {({\tilde{p}}_{M, case} - {\tilde{p}}_{M, control})}^{2}$ where sample values are indicated by tildes. The denominator of X²_A can be written as $4 N^{2} S R {\tilde{p}}_{M} (1 - {\tilde{p}}_{M})$ where ${\tilde{p}}_{M}$ is the sample frequency of allele M in the total sample of size N. Under the null hypothesis that the marker allele has the same frequency p_M in cases and controls, and under the assumption that all individuals in the population are independent, the numerator has expectation 4N²SRp_M(1 − p_M)(1 + f) and the denominator has expectation 4N²SRp_M(1 − p_M)[1 − (1 + f)/2N] or approximately 4N²SRp_M(1 − p_M), and the expected value of $X_{A}^{2}$ is approximated by the ratio of the expected values of numerator and denominator: $E (X_{A}^{2}) \approx (1 + f)$ .

The numerator for the linear trend test statistic $X_{T}^{2}$ is half that for $X_{A}^{2}$ and therefore has the expectation 2N²SRp_M (1 − p_M)(1 + f) under the null hypothesis. The denominator of $X_{T}^{2}$ can be written as $2 N^{2} S R ({\tilde{p}}_{M} + {\tilde{P}}_{M M} - 2 {\tilde{p}}_{M}^{2})$ where ${\tilde{p}}_{M}$ , ${\tilde{p}}_{M M}$ are the sample frequencies of allele M and MM homozygotes in the total sample of size N. Under the null hypothesis that the marker allele has the same frequency p_M in cases and controls, and under the assumption that all individuals in the population are independent, the denominator has expectation $2 N^{2} S R [p_{M} + [p_{M}^{2} + p_{M} (1 - p_{M}) f] - 2 (p_{M} (1 - p_{M}) (1 + f) / (2) + p_{M}^{2})]$ or approximately 2N²SRp_M(1 − p_M)(1 + f). The ratio of numerator and denominator expected values provides an approximation to the expected value of $X_{T}^{2}$ .

Effects of population structure

Whether or not a population is inbred there may be substructure and this can inflate association test statistics (Devlin and Roeder 1999; Cardon and Palmer 2003; Wen et al. 2009). A discussion of the effects of population structure involves the total inbreeding coefficient F or F_IT and the related quantity θ or F_ST. The change in perspective is most easily explained by referring to the variance of allele frequencies. In the previous section, the within-population inbreeding coefficient f allows the variance of a sample allele frequency ${\tilde{p}}_{M}$ in a sample of n individuals to be expressed as p_Mp_m(1+f)/2n. This variance is over repeated samples from the same population. In a structured population, there is variation of allele frequencies among subpopulations as well as among samples within subpopulations. The total variance of a sample allele frequency accommodates both sources and is given by p_Mp_m[θ + (1 + F − 2θ)/(2n)].

If the ith subpopulation furnishes R_i cases and S_i controls, with N_i = R_i + S_i, then it is convenient to introduce x_i = R_i/R, y_i = S_i/S, z_i = N_i/N as the proportions of cases, controls and all samples from the ith subpopulation. If it is not known to which subpopulation an individual belongs, the expected values of the association test statistics are

\begin{array}{l} ε (Z_{A}^{2}) \approx \frac{2 R S θ \sum_{i} {(x_{i} - y_{i})}^{2} + N (1 + F - 2 θ)}{N (1 - θ \sum_{i} z_{i}^{2}) - (1 + F - 2 θ) / 2} \\ ε (Z_{T}^{2}) \approx \frac{2 R S θ \sum_{i} {(x_{i} - y_{i})}^{2} + N (1 + F - 2 θ)}{N [(1 + F) - 2 θ \sum_{i} z_{i}^{2}]} \end{array}

If there is random mating within each subpopulation, F = θ, and if there are equal numbers of cases and controls R = S = N/2. If, further, there are many subpopulations, we can ignore the term $F \sum_{i} z_{i}^{2}$ in the denominator and then

ε (X_{A}^{2}) = ε (X_{T}^{2}) \approx 1 + \frac{R F \sum_{i} {(x_{i} - y_{i})}^{2} - 2 F}{1 + F}

The trend test does not offer protection against population structure, and the inflation of association tests increases the more a subpopulation contributes unequally to the case and control samples. With equal representation, x_i = y_i, the test statistics may actually be deflated.

Effects of relatedness

Individual plants or lines derived from crosses involving the same founders are related by virtue of sharing some of their genome — they have alleles identical by descent from these founders. Plants in a finite population necessarily have common ancestors even if there is random outcrossing, so any distinction between family and evolutionary relatedness is a little artificial. Nevertheless, it is usual to use “relatedness” for plants in the same family or pedigree. The availability of extensive marker data means that relatedness can be estimated in the absence of pedigree information. Even if pedigree information is available, however, there may be a case for using genetic estimates to guard against pedigree errors or to accommodate differences in actual relatedness along the genome whether these are due to chance or to the effects of natural or artificial selection.

A treatment of relatedness rests on the concept of identity by descent. Two noninbred diploid plants may have received identical alleles for neither, either, or both of their maternal and paternal genes. The probabilities of these three events are written as k₀, k₁, and k₂ and these sum to one. For full siblings, k₀ = 1/4, k₁ = 1/2, and k₂ = 1/4. Association testing rests on the kinship coefficient, deliberately written as θ to emphasize the continuity between evolutionary and family relatedness, with θ = k₂/2 + k₁/4.

The allelic case-control test was introduced above as being a contingency table test on the 2 × 2 table of marker allelic state versus case-control state. Equivalently, it can be constructed by assuming that marker allele frequencies are normally distributed and working with the difference of case- and control-sample frequencies $({\tilde{p}}_{M,case} - {\tilde{p}}_{M,control})$ . Specifically, the test statistic is

X^{2} = \frac{{({\tilde{p}}_{M,case} - {\tilde{p}}_{M,control})}^{2}}{Var ({\tilde{p}}_{M,case}) - 2 Cov({\tilde{p}}_{M,case}, {\tilde{p}}_{M,control})+Var({\tilde{p}}_{M,control})}

The denominator depends on the total inbreeding coefficients F_i for all individuals i in the sample and the coancestry coefficients θ_ij for all pairs of individuals i,j. It consists of three sums, over the n_a cases, the n_o controls and the n_an_o case-control pairs:

\begin{array}{l} Var ({\tilde{p}}_{M,case}) = \frac{p_{M} (1 - p_{M})}{n_{a}^{2}} [\sum_{i = 1}^{n_{a}} (1 + F_{i}) + \sum_{i \neq i^{'}} θ_{i i^{'}}] \\ Var ({\tilde{p}}_{M,control}) = \frac{p_{M} (1 - p_{M})}{n_{o}^{2}} [\sum_{j = 1}^{n_{o}} (1 + F_{j}) + \sum_{j \neq j^{'}} θ_{j j^{'}}] \\ Cov ({\tilde{p}}_{M,case}, {\tilde{p}}_{M,control}) = \frac{p_{M} (1 - p_{M})}{n_{a} n_{o}} \sum_{i = 1}^{n_{a}} \sum_{j = 1}^{n_{o}} θ_{i j} \end{array}

Bourgain et al. (2003) suggested using the F_i’s and θ_ij’s derived from known pedigrees, along with allele frequencies p_M estimated from the combined case and control sample. Yu et al. (2006) and Choi et al. (2009) suggested using the SNP genotypes to estimate the inbreeding and coancestry coefficients.

For noninbred individuals, F_i = 0, maximum-likelihood estimation of the three identity coefficients k₀,k₁, and k₂ is straightforward, although it requires iteration. The procedure follows from being able to express the joint genotype probabilities for pairs of individuals in terms of the k’s and allele frequencies, with the coefficients being assumed the same over a set of independent SNPs. A sufficiently dense collection of SNPs would allow for differences in coefficients in different regions of the genome. If the individuals are inbred, however, a set of nine identity by descent coefficients is needed (Weir et al. 2006).

Variation in the identity coefficients and their estimates raise potential problems with incorporating relatedness coefficients into association mapping. Estimation relies on allele frequencies which, in turn, have to be estimated from the set of individuals under consideration. This need to estimate allele frequencies introduces variation into relatedness estimates but a larger source of variation has to do with the inherent variation of the coefficients among loci. At any particular locus, of course, the individuals have only one identity state, so actual probabilities $\hat{k_{0}}$ , $\hat{k_{1}}$ , and $\hat{k_{2}}$ are either one or zero and exactly one of the three equals one. For full siblings, the variances of the three binary quantities are 3/16, 1/4, and 3/16, respectively. When several loci are considered, the variances of the average actual identity coefficients depend on the recombination fractions between pairs of loci. If they are estimated from a very dense set of markers over a genome of with m chromosomes with maplengths l_i, i = 1,2…,m Morgans and total maplength L, then the average actual coancestry coefficient for full siblings is (Visscher 2009)

Var ({\hat{θ}}^{F S}) = \frac{1}{256 L^{2}} (4 L + \sum_{i} e^{- 4 l_{i}} - m)

This result assumes the Haldane mapping function. Visscher et al. (2006) have proposed using estimated coancestries for estimation of components of genetic variance and, hence, heritabilities of quantitative traits.

Discussion

Association mapping for all plants, including Brassica, is entering a phase where the number of genetic markers is not limiting factor, although a more pessimistic note is sounded by Ganal et al. (2009). We have come a long way from the isozyme studies of the 1970s and even candidate gene studies in the near future will have much more detailed information than those that use fewer than 10 markers (e.g., Krouchi et al. 2008). The lament of Rostoks et al. (2006) that “whole-genome association studies in crop plants, with the exception of rice, are currently limited by the number of markers available, their format, and cost” is no longer necessary. Recent authors (Myles et al. 2009) have stressed that, especially with the imminence of whole-genome sequence data, attention should be directed towards the design of association studies rather than the number or genomic spacing of markers. The observation of Stich et al. (2007) that association mapping is less expensive than linkage mapping because data routinely collected in plant breeding programs can be used remains pertinent.

As marker sets develop, genome-wide association studies in Brassica can be expected to proceed as they have in Arabidopsis (Aranzana et al. 2005), barley (Cockram et al. 2008), and maize (Beló et al. 2008). The special features of Brassica will have to be taken into account but the work already in place for identifying aneuploidy and accommodating population structure and relatedness in humans should provide a foundation.

Acknowledgments

This work was supported in part by NIH grant GM 075091 and by a travel grant from OECD.

Footnotes

This article is one of a selection of papers from the conference “Exploiting Genome-wide Association in Oilseed Brassicas: a model for genetic improvement of major OECD crops for sustainable farming”.

References

Allard RW, Kahler AL, Weir BS. The effect of selection on esterase allozymes in a barley population. Genetics. 1972;72(3):489–503. doi: 10.1093/genetics/72.3.489. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aranzana MJ, Kim S, Zhao KY, Bakker E, Horton M, Jakob K, et al. Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes. PLoS Genet. 2005;1(5):e60. doi: 10.1371/journal.pgen.0010060. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beló A, Zheng P, Luck S, Shen B, Meyer DJ, Li B, et al. Whole genome scan detects an allelic variant of fad2 associated with increased oleic acid levels in maize. Mol Genet Genomics. 2008;279(1):1–10. doi: 10.1007/s00438-007-0289-y. [DOI] [PubMed] [Google Scholar]
Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, et al. Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet. 2003;73(3):612–626. doi: 10.1086/378208. [DOI] [PMC free article] [PubMed] [Google Scholar]
Broman KW. Cleaning genotype data. Genet Epidemiol. 1999;17(Suppl 1):S79–S83. doi: 10.1002/gepi.1370170714. [DOI] [PubMed] [Google Scholar]
Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361(9357):598–604. doi: 10.1016/S0140-6736(03)12520-2. [DOI] [PubMed] [Google Scholar]
Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G, et al. Replicating genotype-phenotype associations. Nature. 2007;447(7145):655–660. doi: 10.1038/447655a. [DOI] [PubMed] [Google Scholar]
Choi Y, Wijsman EM, Weir BS. Case-control association testing in the presence of unknown relationships. Genet Epidemiol. 2009;33(8):668–678. doi: 10.1002/gepi.20418. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cockram J, White J, Leigh FJ, Lea VJ, Chiapparino E, Laurie DA, et al. Association mapping of partitioning loci in barley. BMC Genet. 2008;9(1):16. doi: 10.1186/1471-2156-9-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004. doi: 10.1111/j.0006-341X.1999.00997.x. [DOI] [PubMed] [Google Scholar]
Duran C, Appleby N, Clark T, Wood D, Imelfort M, Batley J, Edwards D. AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants. Nucleic Acids Res. 2008;37:D951–D953. doi: 10.1093/nar/gkn650. Database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ganal MW, Altmann T, Röder MS. SNP identification in crop plants. Curr Opin Plant Biol. 2009;12(2):211–217. doi: 10.1016/j.pbi.2008.12.009. [DOI] [PubMed] [Google Scholar]
Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, et al. A first-generation haplotype map of maize. Science. 2009;326(5956):1115–1117. doi: 10.1126/science.1177837. [DOI] [PubMed] [Google Scholar]
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308(5720):385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krouchi F, Gustavsson S, Sjödin P, Kruskopf-Österberg M, Lagercrantz U, Lascoux M. Association between COL1 and flowering time in Brassica nigra: replication validation, and genotypic disequilibrium. Int J Plant Sci. 2008;169(9):1229–1237. doi: 10.1086/591989. [DOI] [Google Scholar]
Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010;34(6):591–602. doi: 10.1002/gepi.20516. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maciejewski JP, Mufti GJ. Whole genome scanning as a cytogenetic tool in hematologic malignancies. Blood. 2008;112(4):965–974. doi: 10.1182/blood-2008-02-130435. [DOI] [PMC free article] [PubMed] [Google Scholar]
Manolio TA, Rodriguez LL, Brooks L, Abecasis G, Ballinger D, Daly M, et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet. 2007;39(9):1045–1051. doi: 10.1038/ng2127. [DOI] [PubMed] [Google Scholar]
Miyagawa T, Nishida N, Ohashi J, Kimura R, Fujimoto A, Kawashima M, et al. Appropriate data cleaning methods for genome-wide association study. J Hum Genet. 2008;53(10):886–893. doi: 10.1007/s10038-008-0322-y. [DOI] [PubMed] [Google Scholar]
Myles S, Peiffer J, Brown PJ, Ersoz ES, Zhang Z, Costich DE, Buckler ES. Association mapping: critical considerations shift from genotyping to experimental design. Plant Cell. 2009;21(8):2194–2202. doi: 10.1105/tpc.109.068437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7218):98–101. doi: 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):2074–2093. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res. 2006;16(9):1136–1148. doi: 10.1101/gr.5402306. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273(5281):1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
Rostoks N, Ramsay L, MacKenzie K, Cardle L, Bhat PR, Roose ML, et al. Recent history of artificial outcrossing facilitates whole-genome association mapping in elite inbred crop varieties. Proc Natl Acad Sci USA. 2006;103(49):18656–18661. doi: 10.1073/pnas.0606133103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53(4):1253–1261. doi: 10.2307/2533494. [DOI] [PubMed] [Google Scholar]
Sorkheh K, Malysheva-Otto LV, Wirthensohn MG, Tarkesh-Esfahani S, Martínez-Gómez P. Linkage disequilibrium, genetic association mapping and gene localization in crop plants. Genet Mol Biol. 2008;31(4):805–814. doi: 10.1590/S1415-47572008005000005. [DOI] [Google Scholar]
Stich B, Melchinger AE, Piepho HP, Hamrit S, Schipprack W, Maurer HP, Reif JC. Potential causes of linkage disequilibrium in a European maize breeding program investigated with computer simulations. Theor Appl Genet. 2007;115(4):529–536. doi: 10.1007/s00122-007-0586-1. [DOI] [PubMed] [Google Scholar]
Syvänen A-C. Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet. 2001;2(12):930–942. doi: 10.1038/35103535. [DOI] [PubMed] [Google Scholar]
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Varshney RK, Nayak SN, May GD, Jackson SA. Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends Biotechnol. 2009;27(9):522–530. doi: 10.1016/j.tibtech.2009.05.006. [DOI] [PubMed] [Google Scholar]
Visscher PM. Whole genome approaches to quantitative genetics. Genetica. 2009;136(2):351–358. doi: 10.1007/s10709-008-9301-7. [DOI] [PubMed] [Google Scholar]
Visscher PM, Medland SE, Ferreira MA, Morley KI, Zhu G, Cornes BK, et al. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006;2(3):e41. doi: 10.1371/journal.pgen.0020041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weir BS. Linkage disequilibrium and association mapping. Annu Rev Genomics Hum Genet. 2008;9(1):129–142. doi: 10.1146/annurev.genom.9.081307.164347. [DOI] [PubMed] [Google Scholar]
Weir BS, Hill WG, Cardon LR. Allelic association patterns for a dense SNP map. Genet Epidemiol. 2004;27(4):442–450. doi: 10.1002/gepi.20038. [DOI] [PubMed] [Google Scholar]
Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7(10):771–780. doi: 10.1038/nrg1960. [DOI] [PubMed] [Google Scholar]
Wen W, Mei H, Feng F, Yu S, Huang Z, Wu J, et al. Population structure and association mapping on chromosome 7 using a diverse panel of Chinese germplasm of rice (Oryza sativa L.) Theor Appl Genet. 2009;119(3):459–470. doi: 10.1007/s00122-009-1052-z. [DOI] [PubMed] [Google Scholar]
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu JM, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38(2):203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]

[R1] Allard RW, Kahler AL, Weir BS. The effect of selection on esterase allozymes in a barley population. Genetics. 1972;72(3):489–503. doi: 10.1093/genetics/72.3.489. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Aranzana MJ, Kim S, Zhao KY, Bakker E, Horton M, Jakob K, et al. Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes. PLoS Genet. 2005;1(5):e60. doi: 10.1371/journal.pgen.0010060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Beló A, Zheng P, Luck S, Shen B, Meyer DJ, Li B, et al. Whole genome scan detects an allelic variant of fad2 associated with increased oleic acid levels in maize. Mol Genet Genomics. 2008;279(1):1–10. doi: 10.1007/s00438-007-0289-y. [DOI] [PubMed] [Google Scholar]

[R4] Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, et al. Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet. 2003;73(3):612–626. doi: 10.1086/378208. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Broman KW. Cleaning genotype data. Genet Epidemiol. 1999;17(Suppl 1):S79–S83. doi: 10.1002/gepi.1370170714. [DOI] [PubMed] [Google Scholar]

[R6] Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361(9357):598–604. doi: 10.1016/S0140-6736(03)12520-2. [DOI] [PubMed] [Google Scholar]

[R7] Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G, et al. Replicating genotype-phenotype associations. Nature. 2007;447(7145):655–660. doi: 10.1038/447655a. [DOI] [PubMed] [Google Scholar]

[R8] Choi Y, Wijsman EM, Weir BS. Case-control association testing in the presence of unknown relationships. Genet Epidemiol. 2009;33(8):668–678. doi: 10.1002/gepi.20418. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Cockram J, White J, Leigh FJ, Lea VJ, Chiapparino E, Laurie DA, et al. Association mapping of partitioning loci in barley. BMC Genet. 2008;9(1):16. doi: 10.1186/1471-2156-9-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004. doi: 10.1111/j.0006-341X.1999.00997.x. [DOI] [PubMed] [Google Scholar]

[R11] Duran C, Appleby N, Clark T, Wood D, Imelfort M, Batley J, Edwards D. AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants. Nucleic Acids Res. 2008;37:D951–D953. doi: 10.1093/nar/gkn650. Database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Ganal MW, Altmann T, Röder MS. SNP identification in crop plants. Curr Opin Plant Biol. 2009;12(2):211–217. doi: 10.1016/j.pbi.2008.12.009. [DOI] [PubMed] [Google Scholar]

[R13] Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, et al. A first-generation haplotype map of maize. Science. 2009;326(5956):1115–1117. doi: 10.1126/science.1177837. [DOI] [PubMed] [Google Scholar]

[R14] Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308(5720):385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Krouchi F, Gustavsson S, Sjödin P, Kruskopf-Österberg M, Lagercrantz U, Lascoux M. Association between COL1 and flowering time in Brassica nigra: replication validation, and genotypic disequilibrium. Int J Plant Sci. 2008;169(9):1229–1237. doi: 10.1086/591989. [DOI] [Google Scholar]

[R16] Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010;34(6):591–602. doi: 10.1002/gepi.20516. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Maciejewski JP, Mufti GJ. Whole genome scanning as a cytogenetic tool in hematologic malignancies. Blood. 2008;112(4):965–974. doi: 10.1182/blood-2008-02-130435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Manolio TA, Rodriguez LL, Brooks L, Abecasis G, Ballinger D, Daly M, et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet. 2007;39(9):1045–1051. doi: 10.1038/ng2127. [DOI] [PubMed] [Google Scholar]

[R19] Miyagawa T, Nishida N, Ohashi J, Kimura R, Fujimoto A, Kawashima M, et al. Appropriate data cleaning methods for genome-wide association study. J Hum Genet. 2008;53(10):886–893. doi: 10.1007/s10038-008-0322-y. [DOI] [PubMed] [Google Scholar]

[R20] Myles S, Peiffer J, Brown PJ, Ersoz ES, Zhang Z, Costich DE, Buckler ES. Association mapping: critical considerations shift from genotyping to experimental design. Plant Cell. 2009;21(8):2194–2202. doi: 10.1105/tpc.109.068437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7218):98–101. doi: 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):2074–2093. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res. 2006;16(9):1136–1148. doi: 10.1101/gr.5402306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

[R25] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273(5281):1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]

[R27] Rostoks N, Ramsay L, MacKenzie K, Cardle L, Bhat PR, Roose ML, et al. Recent history of artificial outcrossing facilitates whole-genome association mapping in elite inbred crop varieties. Proc Natl Acad Sci USA. 2006;103(49):18656–18661. doi: 10.1073/pnas.0606133103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53(4):1253–1261. doi: 10.2307/2533494. [DOI] [PubMed] [Google Scholar]

[R29] Sorkheh K, Malysheva-Otto LV, Wirthensohn MG, Tarkesh-Esfahani S, Martínez-Gómez P. Linkage disequilibrium, genetic association mapping and gene localization in crop plants. Genet Mol Biol. 2008;31(4):805–814. doi: 10.1590/S1415-47572008005000005. [DOI] [Google Scholar]

[R30] Stich B, Melchinger AE, Piepho HP, Hamrit S, Schipprack W, Maurer HP, Reif JC. Potential causes of linkage disequilibrium in a European maize breeding program investigated with computer simulations. Theor Appl Genet. 2007;115(4):529–536. doi: 10.1007/s00122-007-0586-1. [DOI] [PubMed] [Google Scholar]

[R31] Syvänen A-C. Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet. 2001;2(12):930–942. doi: 10.1038/35103535. [DOI] [PubMed] [Google Scholar]

[R32] The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Varshney RK, Nayak SN, May GD, Jackson SA. Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends Biotechnol. 2009;27(9):522–530. doi: 10.1016/j.tibtech.2009.05.006. [DOI] [PubMed] [Google Scholar]

[R34] Visscher PM. Whole genome approaches to quantitative genetics. Genetica. 2009;136(2):351–358. doi: 10.1007/s10709-008-9301-7. [DOI] [PubMed] [Google Scholar]

[R35] Visscher PM, Medland SE, Ferreira MA, Morley KI, Zhu G, Cornes BK, et al. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006;2(3):e41. doi: 10.1371/journal.pgen.0020041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Weir BS. Linkage disequilibrium and association mapping. Annu Rev Genomics Hum Genet. 2008;9(1):129–142. doi: 10.1146/annurev.genom.9.081307.164347. [DOI] [PubMed] [Google Scholar]

[R37] Weir BS, Hill WG, Cardon LR. Allelic association patterns for a dense SNP map. Genet Epidemiol. 2004;27(4):442–450. doi: 10.1002/gepi.20038. [DOI] [PubMed] [Google Scholar]

[R38] Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7(10):771–780. doi: 10.1038/nrg1960. [DOI] [PubMed] [Google Scholar]

[R39] Wen W, Mei H, Feng F, Yu S, Huang Z, Wu J, et al. Population structure and association mapping on chromosome 7 using a diverse panel of Chinese germplasm of rice (Oryza sativa L.) Theor Appl Genet. 2009;119(3):459–470. doi: 10.1007/s00122-009-1052-z. [DOI] [PubMed] [Google Scholar]

[R40] Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Yu JM, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38(2):203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]

PERMALINK

Statistical genetic issues for genome-wide association studies¹

Bruce S Weir

Abstract

Introduction

GWAS data

Generating the data

Managing the data

Cleaning the data

Sample cleaning

SNP cleaning

Association mapping

Allelic association

Table 1.

Table 2.

Trait association

Continuous traits

Binary traits

Table 3.

Effects of inbreeding

Table 4.

Effects of population structure

Effects of relatedness

Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Statistical genetic issues for genome-wide association studies1

Bruce S Weir

Abstract

Introduction

GWAS data

Generating the data

Managing the data

Cleaning the data

Sample cleaning

SNP cleaning

Association mapping

Allelic association

Table 1.

Table 2.

Trait association

Continuous traits

Binary traits

Table 3.

Effects of inbreeding

Table 4.

Effects of population structure

Effects of relatedness

Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Statistical genetic issues for genome-wide association studies¹