Skip to main content
Genetics logoLink to Genetics
. 2012 Jan;190(1):275–277. doi: 10.1534/genetics.111.134841

“SNP Snappy”: A Strategy for Fast Genome-Wide Association Studies Fitting a Full Mixed Model

Karin Meyer 1,1, Bruce Tier 1
Editor: G A Churchill
PMCID: PMC3249377  PMID: 22021386

Abstract

A strategy to reduce computational demands of genome-wide association studies fitting a mixed model is presented. Improvements are achieved by utilizing a large proportion of calculations that remain constant across the multiple analyses for individual markers involved, with estimates obtained without inverting large matrices.


GENOME-WIDE association studies (GWAS) have become a routine task for geneticists in a range of areas. Analyses employing a mixed model are widely used as this provides a flexible framework to account for systematic differences and covariances due to other sources, such as population stratification and a family structure among genotyped individuals (Kang et al. 2010; Price et al. 2010; Zhang et al. 2010). A common type of investigation involves solving a system of mixed model equations (MME) fitting one or a few single nucleotide polymorphism markers (SNP) at a time, treating SNP effects as covariables, with variance components fixed at their estimates from an analysis omitting SNP. Typically this is done by inverting the coefficient matrix in the MME for each analysis. While individual analyses take only seconds, analyzing all markers for a high-density chip imposes a considerable computational burden. Hence, estimation of SNP effects by first fitting the mixed model excluding any SNP effects and then applying the SNP-wise analysis to the resulting residuals has been suggested (Aulchenko et al. 2007). However, this may lead to biased results if genotypes are not randomized across the effects in the model or if SNP effects and population strata are partially confounded. A typical example is an analysis comprising animals of different breeds with different allele frequencies (Johnston and Graser 2010).

When fitting the full model, we can partition the pertaining MME into a small part due to SNP effects and a part due to the other effects in the model. For complete genotype information only the former changes as different SNPs are considered. This structure can be exploited to reduce computational requirements. We present the strategy to do so, describe its implementation in freely available mixed model software, and show an example application.

Computing Strategy

Consider a mixed model

y=Xbk+Zuk+Wksk+ek, (1)

with y, bk, uk, sk, and ek denoting the vector of observations (phenotypes), fixed effects other than SNP effects, random effects, SNP effects and residuals, and X, Z, and Wk the incidence matrices pertaining to bk, uk, and sk. As emphasized by the superscript k, only Wk differs between analyses for different SNPs, with the elements of Wk equal to the number of copies of the reference allele—0, 1, or 2 in a biallelic model—for the SNP(s) in the kth analysis. To estimate s^k we need to solve the MME

(XR1XXR1ZXR1WkZR1XZR1Z+G1ZR1WkWkR1XWkR1ZWkR1Wk)(b^ku^ks^k)=(XR1yZR1yWkR1y), (2)

with R = Var(e) and G = Var(u) the covariance matrices among residuals and random effects, respectively. Rewrite Equation 2 as

(C11C12kC21kC22k)(v^ks^k)=(r1r2k), (3)

with C11 of size n1 × n1 denoting the part of the coefficient matrix that is constant and r1 the pertaining vector of right-hand sides, C22k, of size n2 × n2, r2k the corresponding terms for the effects changing with each analysis, and C12k and C21k the off-diagonal blocks in the coefficient matrix. With v^k generally not of interest, we can estimate s^k as a solution to

(C22kC21kC111C12k)s^k=r2kC21kC111r1. (4)

With n2 small, inversion of the coefficient matrix and direct solution of Equation 4 is undemanding. While C11−1 remains constant and thus needs only to be determined once, computations for the inversion are proportional to n13 and thus can be nontrivial for large n1. Fortunately, we can obtain s^k without inverting C11. Let

L=(L110L21kL22k) (5)

denote the Cholesky factor of the coefficient matrix in Equation 3 with C11 = L11L11, C21k=L21kL11 and C22k=L21kL21k+L22kL22k. Substituting these terms in Equation 4 yields

L22kL22ks^k=r2kL21kt^1witht^1=L111r1. (6)

This suggests that estimates s^k for k = 1, …, K can be obtained efficiently by splitting computations as follows.

To be performed once

  • Set up C11 and r1, i.e., the MME omitting SNP effects.

  • Perform the Cholesky factorization of C11 to obtain L11.

  • Determine t^1 as a solution to L11t^1=r1. With L11 triangular, this involves forward substitution steps
    t^1=r1/11
    and
    t^i=(rij=1i1ijt^j)/iifori=2,n1

for ri and t^i, the ith element of r1 and t^1, and ℓij the ijth element of L11.

To be performed for each set of SNPs

  • Determine parts of the MME specific to the kth analysis, C21k, C22k, and r2k.

  • Set up L*, representing the intermediate matrix arising in factorizing the coefficient matrix in Equation 3 after rows 1 to n1 have been processed:
    L*=(L110C21kC21k).
  • “Complete” the factorization steps for rows n1 + 1 to n1 + n2 using
    ij*=(ij*k=1j1ik*jk*)/jj*forj<iandii*=ii*k=1i1(ik*)2

(for ij* the ijth element of L*). Processing columns 1 to n1 column-wise replaces C21k in L* with L21k. The remaining elements (in columns n1 + 1 to n1 + n2) are then adjusted row-wise, overwriting C22k with L22k.

  • Determine a general inverse of L22k, L22k, to obtain s^k=L22k(L22k)(r2kL21kt^1). Sampling variances of s^k are given by the diagonal elements of L22k(L22k).

Note that L22k can have diagonal elements of zero if a SNP is monomorphic or if SNPs with proportional allele counts are considered simultaneously. This is accounted for in the generalized inverse. If n2 is not small or if sampling variances are not required, an alternative to solve for s^k is a series of forward and backward substitution steps.

Implementation

The strategy described above has been implemented in the mixed model package wombat (Meyer 2007), utilizing the existing capabilities to set up the MME for an arbitrary model and sparse matrix calculations, including Cholesky factorization of the coefficient matrix. Estimation of SNP effects is invoked through a run-time option. In addition to the data, pedigree, and parameter files as required for standard analyses, allele counts for each SNP analysis are expected to be read sequentially from a separate file. The software and user manual together with a worked example illustrating its use for GWAS analyses are available for download from http://didgeridoo.une.edu.au/km/wmbdownloads.php.

Application

Our strategy was applied to estimate effects for 4541 SNPs on age at first corpus luteum in beef cattle. Any missing allele counts were imputed so that marker information was complete. Records were a subset of data analyzed previously (Hawken et al. 2011). The model of analysis fitted five fixed effects and two linear covariables as well as a linear regression on a single SNP effect. Animals’ additive genetic effects were fitted as random effects with the relationship matrix determined from pedigree information. There were 941 animals with genotypes and phenotypes. Including additive genetic effects for parents without records yielded a total of 3858 animals in the model and 3909 equations in total.

Calculations were carried out on a desktop computer with an Intel I7 processor rated at 3.2 GHz. Performing single SNP analyses one by one, inverting the complete coefficient matrix in the MME each time required a total of 1784 sec CPU time. Estimation using our new strategy reduced this to 16 sec. Repeating SNP information 150 times to mimic a high-density chip with 681,150 SNPs, analysis was completed in 2295 sec. With a set-up time required of ∼1 sec, this gave an average of 297 SNPs analyzed per second.

Conclusions

Computational demands for GWAS analyses fitting a full mixed model can be reduced by orders of magnitude utilizing that a large part of the MME and computations involved remain constant. The computing strategy described to exploit this is straightforward and is readily implemented in existing mixed model software. Savings that can be achieved increase with the number of effects in the mixed model and are proportional to the number of SNP effects considered.

Acknowledgments

This work was supported by Meat and Livestock Australia under grant B.BFG.0050. We are indebted to the CRC for Beef Genetic Technologies for the data used in the example. AGBU is a joint venture of New South Wales Department of Primary Industries and the University of New England.

Literature Cited

  1. Aulchenko Y. S., de Koning D.-J., Haley C., 2007.  Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177: 577–585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Hawken R. J., Zhang Y., Fortes M. R. S., Collis E., Reverter A., et al. , 2011 Dissecting the genetics underlying reproduction rate in tropically adapted beef cattle. In Applied Genomics for Sustainable Livestock Breeding. Sir Mark Oliphant Conferences, Melbourne
  3. Johnston D. J., Graser H.-U., 2010.  Estimated gene frequencies of GeneSTAR markers and their size of effects on meat tenderness, marbling, and feed efficiency in temperate and tropical beef cattle breeds across a range of production systems. J. Anim. Sci. 88: 1917–1935 [DOI] [PubMed] [Google Scholar]
  4. Kang H. M., Sul J. H., Service S. K., Zaitlen N. A., Kong S.-Y., et al. , 2010.  Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42: 348–354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Meyer K., 2007.  WOMBAT – a tool for mixed model analyses in quantitative genetics by REML. J. Zhejiang Uni. SCIENCE B 8: 815–821 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Price A. L., Zaitlen N. A., Reich D., Patterson N., 2010.  New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11: 459–463 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Zhang Z., Ersoz E., Lai C.-Q., Todhunter R. J., Tiwari H. K., et al. , 2010.  Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42: 355–360 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES