“SNP Snappy”: A Strategy for Fast Genome-Wide Association Studies Fitting a Full Mixed Model

Karin Meyer; Bruce Tier

doi:10.1534/genetics.111.134841

. 2012 Jan;190(1):275–277. doi: 10.1534/genetics.111.134841

“SNP Snappy”: A Strategy for Fast Genome-Wide Association Studies Fitting a Full Mixed Model

Karin Meyer ^1,¹, Bruce Tier ¹

Editor: G A Churchill

PMCID: PMC3249377 PMID: 22021386

Abstract

A strategy to reduce computational demands of genome-wide association studies fitting a mixed model is presented. Improvements are achieved by utilizing a large proportion of calculations that remain constant across the multiple analyses for individual markers involved, with estimates obtained without inverting large matrices.

GENOME-WIDE association studies (GWAS) have become a routine task for geneticists in a range of areas. Analyses employing a mixed model are widely used as this provides a flexible framework to account for systematic differences and covariances due to other sources, such as population stratification and a family structure among genotyped individuals (Kang et al. 2010; Price et al. 2010; Zhang et al. 2010). A common type of investigation involves solving a system of mixed model equations (MME) fitting one or a few single nucleotide polymorphism markers (SNP) at a time, treating SNP effects as covariables, with variance components fixed at their estimates from an analysis omitting SNP. Typically this is done by inverting the coefficient matrix in the MME for each analysis. While individual analyses take only seconds, analyzing all markers for a high-density chip imposes a considerable computational burden. Hence, estimation of SNP effects by first fitting the mixed model excluding any SNP effects and then applying the SNP-wise analysis to the resulting residuals has been suggested (Aulchenko et al. 2007). However, this may lead to biased results if genotypes are not randomized across the effects in the model or if SNP effects and population strata are partially confounded. A typical example is an analysis comprising animals of different breeds with different allele frequencies (Johnston and Graser 2010).

When fitting the full model, we can partition the pertaining MME into a small part due to SNP effects and a part due to the other effects in the model. For complete genotype information only the former changes as different SNPs are considered. This structure can be exploited to reduce computational requirements. We present the strategy to do so, describe its implementation in freely available mixed model software, and show an example application.

Computing Strategy

Consider a mixed model

y = {Xb}^{k} + {Zu}^{k} + W^{k} s^{k} + e^{k},

(1)

with y, b^k, u^k, s^k, and e^k denoting the vector of observations (phenotypes), fixed effects other than SNP effects, random effects, SNP effects and residuals, and X, Z, and W^k the incidence matrices pertaining to b^k, u^k, and s^k. As emphasized by the superscript k, only W^k differs between analyses for different SNPs, with the elements of W^k equal to the number of copies of the reference allele—0, 1, or 2 in a biallelic model—for the SNP(s) in the kth analysis. To estimate ${\hat{s}}^{k}$ we need to solve the MME

(\begin{matrix} X^{'} R^{- 1} X & X^{'} R^{- 1} Z & X^{'} R^{- 1} W^{k} \\ Z^{'} R^{- 1} X & Z^{'} R^{- 1} Z + G^{- 1} & Z^{'} R^{- 1} W^{k} \\ W^{k^{'}} R^{- 1} X & W^{k^{'}} R^{- 1} Z & W^{k^{'}} R^{- 1} W^{k} \end{matrix}) (\begin{matrix} {\hat{b}}^{k} \\ {\hat{u}}^{k} \\ {\hat{s}}^{k} \end{matrix}) = (\begin{matrix} X^{'} R^{- 1} y \\ Z^{'} R^{- 1} y \\ W^{k^{'}} R^{- 1} y \end{matrix}),

(2)

with R = Var(e) and G = Var(u) the covariance matrices among residuals and random effects, respectively. Rewrite Equation 2 as

(\begin{matrix} C_{11} & C_{12}^{k} \\ C_{21}^{k} & C_{22}^{k} \end{matrix}) (\begin{matrix} {\hat{v}}^{k} \\ {\hat{s}}^{k} \end{matrix}) = (\begin{matrix} r_{1} \\ r_{2}^{k} \end{matrix}),

(3)

with C₁₁ of size n₁ × n₁ denoting the part of the coefficient matrix that is constant and r₁ the pertaining vector of right-hand sides, $C_{22}^{k}$ , of size n₂ × n₂, $r_{2}^{k}$ the corresponding terms for the effects changing with each analysis, and $C_{12}^{k}$ and $C_{21}^{k}$ the off-diagonal blocks in the coefficient matrix. With ${\hat{v}}^{k}$ generally not of interest, we can estimate ${\hat{s}}^{k}$ as a solution to

(C_{22}^{k} - C_{21}^{k} C_{11}^{- 1} C_{12}^{k}) {\hat{s}}^{k} = r_{2}^{k} - C_{21}^{k} C_{11}^{- 1} r_{1} .

(4)

With n₂ small, inversion of the coefficient matrix and direct solution of Equation 4 is undemanding. While $C_{11}^{−1}$ remains constant and thus needs only to be determined once, computations for the inversion are proportional to $n_{1}^{3}$ and thus can be nontrivial for large n₁. Fortunately, we can obtain ${\hat{s}}^{k}$ without inverting C₁₁. Let

L = (\begin{matrix} L_{11} & 0 \\ L_{21}^{k} & L_{22}^{k} \end{matrix})

(5)

denote the Cholesky factor of the coefficient matrix in Equation 3 with C₁₁ = L₁₁L₁₁′, $C_{21}^{k} = L_{21}^{k} L_{11}'$ and $C_{22}^{k} = L_{21}^{k} L_{21}^{k}' + L_{22}^{k} L_{22}^{k}'$ . Substituting these terms in Equation 4 yields

L_{22}^{k} L_{22}^{k^{'}} {\hat{s}}^{k} = r_{2}^{k} - L_{21}^{k} {\hat{t}}_{1} with {\hat{t}}_{1} = L_{11}^{- 1} r_{1} .

(6)

This suggests that estimates ${\hat{s}}^{k}$ for k = 1, …, K can be obtained efficiently by splitting computations as follows.

To be performed once

Set up C₁₁ and r₁, i.e., the MME omitting SNP effects.
Perform the Cholesky factorization of C₁₁ to obtain L₁₁.
Determine ${\hat{t}}_{1}$ as a solution to $L_{11} {\hat{t}}_{1} = r_{1}$ . With L₁₁ triangular, this involves forward substitution steps
${\hat{t}}_{1} = r_{1} / ℓ_{11}$
and
${\hat{t}}_{i} = (r_{i} - \sum_{j = 1}^{i - 1} ℓ_{i j} {\hat{t}}_{j}) / ℓ_{i i} for i = 2, n_{1}$

for r_i and ${\hat{t}}_{i}$ , the ith element of r₁ and ${\hat{t}}_{1}$ , and ℓ_ij the ijth element of L₁₁.

To be performed for each set of SNPs

Determine parts of the MME specific to the kth analysis, $C_{21}^{k}$ , $C_{22}^{k}$ , and $r_{2}^{k}$ .
Set up L*, representing the intermediate matrix arising in factorizing the coefficient matrix in Equation 3 after rows 1 to n₁ have been processed:
$L^{*} = (\begin{matrix} L_{11} & 0 \\ C_{21}^{k} & C_{21}^{k} \end{matrix})$ .
“Complete” the factorization steps for rows n₁ + 1 to n₁ + n₂ using
$ℓ_{i j}^{*} = (ℓ_{i j}^{*} - \sum_{k = 1}^{j - 1} ℓ_{i k}^{*} ℓ_{j k}^{*}) / ℓ_{j j}^{*} for j < i and ℓ_{i i}^{*} = \sqrt{ℓ_{i i}^{*} - \sum_{k = 1}^{i - 1} {(ℓ_{i k}^{*})}^{2}}$

(for $ℓ_{i j}^{*}$ the ijth element of L*). Processing columns 1 to n₁ column-wise replaces $C_{21}^{k}$ in L* with $L_{21}^{k}$ . The remaining elements (in columns n₁ + 1 to n₁ + n₂) are then adjusted row-wise, overwriting $C_{22}^{k}$ with $L_{22}^{k}$ .

Determine a general inverse of $L_{22}^{k}$ , $L_{22}^{k -}$ , to obtain ${\hat{s}}^{k} = L_{22}^{k -} (L_{22}^{k -})' (r_{2}^{k} - L_{21}^{k} {\hat{t}}_{1})$ . Sampling variances of ${\hat{s}}^{k}$ are given by the diagonal elements of $L_{22}^{k -} (L_{22}^{k -})'$ .

Note that $L_{22}^{k}$ can have diagonal elements of zero if a SNP is monomorphic or if SNPs with proportional allele counts are considered simultaneously. This is accounted for in the generalized inverse. If n₂ is not small or if sampling variances are not required, an alternative to solve for ${\hat{s}}^{k}$ is a series of forward and backward substitution steps.

Implementation

The strategy described above has been implemented in the mixed model package wombat (Meyer 2007), utilizing the existing capabilities to set up the MME for an arbitrary model and sparse matrix calculations, including Cholesky factorization of the coefficient matrix. Estimation of SNP effects is invoked through a run-time option. In addition to the data, pedigree, and parameter files as required for standard analyses, allele counts for each SNP analysis are expected to be read sequentially from a separate file. The software and user manual together with a worked example illustrating its use for GWAS analyses are available for download from http://didgeridoo.une.edu.au/km/wmbdownloads.php.

Application

Our strategy was applied to estimate effects for 4541 SNPs on age at first corpus luteum in beef cattle. Any missing allele counts were imputed so that marker information was complete. Records were a subset of data analyzed previously (Hawken et al. 2011). The model of analysis fitted five fixed effects and two linear covariables as well as a linear regression on a single SNP effect. Animals’ additive genetic effects were fitted as random effects with the relationship matrix determined from pedigree information. There were 941 animals with genotypes and phenotypes. Including additive genetic effects for parents without records yielded a total of 3858 animals in the model and 3909 equations in total.

Calculations were carried out on a desktop computer with an Intel I7 processor rated at 3.2 GHz. Performing single SNP analyses one by one, inverting the complete coefficient matrix in the MME each time required a total of 1784 sec CPU time. Estimation using our new strategy reduced this to 16 sec. Repeating SNP information 150 times to mimic a high-density chip with 681,150 SNPs, analysis was completed in 2295 sec. With a set-up time required of ∼1 sec, this gave an average of 297 SNPs analyzed per second.

Conclusions

Computational demands for GWAS analyses fitting a full mixed model can be reduced by orders of magnitude utilizing that a large part of the MME and computations involved remain constant. The computing strategy described to exploit this is straightforward and is readily implemented in existing mixed model software. Savings that can be achieved increase with the number of effects in the mixed model and are proportional to the number of SNP effects considered.

Acknowledgments

This work was supported by Meat and Livestock Australia under grant B.BFG.0050. We are indebted to the CRC for Beef Genetic Technologies for the data used in the example. AGBU is a joint venture of New South Wales Department of Primary Industries and the University of New England.

Literature Cited

Aulchenko Y. S., de Koning D.-J., Haley C., 2007. Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177: 577–585 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hawken R. J., Zhang Y., Fortes M. R. S., Collis E., Reverter A., et al. , 2011 Dissecting the genetics underlying reproduction rate in tropically adapted beef cattle. In Applied Genomics for Sustainable Livestock Breeding. Sir Mark Oliphant Conferences, Melbourne
Johnston D. J., Graser H.-U., 2010. Estimated gene frequencies of GeneSTAR markers and their size of effects on meat tenderness, marbling, and feed efficiency in temperate and tropical beef cattle breeds across a range of production systems. J. Anim. Sci. 88: 1917–1935 [DOI] [PubMed] [Google Scholar]
Kang H. M., Sul J. H., Service S. K., Zaitlen N. A., Kong S.-Y., et al. , 2010. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42: 348–354 [DOI] [PMC free article] [PubMed] [Google Scholar]
Meyer K., 2007. WOMBAT – a tool for mixed model analyses in quantitative genetics by REML. J. Zhejiang Uni. SCIENCE B 8: 815–821 [DOI] [PMC free article] [PubMed] [Google Scholar]
Price A. L., Zaitlen N. A., Reich D., Patterson N., 2010. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11: 459–463 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z., Ersoz E., Lai C.-Q., Todhunter R. J., Tiwari H. K., et al. , 2010. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42: 355–360 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] Aulchenko Y. S., de Koning D.-J., Haley C., 2007. Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177: 577–585 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Hawken R. J., Zhang Y., Fortes M. R. S., Collis E., Reverter A., et al. , 2011 Dissecting the genetics underlying reproduction rate in tropically adapted beef cattle. In Applied Genomics for Sustainable Livestock Breeding. Sir Mark Oliphant Conferences, Melbourne

[bib3] Johnston D. J., Graser H.-U., 2010. Estimated gene frequencies of GeneSTAR markers and their size of effects on meat tenderness, marbling, and feed efficiency in temperate and tropical beef cattle breeds across a range of production systems. J. Anim. Sci. 88: 1917–1935 [DOI] [PubMed] [Google Scholar]

[bib4] Kang H. M., Sul J. H., Service S. K., Zaitlen N. A., Kong S.-Y., et al. , 2010. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42: 348–354 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Meyer K., 2007. WOMBAT – a tool for mixed model analyses in quantitative genetics by REML. J. Zhejiang Uni. SCIENCE B 8: 815–821 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Price A. L., Zaitlen N. A., Reich D., Patterson N., 2010. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11: 459–463 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Zhang Z., Ersoz E., Lai C.-Q., Todhunter R. J., Tiwari H. K., et al. , 2010. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42: 355–360 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

“SNP Snappy”: A Strategy for Fast Genome-Wide Association Studies Fitting a Full Mixed Model

Karin Meyer

Bruce Tier

Roles

Abstract

Computing Strategy

To be performed once

To be performed for each set of SNPs

Implementation

Application

Conclusions

Acknowledgments

Literature Cited

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

“SNP Snappy”: A Strategy for Fast Genome-Wide Association Studies Fitting a Full Mixed Model

Karin Meyer

Bruce Tier

Roles

Abstract

Computing Strategy

To be performed once

To be performed for each set of SNPs

Implementation

Application

Conclusions

Acknowledgments

Literature Cited

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases