Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Mar 14.
Published in final edited form as: Nat Methods. 2012 May 30;9(6):525–526. doi: 10.1038/nmeth.2037

Improved linear mixed models for genome-wide association studies

Jennifer Listgarten 1,5, Christoph Lippert 1,2,5, Carl M Kadie 3, Robert I Davidson 3, Eleazar Eskin 4, David Heckerman 1,5
PMCID: PMC3597090  NIHMSID: NIHMS448825  PMID: 22669648

To the Editor: The use of linear mixed models (LMMs) in genome-wide association studies (GWAS) is now widely accepted1 because LMMs have been shown to be capable of correcting for several forms of confounding due to genetic relatedness, such as population structure and familial relatedness1, and because recent advances have made them computationally efficient1,2. LMMs tackle confounding by using a matrix of pairwise genetic similarities to model the relatedness among subjects. The consensus until now has been that all available single-nucleotide polymorphisms (SNPs) should be used to determine these similarities1. Here, however, we show theoretically and experimentally that carefully selecting a small number of SNPs systematically increases power (that is, it jointly reduces false positives and false negatives), improves calibration (lessens inflation or deflation of the test statistic) and reduces computational cost.

Our approach is motivated by two considerations. First, an LMM with no fixed effects using genetic similarities constructed from a set of SNPs is mathematically equivalent to a linear regression of the SNPs on the phenotype (with weights integrated over independent normal distributions having the same variance—in particular, the genetic variance)3. That is, an LMM using a given set of SNPs for genetic similarity is equivalent to (Bayesian) linear regression using those SNPs as covariates to correct for confounding. In theory, this equivalence holds only for certain forms of genetic similarity matrices, such as the realized relationship matrix2,3. In practice, however, the realized relationship matrix and other measures of similarity, such as identity by state1, yield very similar measures of association (Supplementary Note 1), and thus our demonstration is quite general.

Second, regardless of the form of regression used for GWAS, the significance of SNP-phenotype association should be determined by conditioning on exactly those SNPs that are associated with the phenotype. These SNPs include causal SNPs, or those nearby that tag causal SNPs, and SNPs that are associated by way of confounding (for example, because of population structure). By conditioning on causal or tagging SNPs, we reduce the noise in the assessment of the association4. By conditioning on SNPs associated because of confounding, we control for such confounding5. Moreover, if a SNP is unrelated to the phenotype, it should not be in the conditioning set. In the particular case in which we use Bayesian linear regression for GWAS, the inclusion of unrelated SNPs in the genetic similarity matrix decreases the relative influence of each SNP on the phenotype (because all SNP weights share the same prior distribution whose variance—the genetic variance in the LMM view—is estimated from the data). The decrease in influence leads to incomplete correction for confounding and hence inflated test statistics and reduced power. We refer to this phenomenon as ‘dilution.’

To identify SNPs that satisfy these principles, we developed a simple heuristic that yields improved power and calibration. First, we order SNPs by their linear-regression P values from lowest to highest. Then we construct genetic similarity matrices with an increasing number of SNPs as previously ordered until we find the first minimum in λGC (the genomic control factor). In practice, the number of SNPs selected is typically smaller than the number of individuals analyzed, a condition that can be exploited by an existing algorithm, FaST-LMM, to yield large computational savings2.

The equivalence between the LMM and Bayesian linear regression also implies that, when a given SNP is being tested, that SNP should be excluded from the computation of genetic similarity to avoid using it as a covariate. Including the SNP would make the log likelihood of the null model higher than it should be and lead to deflation of the test statistic and loss of power. We call this phenomenon ‘proximal contamination’. In addition to the SNP being tested, we also exclude those SNPs in close proximity (for example, within 2 centimorgans), as linkage disequilibrium will lead to a similar deflation and loss of power. A naive algorithm for excluding these from the similarity matrix is computationally expensive, so we developed a speedup (Supplementary Note 2). Together, the linear-regression scan to select SNPs for inclusion in the matrix along with the efficient removal of the test SNPs and those nearby constitute our new approach, FaST-LMM-Select.

When applied to Wellcome Trust data for Crohn’s disease6 (Table 1, Supplementary Fig. 1, Supplementary Table 1 and Supplementary Methods) that includes family members and non-Caucasians, FaST-LMM-Select yielded slightly less inflation, fewer false positives and fewer false negatives (due to lack of dilution) compared to the use of all SNPs while accounting for proximal contamination. When all SNPs were used, proximal contamination had a dramatic effect on calibration and false positives even though correction for it excluded (on average) only 516 of the available 356,441 SNPs from the genetic similarity matrix. Compared with the original version of FaST-LMM, wherein equally spaced SNPs were used to reduce computational demands, FaST-LMM-Select had far better calibration and fewer false positives. FaST-LMM-Select also performed well on synthetic data (Supplementary Note 1) and other real cohorts with substantial genetic structure (Supplementary Note 3).

Table 1.

Comparison of calibration, power and computational costs on a GWAS of Crohn’s disease

Algorithm parameters Algorithm performance


Algorithm SNP selection
method
No. SNPs
for matrix
Proximal
contamination
avoided?
λGC No. false
positives
No. false
negatives
Runtime without
speedup (min)
Runtime with
speedup (min)
Memory
usage
(GB)
FaST-LMM-Select Select 310 Yes 1.08 0 1 1.3 × 103 45 <1

FaST-LMM (all) All All Yes 1.09 2 2 4.0 × 106 4,567 86

FaST-LMM (orig 310) Equally spaced 310 Yes 1.26 9 1 1.1 × 103 6 <1

FaST-LMM (orig 4,000) Equally spaced 4,000 Yes 1.17 5 1 2.1 × 105 30 2

Traditional All All No 0.97 2 6 4.2 × 101 NA 45

The original version of FaST-LMM, which used equally spaced SNPs to estimate genetic similarity, was evaluated using 310 SNPs (the same number used by FaST-LMM-Select) and 4,000 SNPs (as used in the original version of FaST-LMM (ref.2)). The five algorithms yielded substantially different P values (Supplementary Fig. 1), which in turn led to different SNPs being deemed significant (using the P value threshold of 5 × 10−7 (ref. 6)). Previous studies were used to determine the gold standard in order to label the false positive and false negative loci (Supplementary Table 1). Details of the analysis are described in the Supplementary Methods.

FaST-LMM-Select is available at http://mscompbio.codeplex.com/.

Supplementary Material

Supplementary Data

ACKNOWLEDGMENTS

We thank J. Carlson for help with tools to manage and analyze the data and P. Palamara for cataloging the positions and genetic distances of SNPs in the data for Crohn’s disease. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk/. Funding for the project was provided by the Wellcome Trust under award 076113 and 085475. E.E. is supported by US National Science Foundation grants 0916676 and 1065276 and by US National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568 and PO1-HL28481.

Footnotes

Note: Supplementary information is available at http://www.nature.com/doifinder/10.1038/nmeth.2037.

AUTHOR CONTRIBUTIONS

J.L., C.L. and D.H. designed and performed the research, contributed analytic tools, analyzed data and wrote the paper. C.M.K. and R.I.D. contributed analytic tools. E.E. helped to write the paper.

COMPETING FINANCIAL INTERESTS

The authors declare competing financial interests: details are available at http://www.nature.com/doifinder/10.1038/nmeth.2037.

Contributor Information

Jennifer Listgarten, Email: jennl@microsoft.com.

Christoph Lippert, Email: christoph.lippert@tuebingen.mpg.de.

David Heckerman, Email: heckerma@microsoft.com.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

RESOURCES