Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Aug 1.
Published in final edited form as: Nat Genet. 2012 Jul 22;44(8):955–959. doi: 10.1038/ng.2354

Fast and accurate genotype imputation in genome-wide association studies through pre-phasing

Bryan Howie 1,6, Christian Fuchsberger 2,6, Matthew Stephens 1,3, Jonathan Marchini 4,5, Gonçalo R Abecasis 2
PMCID: PMC3696580  NIHMSID: NIHMS475535  PMID: 22820512

Abstract

Sequencing efforts, including the 1000 Genomes Project and disease-specific efforts, are producing large collections of haplotypes that can be used for genotype imputation in genome-wide association studies (GWAS). Imputing from these reference panels can help identify new risk alleles, but the use of large panels with existing methods imposes a high computational burden. To keep imputation broadly accessible, we introduce a strategy called “pre-phasing” that maintains the accuracy of leading methods while cutting computational costs by orders of magnitude. In brief, we first statistically estimate the haplotypes for each GWAS individual (“pre-phasing”) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because: (i) the GWAS samples must be phased only once, whereas standard methods would implicitly re-phase with each reference panel update; (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match unphased GWAS genotypes to a pair of reference haplotypes. This strategy will be particularly valuable for repeated imputation as reference panels evolve.


Genotype imputation is a key step in the analysis of genome-wide association studies. The approach works by finding haplotype segments that are shared between study individuals, which are typically genotyped on a commercial array with 300,000–2,500,000 SNPs, and a reference panel of more densely typed individuals, such as those provided by The International HapMap Project1,2, The 1000 Genomes Project3, or by sequencing a subset of study individuals.

Imputation methods can accurately estimate genotypes, or genotype probabilities, at markers that have not been directly examined in a GWAS. Imputed genotypes are now routinely used to increase the power of GWAS analyses, guide fine-mapping efforts, and facilitate meta-analysis of studies genotyped on different marker sets4,5.

The maturation of high-throughput genotyping and sequencing technologies has led to a rapid increase in the size of publicly available reference datasets. For example, whereas HapMap Phase II included 210 unrelated individuals typed at ~4M SNPs, the Phase I variant call set from The 1000 Genomes Project (released March 2012) includes 1,092 individuals typed at >38M polymorphic sites. This dataset will grow to >2,000 individuals typed at an even greater number of sites in future phases of the project, and other sequencing efforts are also producing large genetic variation resources.

These developments can provide immediate benefits to GWAS through imputation: a more complete catalog of variants increases the chances that a causal or trait-associated variant will be imputed, and reference panels with more haplotypes increase imputation accuracy and power for downstream association analysis, especially for variants with low allele frequencies4,5. On the other hand, many existing genotype imputation methods require substantial computing power to run using large reference datasets. This problem is compounded by the fact that reference collections are now regularly improved and expanded, so that investigators might benefit from re-imputing their samples multiple times over the course of a study.

Here, we propose a practical solution that maintains imputation accuracy while greatly reducing computational costs. Our approach is motivated by the observation that imputation methods spend much of their time accounting for the unknown phase of GWAS genotypes. Some methods do this through analytical calculations that integrate over all possible phase configurations for each study individual, while other methods average the imputation probabilities across multiple haplotypes sampled by a phasing algorithm4. Both approaches have limitations. Analytical phase integration becomes computationally expensive as reference panels grow (as we show below) and is only possible when the study individuals are treated independently, which sacrifices linkage disequilibrium (LD) information in the GWAS data. Sampling-based methods can scale better with reference panel size and capture LD information to improve imputation accuracy6, but they may still have non-trivial computational costs because of the need to sample and impute into several haplotype configurations for each individual.

In our new approach, we first statistically estimate the haplotypes underlying the GWAS genotypes (“pre-phasing”), then impute into these haplotypes as if they were correct; see Figure 1 for an illustration of a traditional workflow and the more efficient workflow proposed here. Imputing into pre-phased haplotypes is known to be fast, and it is highly accurate when the haplotypes are estimated through genotyped family members7,8 or long segments of recent shared ancestry9. These two phasing techniques cannot be used on unrelated individuals from outbred populations (a common study design in GWAS), which means many datasets can only be phased by statistical algorithms that yield lower-quality (but still reasonable) haplotypes. The central aims of this work are (i) to show that the GWAS haplotypes estimated by existing algorithms can produce accurate imputation and (ii) to quantify the efficiency gains of pre-phasing the study genotypes. We assume throughout that the reference genotypes were also phased prior to imputation, as is typical for public reference datasets.

Figure 1.

Figure 1

Imputation schematic. Each box represents a genetic dataset and each arrow represents an analysis step. The sizes of the boxes reflect the relative numbers of genotypes they contain, and the widths of the arrows reflect the relative computational costs of the analyses. Given a single GWAS dataset (red box), successively larger reference panels (blue boxes) lead to larger and more accurate imputed datasets (orange boxes). The computational cost of imputation is much lower when using pre-phased GWAS haplotypes (green box, right-hand side) than when using traditional imputation approaches (left-hand side).

RESULTS

Pre-phasing run-time performance

To illustrate the computational advantages of pre-phasing, we analyzed a GWAS dataset of 2,490 individuals from the 1958 British Birth Cohort of the Wellcome Trust Case Control Consortium 2 (WTCCC2)10. We imputed this dataset from a series of reference panels using related imputation methods that account for phase uncertainty in different ways (Table 1). IMPUTE version 1 (“IMPUTE111”) uses an analytical integration strategy. This was relatively efficient with a reference panel of 60 individuals (41 minutes per genome with 1000 Genomes Pilot data), but the computational burden grew quickly as haplotypes were added to the reference set. By contrast, IMPUTE version 2 (“IMPUTE26”) uses a haplotype sampling strategy. This approach scaled more favorably with larger reference panels, but it still required 512 minutes per genome to impute from the latest 1000 Genomes panel. By comparison, an updated version of IMPUTE2 that uses our proposed approach required a one-time pre-phasing investment of 25 minutes per genome, then just 24 minutes to impute each sample from the largest reference panel. We observed similar trends with MaCH12 (which typically uses a similar approach to IMPUTE1) and minimac (which performs imputation with pre-phased haplotypes in the MaCH framework), as shown in Supplementary Table 1.

Table 1.

Running times in WTCCC2 controls (in CPU minutes needed to impute one individual genome-wide) for different imputation methods and reference panels.

Reference Panela Imputation method
IMPUTE1 IMPUTE2 (samplingb) IMPUTE2 (pre-phasingc)
HapMap 2 CEU (60 indiv, 2.5M SNPs) 14 31 <1
1000G CEU (60 indiv, 7.3M SNPs) 41 48 1
1000G EUR (283 indiv, 11.6M SNPs) 1,287 144 6
1000G EUR (381 indiv, 37.4M SNPs) 7,800d 512 24
a

Reference panels, in order: HapMap 2 release #22; 1000 Genomes low-coverage pilot (June 2010); 1000 Genomes interim release (Aug 2010); 1000 Genomes interim Phase I release (Nov 2010).

b

Version of the IMPUTE2 algorithm published by Howie et al. (2009). This method averages the imputation results across 20 sampled haplotype configurations per individual.

c

Running times do not include the initial investment required to phase the GWAS genotypes, which took 25 minutes per individual.

d

Projected running time extrapolated from existing benchmarks.

Haplotype estimation

These results demonstrate that pre-phasing can greatly speed up the imputation process, but the accuracy of imputation with this shortcut may depend on how well the GWAS haplotypes were estimated. The accuracy of computationally estimated haplotypes depends on a number of factors including marker density, relatedness of sampled individuals, sample size, and demography12,13. In founder populations14, long-range haplotypes can be estimated very accurately even with modest sample sizes9. For example, by comparing the results of population- and trio-based phasing in Finnish samples from the FUSION study of type 2 diabetes2,15, we estimate that population phasing produces <1 switch error16 per 5.5 megabases (Mb). These results were aided by the relatively large number of genotyped individuals (>2,000) and by the fact that Finland is a founder population in which long haplotypes are shared between apparently unrelated individuals. In more diverse populations, haplotype estimation may often be less accurate. For example, average distances between switches in European GWAS datasets are typically in the range of 0.6 to 1.4 Mb17.

Genotype imputation accuracy in diverse populations

Here, we evaluate whether pre-phased haplotypes can be used to accurately impute missing genotypes in three GWAS datasets sampled from diverse populations: the WTCCC2 data described above, a European-American case-control dataset from a psoriasis study by the Genetic Association Information Network (GAIN)18, and a set of African Americans from the Women’s Health Initiative (WHI)19. In each dataset we masked and imputed a subset of the genotyped SNPs, as detailed in Table 2. We measured imputation accuracy at these SNPs as the average squared correlation (mean R2) between masked array genotypes and imputed allele dosages (posterior mean genotypes).

Table 2.

Accuracy of different imputation methods and 1000 Genomes reference panels applied to various GWAS datasets.

GWAS dataset Imputation methoda Reference panelb Imputation Accuracy (mean R2)c
MAF 1–3% MAF 3–5% MAF >5%
GAIN Psoriasis (European-American; N = 2,759) MaCH/minimac 60 CEU 0.67 0.76 0.91
0.69 0.77 0.91
283 EUR 0.73 0.78 0.92
381 EUR 0.83 0.85 0.94
WTCCC2 (United Kingdom; N = 2,490) IMPUTE2 (sampling/pre-phasing) 60 CEU 0.66 0.78 0.88
0.65 0.77 0.87
283 EUR 0.77 0.82 0.89
0.75 0.81 0.88
381 EUR 0.84 0.88 0.92
0.82 0.86 0.91
WHI (African-American; N = 8,421) MaCH/minimac 60 CEU + 59 YRI 0.51 0.73 0.83
0.49 0.70 0.80
283 EUR + 172 AFR 0.55 0.72 0.81
381 EUR + 174 AFR 0.61 0.75 0.83
1000 Genomes EUR (European ancestry; N = 381) IMPUTE2 (sampling/pre-phasing) 380 EUR (WTCCC2 SNPs) 0.82 0.86 0.92
0.81 0.85 0.91
380 EUR (sequence SNPs) 0.66 0.79 0.91
0.64 0.78 0.90
a

We imputed each GWAS dataset with an existing imputation method (plain text) and its pre-phasing counterpart (bold text).

b

Reference panels used to impute each GWAS dataset, in order: 1000 Genomes low-coverage pilot (June 2010); 1000 Genomes interim release (Aug 2010); 1000 Genomes interim Phase I release (Nov 2010).

c

Each cell shows the mean R2 between true genotypes and imputed dosages for the specified MAF band and reference panel. For a given GWAS dataset, all accuracy values within an MAF band were calculated on the same set of SNPs; the corresponding SNP counts are shown in Figure S1. Accuracy values from pre-phasing are shown in bold (some analyses were performed only with pre-phasing).

We used the GAIN dataset to compare a well-benchmarked imputation method (MaCH) against a related method that uses pre-phasing (minimac). Encouragingly, both methods produced similar accuracy when applied to a common reference panel of 60 CEU individuals (Table 2). In fact, our pre-phasing strategy generated slightly better results despite ignoring the uncertainty in the estimated GWAS haplotypes, possibly because pre-phasing captures joint LD information that is not used by the analytical phasing/imputation framework of methods like MaCH and IMPUTE1.

We then used the WTCCC2 data to compare pre-phasing against a haplotype sampling approach, both of which were implemented in IMPUTE26. Our results again show that pre-phasing can provide comparable accuracy to the existing imputation method, although in this case the pre-phasing results were slightly less accurate (Table 2). Both pre-phasing and haplotype sampling capture LD information in the GWAS data, but the sampling approach also accounts for some of the uncertainty in phasing the GWAS genotypes, which could explain why it is more accurate here.

Finally, since it is well established that phasing and imputation can be more challenging in individuals with recent African ancestry because of their reduced LD and higher genetic diversity, we evaluated our pre-phasing approach in the WHI GWAS of African Americans. In this comparison pre-phasing was less accurate than MaCH’s analytical approach by the largest (but still small) margin in Table 2, which we interpret as evidence that accounting for phase uncertainty is more important when the haplotypes are harder to estimate.

The advantages of pre-phasing become particularly clear when considering successive reference panels that have been updated over time. Following a relatively modest pre-phasing investment, each new reference panel can be imputed at a low computational cost while improving the accuracy and completeness of the imputed genotypes. This point is illustrated by Table 2, which shows that adding haplotypes to the 1000 Genomes resource increased accuracy for all SNPs, especially those with minor allele frequency (MAF) of 1–3%: from a mean R2 of 0.65 to 0.75 to 0.82 in the WTCCC2; from a mean R2 of 0.69 to 0.73 to 0.83 in GAIN; and from a mean R2 of 0.49 to 0.55 to 0.61 in the WHI. Beyond the accuracy increase at known variants, each new panel also introduces many novel variants that could lead to additional association signals and biological insights.

Evaluation of imputation accuracy using sequence data

One caveat to the comparisons above is that SNPs on GWAS arrays tend to be more common (e.g., see Supplementary Fig. 1) and easier to impute than unascertained SNPs2. We addressed this issue by performing a cross-validation in the EUR panel of 1000 Genomes Phase I, which includes a more complete set of SNPs discovered by low-pass whole-genome and high-pass exome sequencing in >1,000 individuals. For each of the 381 EUR individuals in turn, we masked genotypes on chromosome 10 at all sites except those included on Affymetrix 500k SNP arrays, then imputed the missing sites using the Affymetrix 500k scaffold and the remaining 760 EUR haplotypes. To mimic pre-phasing in a GWAS, we reduced the EUR dataset to sites present on the array scaffold, re-phased the genotypes, and then used these estimated haplotypes when imputing masked genotypes for a given individual.

The bottom rows in Table 2 show the results of running this experiment. To provide a point of comparison with the GWAS results in Table 2, we initially imputed only the SNPs that were used in the WTCCC2 analysis (“WTCCC2 SNPs”). The imputation accuracy at these SNPs was slightly lower in the EUR cross-validation than in the WTCCC2 analysis; e.g., pre-phasing in EUR produced mean R2 values of 0.81, 0.85, and 0.91 for SNPs in ascending MAF bins, as compared to 0.82, 0.86, and 0.91 for the WTCCC2 experiment with the same scaffold SNPs, reference panel, and phasing approach (Table 2). These differences in pre-phasing accuracy may reflect the relative amount of phase information in a sample of 381 individuals (1000 Genomes EUR) and a sample of nearly 2,500 individuals (WTCCC2). Nonetheless, the overall similarity in results suggests that our EUR cross-validation provides a good approximation to a European GWAS.

We next extended the experiment by imputing the full set of SNPs in the EUR sequence data (“sequence SNPs”). As expected, the sequence SNPs were imputed less accurately than the WTCCC2 SNPs within each frequency bin. For example, haplotype sampling produced mean R2 values of 0.82, 0.86, and 0.92 (for MAF 1–3%, 3–5%, and >5%, respectively) in the array SNP analysis described above, but the accuracy dropped to 0.66, 0.79, and 0.91 when evaluating all sequence SNPs in the same frequency ranges (Table 2). Despite the added difficulty of imputing low-frequency and unascertained variants, pre-phasing was still nearly as effective as haplotype sampling at these SNPs (mean R2 of 0.64 vs. 0.66 for MAF 1–3%; Table 2). This analysis also allows us to measure accuracy at SNPs with MAF < 1%, where we observed mean R2 values of 0.42 and 0.44 for pre-phasing and haplotype sampling, respectively. Hence, while all methods have lower imputation accuracy at unascertained and low-frequency SNPs, pre-phasing still achieves competitive accuracy at such variants.

Multiple imputation

The examples in Table 2 show that imputation accuracy may sometimes decrease when using the most likely haplotype pair for each GWAS individual rather than integrating over the phase uncertainty. We also note that, over the span of entire chromosomes and in the datasets examined here, haplotype estimates will almost never match the true underlying haplotypes. These considerations led us to assess whether we could improve accuracy for a reasonable increment in computing time by saving multiple sampled haplotype configurations at the pre-phasing stage and then imputing into each of these (see Supplementary Notes for details). Supplementary Fig. 2 and Supplementary Fig. 3 show that imputing into 4–10 sampled haplotypes per individual provide a small increase in accuracy while increasing computational costs by 4–10x. At the same time, Supplementary Fig. 2 shows that using a much larger number of sampled haplotypes (up to 500 per individual, for a 500x increase in computing time) provides only a modest additional increase in accuracy, which confirms that a single pre-phased configuration provides nearly as much accuracy as much more compute intensive methods for capturing haplotype uncertainty. These results suggest that pre-phasing is a good general strategy for genome-wide imputation, while slower but more accurate approaches may be useful for follow-up analyses near putative disease loci.

DISCUSSION

We have described a practical strategy for imputing genotypes from the large reference panels that are now emerging from sequencing efforts like the 1000 Genomes Project. These panels are regularly updated, both to incorporate newly sequenced individuals and to take advantage of improved methods for analyzing next-generation sequence data that can handle increasingly diverse variant types, including insertion/deletion polymorphisms and copy number variants. New reference datasets may provide significant benefits for disease studies, but imputing them into large-scale GWAS currently requires substantial computational resources. The pre-phasing strategy introduced here will allow investigators to routinely impute from these emerging reference panels at a reasonable computational cost and thereby enhance studies of complex disease genetics.

Overall, our results show that pre-phasing provides comparable accuracy to state-of-the-art imputation methods. While Tables 1 and 2 focus on selected combinations of data and methods, we also find that minimac and IMPUTE2 produce very similar trends in accuracy and running time when applied to the same dataset (for details, see Supplementary Fig. 4 and compare Table 1 and Supplementary Table 1).

It is somewhat surprising that pre-phasing remains competitive with other methods when imputing rare variants (MAF < 1%; Table 2). Such variants should require longer flanking haplotypes for successful imputation, and a single pre-phasing solution may include errors that break up long-range haplotypes. One possible explanation is that existing methods also struggle to infer very long haplotypes, so that pre-phasing still looks accurate by comparison. Conversely, it is important to realize that imputation accuracy is affected by phasing accuracy in the reference panel and by the GWAS SNPs used to drive imputation4,5. Imperfections in the reference haplotypes would limit imputation accuracy even with perfectly phased GWAS haplotypes, and it may be difficult to impute rare variants with any method when using sparse GWAS scaffolds. These factors may outweigh the differences between methods that use pre-phasing and those that integrate over phase uncertainty.

We consider extensions of the pre-phasing approach in the Supplementary Material, including an exploration of multiple imputation approaches and an example of how imputation accuracy can be improved by pre-phasing with other haplotyping engines like SHAPEIT17. We also note that when genotypes from family members are available, it may be particularly attractive to use our imputation software with haplotypes informed by transmission patterns in pedigrees, where the best phasing tool may depend on family structure.

Our results show that pre-phasing is a highly generalizable strategy that can be adapted to most imputation engines, and we expect that it will be combined with future methodological developments to make imputation even faster and more flexible. Software implementing our approach, using either the IMPUTE2 or MaCH/minimac framework, is available from the authors’ websites.

Online Methods

FUSION Dataset

The Finland-United States Investigation of Non-Insulin-Dependent Diabetes Mellitus Genetics (FUSION) Study consists of 1,161 Finnish individuals with type 2 diabetes (T2D) and 1,174 normal glucose tolerant Finnish controls. Samples were genotyped with the Illumina Human-Hap300 BeadChip (v1.1). In sum, 306,222 autosomal SNPs passed quality control (HWE P ≥ 10−6 in the total sample, call frequency ≥ 0.90 and MAF > 0.01)15. In addition, 120 trios were genotyped with the same chip and haplotypes were estimated based on the most likely pattern of gene flow using Merlin20 and compared with haplotypes estimated statistically using population information and the software program MaCH12.

GAIN Psoriasis Dataset

The Genetic Association Information Network (GAIN)21 supported a series of Genome-Wide Association Studies (GWAS) designed to identify specific points of DNA variation associated with the occurrence of a particular common disease. Data used for this study included 1,359 psoriasis cases and 1,400 controls genotyped at Perlegen Sciences using a custom genotyping array. In total, 438,670 autosomal SNPs passed the quality control filters (HWE P ≥ 10−6 in the total sample, call frequency ≥ 0.95 and MAF > 0.01)21. In this study, 88 individuals were also genotyped using Affymetrix 6.0 arrays and these genotypes were used to evaluate imputation accuracy by examining the correlation between imputed dosages and array genotypes (for markers that were present on the Affymetrix 6.0 arrays but not on the Perlegen custom array).

WTCCC2 Dataset

We used genotype data from the Wellcome Trust Case Control Consortium 2 (WTCCC2)10 on members of the 1958 British Birth Cohort, which is comprised of controls sampled from the United Kingdom. These individuals were genotyped on Affymetrix 6.0 and Illumina 1.2 M SNP arrays. The WTCCC2 merged genotypes across platforms and applied standard QC filters, which resulted in 2,490 individuals and 71,190 SNPs on chromosome 10. For our imputation experiments we masked the SNPs not found on the Affymetrix 500k array, imputed the masked SNPs, and compared the imputed dosages against the original array genotypes.

WHI Dataset

We obtained genotype data for the Women’s Health Initiative (WHI)19 study from dbGaP (http://www.ncbi.nlm.nih.gov/gap). The dataset included 8,421 African Americans genotyped on Affymetrix 6.0. We removed SNPs with genotype call rates < 90%, HWE P < 10−6, or MAF < 1%, resulting in 829,370 SNPs passing quality control criteria. For our imputation experiments we masked every 10th SNP and repeated in sliding windows, such that each analysis was informed by ~90% of the array SNPs and every array SNP was imputed exactly once.

Phasing

Haplotyping approaches such as those implemented in MaCH and IMPUTE2 proceed through a series of iterative steps. In each step, a new pair of haplotypes is sampled for each individual as an imperfect mosaic22 of the estimated haplotypes (“templates”) for other individuals in the dataset. After a number of iterations, “best-guess” haplotypes are constructed for each individual by combining information across the sampled haplotype configurations; both MaCH and IMPUTE2 perform this step by minimizing the expected switch error rate23. The computational cost of phasing with these methods depends on the number of iterations performed and the number of template haplotypes that are used in each update. For the experiments described here, we used 20 iterations and 200 – 400 templates for MaCH and 30 iterations (first 10 discarded as burn-in) and 80 templates for IMPUTE2. These methods differ in various details, such as how they fit the parameters of their models and how they choose templates for each haplotype sampling step; further information is provided in the original papers6,12.

Imputation into Phased Haplotypes

When GWAS genotypes have been phased prior to imputation, each haplotype can be imputed separately if we assume that the GWAS haplotypes are conditionally independent, given a reference panel. The reference panel provides template haplotypes for the imputation model, and marginal probabilities for the untyped alleles in each GWAS haplotype are estimated via standard hidden Markov model (HMM) calculations (the “forward-backward” algorithm24). The parameters of the HMM are estimated in different ways by minimac and IMPUTE2; see elsewhere for details6,12. Allelic probabilities are converted to genotypic probabilities for each individual by assuming Hardy-Weinberg equilibrium; these genotypic probabilities can be directly compared with those produced by other imputation approaches.

Computational Costs

Many existing imputation methods (e.g., MaCH and IMPUTE1) use analytical integration to account for the unknown phase of GWAS genotypes. The computational cost of this approach is proportional to the number of GWAS individuals (N), the number of genotyped markers in the reference panel (MREF), and the square of the number of reference haplotypes (H), or O(N * MREF * H2). Some methods, such as fastPHASE23 and Beagle25, reduce H by grouping similar haplotypes into clusters. The quadratic term affects all markers, whether they are typed in a GWAS or just in the reference panel. Consequently, the computational cost grows quickly with reference panel size, and it can be time-consuming to run these methods on modern reference datasets.

IMPUTE2 aims to reduce the computing burden through a Monte Carlo algorithm that separates the phasing and imputation tasks. This approach alternately samples phase configurations for genotyped markers and imputes allele probabilities for markers not typed in the GWAS. The cost of the phasing component is proportional to the number of GWAS individuals (N), the number of genotyped markers in the GWAS data (MGWAS), the number of iterations (I), and the square of the number of templates used in each phasing update (K2), or O(N * MGWAS * I * K2). The cost of the imputation component is proportional to the number of GWAS haplotypes (2N), the number of markers in the reference panel (MREF), the number of iterations (I), and the number of haplotypes in the reference panel (H), or O(N * MREF * I * H). Partitioning the analysis in this way allows better scaling with reference panel size, but it requires I repetitions of the imputation step (one for each sampled phase configuration).

Like the IMPUTE2 Monte Carlo algorithm, pre-phasing separates the phasing and imputation steps when imputing a GWAS dataset. The computational cost of pre-phasing in our framework is O(N * MGWAS * I * K2). This is the same as the phasing cost for Monte Carlo integration, although in this context the phasing must be performed just once per GWAS dataset. Given a set of pre-phased GWAS haplotypes, the cost of imputation is then O(N * MREF * H); the efficiency of this step makes imputation from pre-phased haplotypes very fast. The cost of each step in our current computing system, in CPU hours, is approximately N * MGWAS * I * K2 * 10−11 for phasing and N * MREF * H * 10−11 for imputation.

Supplementary Material

Supplementary Material

Acknowledgments

We thank Michael Boehnke for critical reading, advice and suggestion, Yun Li for aid with cleaning the WHI data, and two anonymous reviewers for helpful comments.

B.H. and M.S. were supported by NHGRI grant HGO2585 to M.S. J.M. was supported by United Kingdom Medical Research Council grant number G0801823. C.F. and G.R.A. were supported by NIH grant numbers DK0855840, HG005552, and HG005581.

This study makes use of data generated by the Wellcome Trust Case-Control Consortium, the Genetic Association Information Network and the Women’s Health Initiative. A full list of the investigators who contributed to the generation of the WTCCC data is available from www.wtccc.org.uk. The WTCCC was partially funded by the Wellcome Trust under awards 076113 and 085475. For details of contributors to the GAIN and WHI studies please see the corresponding dbGap accessions.

Footnotes

AUTHOR CONTRIBUTIONS

B.H., C.F., M.S., J.M. and G.R.A designed the methods and experiments. B.H. and C.F. ran the experiments and wrote the first draft; all authors contributed critical reviews of the manuscript during its preparation.

URL

For minimac and instructions on how to implement our pre-phasing approach using MaCH, see: http://genome.sph.umich.edu/wiki/minimac

For separate pre-phasing and imputation with IMPUTE 2.0: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#prephasing_g was

References

  • 1.The International HapMap Project. Nature. 2003;426:789–96. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
  • 2.Altshuler DM, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–8. doi: 10.1038/nature09298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–73. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11:499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]
  • 5.Li Y, Willer C, Sanna S, Abecasis G. Genotype imputation. Annu Rev Genomics Hum Genet. 2009;10:387–406. doi: 10.1146/annurev.genom.9.081307.164242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS genetics. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Burdick JT, Chen WM, Abecasis GR, Cheung VG. In silico method for inferring genotypes in pedigrees. Nat Genet. 2006;38:1002–4. doi: 10.1038/ng1863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chen WM, Abecasis GR. Family-based association tests for genomewide association scans. American journal of human genetics. 2007;81:913–26. doi: 10.1086/521580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kong A, et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature genetics. 2008;40:1068–75. doi: 10.1038/ng.216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Genome-wide association study of 14, 000 cases of seven common diseases and 3, 000 shared controls. Nature. 2007;447:661–78. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature genetics. 2007;39:906–13. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
  • 12.Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic epidemiology. 2010;34:816–34. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Varilo T, Peltonen L. Isolates and their potential use in complex gene mapping efforts. Current opinion in genetics & development. 2004;14:316–23. doi: 10.1016/j.gde.2004.04.008. [DOI] [PubMed] [Google Scholar]
  • 14.Peltonen L, Palotie A, Lange K. Use of population isolates for mapping complex traits. Nature reviews Genetics. 2000;1:182–90. doi: 10.1038/35042049. [DOI] [PubMed] [Google Scholar]
  • 15.Scott LJ, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007;316:1341–5. doi: 10.1126/science.1142382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Marchini J, et al. A comparison of phasing algorithms for trios and unrelated individuals. American journal of human genetics. 2006;78:437–50. doi: 10.1086/500808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nature methods. 2012;9:179–81. doi: 10.1038/nmeth.1785. [DOI] [PubMed] [Google Scholar]
  • 18.Manolio TA, et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nature genetics. 2007;39:1045–51. doi: 10.1038/ng2127. [DOI] [PubMed] [Google Scholar]
  • 19.Design of the Women’s Health Initiative clinical trial and observational study. The Women’s Health Initiative Study Group. Controlled clinical trials. 1998;19:61–109. doi: 10.1016/s0197-2456(97)00078-0. [DOI] [PubMed] [Google Scholar]
  • 20.Abecasis GR, Wigginton JE. Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am J Hum Genet. 2005;77:754–67. doi: 10.1086/497345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Nair RP, et al. Genome-wide scan reveals association of psoriasis with IL-23 and NF-kappaB pathways. Nat Genet. 2009;41:199–204. doi: 10.1038/ng.311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Stephens M, Donnelly P. A comparison of bayesian methods for haplotype reconstruction from population genotype data. American journal of human genetics. 2003;73:1162–9. doi: 10.1086/379378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. American journal of human genetics. 2006;78:629–44. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Baum LE, Petrie T, Soules G, Weiss N. A Maximization Technique Occurring in Statistical Analysis of Probabilistic Functions of Markov Chains. Annals of Mathematical Statistics. 1970;41:164. [Google Scholar]
  • 25.Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. American journal of human genetics. 2009;84:210–23. doi: 10.1016/j.ajhg.2009.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES