A reference panel of 64,976 haplotypes for genotype imputation

Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey Barrett; Dorret I Boomsma; Kari Branham; Gerome Breen; Chad Brummet; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine Min; Karen L Mohlke; John Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenheer; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti

doi:10.1038/ng.3643

. Author manuscript; available in PMC: 2017 Apr 11.

Published in final edited form as: Nat Genet. 2016 Aug 22;48(10):1279–1283. doi: 10.1038/ng.3643

A reference panel of 64,976 haplotypes for genotype imputation

Shane McCarthy ^1,^*, Sayantan Das ^2,^3,^*, Warren Kretzschmar ^4,^*, Olivier Delaneau ⁵, Andrew R Wood ⁶, Alexander Teumer ^7,⁸, Hyun Min Kang ^2,³, Christian Fuchsberger ^2,³, Petr Danecek ⁹, Kevin Sharp ¹⁰, Yang Luo ¹, Carlo Sidore ¹¹, Alan Kwong ^2,³, Nicholas Timpson ¹², Seppo Koskinen ¹³, Scott Vrieze ^14,¹⁵, Laura J Scott ^2,³, He Zhang ¹⁶, Anubha Mahajan ⁴, Jan Veldink ¹⁷, Ulrike Peters ^18,¹⁹, Carlos Pato ²⁰, Cornelia M van Duijn ²¹, Christopher E Gillies ²², Ilaria Gandin ²³, Massimo Mezzavilla ²⁴, Arthur Gilly ¹, Massimiliano Cocca ²⁵, Michela Traglia ²⁵, Andrea Angius ⁵, Jeffrey Barrett ¹, Dorret I Boomsma ²⁶, Kari Branham ²⁷, Gerome Breen ^28,²⁹, Chad Brummet ³⁰, Fabio Busonero ¹¹, Harry Campbell ³¹, Andrew Chan ^32,³³, Sai Chen ^2,^3,^34,³⁵, Emily Chew ³⁶, Francis S Collins ³⁷, Laura Corbin ¹², George Davey Smith ¹², George Dedoussis ³⁸, Marcus Dorr ^39,⁴⁰, Aliki-Eleni Farmaki ³⁸, Luigi Ferrucci ⁴¹, Lukas Forer ⁴², Ross M Fraser ³¹, Stacey Gabriel ⁴³, Shawn Levy ⁴⁴, Leif Groop ^45,^46,⁴⁷, Tabitha Harrison ¹⁸, Andrew Hattersley ⁴⁸, Oddgeir L Holmen ⁴⁹, Kristian Hveem ⁴⁹, Matthias Kretzler ^34,^35,⁵⁰, James Lee ^51,⁵², Matt McGue ⁵³, Thomas Meitinger ^54,⁵⁵, David Melzer ⁵⁶, Josine Min ¹², Karen L Mohlke ⁵⁷, John Vincent ^58,^59,⁶⁰, Matthias Nauck ^8,⁴⁰, Deborah Nickerson ⁶¹, Aarno Palotie ^43,^61,⁶², Michele Pato ²⁰, Nicola Pirastu ²³, Melvin McInnis ⁶³, Brent Richards ⁶⁴, Cinzia Sala ²⁵, Veikko Salomaa ¹³, David Schlessinger ^65,^66,⁶⁷, Sebastian Schoenheer ⁴², P Eline Slagboom ⁶⁸, Kerrin Small ⁶⁹, Timothy Spector ⁶⁹, Dwight Stambolian ⁷⁰, Marcus Tuke ⁶, Jaakko Tuomilehto ^71,^72,^73,⁷⁴, Leonard Van den Berg ¹⁷, Wouter Van Rheenen ¹⁷, Uwe Volker ^40,⁷⁵, Cisca Wijmenga ⁷⁶, Daniela Toniolo ²⁵, Eleftheria Zeggini ¹, Paolo Gasparini ^23,⁷⁷, Matthew G Sampson ²², James F Wilson ^31,⁷⁸, Timothy Frayling ⁶, Paul de Bakker ^79,⁸⁰, Morris A Swertz ^76,⁸¹, Steven McCarroll ^82,⁸³, Charles Kooperberg ¹⁸, Annelot Dekker ¹⁷, David Altshuler ^43,^83,^84,^85,^86,⁸⁷, Cristen Willer ^16,^34,³⁵, William Iacono ⁵³, Samuli Ripatti ⁸⁸, Nicole Soranzo ¹, Klaudia Walter ¹, Anand Swaroop ⁸⁹, Francesco Cucca ¹¹, Carl Anderson ¹, Michael Boehnke ^2,³, Mark I McCarthy ^4,^90,⁹¹, Richard Durbin ^1,^**, Gonçalo Abecasis ^2,^3,^**, Jonathan Marchini ^10,^4,^**; for the Haplotype Reference Consortium

¹Human Genetics, Wellcome Trust Sanger Institute, Hinxton, UK

²Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA

³Center for Statistical Genetics , University of Michigan, Ann Arbor, Michigan, USA

⁴Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK

⁵Genetics and Development, University of Geneva, Geneva, Switzerland

⁶Genetics of Complex Traits, Institute of Biomedical Science, University of Exeter Medical School, Exeter, UK

⁷Institute for Community Medicine, University Medicine Greifswald, Germany

⁸Institute of Clinical Chemistry and Laboratory Medicine, University Medicine Greifswald, Germany

⁹Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Oxford, UK

¹⁰Department of Statistics, University of Oxford, Oxford, UK

¹¹IRGB, CNR, Sardinia, Italy

¹²MRC Integrative Epidemiology Unit, University of Bristol, Oakfield Grove, UK

¹³THL, Finland

¹⁴Institute for Behavioral Genetics, University of Colorado, Boulder, Colorado, USA

¹⁵Department of Psychology and Neurosurgery, University of Colorado, Boulder, Colorado, USA

¹⁶Department of Internal Medicine, Division of Cardiovascular Medicine, University of Michigan, Ann Arbor, Michigan, USA

¹⁷Department of Neurology and Neurosurgery, Brain Center Rudolf Magnus, Utrecht, the Netherlands

¹⁸Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA

¹⁹Department of Epidemiology, University of Washington School of Public Health, Seattle, Washington, USA

²⁰Department of Psychiatry, SUNY Downstate, Brooklyn, New York, USA

²¹Genetic Epidemiology Unit, Department of Epidemiology, ErasmusMC, Rotterdam, the Netherlands

²²Department of Pediatrics-Nephrology, University of Michigan School of Medicine, Ann Arbor, Michigan, USA

²³DSM, University of Trieste, Trieste, Italy

²⁴Genetica Medica, IRCCS-Burlo Garofolo, Trieste, Italy

²⁵Genetics and Cell Biology, San Raffaele Research Institute, Milano, Italy

²⁶Department of Biological Psychology, VU Amsterdam, Neuroscience Campus, Amsterdam, the Netherlands

²⁷Department of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor, Michigan, USA

²⁸MRC Social Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology & Neuroscience, King’s College, London, UK

²⁹NIHR Biomedical Research Centre for Mental Health, Institute of Psychiatry, Psychology & Neuroscience, King’s College London and The South London Maudsley Hospital, London, UK

³⁰Department of Anesthesiology, University of Michigan, Ann Arbor, Michigan, USA

³¹The Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK

³²Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA

³³Brigham and Women’s Hospital, Channing Division of Network Medicine, Boston, Massachusetts, USA

³⁴Department of Computational Medicine, University of Michigan, Ann Arbor, Michigan, USA

³⁵Department of Bioinformatics , University of Michigan, Ann Arbor, Michigan, USA

³⁶Division of Epidemiology and Clinical Applications, National Eye Institute, Bethesda, Maryland, USA

³⁷Medical Genomics and Metabolic Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA

³⁸Department of Nutrition and Dietetics, School of Health Science and Education, Harokopio University, Athens, Greece

³⁹Department of Internal Medicine B, University Medicine Greifswald, Germany

⁴⁰DZHK (German Centre for Cardiovascular Research), Greifswald, Germany

⁴¹Longitudinal Studies Section, Clinical Research Branch, Gerontology Research Centre, National Institute on Aging, Baltimore, Maryland, USA

⁴²Center for Biomedicine, University of Innsbruck, Bolzano, Italy

⁴³Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA

⁴⁴HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, USA

⁴⁵Department of Clinical Sciences, Diabetes and Endocrinology, University of Lund, Malmo, Sweden

⁴⁶Finnish Institute for Molecular Medicine, University of Helsinki, Helsinki, Finland

⁴⁷Research Programs Unit, Diabetes and Obesity, University of Helsinki, Helsinki, Finland

⁴⁸Department of Diabetes and Vascular Medicine, University of Exeter Medical School, Exeter, UK

⁴⁹Hunt Research Centre, Department of Public Health and General Practice, Norwegian University of Science and Technology, Levanger, Norway

⁵⁰Department of Internal Medicine, University of Michigan School of Medicine, Ann Arbor, Michigan, USA

⁵¹Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK

⁵²Department of Medicine, University of Cambridge School of Clinical Medicine, Addenbrooke’s Hospital, Cambridge, UK

⁵³Department of Psychology, University of Minnesota, Minneapolis, Minnesota, USA

⁵⁴Institute of Human Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany

⁵⁵Institute of Human Genetics, Technische Universität München, Munich, Germany

⁵⁶Epidemiology and Public Health, Institute of Biomedical and Clinical Science, University of Exeter Medical School, Exeter, UK

⁵⁷Department of Genetics, University of North Carolina, Chapel Hill, Chapel Hill, North Carolina, USA

⁵⁸Molecular Neuropsychiatry and Development Laboratory, Centre for Addiction and Mental Health, Toronto, Canada

⁵⁹Department of Psychiatry, University of Toronto, Toronto, Canada

⁶⁰Institute of Medical Science, University of Toronto, Toronto, Canada

⁶¹Genome Sciences, University of Washington, Seattle, Washington, USA

⁶²Institute for Molecular Medicine, FIMM, Helsinki, Finland

⁶³Department of Psychiatry, University of Michigan, Ann Arbor, Michigan, USA

⁶⁴Massachusetts General Hospital, Boston, Massachusetts, USA

⁶⁵Department of Medicine, McGill University, Montreal, Canada

⁶⁶Department of Human Genetics, McGill University, Montreal, Canada

⁶⁷National Institute on Aging, National Institutes of Health, Baltimore, Maryland, USA

⁶⁸Molecular Epidemiology Section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands

⁶⁹The Department of Twin Research and Genetic Epidemiology, King’s College, London, UK

⁷⁰Department of Ophthalmology, University of Pennsylvania, Philadelphia, Pennsylvania, USA

⁷¹Chronic Disease Prevention Unit, National Institute for Health and Welfare, Helsinki, Finland

⁷²Instituto de Investigacion Sanitaria del Hospital Universario LaPaz, University Hospital LaPaz, Autonomous University of Madrid, Madrid, Spain

⁷³Center for Vascular Prevention, Danube University Krems, Krems, Austria

⁷⁴Diabetes Research Group, King Abdulaziz University, Saudi Arabia

⁷⁵Interfaculty Institute for Genetics and Functional Genomics, University Medicine Greifswald, Germany

⁷⁶Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands

⁷⁷Department of Experimental Genetics, Sidra, Doha, Qatar

⁷⁸MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, UK

⁷⁹Medical Genetics, University Medical Center, Utrecht, the Netherlands

⁸⁰Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, the Netherlands

⁸¹Genomics Coordination Center, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands

⁸²Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA

⁸³Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA

⁸⁴Department of Molecular Biology, Massachusetts General Hospital, Boston, Massachusetts, USA

⁸⁵Diabetes Research Center (Diabetes Unit), Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA

⁸⁶Department of Medicine, Harvard Medical School, Boston, Massachusetts, USA

⁸⁷Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

⁸⁸Department of Public Health, University of Helsinki, Helsinki, Finland

⁸⁹Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, Bethesda, Maryland, USA

⁹⁰Oxford Centre for Diabetes, Endocrinology and Metabolism, Radcliffe Department of Medicine, University of Oxford, Oxford, UK

⁹¹Oxford NIHR Biomedical Research Centre, Churchill Hospital, Headington, Oxford, UK

these authors should be consider joint first author on this paper

^**

these authors jointly supervised the research

PMCID: PMC5388176 EMSID: EMS71361 PMID: 27548312

Abstract

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1%, a large increase in the number of SNPs tested in association studies and can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

Over the last decade, large scale international collaborative efforts have created successively larger and more ethnically diverse genetic variation resources. For example, in 2007 the International HapMap Project produced a haplotype reference panel of 420 haplotypes at 3.1M SNPs in 3 continental populations¹. More recently, the 1000 Genomes Project has produced a series of datasets built using low-coverage whole genome sequencing (WGS), culminating in 2015 in a reference panel (1000GP3) of 5,008 haplotypes at over 88M variants from 26 world-wide populations². In addition, several other projects have collected low-coverage WGS data in large numbers of samples that could potentially also be used to build haplotype reference panels³^–⁵. A major use of these resources has been to facilitate imputation of unobserved genotypes into genome-wide association study (GWAS) samples that have been assayed using relatively sparse genome-wide microarray chips. As the reference panels have increased in number of haplotypes, SNPs and populations, genotype imputation accuracy has increased, allowing researchers to impute and test SNPs for association at ever lower minor allele frequencies. A succession of methods developments have provided researchers with the tools to cope with these increasing larger panels ⁶^–¹¹.

We formed the Haplotype Reference Consortium (HRC) (see URLs) to bring together as many WGS datasets as possible to build a much larger combined haplotype reference panel. By doing so, our aim is to provide a single centralized resource for human genetics researchers to carry out genotype imputation. Here we describe the first HRC reference panel that combines datasets from 20 different studies (Supplementary Table 1). The majority of these studies have low-coverage WGS data (4-8X coverage) and are known to consist of samples with predominantly European ancestry. However the 1000 Genomes Phase 3 cohort, which has diverse ancestry, is also included. This reference panel consists of 64,976 haplotypes at 39,235,157 SNPs that have evidence of having a minor allele count (MAC) greater or equal to 5.

We took the following approach to create the reference panel. We combined existing sets of genotype calls from each study to determine a ‘union’ set of 95,855,206 SNP sites with MAC >= 2. After initial tests, we decided for this first version of the HRC panel not to include small insertions and deletions (indels), since these were very inconsistently called across projects. We then used a standard tool to calculate the genotype likelihoods consistently for each sample at each site from the original study BAM files (see Methods) and make a baseline set of non-LD based genotype calls. We next applied a number of filters to remove poor quality sites (see Methods). We restricted this site list to sites with MAC >= 5 based on the calls made originally by the individual studies, corresponding to a minimum minor allele frequency (MAF) of 0.0077%, then added back sites that are present on several commonly used SNP microarray chips in GWAS. Sites with lower MAF would be likely to be poorly imputed. This site list consisting of 44,187,567 sites exhibited improved quality compared to the unfiltered MAC >= 5 site list when assessed by measuring a per sample transition-to-transversion (Ts/Tv) ratio (Supplementary Figures 1-2). We also detected and removed 301 duplicate samples across the whole dataset (see Methods).

Calling genotypes and phasing using low-coverage WGS data has been a computational challenging step for many of the 20 studies providing data. To reduce computation, we carried out this step on genotype likelihoods from all 32,611 samples together, and leveraged the original separately called haplotypes from each study to help reduce the search space of the calling algorithm (see Methods). We then applied a further refinement step by re-phasing the called genotypes using the SHAPEIT3 method¹², based on experience from the UK10K project, which found this re-phasing approach produced substantially improved imputation accuracy when using the haplotypes⁴. After final genotype calling, we removed a further 123 samples (see Methods) and filtered out 4,952,410 sites whose MAC after refinement and sample removal was below 5, resulting in a final set of 39,235,157 sites and 32,488 samples. By measuring genotype discordance of the called genotypes compared to Illumina OMNI2.5M chip genotypes available on the 1000 Genomes samples we showed that both our site filtering strategy and the increased sample size of HRC led to improved accuracy (Supplementary Table 2). For example, we obtained a non-reference allele discordance of 0.39% on the full HRC dataset with site filtering, compared to 0.67% on the subset of 1000GP3 samples.

We next carried out experiments to assess and illustrate the downstream imputation performance compared to previous haplotype reference panels. To mimic a typical imputation analysis, we created a pseudo-GWAS dataset using high-coverage Complete Genomics (CG) WGS genotypes on 10 CEU samples (see URLs). We extracted the CG SNP genotypes at all the sites included on an Illumina 1M SNP array (Human1M-Duo v3C). These were used to impute the remaining genotypes which were then compared to the held out genotypes, stratifying results by MAF of the imputed sites. Figure 1 shows that the HRC reference panel leads to a large increase in imputation performance when using a 1M SNP chip, compared to the 1000GP3 (R²=0.64 vs R²=0.36 at MAF = 0.1%) and also that the re-phasing step using SHAPEIT3 is worthwhile. HRC imputation at 0.1% frequency provides similar accuracy to 1000GP3 imputation at 0.6% frequency. Supplementary Figures 3 and 4 show the results from a denser (Illumina OMNI 5M) SNP chip and a sparser (Illumina Core Exome).

The x-axis shows the non-reference allele frequency of the SNP being imputed on a log scale. The y-axis shows imputation accuracy measured by aggregate r² when imputing SNP genotypes into 10 CEU samples. These results are based on using genotypes from sites on Illumina OMNI 1M SNP array was used as pseudo-GWAS data.

To illustrate the benefits of using the HRC resource, we imputed a GWAS study of 1,210 samples from the InCHIANTI study¹³, including 534 that did not contribute to the HRC reference panel because they were not sequenced. Imputing using the HRC panel resulted in 15,501,516 SNPs passing an imputation quality threshold of r²≥0.5 compared to 13,238,968 variants (11,908,509 SNPs and 1,330,459 indels) when imputing using 1000 Genomes Phase 3, an increase of over 2 million variants. Taking the intersection of variant sites between the two panels to account for the filtering applied to the HRC panel resulted in 13,364,795 SNPs and 10,728,322 SNPs with r²≥0.5 for HRC and 1000 Genomes Phase 3 panel, respectively. The majority of these additional SNPs occur at the lower frequency range (Supplementary Table 3).

We next tested the HRC imputed genotypes for association with 93 circulating blood marker phenotypes, including many of relevance to human health such as lipids, vitamins, ions, inflammatory markers and adipokines¹⁴^,¹⁵.This analysis highlighted potential novel associations at the nominal GWAS significance threshold of 5e-8 (Supplementary Table 4). When we repeated the imputation using the HRC panel without the overlapping InCHIANTI samples, we obtained similar results (Supplementary Table 4). We took these SNPs forward for replication in the SHIP and SHIP-TREND cohorts (see Methods) and found that two of the SNPs replicated (Supplementary Table 5). Specifically, we found that SNP rs150956780 (MAF= 0.6%) was associated with the Lactic Dehydrogenase phenotype (meta-analysis p-value = 3.779E-29) and SNP rs147142246 (MAF= 0.6%) was associated with the Potassium phenotype (meta-analysis p-value = 8.7E-09). We also found that it is possible for HRC imputation to refine signals of association. For example, Figure 2 shows the association results of HapMap2, 1000GP3 and HRC based imputation for the α₁-antitripsin phenotype at the SERPINA1 locus. HRC imputation gives a clear refinement of the signal at the rare causal SNP rs28929474 (MAF=0.5%) (Supplementary Table 6), known to predispose to the alpha 1 antitrypsin deficiency lung condition emphysema ¹⁶^,¹⁷. Similar results were obtained when using the HRC panel that excluded the InCHIANTI samples (data not shown).

Association test statistics on the –log10 p-value scale (y-axis) are plotted for each SNP position (x-axis). Three different imputation panels were used : HapMap2 (left), 1000GP3 (middle), HRC release 1 (right). The SNP rs28929474 is shown as a purple and other SNPs are coloured according to the levels of LD (r²) with this SNP (see r² legend in each subplot)

Since the HRC reference panel combines data from many different studies with a range of restrictions on data release we have developed centralized imputation server resources (see URLs). Under this model researchers upload phased or unphased genotype data and imputation is carried out on central servers. Once completed researchers can download imputed datasets. Along similar lines, we have also developed a lower throughput phasing server for haplotype estimation of clinical samples with genotypes from high-coverage WGS data that takes advantage of rare variant sharing ¹⁸ (see URLs). A limited subset of HRC haplotypes will be made available for researchers via the European Genome-phenome Archive (EGA) for the sole purpose of phasing and imputation.

This first release of the HRC is the largest human genetic variation resource to date and has been created via an unprecedented collaboration of data sharing across many groups. We envisage continuing to expand the HRC and are currently planning a second HRC release differing from the first release in two ways. Firstly, we aim to substantially increase the ethnic diversity of the panel, by including data from sequencing studies in world-wide sample sets such as the CONVERGE study¹⁹, AGVP²⁰ and HGDP²¹ Secondly, we aim to include short insertions and deletions in addition to SNP variants. In the limit of a reference panel consisting of the whole human population except the person being imputed, then imputation would likely be almost perfect for alleles at any frequency, since the panel would contain close relatives that share long and almost identical tracts of sequence. Therefore, we do expect to be able to make future gains in imputation performance. In some populations that have experienced isolation (like Sardinia or Iceland) we expect to approach this limit much faster. Thinking further ahead, we hope to work closely with efforts under way to collect large samples of high-coverage sequenced samples such as the UK 100,000 Genomes Project (see URLs).

Online methods

Union site list

Every study provided us with their most recent version of their haplotypes in VCF format with one VCF for every autosome. For every cohort, bcftools (v0.2.0-rc12) was used to create an entire-autosome, SNP-only site list with alternate and total allele count information from these per-chromosome haplotypes. Multiallelic SNPs were broken into biallelics using ‘bcftools norm’. These per-cohort site lists were merged into a single file using an in-house Perl script that correctly merges alternate and total allele counts. We created site lists called MAC2 and MAC5 containing only sites with a minor allele count (MAC) across all studies of >= 2 and >=5, respectively, using bcftools. These sites lists contained 95,855,206 and 51,060,347 sites, respectively.

Genotype likelihood calculations

The 'samtools mpileup' command was used to generate genotype likelihoods (GLs) at all MAC2 sites on a per sample basis from each sample’s BAM file. The pipeline and software versions have been made available online (see URLs). The resulting BCF files were merged using the 'bcftools merge' command and the MAC2 sites and alleles extracted using the 'bcftools call' command. The use of 'bcftools call' here made a baseline set of non-LD based genotype calls for each site across all samples. These calls were used for some initial sample QC (see Sample filtering section). We calculated GLs on 33,070 samples in total.

Site filtering

We used an ad-hoc method for initial variant filtering which enabled us to identify variants that had been filtered out ‘quite often’ by our submitting studies. For each site and for each cohort, we labelled the site as “called” in that study if the putative calls from bcftools based on GLs exhibited more than one allele in that cohort, or “not called” if it showed no variation. We also used the haplotype sets provided by each study to determine whether each study had filtered out each site or not using their own internal calling pipeline. To determine a threshold of “number of times filtered out”, we stratified the sites according to their called status versus their filtered status (Supplementary Figure 5). We also measured the Ts/Tv ratio of the set of SNPs for each of these stratified combinations. SNPs corresponding to the cells above the red line in the figure were filtered out, removing all cells which had been filtered out by more than 4 studies or have Ts/Tv ratio less than 1.7.

We also applied a set of additional site filters as follows. We filtered out sites not on the MAC5 site list to restrict the site list to those that could be imputed well. We also filtered out sites if (i) any study (apart from 1000 Genomes) had a Hardy-Weinberg Equilibrium (HWE) p-value < 10^-10, (ii) any study (apart from 1000 Genomes) had an overall inbreeding coefficient < -0.1, (iii) a MAF>0.1 with the site being called in fewer than 3 of the studies and not called in 1000 Genomes (the latter restriction kept sites present at high frequencies in non-European populations that were only called in 1000 Genomes). We also filtered out sites called only in the GoNLstudy or IBD cohort. We completely excluded GPC haplotypes from this step of the site list creation process.

After applying these filters, the site list comprised of 44,038,997 sites. Finally, we made sure that 4,914,335 sites found on a selection of common SNP genotyping arrays and those used in the GIANT consortium and the Global Lipids Consortium (Supplementary Table 7) were included in the final site list. The final site list after this filtering contained 44,187,567 sites.

Sample filtering

Having used 'bcftools call' to extract sites and alleles, we had a set of baseline non-LD genotype calls (see Genotype likelihood calculations section). Based on these calls for chromosome 22, some outlier samples were evident and we removed 150 samples showing evidence for fewer than 10,000 non-reference SNPs or more than 10 singletons across the chromosome. This left a total of 32,920 samples.

To detect possible duplicates we used the original genotype calls submitted by the individual studies. We selected 1000 random sites that (1) were biallelic; (2) had European minor allele frequency > 5% in 1000GP3; and (3) had no missing data in any of the individual studies. Using the 'bcftools gtcheck' command, we counted the number of genotypes that differed between each sample pair. There was a clear set of 269 sample pairs with very few genotypes differing over the 1000 sites. We identified these samples as duplicates either within or between studies and removed one of the samples in the pair as described in Supplementary Table 8. Due to some samples being represented more than twice, there were a total of 261 samples removed due to duplicates. Before genotype calling, we also removed (i) 9 samples for which we had Complete Genomics data so that we could use these samples for testing purposes, (ii) 31 samples from 1000GP3 that were related samples (see URLs), (iii) 8 samples from the HELIC, AMD and ProjectMinE studies with sample labeling inconsistencies. These filters resulted in 32,611 samples being used for the genotype calling and phasing steps.

In addition, after the phasing, 83 samples from the AMD study were removed as the consent for these samples had been removed. We also repeated the duplicate detection process on the final HRC genotype calls, since some studies increased in size late on within the analysis process. This resulted in an extra 40 samples being removed and a total of 32,488 samples in the final phased reference panel.

Genotype calling method leveraging existing haplotype calls

We called genotypes from the genotype likelihoods computed on the HRC samples by extending the SNPTools²² algorithm to leverage pre-existing haplotypes available from each cohort. Like other phasing and calling approaches⁸^,¹⁰, SNPTools is an MCMC approach in which each sample's haplotypes and genotypes are iteratively updated using the current estimates of all other samples. A low-complexity Hidden Markov Model (HMM) with just four states is used to update each sample, where the states are a set of four "surrogate parent" haplotypes. The MCMC sampler employs a Metropolis-Hastings (MH) step to sample the set of surrogate parents. In large sample sizes the search space for these surrogate haplotypes is huge and results in low acceptance rates for the sampler. Our extension, called GLPhase (see URLs) uses pre-existing haplotypes to restrict the set of possible haplotypes from which the MH sampler may choose surrogate parent haplotypes. For each individual, we restrict the search space to 200 haplotypes that most closely match the two pre-existing haplotypes of the individual using a Hamming distance metric (100 for each haplotype). We run the method on chunks of 1,024 sites at a time, which is the default setting for SNPtools. Since the pre-existing haplotypes from each study do not contain exactly the same set of sites we filled in missing alleles in the pre-existing haplotypes at our site list using the major allele at each site.

Restricting the search space in this way allows us to reduce the number of burn-in iterations from 56 to 5, the number of sampling iterations from 200 to 95, and the number of MH steps taken at each iteration for each individual from 2N to 100, where N is the number of samples being phased. This reduces the complexity of our phasing algorithm from O(N²) to O(N). Although our implementation of the Hamming distance search has complexity O(N²), for N = 30,000, the impact of the search on run time is small (~5% of run time on each chunk). A chunk of 1024 sites can be phased in ~200 minutes using ~1.3GB of RAM. Once sample sizes are encountered where the Hamming distance search begins to dominate, our implementation could be replaced with O(N log N) clustering algorithms that we have implemented within the SHAPEIT3 algorithm¹².

To illustrate how important GLPhase was to genotype calling and phasing on such a large sample size, we carried out a comparison to Beagle 3.1, Beagle 4.1 and the original SNPTools method. We ran all four methods on five randomly selected 1024 site chunks from chromosome 20 on the cluster using increasing sample sizes and measured run time. Supplementary Figure 6 shows that GLPhase is approximately 100 times faster than the next quickest method at the full HRC sample size.

Final phasing and haplotype estimation

We estimated haplotypes from GLPhase genotype calls using SHAPEIT3¹². Chromosomes were phased in chunks consisting of 16,000 variants plus 3,300 variants overlapping with neighboring chunks on either side. The non-default command line option -w 0.5 was used for SHAPEIT3. Chunks were ligated using the ligateHAPLOTYPES program (see URLs). SHAPEIT3 does not handle multiple variants at the same genomic coordinate, so multiallelic sites (SNPs with 3 or 4 alleles) were shifted by one or two base pairs for rephasing, and then moved back to their original position after chunk ligation.

Evaluation of genotype calling process

We tested the genotype calling process on data from chromosome 20 with different combinations of site lists and sample sets to assess both the effects of site filtering and the benefits of increasing samples size. We evaluated 3 different site lists: the 1000 Genomes Phase 3 set of sites (775,927), our HRC MAC5 site list (1,128,114) and our HRC MAC5 site list with additional site filtering (1,006,559). We ran the genotype calling method on 3 different sets of samples : the 2,525 original 1000 Genomes Phase 3 samples, a subset of 13,309 HRC samples that we used at an early stage of HRC testing (HRC Pilot) from studies 1000GP3, AMD, GoNL, GoT2D, ORCADES, SardinIA, FINLAND and UK10K, and the near-final full set of 32,905 HRC samples. We called genotypes using GLPhase on each of these 9 datasets and examined genotype discordance compared to Illumina OMNI2.5M genotypes produced by the 1000 Genomes Project. For this comparison, we focused only on genotypes from 365 samples shared across the 3 sample sets and at 42,244 SNP sites. We calculated percentage discordance for the 3 possible genotypes consisting of reference (REF) and alternate (ALT) alleles as well as an overall non-reference allele discordance rate (NRD). Results are shown in Supplementary Table 2.

Downstream imputation performance

We assessed imputation accuracy of 4 different reference panels : 1000 Genomes Phase 3, UK10K, and two versions of the HRC reference panel, with and without re-phasing with SHAPEIT3. To do this we used high-coverage WGS data made publicly available by Complete Genomics (CG) (see URLs). For the pseudo-GWAS samples we used data from 10 CEU samples that also occur in the 1000 Genomes Phase 3 samples. These samples were removed from the various reference panels before using them to assess imputation performance.

Three pseudo-GWAS panels were created based on three chip lists (see URLs) : The Illumina Omni 5M SNP array (HumanOmni5-4v1-1_A), the Illumina Omni 1M SNP array (Human1M-Duo v3C), and the Illumina Core Exome SNP array (humancoreexome-12v1-1_a). For these comparisons we only used sites in the intersection of the reference panels to enable a direct comparison.

These pseudo-chip genotypes were used to impute the remaining genotypes which were then compared to the held out genotypes, stratifying results by MAF of the imputed sites.

Imputation was carried out using IMPUTE2⁷ which chooses a custom reference panel for each study individual in each 2 Mb segment of the genome. We set the khap parameter of IMPUTE2 to 1000. All other parameters were set to default values. We stratified imputed variants into allele frequency bins and calculated the squared correlation between the imputed allele dosages at variants in each bin with the masked CG genotypes (called aggregate r² in Figure 1). Non-reference allele frequency for each SNP was calculated from HRC release 1 GLs at MAC>=5 sites. Figure 1 shows the results for the Illumina Omni 1M chip. Supplementary Figures 3 and 4 show the results from the Illumina Core Exome chip and the Illumina Omni 5M chip respectively.

Details of imputation, association testing and replication in the InCHIANTI study

A total of 1,210 individuals from the InCHIANTI study were genotyped using the Illumina Infinium HumanHap550 genotyping array¹³^,¹⁴ . Individuals were pre-phased using autosomal SNPs after filtering out SNPs with MAF <1%, Hardy-Weinberg p-value <10^-04, and missingness >1%. SNPs were also removed if they could not be remapped to the GRCh37 (hg19) human reference. This resulted in 483,991 SNPs available for pre-phasing. Phasing was performed locally using SHAPEIT2 ¹⁰.

Imputation was performed remotely using the Michigan Imputation Server (see URLs). A total of 39,235,157 SNPs and 47,045,346 variants were imputed from the HRC and 1000 Genomes Phase 3 (v5) reference panels, respectively. An imputation quality threshold of r² >0.5 was subsequently applied to both imputation datasets prior to association testing. This resulted in 15,501,516 and 13,589,949 variants available for association analysis derived from HRC- and 1000 Genomes-based imputation, respectively.

A total of 93 circulating factors available in the InCHIANTI study were double inverse-normalised, while adjusted for age and sex, prior to association testing¹⁴^,¹⁵. Association analysis was performed using a linear mixed model framework as implemented in GEMMA (see URLs). Plots of association in Figure 2 were produced using LocusZoom (see URLs).

We attempted to replicate the associations reported in Supplementary Table 3 in the SHIP and SHIP-TREND cohorts²³. The SHIP samples were genotyped using the Affymetrix Genome-Wide Human SNP Array 6.0. The SHIP-TREND samples was genotyped using the Illumina Human Omni 2.5 array. Prior to imputation, duplicate samples (by IBS), samples with reported vs. genotyped gender mismatch or samples with a very high heterozygosity rate were excluded. Additionally, all monomorphic SNPs, SNPs with duplicate chromosomal position, SNPs with pHWE <0.0001 and SNPs with a callrate <95% were filtered. Imputation was performed on the Sanger Imputation Service (see URLs) against the HRC panel. In total, 4,070 SHIP samples and 986 SHIP-TREND samples were included in the imputation of genotypes. Association analyses were conducted using SNPTEST v2.5.2²⁴.

Supplementary Material

Supplementary Data

NIHMS71361-supplement-Supplementary_Data.pdf^{(1.9MB, pdf)}

Acknowledgements

J.M acknowledges support from the ERC (Grant no. 617306). W.K acknowledges support from the Wellcome Trust (Grant no. WT097307). S.M and R.D acknowledge support from Wellcome Trust grant WT098051. This study makes use of data generated by the UK10K Consortium. The Wellcome Trust provided funding for UK10K (WT091310). We are grateful to all participants of all the studies that have contributed data to the HRC. A full list of acknowledgements for cohorts is given in the Supplementary Note.

Footnotes

URLs

Haplotype Reference Consortium

http://www.haplotype-reference-consortium.org/

Michigan Imputation Server

https://imputationserver.sph.umich.edu/

Sanger Imputation Server

https://imputation.sanger.ac.uk/

Oxford Phasing Server

https://phasingserver.stats.ox.ac.uk/

Genotype Likelihood calculation scripts

https://github.com/mcshane/hrc-release1

GLPhase

http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html

ligateHAPLOTYPES

https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html

Complete Genomics high-coverage WGS genotypes

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130524_cgi_combined_calls/

1000 Genomes Project OMNI genotypes

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz

100,000 Genomes Project

http://www.genomicsengland.co.uk/the-100000-genomes-project/

GEMMA

http://www.xzlab.org/software.html

LocusZoom

http://locuszoom.sph.umich.edu/locuszoom/

1000GP3 related samples

ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/release/20130502/20140625_related_individuals.txt

SNP chip site lists

http://www.well.ox.ac.uk/~wrayner/strand/

Author contributions

The HRC was initially conceived by discussions between J.M, G.A, R.D, M.M and M.B. Analysis and methods development was carried out by S.M, S.D, W.K, O.D, A.R.W, P.D, H.K. Supervision of the research was provided by J.M, G.A and R.D. The Michigan Imputation server was developed by C.F, L.F, S.S and G.A. The Sanger Imputation server was developed by P.D, S.M and R.D. The Oxford Statistics Phasing server was developed by W.K, K.S and J.M. All other authors contributed datasets to the project or provided advice.

References

1.International HapMap Consortium et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet. 2014;46:818–825. doi: 10.1038/ng.3021. [DOI] [PubMed] [Google Scholar]
4.Huang J, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nature Communications. 2015;6:8111. doi: 10.1038/ncomms9111. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Sidore C, et al. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat Genet. 2015;47:1272–1281. doi: 10.1038/ng.3368. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
7.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 2012;44:955–959. doi: 10.1038/ng.2354. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Delaneau O, Zagury J-F, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods. 2013;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]
11.Fuchsberger C, Abecasis GR, Hinds DA. minimac2: faster genotype imputation. Bioinformatics. 2015;31:782–784. doi: 10.1093/bioinformatics/btu704. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.O'Connell J, Sharp K, Delaneau O, Marchini J. Haplotype estimation for biobank scale datasets. Nat Genet. 2016 doi: 10.1038/ng.3583. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Ferrucci L, et al. Subsystems contributing to the decline in ability to walk: bridging the gap between epidemiology and geriatric practice in the InCHIANTI study. J Am Geriatr Soc. 2000;48:1618–1625. doi: 10.1111/j.1532-5415.2000.tb03873.x. [DOI] [PubMed] [Google Scholar]
14.Melzer D, et al. A Genome-Wide Association Study Identifies Protein Quantitative Trait Loci (pQTLs) PLoS Genet. 2008;4:e1000072. doi: 10.1371/journal.pgen.1000072. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Wood AR, et al. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant-phenotype associations undetected by HapMap based imputation. PLoS ONE. 2013;8:e64343. doi: 10.1371/journal.pone.0064343. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Bathurst IC, Travis J, George PM, Carrell RW. Structural and functional characterization of the abnormal Z α1-antitrypsin isolated from human liver. FEBS Letters. 1984;177:179–183. doi: 10.1016/0014-5793(84)81279-x. [DOI] [PubMed] [Google Scholar]
17.Ferrarotti I, et al. Serum levels and genotype distribution of α1-antitrypsin in the general population. Thorax. 2012;67 doi: 10.1136/thoraxjnl-2011-201321. thoraxjnl-2011-201321-674. [DOI] [PubMed] [Google Scholar]
18.Sharp K, Kretzschmar W, Delaneau O, Marchini J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics. 2016 doi: 10.1093/bioinformatics/btw065. btw065. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.CONVERGE consortium. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature. 2015;523:588–591. doi: 10.1038/nature14659. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Gurdasani D, et al. The African Genome Variation Project shapes medical genetics in Africa. Nature. 2015;517:327–332. doi: 10.1038/nature13997. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Rosenberg NA, et al. Genetic structure of human populations. Science. 2002;298:2381–2385. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]
22.Wang Y, Lu J, Yu J, Gibbs RA, Yu F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 2013;23:833–842. doi: 10.1101/gr.146084.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Völzke H, et al. Cohort profile: the study of health in Pomerania. Int J Epidemiol. 2011;40:294–307. doi: 10.1093/ije/dyp394. [DOI] [PubMed] [Google Scholar]
24.Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11:499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

NIHMS71361-supplement-Supplementary_Data.pdf^{(1.9MB, pdf)}

[R1] 1.International HapMap Consortium et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet. 2014;46:818–825. doi: 10.1038/ng.3021. [DOI] [PubMed] [Google Scholar]

[R4] 4.Huang J, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nature Communications. 2015;6:8111. doi: 10.1038/ncomms9111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Sidore C, et al. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat Genet. 2015;47:1272–1281. doi: 10.1038/ng.3368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]

[R7] 7.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 2012;44:955–959. doi: 10.1038/ng.2354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Delaneau O, Zagury J-F, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods. 2013;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]

[R11] 11.Fuchsberger C, Abecasis GR, Hinds DA. minimac2: faster genotype imputation. Bioinformatics. 2015;31:782–784. doi: 10.1093/bioinformatics/btu704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.O'Connell J, Sharp K, Delaneau O, Marchini J. Haplotype estimation for biobank scale datasets. Nat Genet. 2016 doi: 10.1038/ng.3583. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Ferrucci L, et al. Subsystems contributing to the decline in ability to walk: bridging the gap between epidemiology and geriatric practice in the InCHIANTI study. J Am Geriatr Soc. 2000;48:1618–1625. doi: 10.1111/j.1532-5415.2000.tb03873.x. [DOI] [PubMed] [Google Scholar]

[R14] 14.Melzer D, et al. A Genome-Wide Association Study Identifies Protein Quantitative Trait Loci (pQTLs) PLoS Genet. 2008;4:e1000072. doi: 10.1371/journal.pgen.1000072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Wood AR, et al. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant-phenotype associations undetected by HapMap based imputation. PLoS ONE. 2013;8:e64343. doi: 10.1371/journal.pone.0064343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Bathurst IC, Travis J, George PM, Carrell RW. Structural and functional characterization of the abnormal Z α1-antitrypsin isolated from human liver. FEBS Letters. 1984;177:179–183. doi: 10.1016/0014-5793(84)81279-x. [DOI] [PubMed] [Google Scholar]

[R17] 17.Ferrarotti I, et al. Serum levels and genotype distribution of α1-antitrypsin in the general population. Thorax. 2012;67 doi: 10.1136/thoraxjnl-2011-201321. thoraxjnl-2011-201321-674. [DOI] [PubMed] [Google Scholar]

[R18] 18.Sharp K, Kretzschmar W, Delaneau O, Marchini J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics. 2016 doi: 10.1093/bioinformatics/btw065. btw065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.CONVERGE consortium. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature. 2015;523:588–591. doi: 10.1038/nature14659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Gurdasani D, et al. The African Genome Variation Project shapes medical genetics in Africa. Nature. 2015;517:327–332. doi: 10.1038/nature13997. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Rosenberg NA, et al. Genetic structure of human populations. Science. 2002;298:2381–2385. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]

[R22] 22.Wang Y, Lu J, Yu J, Gibbs RA, Yu F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 2013;23:833–842. doi: 10.1101/gr.146084.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Völzke H, et al. Cohort profile: the study of health in Pomerania. Int J Epidemiol. 2011;40:294–307. doi: 10.1093/ije/dyp394. [DOI] [PubMed] [Google Scholar]

[R24] 24.Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11:499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]

PERMALINK

A reference panel of 64,976 haplotypes for genotype imputation

Shane McCarthy

Sayantan Das

Warren Kretzschmar

Olivier Delaneau

Andrew R Wood

Alexander Teumer

Hyun Min Kang

Christian Fuchsberger

Petr Danecek

Kevin Sharp

Yang Luo

Carlo Sidore

Alan Kwong

Nicholas Timpson

Seppo Koskinen

Scott Vrieze

Laura J Scott

He Zhang

Anubha Mahajan

Jan Veldink

Ulrike Peters

Carlos Pato

Cornelia M van Duijn

Christopher E Gillies

Ilaria Gandin

Massimo Mezzavilla

Arthur Gilly

Massimiliano Cocca

Michela Traglia

Andrea Angius

Jeffrey Barrett

Dorret I Boomsma

Kari Branham

Gerome Breen

Chad Brummet

Fabio Busonero

Harry Campbell

Andrew Chan

Sai Chen

Emily Chew

Francis S Collins

Laura Corbin

George Davey Smith

George Dedoussis

Marcus Dorr

Aliki-Eleni Farmaki

Luigi Ferrucci

Lukas Forer

Ross M Fraser

Stacey Gabriel

Shawn Levy

Leif Groop

Tabitha Harrison

Andrew Hattersley

Oddgeir L Holmen

Kristian Hveem

Matthias Kretzler

James Lee

Matt McGue

Thomas Meitinger

David Melzer

Josine Min

Karen L Mohlke

John Vincent

Matthias Nauck

Deborah Nickerson

Aarno Palotie

Michele Pato

Nicola Pirastu

Melvin McInnis

Brent Richards

Cinzia Sala

Veikko Salomaa

David Schlessinger

Sebastian Schoenheer

P Eline Slagboom

Kerrin Small

Timothy Spector

Figure 2. Association signal α₁-antitripsin phenotype at the SERPINA1 locus.