Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jul 15.
Published in final edited form as: Nature. 2014 Dec 3;517(7534):327–332. doi: 10.1038/nature13997

The African Genome Variation Project shapes medical genetics in Africa

Deepti Gurdasani 1,2,, Tommy Carstensen 1,2,, Fasil Tekola-Ayele 3,, Luca Pagani 1,, Ioanna Tachmazidou 1,, Konstantinos Hatzikotoulas 1, Savita Karthikeyan 1,2, Louise Iles 1,2, Martin O Pollard 1, Ananyo Choudhury 4, Graham R S Ritchie 1,5, Yali Xue 1, Jennifer Asimit 1, Rebecca N Nsubuga 6, Elizabeth H Young 1,2, Cristina Pomilla 1,2, Katja Kivinen 1, Kirk Rockett 7, Anatoli Kamali 6, Ayo P Doumatey 3, Gershim Asiki 6, Janet Seeley 6, Fatoumatta Sisay-Joof 8, Muminatou Jallow 8, Stephen Tollman 4,9, Ephrem Mekonnen 10, Rosemary Ekong 11, Tamiru Oljira 10, Neil Bradman 12, Kalifa Bojang 8, Michele Ramsay 4, Adebowale Adeyemo 3, Endashaw Bekele 10, Ayesha Motala 13, Shane A Norris 4, Fraser Pirie 13, Pontiano Kaleebu 6, Dominic Kwiatkowski 1, Chris Tyler-Smith 1,, Charles Rotimi 3,, Eleftheria Zeggini 1,, Manjinder S Sandhu 1,2,
PMCID: PMC4297536  NIHMSID: NIHMS637763  PMID: 25470054

Abstract

Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterisation of African genetic diversity is needed. The African Genome Variation Project (AGVP) provides a resource to help design, implement and interpret genomic studies in sub-Saharan Africa (SSA) and worldwide. The AGVP represents dense genotypes from 1,481 and whole genome sequences (WGS) from 320 individuals across SSA. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across SSA. We identify new loci under selection, including for malaria and hypertension. We show that modern imputation panels can identify association signals at highly differentiated loci across populations in SSA. Using WGS, we show further improvement in imputation accuracy supporting efforts for large-scale sequencing of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa, showing for the first time that such designs are feasible.

Introduction

Globally, human populations show structured genetic diversity as a result of geographical dispersion, selection and drift. Understanding this variation can provide insights into evolutionary processes that shape both human adaptation and variation in disease susceptibility.1 Although the Hapmap Project2 and 1000 Genomes Project (1000GP)3 have greatly enhanced our understanding of genetic variation globally, the characterisation of African populations remains limited. Other efforts examining African genetic diversity have been limited by variant density and sample sizes in individual populations,4 or have focused on isolated groups, such as hunter gatherers (HG),5,6 limiting relevance to more widespread populations across Africa.

The African Genome Variation Project (AGVP) is an international collaboration that expands on these efforts by systematically assessing genetic diversity among 1,481 individuals from 18 ethno-linguistic groups from Sub-Saharan Africa (SSA) (Figure 1 and SM Tables 1 and 2) with the HumanOmni2.5 genotyping array and whole genome sequences (WGS) from 320 individuals (SM Table 2). Importantly, the AGVP has evolved to help develop local resources for public health and genomic research, including strengthening research capacity, training, and collaboration across the region. We envisage that data from this project will provide a global resource for researchers, as well as facilitate genetic studies in Africa. 7

Figure 1. Populations studied in the African Genome Variation Project.

Figure 1

Figure 1a represents 18 African populations studied in the AGVP including 2 populations from the 1000 Genomes Project. Figure 1b and c represent ADMIXTURE analysis of these 18 populations alone, and in a global context, respectively. For the AGVP populations, the first 6 clusters are represented, while 18 clusters are shown for the global dataset, as K=6 and K=18 were the most likely clusters on ADMIXTURE analysis. ADMIXTURE analysis suggests substructure between North, East, West and South Africa. Studying these populations in the context of Eurasian and African HG populations suggest extensive Eurasian and HG admixture across Africa.

Population structure in SSA

On examining ~2.2M variants, we found modest differentiation among SSA populations (mean pairwise FST 0.019) (Supplementary Methods and Supplementary Table 1). Differentiation among the Niger-Congo language groups—the predominant linguistic grouping across Africa was noted to be modest (mean pairwise FST 0.009) (Supplementary Table 1), providing evidence for the “Bantu expansion”—a recent population expansion and movement throughout SSA originating in West Africa around 3–5,000 ya.8

We identified 29.8M SNPs from Ethiopian, Zulu and Bagandan WGS (Extended Data Figure 1 and Supplementary Methods). A substantial proportion of unshared (11%–23%) and novel (16–24%) variants were observed, with the highest proportion among Ethiopian populations (Extended Data Figure 1). These findings recapitulate the need for large-scale sequencing across Africa, including among genetically divergent populations.

We used principal component analysis (PCA) to explore relationships among AGVP populations (Extended Data Figures 25, Supplementary Figures 1 and 2). PC1 appeared to represent a cline extending from West and East African populations towards Ethiopian populations, possibly suggesting Eurasian gene flow, while PC2 separated West African and South/East African populations (Extended Data Figure 2). Inclusion of 1000GP, North African and Khoe-San populations in PCA (Extended Data Figures 35, and Supplementary Figures 1 and 2) suggested possible HG ancestry among southern Niger-Congo groups—highlighted by clustering towards the Khoe-San, in addition to confirming a cline towards Eurasian populations. Unsupervised ADMIXTURE9 analysis including the 1000GP and Human Origins datasets (Figure 1), also supported evidence for substantial Eurasian and HG ancestry in SSA (Figure 1 and Extended Data Figure 6).

In order to assess the effect of gene flow on population differentiation in SSA, we masked Eurasian ancestry across the genome (Supplementary Methods, Supplementary Note 6). This markedly reduced population differentiation as measured by a decline in mean pairwise FST from 0.021 to 0.015 (Supplementary Note 6) suggesting that Eurasian ancestry has a substantial impact on differentiation among SSA populations. We speculate that residual differentiation between Ethiopian and other SSA populations after masking Eurasian ancestry (pairwise FST = 0.027) may be a remnant of East African diversity pre-dating the Bantu expansion.10

Population admixture in SSA

Formal tests for admixture (f3 tests),11 confirmed widespread Eurasian and HG admixture in SSA (Supplementary Tables 2 and 3). Quantification of admixture (Supplementary Table 4, Supplementary Methods, Supplementary Notes 3 and 4) indicated substantial Eurasian ancestry in many African populations (ranging from 0–50%), with the greatest proportion in East Africa (Figure 2, Supplementary Table 4). Similarly, HG admixture ranged from 0–23%, being greatest among Zulu and Sotho (Figure 2 and Supplementary Table 5).

Figure 2. Dating and proportion of Eurasian and HG admixture among African populations.

Figure 2

Figures 2a and b show the proportion and distribution of Eurasian and HG admixture among different populations across Africa, with approximate dating of admixture using MALDER.

We found novel evidence for historically complex and regionally distinct admixture with multiple HG and Eurasian populations across SSA (Figure 2 and Supplementary Note 5). Specifically, ancient Eurasian admixture was observed in central West African populations (Yoruba; ~7,500–10,500 ya), old admixture among Ethiopian populations (~2,400–3,200 ya) consistent with previous reports,10,12 and more recent complex admixture in some East African populations (~150–1,500 ya) (Figure 2, Extended Data Figure 7 and Supplementary Note 5). Our finding of ancient Eurasian admixture corroborates findings of non-zero Neanderthal ancestry in Yoruba, which is likely to have been introduced through Eurasian admixture and back migration possibly facilitated by greening of the Sahara desert during this period.13,14

We also find novel evidence for complex and regionally distinct HG admixture across SSA (Supplementary Note 5, Extended Data Figure 7 and Figure 2), with ancient gene flow (~9,000 ya) among Igbo and more recent admixture in East and South Africa (multiple events ranging from 100–3,000 ya), broadly consistent with historical movements reflecting the Bantu expansion. An exploration of the likeliest sources of admixture in our data suggested that HG admixture in Igbo was most closely represented by modern day Khoe-San populations rather than by rainforest HG (rHG) populations (Supplementary Note 5). Given limited archaeological and linguistic evidence for the presence of Khoe-San populations in West Africa, this extant HG admixture might represent ancient populations, consistent with the presence of mass HG graves from the early Holocene period comprising skeletons with distinct morphological features,15 and with evidence of HG rock art dating to this period in the Western Sahara.16,17 In East Africa, our analyses suggested that Mbuti rHG populations most closely represented ancient HG mixing populations (Supplementary Note 5), with admixture dating to ~3,000 years ago, suggesting HG ancestry here is likely to be older than previously reported.18 The primary source of HG admixture in Zulu and Sotho populations was from Khoe-San populations (Figure 2, Supplementary Note 5), consistent with linguistic assimilation of click consonants among these populations.

Positive Selection in SSA

We examined highly differentiated SNPs between European and African populations, as well as among African populations to gain insights into loci that may have undergone selection in response to local adaptive forces (Supplementary Methods). To account for confounding due to Eurasian admixture, we also conducted analyses after masking Eurasian ancestry (Supplementary Methods and Supplementary Note 6).

On examining locus-specific Europe-Africa differentiation, enrichment of loci known to be under positive selection was observed among the most differentiated sites (p=1.4×10−31). Furthermore, there was statistically significant enrichment for genic variants among these, indicating this differentiation was unlikely to have arisen purely from random drift (p=0.0002). Additionally, we found no evidence for background selection as the primary driver of differentiation among these loci (Supplementary Note 7).

In addition to genes known to be under positive selection (e.g. SLC24A5, SLC45A2 and OCA2 19, LARGE20 and CYP3A4/5) (Supplementary Figure 3), we found evidence of differentiation in novel gene regions, including those implicated in malaria (CR1) (Extended Data Figure 8). Chemokine receptor 1 (CR1) carries the Knops blood group antigens and has previously been implicated in malaria susceptibility21 and severity,22 with evidence suggesting positive selection in malaria-endemic regions23 (Extended Data Figure 8). We also identified highly differentiated variants within genes involved in osmoregulation (ATP1A1 and AQP2) (Extended Data Figure 8). Deregulation of AQP2 expression and loss-of-function mutations in ATP1A1 have been associated with essential and secondary hypertension, respectively.24,25 Climactic adaptive changes in these gene regions could potentially provide a biological basis for the high burden of hypertension and differences in salt sensitivity observed in SSA.26

By contrast, overall differentiation among African populations was modest (maximum masked FST=0.19) (Supplementary Figure 4) and only 56/1237 sites remained in the tail distribution after masking (Supplementary Methods, Supplementary Table 6). This suggests that a large proportion of differentiation observed among African populations could be due to Eurasian admixture, rather than adaptation to selective forces (Supplementary Note 6). Known genes under selection were significantly enriched among the most differentiated loci after masking of Eurasian ancestry (p=2.3×10−16). Among the 56 loci robust to Eurasian ancestry masking (Supplementary Table 6), we identified several known loci under selection (Extended Data Figure 8) including a highly differentiated variant (rs1378940) in the CSK gene region implicated in hypertension in GWAS.27 The major allele of rs1378940 among Africans was in complete LD with the risk allele of the GWAS SNP rs137894228, with the frequency of this allele highly correlated with latitude (r=−0.67), providing support for local adaptation in response to temperature as a possible mechanism for hypertension (Supplementary Figure 5).2931

Comparing populations residing in endemic and non-endemic infectious disease regions (Supplementary Methods), we identified several novel loci associated with infectious disease susceptibility and severity. As well as the known sickle cell locus for malaria, this approach identified additional signals under potential selection, including the PKLR region.32, RUNX3,33 the haptoglobin locus, CD163,34 IL1035,36, CFH, and the CD28-ICOS-CLTA4 locus (Supplementary Table 7, Extended Data Figure 8).37 Similar comparisons for Lassa fever identified the known LARGE gene, as well as novel candidates associated with viral entry and immune response, including in the HLA, DC-SIGN/DC-SIGNR,38 RNASEL, CXCR6, IFIH139 and OAS2/3 regions (Supplementary Table 7). For trypanosomiasis, we identified APOL1, 40 as well as several novel loci implicated in immune response and binding to trypanosoma, including FAS, FASL41,42, IL23R43, SIGLEC6 and SIGLEC12 (Supplementary Table 7).44 For trachoma, we identified signals in ABCA1 and CXCR6, which may be important for the growth of the parasite and host immune response, respectively (Supplementary Table 7).45,46

Designing medical genetics studies in Africa

To inform the design of genomic studies in Africa, we addressed the following questions: 1) How well do current genotype arrays perform in African populations using existing reference panels for imputation? 2) Can these genotype arrays and reference panels identify and fine-map association signals in populations across Africa? 3) Can we improve imputation accuracy in African populations using a new African reference panel? and 4) What are the most cost-effective designs for large-scale GWAS in Africa?

The 1000GP phase I integrated panel provided reasonably accurate imputation into the Illumina Omni 2.5M array in all populations (Supplementary Note 10). However, imputation accuracy was lower among Sotho, Zulu and Afro-Asiatic populations possibly reflecting poor representation of some African haplotypes (including Khoe-San haplotypes) within the 1000GP panel. These findings suggest that improvements in imputation accuracy across diverse population groups may require larger and more diverse reference panels.

We assessed the reproducibility and potential for fine-mapping association signals within Africa and globally at several disease susceptibility loci. (Supplementary Methods, Supplementary Table 8 and Extended Data Figure 9). Current genotype arrays and imputation panels allowed for identification of relevant association signals at most loci across populations in SSA, demonstrating that association signals are reproducible across populations in SSA (Extended Data Figure 9 and Supplementary Figures 7–18). African populations are likely to provide better fine-mapping resolution around the causal locus (Supplementary Table 8). We highlight one example here: the sickle cell anaemia locus (HBB),47 under selection due to protection conferred against severe malaria. This locus showed marked heterogeneity in association signals across populations, reflecting different LD patterns and allele frequencies among populations in SSA (Supplementary Figures 9–10). This pattern is likely the result of independent selection sweeps at this locus in different parts of Africa, leading to differences in hitchhiking rare haplotypes that attained high frequencies among different populations.48 This suggests that these signatures are recent and occurred during/after the Bantu expansion, consistent with the hypothesis that the advent of agriculture and increased malaria transmission may have resulted in increased selection pressure.49 However, by contrast to previous reports,47 we show that association signals even at such highly differentiated loci can be captured with dense genotype data using existing reference panels for imputation, despite individual population groups not being fully represented in these. This argues against the need for large-scale population-specific sequencing across Africa, but rather a broad sequencing approach, targeted at capturing widespread haplotype diversity.

To assess the utility of a larger and more diverse African reference panel for imputation, we generated a panel integrating the phase I 1000GP and AGVP WGS panels (Supplementary Methods, Supplementary Note 9). Using this integrated panel, we observed marked improvements in imputation accuracy across the whole range of the allele frequency spectrum in specific populations poorly represented by the 1000GP panel (Supplementary Note 11 and Figure 3). These findings suggest that even common haplotypes in some SSA populations may not be sufficiently captured by existing panels, limiting our power to examine associations of common variants with disease. Importantly, given the specificity of the improvement in imputation accuracy, we infer that targeted sequencing of divergent populations representing a broad spectrum of haplotypes across Africa, including HG and North/East African haplotypes, rather than widespread populations sequencing is is likely to provide a more efficient strategy to improve imputation accuracy and a practicable GWAS framework in Africa.

Figure 3. Improvement in imputation accuracy with the AGVP whole genome sequencing panel.

Figure 3

Figure 3 depicts the substantial improvement in imputation accuracy in some populations (Sotho), compared to minimal improvement in others (Igbo) with addition of the AGVP WGS panel to the 1000GP phase I reference panel, suggesting poor representation of some haplotypes (e.g. Khoe-San haplotypes in Sotho) in the 1000GP reference panel.

We compared the utility of existing chip designs (2.5M Illumina) and ultra-low-coverage (ULC) WGS designs (0.5x, 1x, 2x coverage) in order to determine the optimal design for African GWAS. Sensitivity for common variation was >90% at all sequencing depths (Supplementary Note 12). Examining the effective sample size for a fixed budget,50 we found the effective sample size was greater for all ULC-WGS and chip array designs compared with 4x WGS. When computational costs were accounted for (Supplementary Note 12), the 2.5M array provided the greatest effective sample size supporting the development and large-scale use of efficient genotype arrays in Africa, where these have been underutilised.

We therefore sought to evaluate a potential chip design to tag common variation across a wider range of African populations (Supplementary Note 13). Importantly, we show that an array with 1M genetic variants could capture >80% of common variation (MAF>5%) across the genome (Extended Data Figure 10). These analyses suggest that designing a pan-African genotype array to effectively capture common genetic variation across Africa is feasible, and could greatly facilitate large-scale genomic studies in Africa.

Discussion

The marked haplotype diversity within Africa has important implications for the design of large-scale medical genomics studies across the region, as well as studies of population history and evolution. In this context, the AGVP is a resource that will facilitate a broad range of genomic studies in Africa and globally.

Although Africa is the most genetically diverse region in the world, we provide evidence for relatively modest differentiation among populations representing the major sub-populations in SSA, consistent with recent population movement and expansion across the region beginning around 5,000 ya—the Bantu expansion.8 Although the history of the Bantu expansion is likely to have been complex, assessments of population admixture can provide new insights into this. We note historically complex and regionally distinct admixture with multiple HG and Eurasian populations across SSA, including ancient HG and Eurasian ancestry in West and East Africa and more recent complex HG admixture in South Africa. As well as explaining genetic differentiation among modern populations in SSA, these admixture patterns provide novel genetic evidence for early back to Africa migrations, the possible existence of extant HG populations in Western Africa—compatible with archaeological evidence,15 and patterns of gene flow consistent with the Bantu expansion, including genetic assimilation of residing populations across the region.

This admixture also has important implications for the assessment of differentiation and positive selection in Africa. Accounting for these elements, we identify novel loci under positive selection linked with hypertension, malaria, and other pathogens. This provides a proof-of-concept for the utility of geographically widespread genetic data within Africa to identify loci under selection related to diverse environments.

Our evidence for the broad transferability of genetic association signals and their statistical refinement, has important implications for medical genetic research in Africa. Importantly, we highlight that such studies are feasible and can be enabled through the development of more efficient genotype arrays and diverse WGS reference panels for accurate imputation of common variation. In this context, we describe a framework for a new pan-African genotype array that could directly facilitate large-scale genomic studies in Africa.

A critical next step would be to conduct large-scale deep-sequencing of multiple and diverse populations across the region, and integrate ancient DNA resources, to identify and understand signals of ancient admixture, patterns of historical population movements, and provide a comprehensive resource to conduct medical genomic studies in Africa.

Extended Data

Extended Data Figure 1.

Extended Data Figure 1

Extended Data Figure 1 represents the overlap of SNPs between 4x whole genome sequence data from Zulu, Ugandan and Ethiopian individuals (subsampled to 100 samples each). EDF Figure 1b represents the overlap of novel variants (those not in the 1000GP phase I integrated call set) between the 3 populations. EDF Figures 1c and d represent the allele frequency spectra of variants in different portions of the Venn diagrams depicted in Figures a and b respectively. There appear to be a large proportion of unshared (private) variants in each population, between 10–23% of the total number of variants in a given population. The proportion of novel variants was high, with Ethiopia showing the greatest proportion of novel variation. Most of the novel variation appears to be unshared and rare.

Extended Data Figure 2.

Extended Data Figure 2

Extended Data Figure 2 represents the first ten PCs for the African dataset. PC 1 shows a cline among several African populations, most likely to represent Eurasian gene flow. PC2 shows a clear separation between West and South/East Africa. Subsequent PCs show more detailed structure between, and within African populations.

Extended Data Figure 3.

Extended Data Figure 3

Extended Data Figure 3 represents the first ten PCs for the global dataset, including populations from the 1000 Genomes Project. PC 1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow. PC2 shows a clear separation between European and Asian populations. Subsequent PCs show more detailed structure between populations globally, and within African populations.

Extended Data Figure 4.

Extended Data Figure 4

Extended Data Figure 4 represents the first ten PCs for the global extended dataset, including populations from the 1000 Genomes Project, North African and Khoe-San population groups. PC 1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow. PC2 shows a clear separation between European and Asian populations. Subsequent PCs show more detailed structure between populations globally, and within African populations.

Extended Data Figure 5.

Extended Data Figure 5

Extended Data Figure 5a represents the projection of principal components calculated on YRI and CEU from the 1000 Genomes Project onto the African populations. The AGVP populations are seen to fall on a cline between YRI and CEU, with Ethiopian populations closest to CEU. This is suggestive of Eurasian ancestry among these populations. Extended Figure 5b represents the projection of principal components calculated on YRI and Ju/’hoansi onto the AGVP and other Khoe-San populations populations. The AGVP and Khoe-San populations are seen to fall on a cline between YRI and Ju/’hoansi, with Zulu and Sotho leading the cline among the AGVP populations. This is suggestive of HG gene flow among these populations.

Extended Data Figure 6.

Extended Data Figure 6

Extended Data Figure 6 represents ADMIXTURE clustering analysis for AGVP samples combined with 1000 Genomes, HGDP, North African and Khoe-San samples. Clusters 2 shows separation of European and African ancestry, with delineation of Asian and Khoe-San ancestry in Cluster 4. Subsequent clusters show separation of East, West, North and South African ancestral components.

Extended Data Figure 7.

Extended Data Figure 7

Extended Data Figure 7a represents the time and most likely sources of admixture with 95% confidence intervals for different AGVP populations estimated with MALDER (See Supplementary Note 5). Circular markers with a line drawn around them represent high probability events, while those with no line around them represent low probability events. Extended Data Figure 7b represents the time and most likely sources of admixture estimated with MALDER for the same populations using high quality imputed data to improve resolution.

Extended Data Figure 8.

Extended Data Figure 8

Extended Data Figure 8 shows loci with marked allelic differentiation either globally or within Africa. The derived and ancestral alleles are depicted in blue and red, respectively, for all loci. EDF 8a represents the global distribution of the non-synonymous variant rs17047661 at the CR1 locus implicated in malaria severity. This locus was noted to be among the most differentiated sites (top 0.1%) between Europe and Africa. EDF Fig 8b depicts the global distribution of the rs10216063 SNP at the AQP2 locus. The derived allele appears to be the major allele among European populations in contrast to African populations. EDF Fig 8c represents the allele frequency distribution of rs10924081 at the ATP1A1 locus. Marked differentiation is observed globally, with the derived allele noted to be the major allele among European populations. EDF Fig 8d shows the global distribution of the risk allele for the SNP rs1378940 in the CSK locus associated with hypertension. This locus was found to be within the top 0.1% of differentiated loci within Africa, and within the top 1% of differentiated loci globally. EDF Fig 8e shows the allele frequency distribution of the rs3213419 SNP at the HP locus. EDF Fig 8f shows the allele frequency distribution of the rs7313726 SNP at the CD163 locus. The HP and CD163 are among the top 0.1% of differentiated sites between malaria endemic and non-endemic regions in Africa.

Extended Data Figure 9.

Extended Data Figure 9

Extended Data Figure 9 represents the global distribution of biologically relevant loci used for simulation of traits to examine reproducibility of signals across AGVP populations. EDF 9a represents the frequency of the sickle cell variant (rs334) in different regions globally. The blue portion of each pie chart represents the frequency of the causal allele A. EDF 5b represents the distribution of the SORT1 causal SNP r s12740374, with the derived allele T depicted in blue. EDF 5c, d, e, f represent the distributions of the APOL1 variant rs73885319, TCF7L2 variant rs7903146, the APOE variant rs429358 and the PRDM9 variant rs6889665, respectively.

Extended Data Figure 10.

Extended Data Figure 10

Extended Data Figure 10 outlines the coverage obtained across the genome for variants at different allele frequencies for a hypothetical African genotype array with 1M tagging variants. Different allele frequency bins are depicted in different colours. The lines show the coverage that can be achieved by imputation at different r2 thresholds. Coverage, here, is defined as the proportion of variants within an allele frequency captured above a pre-defined r2 threshold (along the x axis) after imputation. The solid lines represent the coverage obtained with 1M variants selected using the hybrid tagging and imputation approach, while the broken lines represent the coverage obtained by using a simple pairwise tagging approach to capture 1M tagging variants. The hybrid method improves coverage obtained, particularly for common variation. Coverage for common variants (>5%) appears to be high at an r2 threshold of 0.8 and above, with >80% of these variants accurately imputed.

Supplementary Material

supp methods
supp tables + figs

Acknowledgments

This project was funded in part by the Wellcome Trust, The Wellcome Trust Sanger Institute (WT098051), Bill and Melinda Gates Foundation, Foundation for National Institutes of Health, and the UK Medical Research Council (G0901213-92157, G0801566, and MR/K013491/1). We also acknowledge the National Institute of Health Research Cambridge Biomedical Research Centre and the Wellcome Trust Cambridge Centre for Global Health Research.

We are very grateful to Dr. Joseph Pickrell for sharing Human Origins data and MALDER code, and for his useful input on interpretations of these analyses. We also thank Erik Garrison for his suggestions on using Platinum genomes sets for validation of whole genome sequencing data.

We also thank the African Partnership for Chronic Disease Research (APCDR) for providing a network to support this study as well as a repository for deposition of curated data. Sample collections from South Africa were funded by The South African Sugar Association, Servier South Africa and The Victor Daitz Foundation. The Kenyan samples were collected by Prof Duncan Ngare of Moi University, Eldoret, Kenya as part of the Africa America Diabetes Mellitus (AADM) study and the International HapMap project. Prof Ngare, who is now deceased, was a great supporter of genomics in Africa as exemplified by his leadership in engaging the Luhya and Maasai communities for the HapMap project. The Igbo samples were collected by Prof Johnnie Oli of the University of Nigeria, Enugu, Nigeria. The Ga-Adangbe samples were collected by the laboratories of Prof Albert Amoah of the University of Ghana, Accra, Ghana and Joseph Acheampong of the University of Science and Technology, Kumasi, Ghana. Support for the AADM study is provided by the National Institute on Minority Health and Health Disparities, the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and the National Human Genome Research Institute (NHGRI). This research was supported in part by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH - Z01HG200362). D.G. was funded by the Cambridge Commonwealth Scholarship. We thank the 1000 Genomes Project for sharing genotype data that were analysed as part of this project. We also thank all study participants who contributed to this study.

Footnotes

Authors’ contributions

Overall project coordination: D.G., C.P., M.S.S. (Project Chair), E.H.Y and E.Z. coordinated the project.

Analysis and writing: C.P. coordinated sample collation, genotyping, quality control and data generation for the study. J.A., T.C., D.G. and C.P. carried out quality control and curation of data. R.N. and Y.X. undertook QC for MalariaGEN and Ethiopian population sets respectively. M.O.P carried out quality control and bam improvement of sequence data at all depths. T.C curated and generated all sequence data, and carried out comparisons with genotype array data and with higher coverage data. D.G. carried out the population structure and admixture analyses. A.C., D.G, S.K. and L.P. carried out analysis of positive selection and population differentiation. L.P. and I.T. carried out analysis of LD decay. T.C., K.H. and I.T carried out imputation based analyses. T.C. developed an efficient tagging algorithm and carried out analysis for coverage of tagging variants for the design of the African genotype array. D.G. and F.T. carried out fine mapping analyses. C.R., M.S.S., C.T. and E.Z. critically appraised and commented on the manuscript. D.G., T.C, L.P. and M.S.S. prepared the manuscript and supplementary materials. C.P. and L.I. contributed to the writing of the supplementary materials. All authors commented on the interpretation of results, and reviewed and approved the final manuscript.

Management, fieldwork, laboratory analyses and coordination of contributing cohorts: K.B., M.J., K.K, D.K., K.R. and F.S. (The Gambian cohorts-MalariaGEN); G.A., P.K, A.K, M.S.S. and J.S. (The General Population Cohort Study); A.M. and F.P. (South African Zulu Cohort); A.A., A.D., C.R. and F.T. (The Kenyan, Ghanaian and Nigerian cohorts), A.C., S.N., M.R. and S.T. (South African Sotho cohort), E.B., N.B., R.E., E.M., T.O., L.P and C.T. (Ethiopian cohort).

Conflicts of interest

There were no conflicts of interest.

References

  • 1.Botigue LR, et al. Gene flow from North Africa contributes to differential human genetic diversity in southern Europe. Proceedings of the National Academy of Sciences of the United States of America. 2013;110:11791–11796. doi: 10.1073/pnas.1306223110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.International HapMap, C. The International HapMap Project. Nature. 2003;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
  • 3.Genomes Project C et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tishkoff SA, et al. The genetic structure and history of Africans and African Americans. Science. 2009;324:1035–1044. doi: 10.1126/science.1172257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Schlebusch CM, et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science. 2012;338:374–379. doi: 10.1126/science.1227721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jarvis JP, et al. Patterns of ancestry, signatures of natural selection, and genetic association with stature in Western African pygmies. PLoS genetics. 2012;8:e1002641. doi: 10.1371/journal.pgen.1002641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Consortium HA. Research capacity. Enabling the genomic revolution in Africa. Science. 2014;344:1346–1348. doi: 10.1126/science.1251546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.de Filippo C, Bostoen K, Stoneking M, Pakendorf B. Bringing together linguistic and genetic evidence to test the Bantu expansion. Proceedings. Biological sciences/The Royal Society. 2012;279:3256–3263. doi: 10.1098/rspb.2012.0318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome research. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pagani L, et al. Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. American journal of human genetics. 2012;91:83–96. doi: 10.1016/j.ajhg.2012.05.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Patterson N, et al. Ancient admixture in human history. Genetics. 2012;192:1065–1093. doi: 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Pickrell J, Patterson N, Loh P, Lipson M, Berger B, Stoneking M, Pakendorf B, Reich D. Ancient west Eurasian ancestry in southern and eastern Africa. 2013 doi: 10.1073/pnas.1313787111. unpublished. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Prufer K, et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014;505:43–49. doi: 10.1038/nature12886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kuper R, Kropelin S. Climate-controlled Holocene occupation in the Sahara: motor of Africa’s evolution. Science. 2006;313:803–807. doi: 10.1126/science.1130989. [DOI] [PubMed] [Google Scholar]
  • 15.Sereno PC, et al. Lakeside cemeteries in the Sahara: 5000 years of holocene population and environmental change. PloS one. 2008;3:e2995. doi: 10.1371/journal.pone.0002995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.The origin of prehistoric rock artists. < http://www.bradshawfoundation.com/giraffe/artists.php>.
  • 17.Tilman L-E. Rock art in African Highlands, Ennedi Highlands, Chad- Artists and Herders in a Lifeworld on the Margins. < http://www.academia.edu/1580718/Rock_art_in_African_Highlands_Ennedi_Highlands_Chad_-_Artists_and_Herders_in_a_Lifeworld_on_the_Margins>.
  • 18.Patin E, et al. The impact of agricultural emergence on the genetic history of African rainforest hunter-gatherers and agriculturalists. Nature communications. 2014;5:3163. doi: 10.1038/ncomms4163. [DOI] [PubMed] [Google Scholar]
  • 19.Norton HL, et al. Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Molecular biology and evolution. 2007;24:710–722. doi: 10.1093/molbev/msl203. [DOI] [PubMed] [Google Scholar]
  • 20.Andersen KG, et al. Genome-wide scans provide evidence for positive selection of genes implicated in Lassa fever. Philosophical transactions of the Royal Society of London. Series B, Biological sciences. 2012;367:868–877. doi: 10.1098/rstb.2011.0299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Eid NA, et al. Candidate malaria susceptibility/protective SNPs in hospital and population-based studies: the effect of sub-structuring. Malaria journal. 2010;9:119. doi: 10.1186/1475-2875-9-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Panda AK, et al. Complement receptor 1 variants confer protection from severe malaria in Odisha, India. PloS one. 2012;7:e49420. doi: 10.1371/journal.pone.0049420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kosoy R, et al. Evidence for malaria selection of a CR1 haplotype in Sardinia. Genes and immunity. 2011;12:582–588. doi: 10.1038/gene.2011.33. [DOI] [PubMed] [Google Scholar]
  • 24.Beuschlein F, et al. Somatic mutations in ATP1A1 and ATP2B3 lead to aldosterone-producing adenomas and secondary hypertension. Nature genetics. 2013;45:440–444. 444e441–442. doi: 10.1038/ng.2550. [DOI] [PubMed] [Google Scholar]
  • 25.Graffe CC, Bech JN, Lauridsen TG, Vase H, Pedersen EB. Abnormal increase in urinary aquaporin-2 excretion in response to hypertonic saline in essential hypertension. BMC nephrology. 2012;13:15. doi: 10.1186/1471-2369-13-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Young JH, et al. Differential susceptibility to hypertension is due to selection during the out-of-Africa expansion. PLoS genetics. 2005;1:e82. doi: 10.1371/journal.pgen.0010082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS biology. 2006;4:e72. doi: 10.1371/journal.pbio.0040072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Tabara Y, et al. Common variants in the ATP2B1 gene are associated with susceptibility to hypertension: the Japanese Millennium Genome Project. Hypertension. 2010;56:973–980. doi: 10.1161/HYPERTENSIONAHA.110.153429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Edwards M, et al. Association of the OCA2 polymorphism His615Arg with melanin content in east Asian populations: further evidence of convergent evolution of skin pigmentation. PLoS genetics. 2010;6:e1000867. doi: 10.1371/journal.pgen.1000867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hong KW, et al. Genetic variations in ATP2B1, CSK, ARSG and CSMD1 loci are related to blood pressure and/or hypertension in two Korean cohorts. Journal of human hypertension. 2010;24:367–372. doi: 10.1038/jhh.2009.86. [DOI] [PubMed] [Google Scholar]
  • 31.Levy D, et al. Genome-wide association study of blood pressure and hypertension. Nature genetics. 2009;41:677–687. doi: 10.1038/ng.384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Machado P, et al. Malaria: looking for selection signatures in the human PKLR gene region. British journal of haematology. 2010;149:775–784. doi: 10.1111/j.1365-2141.2010.08165.x. [DOI] [PubMed] [Google Scholar]
  • 33.Band G, et al. Imputation-based meta-analysis of severe malaria in three African populations. PLoS genetics. 2013;9:e1003509. doi: 10.1371/journal.pgen.1003509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kusi KA, et al. Levels of soluble CD163 and severity of malaria in children in Ghana. Clinical and vaccine immunology : CVI. 2008;15:1456–1460. doi: 10.1128/CVI.00506-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zhang G, et al. Interleukin-10 (IL-10) polymorphisms are associated with IL-10 production and clinical malaria in young children. Infection and immunity. 2012;80:2316–2322. doi: 10.1128/IAI.00261-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wilson JN, et al. Analysis of IL10 haplotypic associations with severe malaria. Genes and immunity. 2005;6:462–466. doi: 10.1038/sj.gene.6364227. [DOI] [PubMed] [Google Scholar]
  • 37.Jacobs T, Graefe SE, Niknafs S, Gaworski I, Fleischer B. Murine malaria is exacerbated by CTLA-4 blockade. Journal of immunology. 2002;169:2323–2329. doi: 10.4049/jimmunol.169.5.2323. [DOI] [PubMed] [Google Scholar]
  • 38.Shimojima M, Stroher U, Ebihara H, Feldmann H, Kawaoka Y. Identification of cell surface molecules involved in dystroglycan-independent Lassa virus cell entry. Journal of virology. 2012;86:2067–2078. doi: 10.1128/JVI.06451-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fumagalli M, et al. Population genetics of IFIH1: ancient population structure, local selection, and implications for susceptibility to type 1 diabetes. Molecular biology and evolution. 2010;27:2555–2566. doi: 10.1093/molbev/msq141. [DOI] [PubMed] [Google Scholar]
  • 40.Ko WY, et al. Identifying Darwinian Selection Acting on Different Human APOL1 Variants among Diverse African Populations. American journal of human genetics. 2013;93:54–66. doi: 10.1016/j.ajhg.2013.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Lopes MF, et al. Increased susceptibility of Fas ligand-deficient gld mice to Trypanosoma cruzi infection due to a Th2-biased host immune response. European journal of immunology. 1999;29:81–89. doi: 10.1002/(SICI)1521-4141(199901)29:01<81::AID-IMMU81>3.0.CO;2-Y. [DOI] [PubMed] [Google Scholar]
  • 42.Martins GA, et al. Fas-FasL interaction modulates nitric oxide production in Trypanosoma cruzi-infected mice. Immunology. 2001;103:122–129. doi: 10.1046/j.1365-2567.2001.01216.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ribeiro CM, et al. Trypanosomiasis-induced Th17-like immune responses in carp. PloS one. 2010;5:e13012. doi: 10.1371/journal.pone.0013012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Crocker PR, Paulson JC, Varki A. Siglecs and their roles in the immune system. Nature reviews. Immunology. 2007;7:255–266. doi: 10.1038/nri2056. [DOI] [PubMed] [Google Scholar]
  • 45.Cox JV, Naher N, Abdelrahman YM, Belland RJ. Host HDL biogenesis machinery is recruited to the inclusion of Chlamydia trachomatis-infected cells and regulates chlamydial growth. Cellular microbiology. 2012;14:1497–1512. doi: 10.1111/j.1462-5822.2012.01823.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Natividad A, et al. Human conjunctival transcriptome analysis reveals the prominence of innate defense in Chlamydia trachomatis infection. Infection and immunity. 2010;78:4895–4911. doi: 10.1128/IAI.00844-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Jallow M, et al. Genome-wide and fine-resolution association analysis of malaria in West Africa. Nature genetics. 2009;41:657–665. doi: 10.1038/ng.388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Teo YY, et al. Genome-wide comparisons of variation in linkage disequilibrium. Genome research. 2009;19:1849–1860. doi: 10.1101/gr.092189.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Hedrick PW. Population genetics of malaria resistance in humans. Heredity. 2011;107:283–304. doi: 10.1038/hdy.2011.16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Pasaniuc B, et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nature genetics. 2012;44:631–635. doi: 10.1038/ng.2283. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp methods
supp tables + figs

RESOURCES