Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 Oct 2:2023.09.30.560263. [Version 1] doi: 10.1101/2023.09.30.560263

Admix-kit: An Integrated Toolkit and Pipeline for Genetic Analyses of Admixed Populations

Kangcheng Hou 1,*, Stephanie Gogarten 2, Joohyun Kim 3, Xing Hua 4, Julie-Alexia Dias 5, Quan Sun 6, Ying Wang 7, Taotao Tan 8; Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium Methods Working Group, Elizabeth G Atkinson 8, Alicia Martin 7, Jonathan Shortt 9, Jibril Hirbo 10, Yun Li 6, Bogdan Pasaniuc 1,*, Haoyu Zhang 4,*
PMCID: PMC10592849  PMID: 37873338

Summary:

Admixed populations, with their unique and diverse genetic backgrounds, are often underrepresented in genetic studies. This oversight not only limits our understanding but also exacerbates existing health disparities. One major barrier has been the lack of efficient tools tailored for the special challenges of genetic study of admixed populations. Here, we present admix-kit, an integrated toolkit and pipeline for genetic analyses of admixed populations. Admix-kit implements a suite of methods to facilitate genotype and phenotype simulation, association testing, genetic architecture inference, and polygenic scoring in admixed populations.

Introduction

Admixed individuals inherit a mosaic of ancestry segments originating from multiple continental ancestral populations, leading to their complex and diverse genetic backgrounds encompassing a wide spectrum of human genetic variation (Seldin et al., 2011). An understanding of such genetic ancestry mosaicism within admixed populations offers opportunities to gain insights into the origins and health implications of various genetic traits and diseases, contributing to a more comprehensive understanding of human genetics (Wojcik et al., 2019; Tan and Atkinson, 2023).

Despite the genetic richness and crucial insights they can offer, admixed populations remain significantly underrepresented in current genetic studies (Mills and Rahal, 2020). This underrepresentation can be attributed to various challenges, including the complexity of analyzing diverse genetic backgrounds and the lack of efficient tools and standardized practices for handling the genetic data of admixed populations. This gap not only hinders progress in genetic research but also exacerbate health disparities. For example, findings with datasets from European ancestry groups for genetic risk prediction models can introduce bias to personalized risk prevention strategies (Martin et al., 2019; Ding et al., 2023).

To address these challenges, we introduce admix-kit, an integrated and flexible python toolkit along with workflows developed using Workflow Development Language (WDL), specifically designed for the simulation and analysis of genetic data from admixed populations. We anticipate that our proposed software packages and workflows will help overcome these analytical challenges, enabling the inclusion of admixed individuals in future genetic studies.

Results

Computational toolkit for analyzing admixed genotypes

We begin by outlining the data structures and computational tools in admix-kit for analyzing admixed genetic datasets. Both genotype and local ancestry data are organized as two matrices of shape N x M x 2 (N and M denote the number of individuals and SNPs respectively, and ‘2’ denotes the two haplotypes; Figure 1a). Given that storage of these matrices often exceeds memory capacity (large N and/or M), we adopt a chunked array representation, implemented with the Dask python library (Rocklin, 2015). Each chunk is loaded from disk on demand, thus conserving memory by loading data only when needed and facilitating large-scale analyses. We employ the pgenlib Python API (URLs) to read phased genotype. Local ancestry matrices are stored in a compressed format that leverages their contiguous nature (local ancestry for nearby SNPs are often identical within each individual). By translating genotype and local ancestry matrices into local-ancestry-specific (LAS) genotype dosages, we have also implemented a set of utility functions tailored for LAS genetic analysis, including LAS allele frequencies, polygenic scores and phenotype modeling that allow for LAS genetic architecture (Figure 1b).

Figure 1: Overview of admix-kit’s data structure, functionality, and illustrative analyses using a simulated dataset.

Figure 1:

(a) Local ancestry and phased genotypes are stored in matrix format. Individual-specific and SNP-specific covariates are stored as two tables with matching orders. (b) Analysis based on ancestry-specific genotype dosage. Starting with a phased genotype for an individual (0/1 denotes presence of minor allele, blue/red color denotes European/African local ancestry), genotypes are separated into ancestry-specific dosages. Local ancestry-informed downstream analyses can be subsequently performed. (c) visualization of local ancestry tracts. (d) Consistency of genome-wide genetic ancestry of simulated dataset (e) Consistency of allele frequencies from the simulated admixed genotypes.

Workflow for simulating admixture genotypes

Genotype simulation is essential to facilitate testing and benchmarking genetic analysis methodologies. One of the significant challenges lies in simulating admixed genomes, which often becomes the most time-consuming step among common analyses involving admixture. We develop a workflow to specifically address this bottleneck (Figure S1a). We primarily focus on two-way admixture for demonstration while noting our software and pipeline are adaptable to various admixture scenarios. First, we use HAPGEN2 (Su et al., 2011) to expand sets of unique haplotypes within each reference genetic ancestry group (e.g. European/African), preserving the minor allele frequency (MAF) and linkage disequilibrium (LD) structure. Second, using the expanded haplotype sets in both genetic ancestry groups, we simulate the admixture process employing admix-simu (URLs) with parameters for genetic ancestry proportion and the number of admixture generations. This process mimics random mating and recombination events to generate realistic distribution of local ancestry segments, MAF and LD structure for the generated genotypes. To make this simulation process more accessible, we have implemented these functionalities as command-line tools within admix-kit (Figure S1a). In details, admix hapgen2 --pfile ${src_plink2} --n-indiv ${n_indiv} --out ${expanded_pop} is used to expand the source population with HAPGEN2. And admix admix-simu --pfile-list “[‘pop1’, ‘pop2’]” --admix-prop “[0.2,0.8]” --n-indiv ${n_indiv} --n-gen ${n_gen} --out ${admix} can be used to simulate the admixture process across source populations. Furthermore, a number of functions are implemented to enable LAS analysis including association testing (Pasaniuc et al., 2011; Atkinson et al., 2021; Hou et al., 2021; Mester et al., 2023) (admix assoc), genetic architecture inference (Hou et al., 2023) (admix genet-cor) and polygenic scoring (Marnetto et al., 2020; Sun et al., 2022) (admix calc-partial-pgs). We also implemented a userfriendly WDL-based workflow for genotype simulation that can be run on cloud-based computing platforms (e.g., AnVIL [https://anvilproject.org/], BioData Catalyst [https://biodatacatalyst.nhlbi.nih.gov/]) (Figure S1b). Users can input essential parameters to define the admixture process and provide the input genotype path of ancestral populations (a set of preprocessed 1,000 genomes dataset is provided for default usage). The workflow will run through each aforementioned step and produce the simulated admixed genotype dataset. The admix-kit software is encapsulated in a publicly available docker image (URLs).

Example analysis of a simulated dataset

We demonstrate the practicality of admix-kit through analyses of a simulated dataset. All associated code and notebooks have been made publicly accessible (https://github.com/UWGAC/admix-kit_workflow). This ensures our results are fully reproducible and can be seamlessly deployed in a cloud platform (e.g., AnVIL). We used the AnVIL workflow to simulate N = 1,000 admixed individuals with M = 174K SNPs on chromosome 1 and 2, using a demographic model similar to African American individuals with over 8 generations of admixture and an average ancestry proportion of 80% African and 20% European (Kidd et al., 2012) (ancestry proportion varies by individual). Notably, the genotype simulation took less than 30 minutes with scalability to a much larger number of individuals and SNPs. Using principal component analysis (PCA), we observed that individuals within the simulated dataset are positioned along a cline between individuals labeled as European and African in the 1,000 Genomes reference dataset, suggesting high quality of the simulated genotype dataset (Figure 1cd). Allele frequencies computed within genotype segments corresponding to the respective local ancestry displayed high consistency with those computed in the reference population, indicating high preservation of MAF structure of the simulated genotype (Figure 1e).

Discussion

Addressing the underrepresentation of admixed individuals in genetic studies is pivotal not only for scientific necessity but also as a commitment to equity. With this goal in mind, we introduce admix-kit, a comprehensive toolkit and workflow tailored for admixed populations. We anticipate that our software package and workflows will facilitate greater inclusion of admixed individuals in future genetic studies.

Development of software and methodology in genetic studies relies heavily on the use of simulated datasets. These datasets help benchmark performance and facilitate comparisons with existing software. Traditionally, simulated datasets are usually derived from publicly available reference populations. Often, these populations are selected based on a high degree of genetic similarity among individuals in the population (e.g., individuals having all four grandparents from a small geographic region.) For instance, HAPGEN2 has recently been widely used for simulating large-scale genetic datasets that mimic the LD structures of reference populations such as European, African, American, East Asian, and South Asian using data from the 1000 Genomes Project (Su et al., 2011; Ruan et al., 2022; Zhang et al., 2022; Miao et al., 2023). While these simulations can recreate datasets with homogeneous LD as the reference populations, they may not capture the complex population structures seen in admixed populations (Figure S2). Consequently, these sampling conditions are not representative of global human genetic variations. As a remedy, simulating admixture among reference populations can provide datasets that more rigorously test the performance of new software. For example, our simulation pipeline can be used to investigate factors that potentially impact accuracy of ancestry inference (including ancestry composition in reference panel, demographic model of simulated admixed population and error in inferred local ancestry) and to understand how errors in ancestry inference propagate to downstream disease mapping and prediction applications.

Admix-kit holds significant potentials in the development of Polygenic Risk Scores (PRS). The efficacy of PRS is known to hinge on the similarity of the target population to the training population (Ding et al., 2023). With the PRIMED consortium working on methods to improve the performance of PRS in diverse populations, simulations will be pivotal for method evaluation (Kachuri et al., 2023). In this context, we expect that admix-kit will be an essential part of this effort.

Supplementary Material

1

Acknowledgements

This research was funded in part by the National Institutes of Health under awards U01-HG011715 (B.P.). H.Z. was supported by NIH Intramural Research Program. E.G.A. was supported by grant K01MH121659 from the National Institutes of Health/National Institute of Mental Health, the Caroline Wiess Law Fund for Research in Molecular Medicine, and the ARCO Foundation Young Teacher–Investigator Fund at Baylor College of Medicine. Research reported in this publication was supported by the National Institutes of Health for the project “Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium”, with grant funding for Study Sites CAPE (U01HG011715) and FFAIRR-PRS (U01HG011719), and the Coordinating Center (U01HG011697). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium Methods Working Group

Sally Adebamowo1, Adebowale Adeyemo2, Paul Auer3, Taoufik Bensellak2, Sonja Berndt2, Rohan Bhukar4, Hongyuan Cao5, Clinton Cario6, Nilanjan Chatterjee7, Jiawen Chen8, Tinashe Chikowore9, Ananyo Choudhury9, Matthew Conomos10, David Conti11, Sinead Cullina12, Burcu Darst13, Yi Ding14, Ruocheng Dong15, Rui Duan16, Yasmina Fakim17, Nora Franceschini8, Tian Ge18, Anisah W. Ghoorah17, Chris Gignoux19, Stephanie Gogarten10, Neil Hanchard2, Rachel Hanisch2, Michael Hauser20, Scott Hazelhurst9, Jibril Hirbo21, Whitney Hornsby18, Kangcheng Hou14, Xing Hua2, Alicia Huerta22, Micah Hysong8, Jin Jin23, Angad Johar24, Jon Judd6, Linda Kachuri6, Abram Bunya Kamiza9, Eimear Kenny12, Alyna Khan10, Elena Kharitonova8, Joohyun Kim21, Iain Konigsberg19, Charles Kooperberg13, Matt Kosel24, Iftikhar Kullo24, Ethan Lange19, Yun Li8, Qing Li2, Maria Liivrand25, Kirk Lohmueller14, Kevin Lu21, Ravi Mandla4, Alisa Manning4, Iman Martin2, Alicia Martin4, Shannon McDonnell24, Leah Mechanic2, Josep Mercader4, Rachel Mester14, Maggie Ng21, Kevin Nguyen1, Kristján Norland24, Franklin Ockerman8, Loes Olde Loohuis14, Ebuka Onyenobi1, Bogdan Pasaniuc14, Aniruddh Patel4, Ella Petter14, Kenneth Rice10, Joseph Rothstein12, Bryce Rowan12, Robb Rowley2, Yunfeng Ruan4, Sriram Sankararaman14, Ambra Sartori7, Dan Schaid24, Ruhollah Shemirani12, Jonathan Shortt19, Xueling Sim26, Johanna Smith24, Maggie Stanislawski19, Daniel Stram11, Quan Sun8, Bamidele Tayo27, Buu Truong4, Kristin Tsuo4, Sarah Urbut18, Ying Wang4, Wallace Minxian Wang4, Riley Wilson2, John Witte6, Genevieve Wojcik7, Jingning Zhang7, Ruyue Zhang8, Haoyu Zhang2, Yuji Zhang1, Michael Zhong1, Laura Zhou8

1University of Maryland Baltimore , Baltimore, MD, United States, 2National Institutes of Health, Bethesda, MD, United States, 3Medical College of Wisconsin, Milwaukee, WI, United States, 4Broad Institute, Cambridge, MA, United States, 5Florida State University, Tallahassee, FL, United States, 6Stanford University, Stanford, CA, United States, 7Johns Hopkins University, Baltimore, MD, United States, 8University of North Carolina at Chapel Hill, Chapel Hill, NC, United States, 9University of the Witwatersrand, Johannesburg South Africa, 10University of Washington, Seattle, WA, United States, 11University of Southern California, Los Angeles, CA, United States, 12Mount Sinai, New York City, NY, United States, 13Fred Hutchinson Cancer Center, Seattle, WA, United States, 14University of California Los Angeles, Los Angeles, CA, United States, 15University of Wisconsin Milwaukee, Milwaukee, WI, United States, 16Harvard, Cambridge, MA, United States, 17University of Mauritius, Réduit Mauritius, 18Massachusetts General Hospital, Boston, MA, United States, 19University of Colorado, Aurora, CO, United States, 20Duke University, Durham, NC, United States, 21Vanderbilt University Medical Center, Nashville, TN, United States, 22Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubiran, Mexico City, CDMX, Mexico, 23University of Pennsylvania, Philadelphia, PA, United States, 24Mayo Clinic, Rochester, MN, United States, 25Genevia Technologies, Tampere Finland, 26National University of Singapore, Singapore, 27Loyola University of Chicago, Chicago, IL, United States

Footnotes

Competing interests

The authors declare no competing interests.

Availability and implementation: Admix-kit package is open-source and available at https://github.com/KangchengHou/admix-kit. Additionally, users can use the pipeline designed for admixed genotype simulation available at https://github.com/UW-GAC/admix-kit_workflow.

References

  1. Atkinson E.G. et al. (2021) Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat. Genet., 53, 195–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ding Y. et al. (2023) Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature, 618, 774–781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Hou K. et al. (2023) Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet., 55, 549–558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Hou K. et al. (2021) On powerful GWAS in admixed populations. Nat. Genet., 53, 1631–1633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Kachuri L. et al. (2023) Principles and methods for transferring polygenic risk scores across global populations. Nat. Rev. Genet. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Kidd J.M. et al. (2012) Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. Am. J. Hum. Genet., 91, 660–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Marnetto D. et al. (2020) Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals. Nat. Commun., 11, 1628. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Martin A.R. et al. (2019) Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet., 51, 584–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Mester R. et al. (2023) Impact of cross-ancestry genetic architecture on GWASs in admixed populations. Am. J. Hum. Genet., 110, 927–939. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Miao J. et al. (2023) Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics. Nat. Commun., 14, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Mills M.C. and Rahal C. (2020) The GWAS Diversity Monitor tracks diversity by disease in real time. Nat. Genet., 52, 242–243. [DOI] [PubMed] [Google Scholar]
  12. Pasaniuc B. et al. (2011) Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet., 7, e1001371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Rocklin M. (2015) Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In, Proceedings of the 14th Python in Science Conference. SciPy. [Google Scholar]
  14. Ruan Y. et al. (2022) Improving polygenic prediction in ancestrally diverse populations. Nat. Genet., 54, 573–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Seldin M.F. et al. (2011) New approaches to disease mapping in admixed populations. Nat. Rev. Genet., 12, 523–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Su Z. et al. (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics, 27, 2304–2305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Sun Q. et al. (2022) Improving polygenic risk prediction in admixed populations by explicitly modeling ancestral-specific effects via GAUDI. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Tan T. and Atkinson E.G. (2023) Strategies for the genomic analysis of admixed populations. Annu. Rev. Biomed. Data Sci., 6, 105–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Wojcik G.L. et al. (2019) Genetic analyses of diverse populations improves discovery for complex traits. Nature, 570, 514–518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Zhang H. et al. (2022) A new method for multi-ancestry polygenic prediction improves performance across diverse populations. bioRxiv, 2022.03.24.485519. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES