Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2018 Jun 29;20(5):1709–1724. doi: 10.1093/bib/bby044

A comprehensive survey of models for dissecting local ancestry deconvolution in human genome

Ephifania Geza 1,2, Jacquiline Mugo 1, Nicola J Mulder 2, Ambroise Wonkam 3, Emile R Chimusa 3, Gaston K Mazandu 1,2,3,
PMCID: PMC7373186  PMID: 30010715

Abstract

Over the past decade, studies of admixed populations have increasingly gained interest in both medical and population genetics. These studies have so far shed light on the patterns of genetic variation throughout modern human evolution and have improved our understanding of the demographics and adaptive processes of human populations. To date, there exist about 20 methods or tools to deconvolve local ancestry. These methods have merits and drawbacks in estimating local ancestry in multiway admixed populations. In this article, we survey existing ancestry deconvolution methods, with special emphasis on multiway admixture, and compare these methods based on simulation results reported by different studies, computational approaches used, including mathematical and statistical models, and biological challenges related to each method. This should orient users on the choice of an appropriate method or tool for given population admixture characteristics and update researchers on current advances, challenges and opportunities behind existing ancestry deconvolution methods.

Keywords: local ancestry, linkage disequilibrium, admixture, hidden Markov models, ancestry informative markers

Introduction

Recent advances in high-throughput technologies have led to the availability of enormous amounts of publicly available genomic data. These data require appropriate techniques for informed decisions, which are useful in both medical and population genetics. This helps in understanding population history, demography, disease aetiology and the way different individuals respond to drugs. Genetic diversities observed in the human deoxyribonucleic acid (DNA) sequences result from inheritance processes, including mutation and recombination, and from population migration [1]. Owing to population migration, most modern humans are admixed [2, 3], as a result of the mating between two or more previously isolated populations. This interbreeding yields genetic recombination break points and the formation of variation and mixed DNA segments. Admixed genomes are a mosaic of segments originating from different ancestral populations [4–9]. There is a great interest in understanding the dynamics related to the origin of these variations, the evolutionary process and its consequences in human health. Studying admixed populations may improve our understanding of the genetic structure of diseases [10–13].

Today, advances in high-throughput sequencing and genotyping technologies, combined with the appropriate computational and statistical methods, have made feasible the inference of the admixture history of human populations. As a result, several local ancestry inference or ancestry deconvolution methods have been implemented. Since 2003, >20 ancestry deconvolution methods have been introduced, some of which have similar statistical models. Here we have categorized them into two main groups based on whether they account for linkage disequilibrium (LD). LD-based models include STRUCTURE [14], SABER [15], HAPMIX [4] and ALLOY [16], and some of non-LD-based models are LAMP [17], WINPOP [18] and LOTER [19]. These local ancestry inference approaches have been used in several applications, mostly in admixed populations from the American continent, including understanding signatures of natural selection, ethnic differences in drug response and the risk of complex diseases. It is worth noting that these methods have merits and drawbacks in inferring local ancestry estimates in multiway admixed populations within a complex or multifaceted admixture model.

Some review studies on ancestry inference exist, but they are limited in describing existing ancestry inference-associated tools, their relationship with one another and applications [20, 21]. Unlike the reviews of Liu et al. [20] and Padhukasahasram [21], in 2013, Gompert and Buerkle [22] provided a mathematical structure of some existing methods, highlighting their similarities and differences. Here, we systematically survey all the ancestry deconvolution methods/tools we are aware of. We dissect approaches for pinpointing local ancestry along the genome of admixed populations and summarize the evolutionary factors they account for. Importantly, this survey explores different statistical model properties, assumptions and results from existing evaluations. This may orient the development of new methods and enable researchers to choose appropriate tools for given population characteristics, depending on where the local ancestry estimates should be applied. Furthermore, this study will enable software developers to identify current challenges and advances in local ancestry deconvolution, providing them with appropriate information for implementing practical and integrative tools, which consider current medical and population genetics demands. It also provides a reference manual on ancestry deconvolution models and on how biologically practical these models are, depending on the use of the inferred local ancestry estimates.

Importance and application of local ancestry deconvolution

Admixture introduces a type of LD, known as ‘admixture LD’ (Box 2). The distribution of admixture at the genetic level and/or along the genome of an admixed individual can exhibit a pattern of single nucleotide polymorphisms (SNPs), which may have either medical or evolutionary implications. Local ancestry deconvolution in human populations has contributed to a wide range of biomedical applications from identifying both local selection and genetic variants underlying ethnic difference in disease risk and drug/treatment responses to an understanding of population genetic history [23, 24]. These studies have so far shed light on the patterns of genetic variation throughout modern human evolution and have improved our understanding of the demographics and adaptive processes of human populations. As the world has become ‘smaller’, with travel making migration easier, many global populations are likely to become admixed [3]. This is likely to increase the complexity of the population admixture dynamics, resulting in multifaceted admixture events. There is a high chance of the SNPs of admixed populations contributing significantly to the phenotypic variability and differences in drug response therapies, which may not be the case in homogeneous populations (Box 2) [25]. As a result, genetic and clinical studies of admixed populations are complicated by the degree of admixture/traits association [25]. Accordingly, understanding the structure of human genetic admixture is not only of anthropological relevance but also a medical necessity; thus, knowledge of local ancestry estimates is of utmost importance. Local ancestry deconvolution is also of great relevance in gene–environment interactions. For example, local ancestry inference may be a key in pharmacogenetics and personalizing medicine [26, 27], considering potential interactions between genes and a drug (or environment). Moreover, local ancestry estimates are also crucial in gene–gene interactions [28], when trying to understand how multiple genes interact to confer a complex trait or disease. This suggests that local ancestry estimates may also be important in understanding signatures of natural selection and complex diseases through admixture mapping as an alternative approach to the popular genome-wide association studies, particularly in admixed populations.

Box 2.

Key definitions

Population admixture— It is the mating of two or more previously isolated populations.

n-way admixture—An admixed population resulting from the mixing of n previously distinct population.

Global ancestry—It is the genome-wide fraction of ancestry contributed by each ancestral population.

Candidate ancestry—It is the possible ancestry that admixed to form an admixed population.

Local ancestry (locus-specific ancestry)—It is the number of alleles inherited from a particular candidate ancestry at a site values are 0, 1 or 2.

Proxy ancestry (candidate ancestry)—It is a population used in representing an ancestral population of admixed individuals.

LD—It is the non-random association of alleles between two or more loci. There exist three major type of LDs:

Mixture LD—It is the LD that is caused by the differences in average ancestral proportions contributed by ancestral populations in sampled individuals. Occurs even if alleles are linked.

Admixture LD—It results when ancestry at nearby markers is inherited together in local chromosomes as a result of the admixture process, resulting in large haplotype blocks. It is weak at all distances and decays slowly.

Background LD—It is the LD within ancestral populations, and it depends highly on population history (i.e. generated by genetic drift and population bottlenecks). It is strong at short distances, results in short haplotype block and decay rapidly.

Local ancestry deconvolution—It is the ancestral origin of each of the alleles at a site.

AIMs—Markers that show significant differences in frequency between diverged populations.

HMMs—Hidden Markov models are statistical models that consist of two sets of variables: observed variables that are used to generate the hidden variables assumed to be a Markov or a memoryless process. They are used to model a sequence data sets, for which the hidden variables are a Markov chain.

There are several applications of local ancestry deconvolution in understanding the demographics and admixture structure of various admixed populations. However, as indicated in Table 2, there exist few applications relevant to disease risk and clinical prediction of treatment. These applications are mostly based on admixed populations from the American continent. Table 2 provides more detail on these studies that applied local ancestry estimates, the reference paper, the admixed populations under study, the number of samples used and markers studied and the results obtained. This table suggests that local ancestry inference has been applied in identifying selection signatures, by examining the distribution of genome-wide ancestry in admixed populations [21] or by comparing the local ancestry estimates in cases to local ancestry estimates in control samples or local ancestry in cases at a site compared with the average genome-wide ancestry.

Table 2.

Applications of local ancestry distribution

Aim of study Admixed populations Number of study samples Markers Results obtained Reference paper
Determining if genetic ancestry is associated with complex diseases (type 2 diabetes) Two admixed populations: Medellin, Columbia and central Mexico 499 cases and 197 controls from Medellin, and 163 cases and 72 controls from central Mexico 67 AIMs Non-European ancestry proportion is associated with type 2 diabetes and lower socio-economic status in Latino admixed populations from the South and North America Florez et al. [52]
Detecting natural selection in African Americans African Americans 5210 individuals (1890 African Americans and 3320 non-African Americans 491 526 Six positions were identified with excess European or African ancestry chr1: 17409539, chr2: 241750403, chr2: 37451925, chr3: 116930811, chr6: 163653158 and chr16: 61214438, in the following corresponding regions 1p36, 2q37, 2p22, 3q13, 6q26 and 16q2, respectively. The disease genes in the identified regions were known to be highly associated with human disease like prostate cancer and hypertension Jin et al. [53]
Inferring genome-wide admixture patterns in the Qatar population Qatar population 156 individuals 71 982 SNPs The Qatar population was subdivided by SUPPORTMIX into three: Arab-Qataris, a mixture of middle eastern population and Bedouin or Hadar-Arab ancestry; Persian-Qataris, a mixture of Europeans and non-Middle East Asian populations and sub-Saharan Africans; and African-Qataris, African, Persian and fewer Middle Eastern populations Omberg et al. [33]
Understanding the disease genetics and history of Latino populations by assessing local ancestry performance in Latino populations GALA data set of mainland United States: Puerto Rico and Mexico, and Multiethnic Cohort (MEC) data set 489 mother–father–child trio families of Puerto and Mexican ethnic groups and 3204 unrelated Latino individuals from MEC 127 935 common SNPs Using WINPOP, LAMPLD, ALLOY and PCADMIX, it was shown that deviations exist in local ancestry inference. It was also shown that sites/loci with increased deviations in local ancestry Mendelian inconsistencies in local ancestry (MILANC) Pasaniuc et al. [25]
Finding the signals of gene–gene interaction on human diseases Simulated admixed population 10 000 admixed individual 1 million SNPs Local ancestry deconvolution estimates provides promising results in detecting gene–gene interaction Aschard et al. [28]
Identifying signatures of selection in Latinos 13 admixed Latin American populations 249 individuals 678 genome-wide micro-satellite markers Three loci were identified as selection candidates. Two regions had excess of European ancestry at 1p36 and 14q32, and one had excess of African ancestry at 6p22 Deng et al. [54]

Overview of local ancestry deconvolution

Owing to their popularity in local ancestry inference models and to guide readers, we illustrate the hidden Markov model (HMM)-based local ancestry problem, choosing a general local ancestry concept, which, under a null model of two-way admixture and simplifying assumption, considers an admixed individual, i, with genotypes coded based on the copy of minor allele, Yit{0, 1, 2}. Unless otherwise stated, throughout this article, we denote:

k=1,2,,K the population index for ancestral population, assumed to be previously isolated and mixed g, generations ago
i=1,,nk the individual index for ancestral population k, with nk the number of individuals in ancestral population k, meaning that the total number, n, of ancestral individuals is n=k=1Knk,
t=1,2,,T the site or SNP index
h{0,1} the haplotype
πk the proportion of ancestry inherited from population k, such that k=1Kπk=1,
π the vector of ancestry inherited from each ancestral population
Π the n×K matrix of proportions of each individual i inherited from population k,
pt allele frequency at a site t and P the matrix of ancestral allele frequencies
Xit the local ancestry at site t, for individual i, copied from k,
Yt the observed admixed genotype at t,
r the recombination rate
dt the genetic distance between t and t1, and
σ(k=k) the indicator function, with σ(k=k)={1if k=k0otherwise and σ(kk)=1σ(k=k)

For window-based methods, a window is indexed by w, and the distance between the centre of the two consecutive windows is dw.

The fraction of the individual's genome-wide ancestry of ancestral population A (πA) and the average number of generations since admixture (g) occurred are assumed to be known. Let Xt be the unobserved state 0, 1 or 2, the number of allele copies from one of the ancestral chromosomes of this individual at marker t along the genome. An HMM family-based model is completely defined based on three main parameters, namely, initial, transition and emission probabilities. Thus, we set the left end of chromosome initial probabilities as P(X0 = 0) = (1− πA)2, P(X0 = 1) = 2 πA (1−πA) and P(X0 = 2) = πA2. Let d be the genetic distance (in Morgans) between markers t and t +1, and the transition probabilities are as follows:

  • 1. Switching probability while in State 0

  •     P(Xt+1= 0 | Xt = 0) = e−2gd + 2egd(1−egd)(1−πA) + (1−egd)2(1−πA)2

  •     P(Xt+1= 1 | Xt = 0) = 2egd(1−egd) πA + (1−egd)22 πA (1−πA)

  •     P(Xt+1= 2 | Xt = 0) = (1−egd)2πA2

  • 2. Switching probabilities while in State 1

  •     P(Xt+1= 0 | Xt = 1) = 2egd(1−egd)(1−πA) + (1−egd)2(1−πA)2

  •     P(Xt+1= 1 | Xt = 1) = e2gd + egd(1−egd) + (1−egd)22 πA (1−πA)

  •     P(Xt+1= 2 | Xt = 1) = egd(1−egd) πA + (1−egd)2πA2

  • 3. Switching probabilities while in State 2

  •     P(Xt+1= 0 | Xt = 2) = (1−egd)2(1−πA)2

  •     P(Xt+1= 1 | Xt = 2) = 2egd(1−egd)(1−πA) + (1−egd)22 πA (1− πA)

  •     P(Xt+1= 2 | Xt = 2) = e2gd + 2egd(1−egd) πA + (1−egd)2πA2

Let pA and pB be the genotype frequencies of marker j in proxy ancestral Populations A and B, respectively, thus for k{A, B}, and u{0, 2}, the likelihoods (the emission probabilities) that the ancestral population k has either 0 (state 0) or both copies of alleles (State 2) are P(Yit= 0 | Xt = u) = (1−pk)2, P(Yit = 1 | Xt = u) = 2 pk(1−pk) and P(Yit= 2 | Xt = u) = pk2. It follows that the emission probabilities that both ancestral populations A and B share copy alleles are P(Yit = 0 | Xt = 1) = (1−pA)(1−pB), P(Yit = 1 | Xt = 1) = pA(1−pB) + pB(1−pA) and P(Yit = 2 | Xt = 1) = pApB. The probability of interest is the inference of posterior probabilities P(Xt| Yit) and the integration over uncertainty in, πA,g, pA and pB.

In general, HMM family-based models in the context of local ancestry address one or all of the three problems, i.e. (1) determining the probability of observing a particular sequence given HMM parameters, (2) determining the most probable hidden state sequence given the observed sequence and (3) finding the HMM parameters that maximize the probability of observing a given sequence [29]. Methods to address these three respective problems exist and include the forward, forward–backward or Viterbi and expectation maximization (EM) algorithms. All the extensions of HMM family-based models have the three HMM parameters and most of them make use of the forward–backward or Viterbi and EM algorithms for inference. Box 1 describes the forward–backward, Viterbi and EM algorithms. It is worth noting that few local ancestry deconvolution methods make use of other approaches, such as discriminative or generative models. Discriminative models use the conditional distribution to model the decision boundary, including conditional random fields (CRFs) and support vector machines (SVMs). On the other hand, generative approaches, which use joint probability distribution, are the most common in ancestry deconvolution, as they are easy to fit as compared with discriminative approaches, which often lead to solving a convex programming problem (Supplementary File).

Box 1.

Forward–backward algorithm

Given the observed sequence and HMM parameters, λ, we seek to estimate the probability of a hidden state at time or marker t. The forward–backward algorithm is an application of dynamic programming, which reduces the model complexity in the naive approach of the computation of the joint probability of hidden states and observed states, i.e. PXt|Y1:T is estimated as follows:

P(Xt=k|Y1:T)P(Xt=k,Y1:T)=P(Yt1:T|Xt=k,Y1:t)P(Xt=k|Y1:t)P(Yt+1:T|Xt=k,Y1:t)P(Xt=k,Y1:t) (4)

The second factor in the second line of Equation (4) is given by:

P(Xt=k,Y1:t)=kP(Xt=k|Xt1=k)P(Yt|Xt=k,Xt-1=k,Y1:t-1)P(Xt1=k,Yt-1)=kP(Xt=k|Xt1=k)P(Yt|Xt=k)P(Xt1=k,Yt1)=t(k)kA(k,k)P(Xt1=k,Yt1) (5)

Letting P(Xt=k,Y1:t)=αt(k) from Equation (2), we obtain:

αt(k)=t(k)kA(k,k)αt1(k) (6)

Setting the first factor of the same line of Equation (22) to βt(k), i.e. βt(k)=PYt+1:T|Xt=k we have:

βt(k)=kP(Xt+1=k,  Yt+1,Yt:T|Xt=k)=kP(Yt+1|Xt+1=k)P(Yt+2  :T|Xt+1=k)P(Xt+1=k|Xt=k)=kt(k)βt+1(k)A(k,k) (7)

Substituting αt(k),βt(k), and letting γt(k)=PXt=k|Y1:T in Equation (4), we have γt(k)αt(k)βt(k):

At locus t the likelihood is given by P(Yt)=k,αt(k)βt(k). However, Equation (6) depends highly on whether the data are phased. The Viterbi algorithm is another dynamic programming technique, similar to back-tracking in the forward backward algorithm [37].

Dissecting current local ancestry deconvolution approaches

This section provides a detailed theoretical overview of existing local ancestry deconvolution and an exploration of their statistical models. Generally, deconvolving genetic ancestry may yield global ancestry and/or local ancestry from a given proxy ancestry (Box 2). Modelling the complex nature of the linkage patterns and the choice of HMM play a critical role in yielding better performance in inferring local ancestry. As mentioned earlier, existing local ancestry deconvolution approaches can be subdivided into two main groups based on whether the method models admixture/background LD (refer to Box 2 for description of LD types). Figure 1 shows the evolution of different local ancestry deconvolution methods graphically, indicating the time each method was proposed. Below, we discuss these two categories of local ancestry deconvolution approaches in more detail.

Figure 1.

Figure 1.

The evolution of local ancestry deconvolution since 2003–17.

LD-based methods

This category consists of all methods that account for either admixture or background LD or both, and model ancestry along the genome of an admixed individual using HMM family-based models [14, 15]. LD is not strictly important in the inference process; however, removing SNPs in LD reduces the estimation accuracy. Moreover, not modelling LD and using all SNPs (linked and unlinked) may cause noise and systematic biases [30] because of the complex nature of modelling LD patterns in both an admixed population and its proxy ancestral populations. Although it is a challenge, accurately modelling LD in local ancestry deconvolution may improve the accuracy. On the other hand, inaccurately modelled background LD within ancestral populations may yield spurious results [11]. Ancestry deconvolution models assume that ancestry at different sites or windows follow a Markov chain. Recombination is assumed to occur at every generation resulting in Poisson recombination points with a rate that depends on both the recombination rate, r, and number of generations since admixture, g [6, 22, 30, 31]. Individuals are assumed to be independent of each other; hence, for admixed haploid data, the ancestry at a site, Xt{1,,K} is modelled by:

PX1=k|π,r=πk, (1)

at the first marker, and the transition probability between consecutive markers is modelled by:

PXt=k|Xt1=k,π,r=σ(k=k)edtr+(1edtr)πk, (2)

where σ(k=k) and dt is as defined above. On haploid data, the probability of a recombination event is 1edtr, meaning that the probability of no recombination is edtr [13, 31].

Modelling only admixture LD

These include the early methods that use ancestry informative markers (AIMs) to estimate the ancestry of alleles inherited from a particular population at a chromosomal site. Examples include STRUCTURE [14, 21], which pioneered genetic ancestry inference from 2003, ANCESTRYMAP [31] and ADMIXMAP [32]. ANCESTRYMAP and ADMIXMAP were both proposed in 2004, while STRUCTURE was proposed in 2003 (Figure 1). They are all aimed at improving the application of local ancestry estimates in admixture mapping. They are based on the Bayesian inference framework [22] and assume markers are independent. The ancestry estimation process is memoryless by modelling admixture LD as Markov chains. These early methods do not model background LD, and most of them were designed for two-way admixtures, although some deconvolve ancestry in multiway admixtures, e.g. STRUCTURE. Local ancestry is deconvolved by a standard HMM integrated with Markov chain Monte Carlo (MCMC) to account for uncertainty in model parameters [12, 13, 20]. Admixture generations, average genome-wide ancestry proportions and ancestral allele frequencies are assumed to be known [14, 31, 32], which is not always the case in reality. As they are based on HMMs, early method emission probability models depend on the allele frequency of allele a originating from one of the K unknown ancestries at locus t.Assuming ancestry of origin is Xt and the observed minor allele at t is Yt, then the emission probability is computed as follows:

PYt=z|Xt=k=12nki=1nkh=01σ(akiht=z), (3)

where σ is the indicator function for akiht such that akiht{A,C,G,T,-}, and h as defined previously.

Advances in technology, which produce dense datasets and progress made in designing statistical and computational methods, have brought to light the inadequacy of AIMs and standard HMM methods in modelling ancestry along an admixed genome. Although violating the independence assumption because of LD [20], high-density SNP data sets are more powerful [11]. In addition to early methods, SUPPORTMIX [33] also models admixture LD only by combining SVMs and HMMs (Supplementary Text S1). The method was proposed in 2012 in an effort to improve on the computational time and also address the challenge of few typed or existing reference panels to improve multiway local ancestry deconvolution overall. Note that SVMs are popular in computational biology [34] because of their flexibility in data types and ability to handle big data. In local ancestry deconvolution, SVMs train admixed individuals on ancestral populations that are bigger in size than the mixed populations. Refer to Table 1 for the biological or statistical parameters of this model and to Supplementary Text S1 for the mathematics of SVMs. Despite using the rich information on haplotypes, this method is faster than other LD-based methods. In the same year, PCADMIX [39], which leverages principal component analysis from proxy ancestral haplotypes in only modelling admixture LD under a similar HMM as above was proposed (Table 1). Unlike LD-based admixture methods, SUPPORTMIX [33] and PCADMIX [39] are much faster. In 2013, as high-quality genotypes are not always available, a method that is based on exome sequence reads and which uses HMMs, namely, SEQMIX [6], was proposed. This method does not model SNPs in LD. As a result, to reduce the noise and systematic biases from the use of all SNPs [30], methods that models both admixture and background LD emerged [11].

Table 1.

Existing ancestry deconvolution tools

Software Multiway Account LD LD model Biological/statistical parameters Reference populations Admixed populations Reference
STRUCTURE V2* HMM Markers and ancestry proportions Unphased Unphased [14]
ANCESTRYMAP* HMM Physical map, recombination and ancestry proportions Unphased Unphased [30]
ADMIXMAP* HMM Physical map and ancestry proportions Unphased Unphased [31]
SABER MHMM Physical map or recombination distance Phased/unphased Phased/unphased [15]
‘LAMP’ Admixture generations, LD threshold and physical map Unphased Unphased [17]
HAPAA HHMM Admixture generations and genetic divergence Phased Phased [35]
SWITCH MHMM Recombination rate Phased Phased [36]
GEDI-ADMX Fixed size FHMM Admixed and ancestral SNPs (physical map) Phased Unphased [37]
WINPOP Recombination, admixture generations, LD threshold and physical map Unphased Unphased [18]
HAPMIX HHMM Genetic map mutation rate and admixed and ancestral SNPs Phased Unphased [4]
CHROMOPAINTER Co-ancestry matrix Recombination rate Phased Phased [38]
LAMPLD HHMM Number of hidden states, window size and physical map Phased Unphased [30]
SUPPORTMIX* HMM Admixture generations and genetic map Phased Phased [33]
PCADMIX* Windows of blocks of SNPs Genetic map and window size Phased Phased [39]
mSPECTRUM SNPs, mutation and recombination rate Phased Phased [40]
MULTIMIX MVN Genetic map, legend file and misfitting probabilities Phased/Unphased Phased/Unphased [41]
ALLOY Non-homogeneous VLMC Markers, ancestral proportions, admixture generations and genetic map Phased Phased [16]
RFMIX Genetic map, window size and admixture generations Phased Phased [42]
EILA Physical map Unphased (no missing values) Unphased (no missing values) [43]
SEQMIX Genetic map Unphased Unphased [6]
ELAI Two-layer HMM Admixture generations, lower and upper cluster Phased/unphased Phased/unphased [44]
LOTER Phased Phased [19]

Note: The columns indicate the tool name (Column 1); tools with * account for admixture LD and not background LD and those without * account for both admixture and background LD, the software’s: ability to model multiway admixtures (Column 2), ability to model LD (Column 3), LD model (Column 4), biological and statistical parameters required (Column 5), population data type required: ancestral (Column 6) and admixed (Column 7) and finally (Column 8), the reference paper from which more details are found, respectively. ✓ indicates the ability of the software to perform a specified task, and ✗ indicates the inapplicability of the task by a particular tool.

Modelling admixture and background LD

Admixture LD HMM-based approaches are often not realistic in modelling biological data, as dense SNPs violate the independence of alleles within the ancestral population (background LD). To model background LD in addition to admixture LD, HMM was extended and rich ancestral haplotype information was also used (Box 3 and 4). Methods modelling background LD include SABER, SWITCH [36], HAPAA [35], HAPMIX [4], MULTIMIX [41], ALLOY [16] and ELAI [44]. These are based on HMM extensions, including the Markov HMM (MHMM), hierarchical HMM (HHMM), factorial HMM (FHMM), two layer HMM and the multivariate normal (MVN) distributions to model background LD. Two existing methods use MHMM: SABER, proposed in 2006, was the first method to account for background LD in the genetic ancestry inference; and SWITCH [36] was later proposed in 2008. Unlike a standard HMM, MHMM emission models depend on the joint distribution of alleles at nearby markers if ancestry is the same at consecutive markers [21, 22]. Otherwise, if ancestry switches, then the model is as in Equation (2). Box 3 discusses the differences in emission models, while Box 4 discusses the switch models of different local ancestry deconvolution models. The expected length of blocks inherited from population k, is inversely related to a parameter τ, which approximately corresponds to the number of generations, as admixture happened with all ancestral populations mixed from a single point back in time, referred to as a hybrid isolation model. SABER extends Equation (2) for K>3 given the different admixing times, gk [21]. Assuming the inverse of gk is the expected length of blocks inherited from population k, the transition matrix is given in Box 4.

Box 3.

Assuming that paternal and maternal ancestral states are independent of each other, given Ytm as the observed maternal haplotype at t and the corresponding ancestral state Xtm, LD is modelled by the first-order Markov chain, such that

P(Ytm|X1m,X2m,,Xtm,Y1m,,Yt1m)={P(Ytm|Xtm,Xt1m),forXtm=Xt1mP(Ytm|Xtm),Otherwise.

As a result, the emission probability of maternal alleles at two neighbouring markers t1 and t in the SABER model is given by:

P(Ytm=s|Yt1m=r,Xtm=k,Xt1m=k)=Bt(s,r,k,k)={B˜k,t(s,r),fork=kBk,t¯(s),Otherwise, (8)

where B˜k,t(s,r) is the chance of having alleles at marker t, given there was allele r at t1, given r and s originated from the same population, i.e. k=k.Bk,t¯(s) is the allele frequency of alleles at marker t with origins k, which is evaluated as in Equation (3). In Equation (8), if ancestry switches between consecutive markers, the first-order MHMM emission model reduces to standard HMM emission model, while having same ancestry at consecutive markers makes the chance of observing an allele at t to depend on the alleles at markers t and t1 [20]. Posterior probabilities of the ancestry given observations are estimated by the likelihood approach through the forward–backward algorithm.

In the SWITCH model, the emission probabilities at t depend on those at t−1 for instance, given a two-way admixture, the emission probability is defined as:

P(Yh,t=1|Wh,t,Xh,t,Yh,t1,pt,qt,pt1,qt1,t)={P(Yh,t=1|Xh,t,pt,qt),forWh,t=1,P(Yh,t=1|Xh,t,Xh,t1,pt1,t,qt1,t),otherwise (9)

where Yh,t(0,1)matrix is the observed tth marker of haplotype h, and Xh,t is the hth hidden haplotype at marker t.pt, and qt are the allele frequencies of Populations 1and2atmarkett, respectively, pt1,tandqt1,t are pairwise SNP frequencies at t and t1, and Wh,t is the recombination indicator. Equations (8) and (9) are the same, except that, in Equation (9), recombination is considered, while Equation (8) ignores all recombinations that do not change ancestry [22]. Unlike the MHMM emission probability models of SWITCH and SABER, the emission probability of HAPAA is a 5×5 stochastic matrix such that

P(a¯t=z|Xt=Xkih)=E(akiht,z) (10)

where z{A,C,G,T,},andE(z,zz) capture mutation, genotyping error and haplotypes not observed in representative individuals. Unlike Equation (8), HAPMIX emission model is a 3×3matrix representing the three possible genotypes, i.e. u=0,1,or2 where u is the minor allele count of allele ‘1’. Given the underlying hidden state (ijklmn) and the parental individual k in reference population j,pjk denote the probability of observing genotype u as eijklmnu(t) that is,

eijklmnu(t)={(1eijk1(t))(1elmn1(t)),forg=0,elmn1(t)(1eijk1(t))+eijk1(t)(1elmn1(t)),forg=1,elmn1(t)eijk1(t)forg=2.

where eijk1(t) is the probability of having an admixed chromosome being of type ‘1’ allele at marker t denoted by:

eijk1(t)={θiσ(ρjk=0)+(1θi)σ(ρjk=1)ifi=jθ3σ(ρjk=0)+(1θ3)σ(ρjk=1)ifij,

where θi are mutation parameters capturing historical mutation and genotyping error for i=1,2,3 and σ indicator function. In the window-based ancestry deconvolution model, LAMPLD, for a pair of ancestry in a window Xw=(X1w,X2w), the emission probability of observing a genotype Yw in window w is given by:

P(Yw|Xw=(X1w,X2w))=(H1w,H2w)P(H1w|X1w)P(H2w|X2w). (11)

where P(Hiw|Xiw) is the probability that haplotype Hi is emitted by ancestry Xi in window w, and the haplotype pair (H1w,H2w) is compatible with observed genotype Yw.

Box 4.

The SABER switch model is given by: P(Xt=k|Xt1=k,π,r)=edtA where A is a K×K matrix representing rate of transition from ancestral population i to ancestral population j, obtained as follows:

Aij={igi2k=1kπkgkgi,fori=j,πjgigjk=1kπkgk,otherwise

The switch model in another MHMM-based model, SWITCH [22], depends on the occurrence of a recombination event. Let K=2, if recombination occurs between markers t and t1 of haplotype h, define ancestry Xh,t{0,1} such that ancestry is 0 if it is inherited from the first population with a probability α to be selected and 1 if not from the first population with probability 1α. In the absence of recombination, ancestry will stay the same at consecutive markers, and thus, the transition model is modelled as follows:

P(Xh,t|Xh,t1,Wh,t,α)={σ(Xh,t,Xh,t1)Wh,t=0,(1α)Xh,tα1Xh,t,Wh,t=1 (12)

where σ is the indicator function, and Wh,t is the recombination event between markers t and t1. Recombination is a Poison process, and hence, Wh,u and Wh,v are independent for uv, and the probability of at least one recombination is 1e(g1)dtrt where g is the number of generations since admixture. dt and rt are the genetic distance between markers t and t1, and the recombination rate in the region, respectively. Marginalizing over Wh,t, Equation (12) can reduce to Equation (2). As HHMMs are a two-scale HMM, there are two switch models in HHMM, i.e. the switch model at a small scale, which is between haplotypes within a particular reference population, and large scale, which consist of transitions between populations. HAPAA's small-scale transitions depend on the phasing error and the probability of recombinations. From any putative haplotype, transitions can be any one of the following three states: itself, other putative haplotype or to the end state (non-emitting), ok. If there is no phase error, then, within an individual, there is no transition between putative haplotypes. Given two emitting states, Xkih{Xkio,Xkil} denoting haplotype h{0,1} for an individual i of population k and two non-emitting states for each population k,Ik and ok activating the emitting states. Given the hidden state Xt at marker t, recombination points follow a Poison process with rate dtr where dt is the genetic distance between marker tand t1, and r the recombination rate. The transition probability can be expressed as in following equations:

P(Xt=Xkih)=12nk for the initial state and P(Xt=a|Xt1=a)=θedtr+(1edtr)ω, where θ={wkitifa,a{Xki0,Xki1}andaa1wkitifa,a{Xki0,Xki1}anda′=a for wkit the probability of having a phasing error between markers t and t1, for individual i in population k, and ω=0 if both states are emitting states and 1 if at least one of the two consecutive states is the non-emitting state Ok. From the end state, Ok, a transition is made into the other non-emitting state, Ik resulting in a K×K matrix denoted A(k,k)=P(OkIk) while transitioning from Ik to the emitting state (either same or different population) and Xkih is uniform with probability 1/(2nk) Despite being a two-way admixture ancestry deconvolution model, HAPMIX has so many parameters, including the miscopying parameters, pl for l=1,2 where pl is defined by the probability of copying ancestry from population 1 given the actual ancestry was copied from population 2. In addition, switches between individuals within a reference population group occur as a Poison process with rate ρl, the recombination rate [5]. As a result, given a haploid admixed chromosome, a hidden state at marker t is represented by (a,b,c),a=1or2 is the index for which ancestry was drawn from, b=1or2 index that captures the population the chromosome copies from at t and so b may be different from a [5] and c is the individual from which the chromosomal segment is copied from. The probability of transition from state (a,b,c) to state (d,e,f) between consecutive sites tand t1 is denoted by

pt(a,b,c;d,e,f)={aπd1pdneifdaande=daπdpdneifdaandeda¯b1pdne+aπd1pdneifd=aande=dand(beorcf)a¯b1pdne+a¯a1pdneifd=aande=dand(b=eandc=f)a¯bpdne+aπdpdneifd=aandedand(beandcf)a¯b¯+a¯bpdne+aπdpdneifd=aandedand(b=eandc=f), (13)

where a=1ertg,b=1ertρd,q¯=1q, and g is the generation since admixture, πk and nk are as defined before. The HAPMIX switch model has 2(n1+n2) states.

With a large parameter set, SABER's Markov model only accounts for background LD between consecutive markers [11]; hence, some researchers described it as incomplete modelling of background LD. As a result, in 2008, SWITCH [36], which considers recombination even if it does not result in an ancestry switch, emerged. SWITCH differs from SABER in that it conditions its MHMM on recombination, whereas in SABER, it is conditioned on the ancestry states. The SWITCH model is as described in Box 3 and 4, and marginalizing this model over recombination yields Equations (1) and (2). The probability of recombination depends on the admixture generations, genetic distance between consecutive SNPs and the recombination rate. Emission models of SABER and SWITCH are the same, except that SABER ignores recombination that does not result in an ancestry change [36]. If ancestry switches between sites, then the emission model of MHMM is as in Box 3. For SABER, the same ancestry at consecutive markers implies the emission model depends on the allele frequencies at the two consecutive markers [11], while in SWITCH, the model depends on pairwise SNP allele frequencies between consecutive markers (Box 3) [36]. Although SWITCH models background LD and estimates recombination rates, the authors recommended the development of richer MHMM or other models different from the pairwise models in SWITCH and SABER [36]. As a result, methods that use both large- and small-scale HMMs, referred to as the HHMM, were proposed. The first model to use the HHMM is a multiway local ancestry deconvolution method, HAPAA [35]. It filters admixed segments by removing those with sizes that do not meet the expected minimum threshold [35] and assumes the phase of admixed individuals is known. It accounts for phase switch errors (Box 4) through a variable, θ, and its transition probability model is illustrated in Box 4, with the indicator function in Equation (2) replaced by a function of phasing error between consecutive markers for each given individual in a given population. Unlike Equation (3), the emission model in HHMM depends on base pairs, capturing mutation and genotyping error (Equations 10 and 11, Box 3). In 2009, a genotype imputation-based method modelling admixture and background LD, GEDI-ADMX [37], was introduced. It uses HMMs similar to those in Marchini et al. [45], Kimmel and Shamir [46] and Rastas et al. [47], referred to as FHMMs. This method trains the FHMM on ancestral haplotypes for genotype imputation [37]. In addition to estimating the local ancestry, GEDI-ADMX imputes missing genotypes at typed or untyped SNPs, which is not offered by most of the methods described so far [35]. Furthermore, GEDI-ADMX HMMs are of fixed state space [37] unlike those leading to a quadratic computational time based on the number of ancestral haplotypes in HHMM models.

HAPMIX [4] was proposed in 2009 and uses an HHMM emission model (Box 3). It assumes the admixed individual’s phase information is unknown and uses a built-in phasing algorithm [4] in phasing the study samples. HAPMIX is computationally expensive, and its processing time is proportional to the square of the number of ancestral haplotypes. This is mainly because of the number of parameters to be estimated under its ancestry switch model (Box 4). Despite outperforming most existing state-of-the-art methods, HAPMIX is for two-way admixture only. HHMM methods face the following issues: (1) the number of ancestral individuals increases with the model complexity [30], and (2) a haplotype is locally more similar to haplotypes from different ancestral populations. As a result, HHMM yields spurious ancestry switches, explaining why HAPMIX allows miscopying from the wrong populations. The problem worsens when some candidate reference populations are small in size, are not good proxies or are closely related [30]. These issues opened opportunities for other HMM-based methods, such as FHMM and two-layer HMM in multiway ancestry deconvolution.

CHROMOPAINTER [38] leverages the candidate ancestral haplotypes similar to the clustering approach in STRUCTURE by extracting more relevant genealogical information from the haplotype data, which is then represented in a co-ancestry matrix [22]. The elements of the co-ancestry matrix are estimated using an HMM [38] suggested by Li and Stephen’s [48]. The co-ancestry matrix measures the genetic similarity between individuals while accounting for the genealogical processes and the structure of the genome. The CHROMOPAINTER model can deconvolute ancestry on linked or unlinked SNPs. Although it is a haplotype-based method, CHROMOPAINTER is inaccurate in estimating local ancestry in populations with sub-continental ancestry and does not incorporate familial relationships, information on admixture or genetic drift. Hence, LAMPLD [30], which uses the Li and Stephen model (HHMM) [48] to leverage the structure of LD in ancestral populations (Box 3), as in HAPAA and HAPMIX, was proposed. Unlike the other HHMMs, by integrating the HHMM within a window-based framework (Box 3), LAMPLD compresses the haplotype information in each ancestry reducing the computational time and addressing the ancestry switch problem. It is less prone to misspecification of model parameters, as it estimates parameters from the ancestral haplotypes. The only burden for LAMPLD is that ancestry is assumed constant within each window. Just like the standard HMM in Equation (1) and (2), the ancestry switch model of LAMPLD (Box 5) depends on a constant recombination and length in base pairs between consecutive windows [30]. Refer to Table 1 for biological/statistical parameters required in running LAMPLD. To account for model correlation between proxy ancestral haplotypes whilst catering for the unavailability of some phased reference panels [30], later in the same year, MULTIMIX [41] was suggested and is also a window-based method. It uses an MVN distribution, whereby model parameters are trained from MCMC, the EM and the classification EM algorithms (Supplementary Text S2). In the same year, a study implemented PCADMIX (Supplementary Text S2), which uses an MVN emission model. Both MULTIMIX (see admixture LD-based methods) and PCADMIX have ancestry switch models as in Equations (1) and (2) above. Despite the close relationship between LD and population structure, existing methods treated these independently. To address this, mSPECTRUM [40] was proposed. It leverages proxy ancestral haplotypes to jointly infer population structure and recombination events using an infinite HMM. Although it is a multiway admixture deconvolution method, it is computationally expensive like HAPMIX [22].

Box 5.

The window-based method LAMPLD estimates the HMM parameters from the reference populations and has top-level HMM with (K2) hidden states, which are local ancestries on each chromosome within each window, Xw=(X1w,X2w) that generate genotype blocks, Yw. The probability of transition from state (X1w,X2w) to state (X1w,X2w) is given by:

P(Xt=(X1w,X2w)|Xt1=(X1w,X2w))={          θif XiwXiw,   for at least one i          θ2if XiwXiw,  i=1,212θ3θ2if Xiw=Xiw (14)

where i=1,2;θ=D108, with D the length in base pairs between window w=[i,i+L] and w=[i+L,i+2×L], where L is the number of SNPs in each window. Given the notation as in FHMM and emission model (Supplementary File: Text S3), the probability of occurrence of recombination between two consecutive sites, P(rt)=1eϕ((g1)dt), and P(r¯t), then the transition probabilities for ALLOY are given by:

P(Xt|Xt1)=Panc(Xt)(Xt|Xt1)P(r¯t)+P(rt)Panc(Xt)(Xt)=Panc(Xt)(Xt|Xt1)v+Panc(Xt)(Xt)v¯ (15)

where v=eϕ((g1)dt),v¯=1v, and Panc(Xt)(Xt) is the ancestral haplotype cluster at t and Panc(Xt)(Xt|Xt1) model transitions within ancestral population anc(Xt) accounting for LD. dt and g are as defined above, and ϕ(z) is the recombination rate function. The prior probability of the ancestral haplotype cluster depends on the ancestry proportions [33], π, and the intra-population haplotype cluster, Panc(Xt)(Xt), such that

P(Xt)=Πanc(Xt)Panc(Xt)(Xt) (16)

In the ALLOY model, given a set of markers from a single ancestry, markers are processed in chromosomal order. Nodes represent some history of the allele sequences and are split according to the alleles for such nodes. Nodes are then merged based on the Markov criterion, such that at t, two clusters are merged if the allele sequence probabilities at sites t+1,t+2,,t+z resemble each other for some parameter z. For each population, a directed acyclic graph is formed with edges eti for cluster i and the number of haplotypes in population sample passing through cluster i as weights denoted by wti at t. Let sti and Tti be the source node and target for the edge at t for cluster i,eti, respectively. Hence, the prior and transition probabilities in Equations (15) and (16) are calculated as follows:

Panc(Xt=atanc,i)=wtiwtj,  and Panc  (Xt=atanc,i|Xt1=at1anc,j)={wtiwt1j,if Tt1j=sti0,Otherwise (17)

The specific Panc(at)(Xt) and Panc(at)Xt=at|Xt1=at1 are obtained by repeating the above separately for each ancestry. The notation in this paragraph follows the one described in two-layer HMM (Supplementary File: Text S4). For the main HMM, the switching model is given by:

P(X1=k,Z1=s)=P(Z1=s|X1=k)P(X1=k)=πs(i)β1ks

at the first marker, and the transition from state t1 to t is modelled as follows:

P(Xt=k,Zt=s|Xt1=k,Zt1=s)=jtπk(i)βtks+(1jt)rtβtksσ(k=k)+(1jt)(1rt)σ(k=k)σ(s=s) (18)

where πk(i) is the probability that individual i jumps into the upper cluster k given the jump occurs, βtks is the probability that individual i jumps to lower cluster given the jump occurs with the upper cluster being k,β is a T×K×S tensor shared by individuals [36] and σ is an indicator function. More details on the assumptions made on Equation (18) are described in [28]. The other HMM layer in the ELAI model referred to as the ancillary HMM is independent of the main HMM given that m ancestral haplotypes have been observed. In ancillary HMM, the hidden state X is replaced by W, which represents the upper cluster, such that Wt(m){1,,K}. The transition of hidden states is given by:

P(W1(m)=k)=ak(m),   and   P(Wt(m)=k|Wt1(m)=k)=ρtαk(m)+(1ρt)σ(k=k), (19)

where there is no association between the jump probabilities ρ in the ancillary HMM and the jump probabilities in the main HMM, and ak(m), is the probability that the upper cluster is k.

In the race towards accurately modelling the LD structure to improve local ancestry deconvolution, in 2013, ALLOY [16] was developed. It accounts for LD through a non-homogeneous variable length Markov chain (VLMC). The method in ALLOY captures the way admixed maternal and paternal haplotypes can be inherited using an FHMM (Supplementary Text S3) [16]. Technically, its accuracy improved, as it leveraged the haplotype structure in the compound state. The emission and ancestry switch models are described in Box 5. In 2014, a two-layer HMM that models LD at two scales, i.e, one between ancestral populations and the other within ancestral populations between haplotype groups, ELAI [44], was proposed for multiway local ancestry deconvolution. The model is divided into the main HMM and an ancillary HMM (Supplementary Text S4). ELAI was motivated by the need to infer recombination rates using local ancestry deconvolution. It uses either phased or unphased data, limiting phase uncertainty problems. As a result, ELAI can still perform well when the ancestral populations' high-quality haplotype information is unavailable [44]. Given an upper cluster of one, ELAI extends fastPhase, a phasing method. If the upper and lower clusters are equal, then it yields independent markers, while having a lower cluster descending from an upper cluster reduces ELAI to STRUCTURE [44].

Non-LD-based methods

Methods discussed under this section do not model either background or admixture LD. Some of the methods prune or remove SNPs in LD (LAMP [17] and WINPOP [18]), while others do not model LD but include both linked and unlinked SNPs in local ancestry deconvolution. Owing to the complexity of the MHMM, a window-based method, LAMP [17], emerged in 2008. It is fast and robust and can infer local ancestry in the absence of proxy ancestral genotypes. It uses the naive Bayes classifier and the iterative conditional modes, a clustering algorithm, to estimate the most probable ancestry at a particular chromosomal site through the application of the majority vote for each SNP [17]. It does not deal with issues arising from the use of HMMs and their extensions but at the cost of neglecting admixture and background LD information.

WINPOP [18], a dynamic programming algorithm, was proposed to deconvolve ancestry in closely related populations and reduce the computational cost of HMM-based methods. It extended LAMP, by assuming at least one recombination event in each window and varying the window length depending on how similar ancestral populations are. Similar to LAMP, WINPOP assumes unlinked markers and removes SNPs in LD. For instance, WINPOP outperforms other methods in closely related populations (Table 3) [18], and LAMP outperformed them in distant ancestral populations (Table 3). As sequence data availability increased, in 2013, Maples et al. [42] proposed a discriminative approach, RFMIX [42], to estimate local ancestry. RFMIX uses the information contained in admixed samples, while generative ancestry deconvolution methods use the information in ancestral samples. Using the information contained in admixed samples was advantageous in the case of scarce reference populations, e.g, the Native Americans [42]. RFMIX uses CRFs parametrized with random forests to estimate ancestry. First-order CRFs are shown in Supplementary Text S5. This method seemed to improve and stabilize the local ancestry estimates in various admixture models, in particular multiway admixture scenarios (Table 3). This could be because it models phase switch errors just like HAPMIX.

Table 3.

Details of comparing or evaluating local ancestry deconvolution methods

Aim of study Methods used to deconvolve ancestry Rank as per findings Reference paper
Introducing a local ancestry deconvolution method: WINPOP SABER, HAPAA, LAMP and WINPOP
  • WINPOP

  • LAMP

  • HAPAA

  • SABER

Pasaniuc et al. [18]
Introducing a local ancestry deconvolution method: LAMPLD WINPOP, GEDI-ADMIX, HAPMIX and LAMPLD
  • LAMPLD

  • WINPOP

  • GEDI-ADMIX

  • HAPMIX

Baran et al. [30]
Introducing a local ancestry deconvolution method: SUPPORTMIX WINPOP and SUPPORTMIX
  • SUPPORMIX

  • WINPOP

Omberg et al. [33]
Introducing a local ancestry deconvolution method: PCADMIX LAMP, HAPMIX and PCADMIX
  • PCADMIX

  • LAMP

  • HAPMIX

Brisbin et al. [39]
Introducing a local ancestry deconvolution method: mSPECTRUM LAMP, HAPMIX, mSPECTRUM
  • mSPECTRUM

  • LAMP

  • HAPMIX

Sohn et al. [40]
Introducing a local ancestry deconvolution method: ALLOY WINPOP and ALLOY
  • ALLOY

  • WINPOP

Rodriguez et al. [16]
Introducing a local ancestry deconvolution method: RFMIX LAMPLD, SUPPORMIX and RFMIX
  • RFMIX

  • LAMPLD

  • SUPPORTMIX

Maples et al. [42]
Introducing a local ancestry deconvolution method: EILA HAPMIX, LAMP and EILA
  • EILA

  • LAMP

  • HAPMIX

Yang et al. [43]
Association studies LAMPLD and WINPOP
  • LAMPLD

  • WINPOP

Chimusa et al. [51]
Admixture mapping LAMP, LAMPLD and MULTIMIX
  • LAMPLD

  • MULTIMIX

  • LAMP

Chen et al. [50]
Introducing a local ancestry deconvolution method: ELAI HAPMIX, LAMPLD and ELAI
  • ELAI

  • LAMPLD

  • HAPMIX

Guan [44]
Introducing a local ancestry deconvolution method: LOTER LAMPLD, RFMIX and LOTER
  • LOTER

  • RFMIX

  • LAMPLD

Dias-Alves et al. [19]

Note: Studies considered multiway admixtures. Ranking is for distant ancestral populations. Though all methods underperformed in closely related populations, HAPMIX was better than the other two. Ranking based on three-way ancient admixtures.

To address some of the challenges in local ancestry deconvolution that reduce the inference power, including use of three genotype values, identifying break points, which are useful in estimating generations since admixture, and the independence of SNPs assumption, Yang et al. [43] in 2013, proposed a multivariate statistics-based method (Supplementary Text S6), EILA [43]. It uses quantile regression and k-means classifiers to infer ancestry from genotype data. To overcome the limited power from using three genotype values, EILA assigns a numerical score, ns, to admixed genotypes. This numerical score is a value between 0 and 1 and measures the similarity of SNPs to certain ancestral populations [43]. Fused quantile regression is then used in identifying break points. After break point identification, k-means classifiers are used to estimate local ancestry using all SNPs (linked and unlinked).

Recently, a software package LOTER [19], which deconvolved local ancestry in multiway admixtures for a wide range of species, was proposed, but the model in LOTER can currently model phase (Table 1) errors in two-way admixture only. Despite the availability of dense genotype and sequence data on a wide range of species [19], deconvolving local ancestry in non-model species is a challenge. Existing methods require biological and statistical parameters to deconvolve ancestry, while LOTER requires none of these parameters. Biological parameters include admixture time, recombination rates, mutation rates, physical map, etc., while statistical parameters include the number of hidden states for the HMM and window size [19]. LOTER is a non-probabilistic approach formulated from an optimization problem (Supplementary Text S7), using Li and Stephen's copying model [48], where the optimal solution is obtained from dynamic programming. In the LOTER model, an optimization problem similar to that of EILA is solved. Table 1 provides a summary of different ancestry deconvolution tools. From a number of previous reviews [20–22, 49], we observed that some non-LD-based methods outperform LD-based methods, which could be because of the estimation of a high number of parameters in some of the LD-based approaches given the complexity of modelling the LD structure. Another reason could be the ability of non-LD-based methods to use the information contained in the haplotypes and also model phase switch errors, for example RFMIX and LOTER.

Challenges, evaluation and opportunities

Despite efforts that have been invested in local ancestry inference or local ancestry deconvolution, several challenges still exist. Previously, we discussed the importance of local ancestry and the applications of the estimates from such studies. However, the success of these studies depends heavily on the accuracy of the estimates. Studies in Tables 2 and 3 showed that deviations in local ancestry estimates still exist in multiway admixtures. As illustrations, Chen et al. [50] used three state-of-the-art methods (refer to Table 3) to deconvolve local ancestry. The best two methods (Table 3) that are all LD-based models had differences in ancestry estimates on almost 20% of the analysed SNPs. This could be because of the way these two approaches account for the biological or statistical parameters within their model, for example, although both methods require window size to approximate ancestry estimates, LAMPLD performed well across all the window sizes used (150, 100, 75 and 50 SNPs), while MULTIMIX performance improves with the decrease in window size [50]. In addition, Chimusa et al. [51] using a different population with two state-of-the-art methods also highlighted that admixture mapping is limited by inaccuracies in local ancestry inference. However, the two methods used do not all fall under the same group category, and in this case, LAMPLD, which uses the haplotypic information in ancestral populations and models LD, outperformed the non-LD-based method that uses allele frequencies of ancestral populations to infer ancestry.

Deviations in local ancestry may be because of several reasons, such as signals of selection, risk of disease, miscalling true ancestry and error in genotyping [24, 50]. As a result, inaccuracies in local ancestry are one of the main challenges in studying genetic ancestry. There are several possible causes of inaccuracies in local ancestry deconvolution related to the fact that (a) most existing methods (1) require biological or statistical parameters to infer ancestry, which are not always accurate if provided, (2) assume information on reference ancestral populations is known and (3) assume markers are unlinked; (b) there are few existing individuals that have been genotyped; (c) existing methods generally underperform in ancient admixtures; (d) genotypes only take three values, which might reduce the power to infer ancestry; (e) existing methods are benchmarked or tested for up to three-way admixture; (f) HMM-based methods have a large parameter set and (g) methods do not model natural selection when estimating ancestry. Some of these challenges have been successfully addressed, while others are still pending. To account for (e), a framework that deconvolves local ancestry using existing tools adjusted to incorporate more than three ancestral populations where possible should be implemented. This framework should also facilitate the analysis of all the autosomal chromosomes. Challenges (a (1)) and (c) have recently been addressed by LOTER, as it performs well in admixed populations that mixed >150 generations ago [19]. The SWITCH model addressed (f) by using LAMP to initialize parameters in HMM. Challenge (g) is still yet to be addressed. SWITCH [36] addressed the non-existence of one of the ancestral reference panels by its capability to infer ancestry, as it estimates the allele frequency of one reference given the genotype data. On the other hand, RFMIX [42] uses the information in admixed samples themselves to address (a(2)) and account for (b), but the user is required to provide the ancestral populations.

Admixture scenarios are diverse, and the local ancestry deconvolution tools are proposed for different applications, under different model assumptions. Therefore, local ancestry inference methods are not easy to compare especially if they also do not fall under the same category. As a result, following Pasaniuc et al. [25], we compare methods that fall within the same category, that is, LD-based or non-LD-based methods. Tables 2 and 3 are used to compare existing methods using existing studies. To assess if the local ancestry problem is solved in multiway admixtures, Chen et al. used two LD-based, multiway ancestry deconvolution methods (LAMPLD and MULTIMIX), which both divide the admixed genome into windows before ancestry estimation. It was noted that LAMPLD outperform MULTIMIX. Maples et al. on introducing RFMIX, they, assessed LAMPLD and SUPPORTMIX and discovered that LAMPLD again performs better. Although LAMPLD outperformed ALLOY and PCADMIX in the Mexican population, in 2014 (Table 2) [25], it was outperformed by ELAI (Table 3) [44]. However, as Guan 2014 is the only study that compared ELAI with other local ancestry inference methods, more studies should be done to confirm whether ELAI is the best local ancestry deconvolution method that models LD. Given distant ancestral populations, EILA outperforms LAMP (Table 3), while because of the use of the haplotype information, LOTER and RFMIX outperform WINPOP, which uses ancestral allele frequencies. LOTER outperformed these in both distant ancestral populations and recent/and ancient admixtures.

Conclusions and perspectives

Over the past decade, studies of admixed populations have increasingly gained interest and today >20 ancestry deconvolution methods have been introduced. It was expected that critical applications out of these ancestry deconvolution approaches would not only shed light on the demographics and admixture structure of various admixed populations but also improve our understanding of signatures of natural selection, ethnic difference in drug response and the risk of complex diseases. Although a few applications of local ancestry deconvolution, mostly in admixed populations from the American continent, have emerged, unusual ancestry deviation and switch error in estimating local ancestry estimates in multiway admixed populations within a complex or multifaceted admixture model are still a challenge. Here, we have dissected approaches for pinpointing ancestry along the genome of admixed populations and summarized evolutionary factors they account for. We discussed different statistical model properties, assumptions and results from existing evaluations. We believe that this survey may orient the development of new methods and enable researchers to choose appropriate tools for given population characteristics depending on where the local ancestry estimates should be applied. Furthermore, this survey will enable software developers to identify current challenges and advances in ancestry deconvolution, providing them with useful information for the implementation of practical tools, which consider current medical and population genetics demands.

Although recent development of the LOTER method [19] provides more insights into previous challenges regarding ancestry deconvolution in ancient admixtures and a complete reliance on biological parameters, such as LAMP [17] and WINPOP [18], the computational accuracy is still a challenge because of the use of optimization problems. On the other hand, methods that can model biological parameters, such as mutations and recombination, and allow some miscopying as in HAPMIX, CHROMOPAINTER, etc., may suffer from ancestry deviation and computational complexity. Thus, taking into account current challenges, we recommend SWITCH or ELAI [44] to researchers interested in estimating recombination events. If the aim is to understand admixture patterns without knowledge of some of the ancestral populations of the study samples, while having access to a pool of reference population panels, it would be more effective to use SUPPORMIX [33]. In addition, for ancient or recent admixture with large sample sizes of reference panels (large training sets), ALLOY [16] may be more relevant, as the large sample sizes of reference panels improves the LD model used in ALLOY [16]. In cases where the users suspect imperfect reconstruction in the haplotype, LAMPLD [30], RFMIX [42] and LOTER [19] are instrumental, and HAPMIX should be avoided in such scenarios, where possible, when deconvolving local ancestry users should avoid using default parameters, e.g, admixture generations when using RFMIX. This is because the default generations are too recent. In the case of low coverage sequence data sets, such as exome or targeted data for history analysis or genetic association studies, it may be more effective to use SEQMIX [6]. In summary, current implemented local ancestry deconvolution models are not fully accurate in complex admixture modelling and do not fully enable admixture mapping in multifaceted admixed populations. There is still a need to design different admixture scenario simulation framework, which assesses all existing local ancestry deconvolution models. This may enable the development of a novel and/or integrative local ancestry inference framework that captures appropriate biological parameters given a population admixture model, is flexible to handle the presence or absence of proxy ancestral data and allows end users to select the most appropriate models.

Key Points

  • Dissecting existing ancestry deconvolution methods and underlying tools.

  • Comprehensive survey and consistent classification of existing local ancestry inference approaches.

  • Discussion of issues related to application of existing ancestry deconvolution methods.

  • Highlighting possible gaps that still need to be filled in the area of local ancestry deconvolution.

  • Orienting local ancestry deconvolution tool end users based on the evolution of these tools.

Supplementary Material

Supp_bby044

Acknowledgements

The authors thank researchers who have contributed towards advancing the local ancestry inference around the world. Also, the authors thank those who have helped in the preparation of this manuscript.

Funding

Some of the authors are supported by the Organization for Women in Science for the Developing World (OWSD) and Swedish International Development Cooperation Agency (Sida), and the German Academic Exchange Service (DAAD), and others are supported in part by the National Institutes of Health (NIH) Common Fund (grant numbers U24HG006941 and U01HG009716), Wellcome Trust/AESA Ref: H3A/18/001 and 1U54HG009790-01 through H3ABioNet project, HI Genes Africa and IFGeneRA, respectively.

Ephifania Geza, Masters degree in Mathematical Sciences. She is currently a PhD candidate in Bioinformatics at the Computational Biology Division at University of Cape Town in collaboration with the African Institute for Mathematical Sciences (AIMS).

Jacquiline Mugo, Masters degree in Mathematical Sciences. She is currently a PhD candidate in Bioinformatics at the Computational Biology Division at University of Cape Town.

Nicola J. Mulder, PhD in Medical Microbiology. She is a Professor and Head of the Computational Biology Division at University of Cape Town.

Ambroise Wonkam, MD, DMedSc, PhD degree. He is a Professor at the Division of Human Genetics, -Department of Pathology and senior specialist medical geneticist, HPCSA MP0686980, University of Cape Town.

Emile R. Chimusa, PhD in Bioinformatics. He is currently a Senior Lecturer at the Division of Human Genetics, -Department of Pathology, University of Cape Town.

Gaston K. Mazandu, PhD in Bioinformatics. He is an honorary senior member of the Computational Biology Division at UCT and an associate researcher at AIMS, and holds a senior lecturer position at the Division of Human Genetics, Department of Pathology, University of Cape Town.

References

  • 1. Cavalli-Sforza LL, Feldman MW.. The application of molecular genetic approaches to the study of human evolution. Nat Genet 2003;33:266–75. [DOI] [PubMed] [Google Scholar]
  • 2. Yang JJ, Cheng C, Devidas M, et al. Ancestry and pharmacogenomics of relapse in acute lymphoblastic leukemia. Nat Genet 2011;43(3):237–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Koehl A. Estimating ancestry and genetic diversity in admixed populations. The University of  Mexico, 2016.
  • 4. Price AL, Tandon A, Patterson N, et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet 2009;5(6):e1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Gravel S. Population genetics models of local ancestry. Genetics 2012;191(2):607–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hu Y, Willer C, Zhan X, et al. Accurate local-ancestry inference in exome-sequenced admixed individuals via off-target sequence reads. Am J Hum Genet 2013;93(5):891–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Ma Y, Zhao J, Wong JS, et al. Accurate inference of local phased ancestry of modern admixed populations. Sci Rep 2015;4(1):5800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Durand EY, Do CB, Mountain JL, et al. Ancestry composition: a novel, efficient pipeline for ancestry deconvolution. BioRxiv 2014. https://www.biorxiv.org/content/early/2014/10/18/010512. [Google Scholar]
  • 9. Khayatzadeh N, Mészáros G, Gredler B, et al. Prediction of global and local simmental and red holstein friesian admixture levels in Swiss fleckvieh cattle. Poljoprivreda 2015;21(Suppl 1):63–7. [Google Scholar]
  • 10. Shriner D. Overview of admixture mapping. Curr Protoc Hum Genet 2013;Chapter 1:Unit 1.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Seldin MF, Pasaniuc B, Price AL.. New approaches to disease mapping in admixed populations. Nat Rev Genet 2011;12(8):523–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Reich D, Patterson N, De Jager PL, et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nat Genet 2005;37(10):1113–18. [DOI] [PubMed] [Google Scholar]
  • 13. Zhu X, Cooper RS, Elston RC.. Linkage analysis of a complex disease through use of admixed populations. Am J Hum Genet 2004;74(6):1136–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Falush D, Stephens M, Pritchard JK.. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 2003;164(4):1567–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Tang H, Coram M, Wang P, et al. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet 2006;79(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Rodriguez JM, Bercovici S, Elmore M, et al. Ancestry inference in complex admixtures via variable-length Markov chain linkage models. J Comput Biol 2013;20(3):199–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Sankararaman S, Sridhar S, Kimmel G, et al. Estimating local ancestry in admixed populations. Am J Hum Genet 2008;82(2):290–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Pasaniuc B, Sankararaman S, Kimmel G, et al. Inference of locus-specific ancestry in closely related populations. Bioinformatics 2009;25(12):i213–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Dias-Alves T, Mairal J, Blum MG.. Loter: a software package to infer local ancestry for a wide range of species. BioRxiv 2017. https://www.biorxiv.org/content/early/2018/04/26/213728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Liu Y, Nyunoya T, Leng S, et al. Softwares and methods for estimating genetic ancestry in human populations. Hum Genomics 2013;7:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Padhukasahasram B. Inferring ancestry from population genomic data and its applications. Front Genet 2014;5:204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Gompert Z, Buerkle CA.. Analyses of genetic ancestry enable key insights for molecular ecology. Mol Ecol 2013;22(21):5278–94. [DOI] [PubMed] [Google Scholar]
  • 23. Martin AR, Gignoux CR, Walters RK, et al. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet 2017;100(4):635–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Chimusa ER, Daya M, Moller MR, et al. Determining ancestry proportions in complex admixture scenarios in South Africa using a novel proxy ancestry selection method. PLoS One 2013;8(9):e73971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Pasaniuc B, Sankararaman S, Torgerson DG, et al. Analysis of Latino populations from GALA and MEC studies reveals genomic loci with biased local ancestry estimation. Bioinformatics 2013;29(11):1407–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Duconge J, Ruano G.. The emerging role of admixture in the pharmacogenetics of Puerto Rican Hispanics. J Pharmacogenomics Pharmacoproteomics 2010;1:101. doi: 10.4172/2153-0645.1000101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Goetz LH, Uribe-Bruce L, Quarless D, et al. Admixture and clinical phenotypic variation. Hum Hered 2014;77(1–4):73–86. [DOI] [PubMed] [Google Scholar]
  • 28. Aschard H, Gusev A, Brown R, et al. Leveraging local ancestry to detect gene-gene interactions in genome-wide data. BMC Genet 2015;16:124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 1989;77(2):257–86. [Google Scholar]
  • 30. Baran Y, Pasaniuc B, Sankararaman S, et al. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics 2012;28(10):1359–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Patterson N, Hattangadi N, Lane B, et al. Methods for high-density admixture mapping of diseases genes. Am J Hum Genet 2004;74(5):979–1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Hoggart CJ, Shriver MD, Kittles RA, et al. Design and analysis of admixture mapping studies. Am J Hum Genet 2004;74(5):965–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Omberg L, Salit J, Hackett N, et al. Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations. BMC Genet 2012;13(1):49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Eilers PH, De Menezes RX.. Quantile smoothing of array CGH data. Bioinformatics 2005;21(7):1146–53. [DOI] [PubMed] [Google Scholar]
  • 35. Sundquist A, Fratkin E, Do CB, et al. Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Res 2008;18(4):676–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Sankararaman S, Kimmel G, Halperin E, et al. On the inference of ancestries in admixed populations. Genome Res 2008;18(4):668–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Paşaniuc B, Kennedy J, Mandoiu I. Imputation-based local ancestry inference in admixed populations. In: Măndoiu I, Narasimhan G, Zhang Y (eds). Bioinformatics Research and Applications (ISBRA 2009), Lecture Notes in Computer Science, Vol. 5542. Springer, Berlin, Heidelberg, 2009, 221–33.
  • 38. Lawson DJ, Hellenthal G, Myers S, et al. Inference of population structure using dense haplotype data. PLoS Genet 2012;8(1):e1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Brisbin A, Bryc K, Byrnes J, et al. PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations. Hum Biol 2012;84(4):343–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Sohn KA, Ghahramani Z, Xing EP.. Robust estimation of local genetic ancestry in admixed populations using a nonparametric Bayesian approach. Genetics 2012;191(4):1295–308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Churchhouse C, Marchini J.. Multi way admixture deconvolution using phased or unphased ancestral panels. Genet Epidemiol 2013;37(1):1–12. [DOI] [PubMed] [Google Scholar]
  • 42. Maples BK, Gravel S, Kenny EE, et al. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 2013;93(2):278–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Yang JJ, Li J, Buu A, et al. Efficient inference of local ancestry. Bioinformatics 2013;29(21):2750–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Guan Y. Detecting structure of haplotypes and local ancestry. Genetics 2014;196(3):625–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Marchini J, Howie B, Myers S, et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 2007;39(7):906–13. [DOI] [PubMed] [Google Scholar]
  • 46. Kimmel G, Shamir R.. A block-free hidden Markov model for genotypes and its application to disease association. J Comput Biol 2005;12(10):1243–60. [DOI] [PubMed] [Google Scholar]
  • 47. Rastas P, Koivisto M, Mannila H, et al. Phasing genotypes using a hidden Markov model In: Bioinformatics Algorithms: Techniques and Applications. 2005, 355–73. [Google Scholar]
  • 48. Li N, Stephens M.. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 2003;165(4):2213–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Tang H, Choudhry S, Mei R, et al. Recent genetic selection in the ancestral admixture of Puerto Ricans. Am J Hum Genet 2007;81(3):626–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Chen M, Yang C, Li C, et al. Admixture mapping analysis in the context of GWAS with gaw18 data. BMC Proc 2014;8(Suppl 1):S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Chimusa ER, Zaitlen N, Daya M, et al. Genome-wide association study of ancestry-specific TB risk in the South African coloured population. Hum Mol Genet 2014;23(3):796–809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Florez JC, Price AL, Campbell D, et al. Strong association of socioeconomic status with genetic ancestry in Latinos: implications for admixture studies of type 2 diabetes. Diabetologia 2009;52(8):1528–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Jin W, Shuhua X, Haifeng W, et al. Genome-wide detection of natural selection in African Americans pre- and post-admixture. Genome Res 2012;22(3):519–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Deng L, Ruiz-Linares A, Xu S, et al. Ancestry variation and footprints of natural selection along the genome in Latin American populations. Sci Rep 2016;6(1):21766. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp_bby044

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES