Abstract
Local ancestry inference is an indispensable component of a variety of analyses in medical and population genetics, from admixture mapping to characterizing demographic history. However, the accuracy of local ancestry inference depends on a number of factors such as phase quality (for phase-based local ancestry inference methods) and time since admixture. Here, we present an empirical analysis of four local ancestry inference methods using simulated individuals of mixed African and European ancestry, examining the impact of variable phase quality and a range of demographic scenarios. We find that regardless of phasing options, calls from local ancestry inference methods that operate on unphased genotypes (phase-free local ancestry inference) have 2.6–4.6% higher Pearson correlation with the ground truth than methods that operate on phased genotypes (phase-based local ancestry inference). Applying the TRACTOR phase correction algorithm led to modest improvements in phase-based local ancestry inference, but despite this, the Pearson correlation of phase-free local ancestry inference remains 2.4–3.8% higher than phase-corrected phase-based approaches (considering the best-performing methods in each category). Further, analyzing perfectly phased data yields accuracies for the phase-based local ancestry inference methods that are only slightly inferior to those of HAPMIX. Phase-free and phase-based local ancestry inference accuracy differences can dramatically impact downstream analyses: estimates of the time since admixture using phase-based local ancestry inference tracts are upwardly biased by ≈10 generations using our highest quality statistically phased data but have virtually no bias using phase-free local ancestry inference calls. This study underscores the strong dependence of phase-based local ancestry inference accuracy on phase quality and highlights the merits of local ancestry inference approaches that analyze unphased genetic data.
Keywords: local ancestry inference, admixture, phase, haplotypes, ancestry
Inferring the source population of genetic markers in an individual, or local ancestry inference (LAI), is important for numerous applications in genetics. However, few independent analyses of existing LAI methods’ performance have been done, and while the impact of phase uncertainty on LAI is broadly appreciated, this relationship has not been examined in detail. Avadhanam and Williams’ work aims to bridge this gap and their findings indicate that in datasets with thousands of individuals, phase-free approaches outperform phase-based approaches.
Introduction
The problem of inferring the ancestral population of each locus in an admixed individual’s genome, or local ancestry inference (LAI), has received attention for well over a decade (Hoggart et al. 2004; Montana and Pritchard 2004; Patterson et al. 2004; Sankararaman et al. 2008; Price et al. 2009; Baran et al. 2012; Maples et al. 2013; Guan 2014; Browning et al. 2023), and a variety of LAI approaches now exist. Early LAI methods only analyzed markers that are not in linkage disequilibrium (LD) (Hoggart et al. 2004; Montana and Pritchard 2004; Patterson et al. 2004), thus reducing the information available for inference while greatly simplifying the modeling. The advent of the Li and Stephens haplotype model (Li and Stephens 2003), which accounts for LD, enabled the development of several LAI methods that leverage all available marker data. HAPMIX (Price et al. 2009) was one of the first approaches to exploit this rich information, which yielded dramatic improvements in both the accuracy and resolution of LAI compared to earlier methods. Subsequent methods that allow for multiple source populations or have improved runtime scaling have been developed (Baran et al. 2012; Maples et al. 2013; Guan 2014; Browning et al. 2023), with many recent LAI methods requiring that the admixed individuals be phased prior to analysis (Maples et al. 2013; Browning et al. 2023). On the other hand, methods like HAPMIX, LAMP-LD (Baran et al. 2012), and ELAI (Guan 2014) analyze unphased genotypes and output inferred local ancestry in an unphased manner.
Local ancestry calls have wide-ranging use in both medical and population genetics. Admixture mapping (Winkler et al. 2010; Shriner 2017) detects trait-associated loci by exploiting the fact that inferred local ancestry tracts are a proxy for variants not directly tested in a study and for a large number of haplotypic backgrounds from the same population. Moreover, methods exist for increasing the power of genome-wide association studies (Pasaniuc et al. 2011; Atkinson et al. 2021) and for improving heritability estimates and mapping in eQTL studies (Zhong et al. 2019) by including local ancestry calls. At the same time, population genetic analyses have used LAI calls to characterize demographic patterns among admixed groups, detecting geographic trends that reflect historical migrations (Bryc et al. 2015), and have inferred times since admixture using local ancestry tract lengths (Gravel 2012).
Phasing errors, or mis-assignment of alleles to haplotypes (herein measured as switch errors) can confound LAI by introducing a short tract of a different ancestry onto a haplotype (in cases of two nearby switch errors) or by prematurely ending a tract. Besides merely switching the ancestry assignment to the opposite haplotype, switch errors decrease the quality of LAI calls in part because accurately detecting tract boundaries is difficult, and short, incorrectly phased tracts may not be detected. Furthermore, LAI switch errors can have important consequences for inferring admixture demography in that they alter observed tract lengths (Gravel 2012). Phase-based LAI methods sometimes attempt to account for phasing errors, as in RFMix, which jointly models switch errors and local ancestry (Maples et al. 2013). By contrast, phase-free LAI methods do not read phase information for the target samples and are, therefore, unaffected by switch errors.
Recently developed statistical phasing methods can scale to thousands or even millions of individuals (Browning et al. 2021; Hofmeister et al. 2023), and when analyzing the UK Biobank or 23andMe datasets, can correctly phase even entire chromosomes for many samples (Williams et al. 2024; ignoring single site errors). Even so, such tremendous phase quality is only achievable given either family data or very large sample sizes (since statistical phasing accuracy is a function of sample size Browning and Browning 2011). Phase accuracy is also a function of the number of genotyped variants (Browning and Browning 2011) and genotyping accuracy. Sample sizes, variant density, and genotype accuracy are often lower when analyzing nonhuman data, particularly in nonmodel organism studies. As such, it is beneficial to consider the impact of phase quality on LAI, noting that phase accuracies that would be considered low in human studies may be representative or even high when analyzing nonhuman data.
Given the many applications of LAI and an appreciation of the associated phasing issues, characterizing the performance of LAI methods is important; yet, many comparisons are in the papers that describe new methods, with relatively few independent analyses. One recent LAI benchmarking study included a range of LAI methods and is complementary to our work in that it does not analyze the impact of phasing errors on the tools (Schubert et al. 2020).
Here, we present a comparison of four LAI methods using genome-wide data for simulated individuals of mixed African and European ancestry. To appropriately model phasing error, we converted the simulated haplotypes into diploid genotypes and phased these using a range of sample sizes, thus accounting for and quantifying the impact of phasing errors on LAI. Moreover, we considered the impact of demography—genome-wide ancestry fraction and time since admixture—as well as phasing choice with respect to the LAI reference panels on LAI accuracy.
Materials and methods
To systematically investigate the effect of various demographic parameters and phasing options on LAI accuracy, we developed a pipeline (Fig. 1) that: (1) simulates genotypes of mixed African and European ancestry with African ancestry proportion and time since admixture (and separately ); and (2) phases these genotypes using three different sample sizes (noting that sample size substantially impacts phase quality Browning and Browning 2011). At the same time, we varied phasing with respect to the unadmixed reference panels by three different panel phasing strategies (see below). To further quantify how much phase quality affects LAI accuracy, we also analyzed perfectly phased haplotypes as generated by simulation.
Fig. 1.
Flowchart depicting our simulation and phasing pipeline. In step a), we simulated two-way admixed individuals under different settings of T and p; in step b), we pooled the simulated data from a) with small, medium, and large sample sizes from the PAGE data; and in step c), we applied three different panel phasing strategies (see Varying simulation phase quality and panel phasing strategies in Materials and methods). Finally in step d), we inferred local ancestry on the output from c) and compared the resulting inferred calls with the ground truth using Pearson correlation.
The LAI methods we benchmarked fall into two categories: phase-free, which take unphased data as input and output unphased local ancestry calls, and phase-based, which require phased haplotypes and output phased local ancestry calls. In the phase-free category, we applied HAPMIX (Price et al. 2009; version 2) and LAMP-LD (Baran et al. 2012; version 1.1), and in the phase-based category we ran RFMix (Maples et al. 2013; version 2), and FLARE (Browning et al. 2023; version 0.3.0). Notable omissions from our list of LAI methods include LOTER (Dias-Alves et al. 2018), MOSAIC (Salter-Townshend and Myers 2019), ELAI (Guan 2014), and MULTIMIX (Churchhouse and Marchini 2013), which have comparable overall accuracy to the approaches we analyzed (Maples et al. 2013; Schubert et al. 2020). The different phasing options we applied reflect workflows that could occur under variable real-world constraints—namely, access to admixed datasets of different sizes, choice of panel-based phasing strategy, and choice of LAI method (which may itself be constrained by phase quality). For example, a user that has few admixed samples from a population of interest might phase them jointly and retain the phase supplied by (for example) the 1,000 Genomes Project (1,000 G) (1000 Genomes Project Consortium et al. 2015) for the unadmixed reference haplotypes (i.e. rather than rephasing this panel—see below). There are also several statistical phasing tools to choose from (Loh et al. 2016; Delaneau et al. 2019; Browning et al. 2021), and in this study we used SHAPEIT4 (Delaneau et al. 2019). Alternatively, the phasing step may not be necessary if a user applies a phase-free LAI method such as HAPMIX or LAMP-LD.
We evaluated the LAI accuracies by measuring the Pearson correlation (R) between the inferred and simulated local ancestry calls (the latter are the ground truth assignments) across all markers. We calculated this correlation using unphased (i.e. diploid) local ancestry calls for two reasons. First, measuring accuracy in a phase-based way is complicated by the presence of switch errors that are introduced upstream of the LAI method. Second, ignoring phase ensures a uniform metric of comparison between phase-based and phase-free LAI methods.
Together with these accuracy calculations, we evaluated the LAI methods’ performance for estimating time since admixture using an adapted version of a Markovian model described by Gravel (2012). We applied this model directly on the output of the phase-based method FLARE and indirectly using PAPI (Avadhanam and Williams 2022) on the output of the phase-free method HAPMIX.
Data-efficient simulation of admixed genotypes
In recent work (Avadhanam and Williams 2022), we described a pipeline for simulating admixed individuals that first uses Ped-sim (Caballero et al. 2019) to generate crossovers in a fixed pedigree, and then uses admix-simu (Williams 2016) to sample haplotype segments at each crossover breakpoint. The latter draws each segment from the same population as the unadmixed ancestor that transmitted it, with the advantage (versus simply using Ped-sim) of requiring far fewer unadmixed individuals to simulate each person. As an example, unmodified, Ped-sim requires 128 founders to simulate one admixed individual when . Furthermore, these founders must be excluded from any subsequent simulation in order to produce unrelated samples. Our pipeline uses far fewer founders, with the requisite number being independent of T (Avadhanam and Williams 2022). We employed a pedigree topology containing the admixed individual in the most recent generation (generation 0) and all that person’s ancestors up to generation T included (i.e. with founders in the oldest generation). The pipeline also uses the parameter p to specify the proportion of founders of African ancestry (Avadhanam and Williams 2022).
In order to make maximal use of the unadmixed 1,000 G individuals and reserve some as LAI reference panels, we simulated by dividing the unadmixed data into three batches, each composed of a nonoverlapping set of 35 YRI and 31 CEU individuals (out of 107 YRI and 95 CEU samples). We mapped these 3 batches arbitrarily to the 3 settings of and simulated admixed individuals from that assigned founder batch for all 3 settings of T. To better represent the variance from sampling crossover breakpoints and founder haplotype segments, we simulated 3 replicates of 30 admixed individuals for each of the nine settings of T and p (for a total of 90 admixed individuals). We used the phase supplied by 1,000 G for these simulations and excluded all trio/duo children. This process ensures that each batch has a corresponding nonoverlapping holdout set of 72 YRI and 64 CEU individuals for use as LAI reference panels.
Following these simulations, for each replicate, we merged the 30 simulated samples with additional admixed individuals of three different sample sizes (small, medium, and large) and phased these using three different panel phasing strategies (see the next subsection and Fig. 1). Considering all sample sizes and phasing strategies, this yielded a total of nine phasing conditions analyzed for each setting of p and T. A benefit of this is that every comparison between phasing sample size and panel strategy operate on the same simulated individuals, thus automatically controlling for the noise generated by the simulation of admixed genotypes. In other words, while comparisons between different values of p and/or T analyze different simulated admixed samples, comparisons between different phasing sample sizes or panel phasing strategies (but the same p and T) consider the exact same simulated data.
Varying simulation phase quality and panel phasing strategies
We induced varying degrees of phase quality by pooling the simulated admixed individuals with different numbers of individuals from the BioMe Biobank subset of the Population Architecture using Genomics and Epidemiology (PAGE II; Wojcik et al. 2017; Wojcik et al. 2019) study and phasing them jointly using SHAPEIT4 (Delaneau et al. 2019; see below and Fig. 1). To that end, we ascertained two-way admixed PAGE individuals of mixed African and European ancestry by following a procedure that we employed in earlier work (Avadhanam and Williams 2022). We first ran ADMIXTURE (Alexander et al. 2009) with on the PAGE samples merged with 176 HapMap trio parents evenly split between the CEU and YRI populations; this allowed us to determine which of the K components corresponded to African and European ancestry. We then selected individuals that have (1) ≥5% African ancestry; (2) ≥5% European ancestry; and (3) the sum of these two ancestries ≥99.5%; this yielded 5,786 PAGE samples. We then randomly sampled 5,780 of these PAGE individuals to use as the “large” sample size, 2,890 for the “medium” sample size, and 580 for the “small” sample size. We verified that the phasing error generally correlated inversely with sample size by measuring switch error rates using vcftools with the --diff-switch-error flag, where we took the phase output by the simulator to be the ground truth (see Results).
Besides varying the sample size, we applied three different panel-based phasing strategies: (1) the “default” phasing strategy, where we retained the original 1,000 G phase for the panels; (2) the “reference” phasing strategy, where we use the --reference option so that SHAPEIT4 conditions on the panel haplotypes when phasing the admixed samples (this is recommended when using such a panel for downstream analyses such as imputation or LAI); and (3) the “rephase” strategy, where we pooled the reference panels with the admixed genotypes and retrieved the resulting rephased haplotypes for use as panels for LAI. The latter two strategies help make the admixed individuals’ phase more consistent with those of the reference panel, which contrasts with the default strategy, where the reference haplotypes have the phase present in the 1,000 G data—which were generated independent of the admixed individuals.
Filtering and processing the PAGE data
The data we used for this work is a merging of the PAGE and 1,000 G data that we used previously. Full details of the merging process and quality control filters we applied are available in our earlier work (Avadhanam and Williams 2022), and we provide a brief summary here. The key steps of the pipeline are filtering SNPs in the PAGE dataset and intersecting these filtered data with the 1,000 G dataset to obtain a common set of SNPs. We filtered the PAGE SNPs using the quality control report distributed with the dataset. This applies a composite filter including those for, among others, Hardy–Weinberg equilibrium, sites with discordant calls in duplicated samples, and those with Mendelian errors. We further ensured there were no allele coding inconsistencies between the two datasets by recoding the PAGE data to the forward strand, filtering out A/T and C/G SNPs, and applying an allele frequency difference test filter. These steps yielded 494,219 markers common to both the datsets, which we used for all analyses.
Performing LAI and measuring accuracy
After phasing with SHAPEIT4, we passed the datasets as input to each of the LAI methods (in the case of phase-free LAI, we erased the phase) and we used 1,000 G YRI and CEU haplotypes as LAI reference panels. As described above, we ensured that in all cases the LAI reference panels and the set of individuals that we used as simulation founders were disjoint. We ran the LAI methods with default settings (except for FLARE where we set min-mac=0 and min-maf=0) and on each chromosome separately.
We compared the inferred local ancestry calls with the ground truth by representing all markers sequentially (across all 22 chromosomes) in vectors and whose elements represent the number of European haplotypes at a site. The Pearson correlation coefficient (R) between and then gives a measure of performance for each simulation setting and LAI method used. For HAPMIX, which outputs a posterior probability of each ancestry state at each marker (i.e. for ), we first calculate before computing the Pearson correlation coefficient between and . The elements of are , which represent the expected local ancestry call for each marker.
Comparing admixture time estimates
Our final analysis examines the performance of estimating time since admixture using local ancestry calls from phase-free and phase-based LAI. For phase-free LAI, we applied PAPI (Avadhanam and Williams 2022), a tool for inferring parental admixture proportions and times since admixture from unphased local ancestry calls. PAPI produces two estimates of admixture time, one for each parent of the admixed sample, which we average to obtain a single admixture time estimate ().
To estimate admixture times from phased local ancestry calls, we first grouped together the tracts of Morgan length corresponding to each ancestry , where , and is the number of tracts with ancestry a. We then computed four statistics:
| (1) |
and
| (2) |
Here, is the estimated exponential rate parameter for switching from ancestry a to the opposite ancestry (switching from to , for example), and is the focal individual’s estimated ancestry fraction from population a. The term denotes the number of tracts of ancestry a that occur at the end of a chromosome (so ) and the factor in the numerator of Equation (1) accounts for the fact that a chromosome end prematurely cuts off a tract before the next crossover is observed (Caballero et al. 2019). Note that the denominator of Equation (2) is always twice the length of the genome.
Next, we applied PAPI’s internal model for estimating admixture time to these haploid tracts. PAPI assumes that the observed local ancestry tracts are generated by transmitted crossovers within a pedigree that includes unadmixed founders of different ancestries. Further, PAPI’s model treats these founder haplotypes as a pool, where the observed haplotypes after T generations of crossovers are generated by a Markovian path that switches to a random haplotype in the pool at rate T per Morgan. Under this model, the rate of between-ancestry crossovers (or switches) from ancestry a to ancestry is approximately the overall switch rate T times a factor equal to the proportion of haplotypes with ancestry in the founder pool (Gravel 2012; Avadhanam and Williams 2022). A reasonable estimate of that proportion is simply the proportion of this ancestry in the admixed individual being analyzed, so:
| (3) |
Solving this for T gives two estimates of admixture time, one for and one for : and . We combined these by simply averaging them, and our estimate of admixture time from phased LAI is thus .
Results
We measured LAI accuracy for two phase-free methods—HAPMIX and LAMP-LD—and two phase-based methods—RFMix and FLARE—across every combination of four input parameters: , , phasing sample size (small, medium, and large), and panel phasing strategy (default, reference, and rephase) (see Materials and methods). We further examined a wider range of times since admixture by considering . Prior to these LAI analyses, we calculated switch error rates in the phased data in order to investigate the sensitivity of phase quality to the parameters; a priori we expected only phasing sample size to have a strong effect on quality. Moreover, to better understand the impact of phase on LAI, we examined the performance of each method using perfectly phased data and also applied a phase correction algorithm that analyzes local ancestry calls (see below). Finally, we estimated the impact of LAI accuracy on downstream admixture time estimates using LAI tracts from one phase-free method (HAPMIX) and one phase-based method (FLARE).
Phasing sample size and other simulation parameters affect switch error rates
After phasing with SHAPEIT4, we measured switch error rates in the simulated samples for every setting of T, p, sample size, and panel phasing strategy (Fig. 2, Supplementary Fig. S1). Overall, the variable with the largest impact on phase quality is sample size. Regardless of strategy or simulation setting, jointly phasing the simulated individuals with the largest possible set of accompanying samples produces the lowest switch error rates, as is commonly seen in phasing analyses (Browning and Browning 2011; Fig. 2c). Averaged across all settings of T, p, and phasing strategy, the switch error rates for the small, medium, and large phasing sample sizes are , , and , respectively. The relative improvement obtained by moving from the small () to medium () sample size is considerably greater () than that of moving from the medium to large () sample size (). This is again consistent with prior studies: phase quality improves dramatically when increasing from smaller sample sizes, but shows a trend of diminishing returns in larger samples (Browning and Browning 2011).
Fig. 2.
Switch error rates depend on phasing sample size and several simulation parameters. Switch error rates plotted against a) proportion of African ancestry p, b) panel phasing strategy, c) phasing sample size, and d) admixture time T (i.e. generations since admixture). Each panel includes data points for all values of the other variables. Boxplot lengths represent the inter-quartile range (IQR) and whiskers extend up to .
As may be expected, phase quality is also sensitive to panel phasing strategy (see Fig. 2b, Supplementary Fig. S2): the average switch error rate is lowest for the reference phasing strategy at , followed by the rephase strategy at , and default phasing at . This indicates that including unadmixed reference haplotypes when phasing—either by pooling them with the admixed individuals as in the rephase approach, or conditioning on them as in reference phasing—improves phase quality markedly. However, the effect may be due to an increase in effective phasing sample size and not specifically because we included these unadmixed reference haplotypes.
In turn, phasing simulated individuals with a larger proportion of African ancestry (p) yields improved phase, with the lowest switch error rate of occurring when , followed by when , and when (see Fig. 2a, Supplementary Fig. S3). A possible explanation is that the admixed haplotypes simulated with may be more similar to the admixed haplotypes from the PAGE dataset that they are pooled with—the PAGE per-person average is (Avadhanam and Williams 2022). Finally, the switch error rates show effectively no dependence on the time since admixture T, with only slight variation that may be driven by statistical noise (, , and , respectively, for , 6, and 7; Fig. 2d).
Phase-free LAI is more accurate than phase-based LAI for recent admixture and a range of sample sizes
A central question in our analysis is whether phase-free LAI is more accurate than phase-based LAI given the phase quality our data provided. We find that both phase-free methods (HAPMIX and LAMP-LD) are substantially more accurate than phase-based methods (RFMix and FLARE), even when the latter receive the highest-quality phased data (Fig. 3, Supplementary Fig. S4). In particular, HAPMIX outperforms all other methods in all parameter settings, with an average R of (range 0.965–0.995); the average R of LAMP-LD is slightly lower at (range 0.946–0.993). The corresponding R values for the phase-based methods are for FLARE (range 0.905–0.983) and for RFMix (range 0.909–0.981).
Fig. 3.
Performance of LAI methods across all simulation parameters. Correlations R between inferred and true local ancestry assignments for all LAI methods plotted against a) proportion of African ancestry p, b) panel phasing strategy, c) phasing sample size, and d) admixture time T (i.e. generations since admixture). Each panel includes data points from all other simulation variables. Boxplot lengths represent the IQR and whiskers extend up to .
A further advantage of using phase-free methods is that they ignore the phase of the target samples by design and are therefore robust to the impact of phasing-specific variables (notably, we input the same data to all tools, but erased the phase for the phase-free methods). In particular, HAPMIX and LAMP-LD show an average R that is identical ( and , respectively) across all phasing sample sizes and phasing strategies.
To assess the generalizability of these findings, we repeated these analyses for simulated admixture times while applying only the reference phasing strategy (Fig. 4). The performances of all methods diminish as T increases, which is unsurprising because local ancestry tract lengths decrease as T increases. Notably, LAMP-LD’s accuracy drops more rapidly for larger T than other methods, such that while it outperforms the phase-based methods in all settings, when , it is only slightly more accurate at mean versus for FLARE and for RFMix. In turn, RFMix and FLARE have very similar accuracy when (mean and , respectively), but RFMix’s relative performance increases for larger T, with an accuracy that differs from FLARE’s more noticeably at (mean and , respectively). Overall, HAPMIX outperforms all methods by for all simulated settings of T, highlighting the effectiveness of its model that analyzes data free of switch errors.
Fig. 4.
Performance of LAI methods for larger values of T using the reference phasing strategy. Each panel shows boxplots of correlations (R) between inferred and true local ancestry assignments across different simulation settings of a) proportion of African ancestry p, b) phasing sample size, and c) admixture time T (i.e. generations since admixture). Boxplot lengths represent the IQR and whiskers extend up to .
Phasing settings and demographic parameters impact LAI performance
As expected for phase-based LAI approaches, RFMix and FLARE are both sensitive to phasing sample size (Fig. 3c, Supplementary Fig. S4), with RFMix performing identically to FLARE for medium () and large () sample sizes, but outperforming FLARE when the sample size is low ( and respectively). Both methods underperform relative to the phase-free methods, as noted above.
With phase-based LAI, we find that the reference and rephase strategies work equally well and are more accurate than the default strategy (Fig. 3b, Supplementary Fig. S4). Specifically, RFMix has an average accuracy of for both the reference and rephase approaches (identical up to three significant figures) and for the default strategy. FLARE has an average for the reference and rephase strategies, and an R value of with the default approach. When the input phasing sample sizes are small, the improvements of reference and rephase strategies ( and respectively, for RFMix and for both strategies using FLARE) over the default phasing strategy ( and for RFMix and FLARE, respectively) are more substantial.
All LAI methods that we examined are relatively insensitive to African admixture proportion p. For instance, the phase-based methods RFMix and FLARE have virtually identical (up to three significant figures) average R of 0.957 and 0.954, respectively, at and , and a slightly higher R of 0.959 and 0.958, respectively, when (Fig. 3a, Supplementary Fig. S4). LAMP-LD has a drop in accuracy between and , from 0.978 to 0.976, and an increase in R to 0.979 when . Overall, HAPMIX is the most robust to changes in p, staying consistently more accurate than the rest at (identically up to three significant figures) for all values of p.
In contrast to p, all methods show a similar and clear trend of decreasing accuracy as T increases (Figs. 3d, 4, Supplementary Fig. S4). More specifically, when moving from to , the methods’ mean R values decrease by for HAPMIX, for LAMP-LD, for FLARE, and for RFMix. This is likely due to the increased number of ancestral recombinations for larger values of T that produce shorter local ancestry tracts. As shorter tracts have fewer SNPs than longer ones, they have lower informativeness, making LAI more challenging. Furthermore, calling local ancestry at tract boundaries is difficult, and a larger number of such boundaries may also contribute to the reduction in accuracy. Here, as with admixture proportion, HAPMIX remains the most accurate LAI method across the range of admixture times we tested.
HAPMIX outperforms phase-based methods even when simulated data is free of switch errors
To further characterize the effect of phase on LAI, we ran each method on the perfectly phased data produced by our simulation pipeline. Strikingly, HAPMIX outperformed all other methods even in this limiting case, with mean (Fig. 5). Even so, FLARE’s performance with mean is on par with that of HAPMIX, and between the phase-based methods, FLARE better exploits the lack of switch errors in these data (RFMix’s mean ). Both phase-based methods outperformed LAMP-LD’s mean by a considerable margin. This is consistent with another benchmarking study that analyzed perfectly phased two-way admixed samples and found that RFMix has higher accuracy than LAMP-LD (Schubert et al. 2020). Yet, it contrasts with our results using statistically phased data (Figs. 3 and 4)—highlighting the importance of analyzing realistically phased data to test LAI methods.
Fig. 5.
HAPMIX outperforms phase-based LAI even when perfectly phased data is available. Plot shows correlations R between the inferred and true local ancestry assignments by phasing sample size, including results for FLARE and RFMix when analyzing perfectly phased data (as produced by the simulation pipeline). Each panel includes data points from all other simulation variables (with ). Boxplot lengths represent the IQR and whiskers extend up to .
Taken together, our results suggest that HAPMIX’s underlying model can produce the most accurate diploid calls regardless of phase quality or sample demography. It is surprising that HAPMIX performs better than the phase-based methods even when they are given perfect phase. There could be many reasons for this—the modeling approaches differ markedly between HAPMIX (a generative model) and RFMix (a discriminative one); and HAPMIX and FLARE, despite being similar in many respects, each implement different extensions of the Li and Stephens model (for example, HAPMIX has a mis-copying parameter that allows a small probability of copying from a haplotype of a different ancestry than the one represented in the hidden state). In general, the methods differ with respect to parameterization, and HAPMIX could be better tuned for the settings we investigated; there are also compute time versus accuracy tradeoffs—HAPMIX is more accurate but considerably slower than other methods we tested.
We also applied TRACTOR’s phase correction algorithm (Atkinson et al. 2021), which rephases input data by detecting local ancestry-based switch errors, i.e. nearby positions where the ancestries of the two haplotypes switch. RFMix’s performance after applying TRACTOR is modestly improved, with using the corrected haplotypes versus with the original data (averaged over all simulation parameters). With the exception of the default panel phasing strategy, the performance improves across all sample sizes, demographic parameters, and phasing conditions (Supplementary Fig. S5). Despite this, the accuracy of the phase-corrected LAI calls remain much lower than that of the HAPMIX calls. For example, the average R for HAPMIX is 0.988, which is still a 2.92% increase over TRACTOR-corrected RFMix ().
Admixture time estimates using phase-based LAI are strongly biased
To characterize the impact of LAI accuracy on downstream applications, we calculated admixture time estimates using the output of both FLARE and HAPMIX. Because HAPMIX produces unphased local ancestry calls, we provided these as input to PAPI (Avadhanam and Williams 2022) and report its time estimates; we also applied a model that parallels that of PAPI to FLARE’s tracts (see Materials and methods). Figure 6 plots the deviations of the admixture time estimates from the ground truth () for the reference phasing strategy data and for each phasing sample size. Strikingly, estimates using HAPMIX’s calls are virtually unbiased, while the FLARE-based estimates are strongly upwardly biased, even when the phase quality is high. Specifically, the average deviation values using FLARE segments are 14.2, 11.1, and 9.84 for small, medium, and large sample sizes, respectively. (For context, FLARE also internally estimates admixture time and these values are similarly upwardly biased, with an average deviation of 9.61 for the large sample size and reference phasing strategy setting, considering .) In contrast, HAPMIX has a slight downward bias of 0.0909 for all sample sizes (as a phase-free LAI method, HAPMIX’s results are expected to remain unchanged with respect to phasing sample size).
Fig. 6.
Deviations of admixture time estimates using local ancestry calls from HAPMIX and FLARE. Plots depict histograms of the deviations of admixture time estimates for individual samples using a phase-free method (applying PAPI to the output of HAPMIX) and a phase-based method (applying Equation (3) to the output of FLARE) for small (top), medium (middle), and large (bottom) phasing sample sizes. Data points are pooled across all other simulation variables.
Discussion
In this work, we analyzed the performance of four state-of-the-art LAI methods and found that despite large sample sizes and good quality phase, phase-based methods do not perform as well as phase-free ones in terms of call accuracy of unphased ancestry states (Fig. 3). Furthermore, unphased LAI remains more effective even after applying TRACTOR’s phase correction algorithm, which identifies likely switch errors from changes in the local ancestry on both haplotypes of an individual (Supplementary Fig. S5). Even so, use of perfectly phased data greatly improves the phase-based methods’ performance, with FLARE’s accuracies being only slightly inferior to those of HAPMIX in this case (Fig. 5).
Based on our analyses, the most important factor that affects phase quality in admixed individuals is sample size (Fig. 2). Moreover, for a fixed sample size, we found empirically that it is best to include the unadmixed reference haplotypes when phasing, with a slight improvement by conditioning on the pre-phased 1,000 G data (using, e.g. SHAPEIT's --reference or Beagle’s ref option Browning et al. 2021) rather than by pooling them with the admixed individuals (as in the rephasing strategy). That is, there is an improvement in phase quality alone by including these reference haplotypes—perhaps because they increase the sample size. Additionally, conditioning on reference haplotypes or rephasing them makes the target individuals’ phase more consistent with these panels and is therefore recommended for LAI and other downstream processes (such as imputation) that utilize reference panels.
A combination of both switch errors breaking up ancestry tracts and reduced accuracy of phase-based LAI methods can substantially impact downstream analyses that leverage these data, such as estimating admixture times. Our findings show that even with good phase quality (i.e. phasing sample sizes of 5,810 individuals), admixture time estimates derived from FLARE’s output are biased by more than an order of magnitude relative to those based on HAPMIX calls (Fig. 6). Indeed, concern over phase-based LAI quality motivated us to develop PAPI for estimating time since admixture and parent ancestry proportions using unphased local ancestry calls (Avadhanam and Williams 2022). Because of these issues, many estimates of time since admixture use trio-phased data (Gravel 2012), but given the quality of phase-free methods, tools such as PAPI can leverage the more abundant nontrio samples to perform even sample-specific admixture time estimates.
The difference in accuracy between phase-based and phase-free LAI is substantial enough that careful deliberation on whether to apply phase-based LAI may be warranted, depending on the setting. This may especially be true when analyzing nonmodel organisms where sample sizes are sometimes small and, therefore, phase quality low. Our results suggest that HAPMIX, in particular, is considerably more accurate than the other methods in all parameter settings we considered (see Results), making it a clear choice when high-quality LAI calls are required but high-quality phase is unavailable. Of course, a key limitation of HAPMIX is that it applies only to two-way admixed samples; yet LAMP-LD is a competitive alternative ( on average versus HAPMIX’s ) that can perform LAI in multiway admixed samples. We note that our primary metric is the per-site correlation of the true local ancestry with each method’s local ancestry calls, where we calculated the expected call per-site for HAPMIX since its output is probabilistic. HAPMIX’s performance may benefit from this choice of metric, but this reflects its more informative probabilistic output—information that can be incorporated in downstream applications. When considering which LAI method to apply, another important factor is runtime. While phase-free methods may be slower than phase-based ones, it is worth noting that the latter require the input to be phased, which can be a computationally intensive step.
There are notable limitations to this study. The sample sizes we used yield good but far less accurate phase than is available with biobank-scale data (Delaneau et al. 2019; Browning et al. 2021), and phase-based LAI performance may be competitive with phase-free methods given such well-phased data. Indeed, we found that RFMix outperforms LAMP-LD when using perfectly phased data (mirroring the results of a previous study Schubert et al. 2020), while in our statistically phased samples, LAMP-LD performs better, consistent with earlier work (Dias-Alves et al. 2018; Fig. 5). Further, we focused on two-way admixed samples, examining the archetypal case of admixture between relatively divergent populations (Africans and Europeans) where we can expect the highest quality LAI. In that regard, our work can be viewed as a best-case empirical analysis for sub-biobank-scale data, and demonstrates that challenges for phase-based methods apply even in such settings. Our simulated admixed samples are generated from a subset of the unadmixed 1,000 G individuals, and we excluded these individuals from the LAI reference panels, reducing their size. The size of these panels typically impacts LAI quality, and previous work found that MOASIC is more robust to small reference panels than both RFMix and FLARE (Browning et al. 2023). Thus, the phase-based method MOASIC may outperform these methods in our analyses. Additionally, we considered investigating three-way admixture such as those of Latin American populations, but due to a lack of both unadmixed Indigenous American reference panel data and three-way admixed individuals where all samples are genotyped on substantial numbers of overlapping sites, we excluded this analysis from the current study (data for admixed samples are required to realistically phase the simulated individuals). Finally, our analysis considered only recent admixture times (), and future work studying an expanded range of such times and multiway admixture would be valuable.
This work highlights important concerns that can arise in LAI from phase quality issues—even when the phasing is done with several thousand genotyped samples. Future LAI methods development could focus on new phase-free approaches or on phase-based methods that internally model and correct the switch errors most likely to impact accuracy, such as those that change the ancestry of the underlying haplotypes. An effective approach would be similar to the corrections TRACTOR makes (while having the advantage of access to the complete internal model) and would deviate from current approaches (Maples et al. 2013). Our findings underscore the importance of designing studies that apply LAI carefully, with due consideration to factors such as phase quality, utilization of reference panels, choice of LAI method, and admixture demography. Most crucially, we demonstrate that the phase quality of the admixed samples has a large impact on LAI accuracy for phase-based methods, an issue implicitly mitigated by phase-free LAI methods. Finally, while the current study analyzes human data, the issues we highlight here are particularly relevant in settings where high-quality reference datasets may not be readily available, such as for nonmodel organisms.
Supplementary Material
Acknowledgments
We thank Shai Carmi, Jeffrey Ross-Ibarra, and the anonymous reviewer for their helpful feedback on the manuscript. Computing was performed on a cluster administered by the Biotechnology Resource Center at Cornell University. Samples and data of The Charles Bronfman Institute for Personalized Medicine (IPM) BioMe BioBank used in this study were provided by The Charles Bronfman Institute for Personalized Medicine at the Icahn School of Medicine at Mount Sinai (New York). Phenotype data collection was supported by The Andrea and Charles Bronfman Philanthropies. Funding support for genotyping, which was performed at The Center for Inherited Disease Research (CIDR), was provided by the NIH (U01HG007417). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000925.v1.p1.
Contributor Information
Siddharth Avadhanam, Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA.
Amy L Williams, Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Department of Computer Science, Brigham Young University, Provo, UT 84602, USA.
Data availability
Genotype data for BioMe Biobank subset of the PAGE II dataset are available (dbGaP:phs000925.v1.p1), and the phased 1,000 Genomes data used for local ancestry inference is publicly available at https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/. The simulated genotype data supporting the current study have not been deposited in a public repository but can be reproduced using scripts distributed with PAPI, Ped-sim (https://github.com/williamslab/ped-sim) and admix-simu (https://github.com/williamslab/admix-simu).
Supplemental material available at G3 online.
Funding
Funding for this work was provided by National Institutes of Health grant R35 GM133805.
Literature cited
- Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19(9):1655–1664. 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atkinson EG, Maihofer AX, Kanai M, Martin AR, Karczewski KJ, Santoro ML, Ulirsch JC, Kamatani Y, Okada Y, Finucane HK, et al. 2021. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet. 53(2):195–204. 10.1038/s41588-020-00766-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, et al. 2015. A global reference for human genetic variation. Nature. 526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avadhanam S, Williams AL. 2022. Simultaneous inference of parental admixture proportions and admixture times from unphased local ancestry calls. Am J Hum Genet. 109(8):1405–1420. 10.1016/j.ajhg.2022.06.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, Rodriguez-Cintron W, Chapela R, Ford JG, Avila PC, et al. 2012. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics. 28(10):1359–1367. 10.1093/bioinformatics/bts144 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning SR, Browning BL. 2011. Haplotype phasing: existing methods and new developments. Nat Rev Genet. 12(10):703–714. 10.1038/nrg3054 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning BL, Tian X, Zhou Y, Browning SR. 2021. Fast two-stage phasing of large-scale sequence data. Am J Hum Genet. 108(10):1880–1890. 10.1016/j.ajhg.2021.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning SR, Waples RK, Browning BL. 2023. Fast, accurate local ancestry inference with FLARE. Am J Hum Genet. 110(2):326–335. 10.1016/j.ajhg.2022.12.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bryc K, Durand EY, Macpherson JM, Reich D, Mountain JL. 2015. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet. 96(1):37–53. 10.1016/j.ajhg.2014.11.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caballero M, Seidman DN, Qiao Y, Sannerud J, Dyer TD, Lehman DM, Curran JE, Duggirala R, Blangero J, Carmi S, et al. 2019. Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives. PLoS Genet. 15(12):e1007979. 10.1371/journal.pgen.1007979 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Churchhouse C, Marchini J. 2013. Multiway admixture deconvolution using phased or unphased ancestral panels. Genet Epidemiol. 37(1):1–12. 10.1002/gepi.21692 [DOI] [PubMed] [Google Scholar]
- Delaneau O, Zagury JF, Robinson MR, Marchini JL, Dermitzakis ET. 2019. Accurate, scalable and integrative haplotype estimation. Nat Commun. 10(1):5436. 10.1038/s41467-019-13225-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dias-Alves T, Mairal J, Blum MG. 2018. Loter: a software package to infer local ancestry for a wide range of species. Mol Biol Evol. 35(9):2318–2326. 10.1093/molbev/msy126 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gravel S. 2012. Population genetics models of local ancestry. Genetics. 191(2):607–619. 10.1534/genetics.112.139808 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guan Y. 2014. Detecting structure of haplotypes and local ancestry. Genetics. 196(3):625–642. 10.1534/genetics.113.160697 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. 2023. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet. 55(7):1243–1249. 10.1038/s41588-023-01415-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. 2004. Design and analysis of admixture mapping studies. Am J Hum Genet. 74(5):965–978. 10.1086/420855 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li N, Stephens M. 2003. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 165(4):2213–2233. 10.1093/genetics/165.4.2213 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loh PR, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, Schoenherr S, Forer L, McCarthy S, Abecasis GR, et al. 2016. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 48(11):1443–1448. 10.1038/ng.3679 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maples B, Gravel S, Kenny E, Bustamante C. 2013. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 93(2):278–288. 10.1016/j.ajhg.2013.06.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montana G, Pritchard JK. 2004. Statistical tests for admixture mapping with case-control and cases-only data. Am J Hum Genet. 75(5):771–789. 10.1086/425281 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WL, Ruczinski I, Fornage M, Siscovick DS, Zhu X, et al. 2011. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a breast cancer consortium. PLoS Genet. 7(4):e1001371. 10.1371/journal.pgen.1001371 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O’Brien SJ, Altshuler D, et al. 2004. Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 74(5):979–1000. 10.1086/420871 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, Beaty TH, Mathias R, Reich D, Myers S. 2009. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5(6):e1000519. 10.1371/journal.pgen.1000519 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salter-Townshend M, Myers S. 2019. Fine-scale inference of ancestry segments without prior knowledge of admixing groups. Genetics. 212(3):869–889. 10.1534/genetics.119.302139 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sankararaman S, Sridhar S, Kimmel G, Halperin E. 2008. Estimating local ancestry in admixed populations. Am J Hum Genet. 82(2):290–303. 10.1016/j.ajhg.2007.09.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schubert R, Andaleon A, Wheeler HE. 2020. Comparing local ancestry inference models in populations of two- and three-way admixture. PeerJ. 8(7571):e10090. 10.7717/peerj.10090 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shriner D. 2017. Overview of admixture mapping. Curr Protoc Hum Genet. 94(1):1–23. 10.1002/0471142905.2017.94.issue-1 [DOI] [PubMed] [Google Scholar]
- Williams A. 2016. Admix-Simu: program to simulate admixture between multiple populations. https://github.com/williamslab/admix-simu.git.
- Williams CM, O’Connell J, Freyman WA, 23andMe Research Team, Gignoux CR, Ramachandran S, Williams AL. Phasing millions of samples achieves near perfect accuracy, enabling parent-of-origin classification of variants. bioRxiv. 10.1101/2024.05.06.592816, 2024, preprint: not peer reviewed. [DOI] [PubMed]
- Winkler CA, Nelson GW, Smith MW. 2010. Admixture mapping comes of age. Annu Rev Genomics Hum Genet. 11(1):65–89. 10.1146/genom.2010.11.issue-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, Highland HM, Patel YM, Sorokin EP, Avery CL, et al. 2017. The Charles Bronfman Institute for Personalized Medicine (IPM) BioMe BioBank. dbGaP; phs000925.v1.p1 https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000925.v1.p1. [Google Scholar]
- Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, Highland HM, Patel YM, Sorokin EP, Avery CL, et al. 2019. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 570(7762):514–518. 10.1038/s41586-019-1310-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong Y, Perera MA, Gamazon ER. 2019. On using local ancestry to characterize the genetic architecture of human traits: genetic regulation of gene expression in multiethnic or admixed populations. Am J Hum Genet. 104(6):1097–1115. 10.1016/j.ajhg.2019.04.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Genotype data for BioMe Biobank subset of the PAGE II dataset are available (dbGaP:phs000925.v1.p1), and the phased 1,000 Genomes data used for local ancestry inference is publicly available at https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/. The simulated genotype data supporting the current study have not been deposited in a public repository but can be reproduced using scripts distributed with PAPI, Ped-sim (https://github.com/williamslab/ped-sim) and admix-simu (https://github.com/williamslab/admix-simu).
Supplemental material available at G3 online.






