Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Dec 29:2024.12.29.630632. [Version 1] doi: 10.1101/2024.12.29.630632

Coancestry superposed on admixed populations yields measures of relatedness at individual-level resolution

Danfeng Chen , John D Storey ★,
PMCID: PMC11703181  PMID: 39763999

Abstract

The admixture model is widely applied to estimate and interpret population structure among individuals. Here we consider a “standard admixture” model that assumes the admixed populations are unrelated and also a generalized model, where the admixed populations themselves are related via coancestry (or covariance) of allele frequencies. The generalized model yields a potentially more realistic and substantially more flexible model that we call “super admixture”. This super admixture model provides a one-to-one mapping in terms of probability moments with the population-level kinship model, the latter of which is a general model of genome-wide relatedness and structure based on identity-by-descent. We introduce a method to estimate the super admixture model that is based on method of moments, does not rely on likelihoods, is computationally efficient, and scales to massive sample sizes. We apply the method to several human data sets and show that the admixed populations are indeed substantially related, implying the proposed method captures a new and important component of evolutionary history and structure in the admixture model. We show that the fitted super admixture model estimates relatedness between all pairs of individuals at a resolution similar to the kinship model. The super admixture model therefore provides a tractable, forward generating probabilistic model of complex structure and relatedness that should be useful in a variety of scenarios.

Keywords: admixture, coancestry, kinship, population structure

1. Introduction

Populations are structured when genotype frequencies do not follow Hardy-Weinberg proportions. This may be due to several factors, including finite population sizes, migration, and genetic drift [1, 2]. Our goal here is to develop a framework and estimation method of a forward generating probability process that captures the observed genetic structure and relatedness among a set of individuals in a population-based study.

The framework is based on covarying allele frequencies among populations [3] and individuals [4], which we will refer to as coancestry [35]. The data underlying the proposed method are single nucleotide polymorphism (SNP) genotypes measured throughout the genome on a set of individuals. The aim is to formulate and estimate a model of the underlying process that leads to individual-specific allele frequencies (IAFs), which are parameters consisting of possibly distinct allele frequencies for every individual-SNP pair. IAFs have been formulated in previous work [6, 7] and they are the estimation target in several established admixture methods [810], a genome-wide association test for structured populations [11], and a test of structural Hardy-Weinberg equilibrium [12].

A joint probability distribution of the IAFs under a neutral model has been developed that yields covariances for all pairs of IAFs, parameterized by ancestral allele frequencies and coancestry parameters [4, 5]. This model produces a one-to-one mapping with the kinship parameters from the identity-by-descent model [13, 14], excluding close familial genetic relationships. This coancestry model therefore captures pairwise individual-level structure and relatedness equivalent to the kinship model. However, similarly to the kinship model, the coancestry parameterization is in terms of expected values, variances, and covariances of the IAFs and genotypes. It does not explicitly define a forward-generating probability model of IAFs.

Admixture models have been explored as a possible way to define such a forward-generating probability model [4, 5]. The products of an admixture model are individual-specific admixture proportions and population-specific allele frequencies. The IAFs are modeled as a weighted average of these antecedent population allele frequencies by the individual-specific admixture proportions. Several methods treat the admixture proportions and antecedent population allele frequencies as unknown parameters without explicitly making any assumptions about their random distributions [810]. Other methods place a prior probability distribution on them for Bayesian modeling fitting purposes [1517]; however, these Bayesian methods do not include these prior distributions as an inference target.

In considering a model of random antecedent population allele frequencies, one could assume that the allele frequencies are independently generated among all antecedent populations based on a common set of parameters (e.g., independent draws from the Balding-Nichols distribution [18]). We will call this assumption the “standard admixture” model. However, this standard admixture model may be overly restrictive; rather, one could implement a coancestry model of the antecedent allele frequencies according to pairwise covariances [4, 5]. We will call this model the “super admixture” model, as coancestry (or covariance) is superposed on the admixed antecedent populations. Fig. 1 displays a schematic of these models.

Figure 1:

Figure 1:

Graphical representations of the coancestry model, the standard admixture model, and the super admixture model. (A) In the coancestry model, individuals in the present-day population are connected by a complex genealogy. (B) In the standard admixture model, the arrows connecting T with S1,,SK reflect that the antecedent populations evolved independently from T. Arrows connecting S1,,SK with individuals in the present-day population reflect that these individuals were admixed from independent antecedent populations. (C) In the super admixture model, dashed lines connecting all pairs of antecedent populations reflect that antecedent populations have coancestry parameterized by Λ. Arrows connecting S1,,SK with individuals in the present-day population reflect that these individuals were admixed from covarying antecedent populations.

Here, we develop a method that estimates the parameters in the super admixture model, which includes the standard admixture model as a special case. The method is based on method of moments estimation and geometric considerations, so it does not make assumptions about the probability distributions of the parameters and it does not involve costly likelihood maximization computations. Likelihood maximization is the most common approach used in fitting the admixture models [8, 9, 1517], but we build from a recently proposed distribution-free moment-based method, called ALStructure, that only uses linear projections and geometric constraints on parameters to estimate the model [10]. ALStructure performs favorably to likelihood based methods (even in achieving a high likelihood) and can be tractably scaled to massive data sets. Our proposed super admixture method complements this framework and has similar advantages.

We establish super admixture through computational studies and analyses of data sets, including the human genome diversity panel (HGDP) [19], the 1000 genomes project (TGP) [20], the Human Origins study (HO) [21], and a study on individuals with Inadian ancestry (IND) [22]. We show on all of these data sets that the super admixture method is capable of capturing the same relatedness and structure as a model-free individual-level coancestry estimator [4], whereas the standard admixture model does not. We demonstrate that the framework can generate bootstrap genotypes that retain the structure seen in the human studies. For example, Fig. 2 shows these results on the HO study. We show that the coancestry among antecedent populations estimated by super admixture yields new insights and visualizations of structure previously unavailable, for example, Fig. 3 on the HO study. We develop and perform a statistical test to demonstrate on the studies that coancestry among the admixed antecedent populations is statistically different from zero to an high degree of significance.

Figure 2:

Figure 2:

Heatmaps of individual-level coancestry estimates in the HO data set.

Figure 3:

Figure 3:

(A) Heatmap of antecedent population coancestry estimates in the HO data set. (B) Dendrogram representation of the antecedent population coancestry estimates. (C) Stacked bar plot of admixture proportions.

Our proposed framework makes several contributions: (i) a distribution-free framework that can account for arbitrarily complex relationships among the admixed antecedent populations in the admixture model; (ii) admixture-based estimation of individual-level pairwise coancestry at a resolution equivalent to general, model-free coancestry and kinship; (iii) a partitioning of the super admixture model into evolutionary, genealogical, and statistical sampling components; and (iv) a tractable algorithm to form bootstrap samples of genotypes from the estimated evolutionary process.

2. Super admixture framework

Here, we first introduce the data and models, and then we detail the proposed framework. We describe how the framework is used to estimate the super admixture model, generate parameters and data from the model, and perform a hypothesis test of the standard versus super admixture models.

2.1. Coancestry

We assume that m SNPs are measured on n individuals. The genotype measurements are denoted by xij for i=1,,m and j=1,,n. For each SNP, one of the alleles is counted as a 0 and the other as a 1, implying that the SNP genotypes are xij{0,1,2} where xij=0 is homozygous for the 0 allele, xij=1 is a heterozygote, and xij=2 is homozygous for the 1 allele. We assume that Exijπij=2πij for IAF πij. This IAF parameterization allows each individual-SNP pair to possibly have a distinct allele frequency. The classical scenario where there is one allele frequency per SNP is a special case where πi1=πi2==πin. The conditional expected value Exijπij=2πij also allows for the IAFs πij to be random parameters, which we assume here.

We utilize an existing coancestry model where the IAFs are random parameters with respect to some ancestral population T that is common to all n individuals [4, 5]. This is a neutral model where

EπijT=ai (1)
Cπij,πikT=ai1-aiθjk (2)

for i=1,,m and j,k=1,,n. The parameter ai is the ancestral allele frequency in T for SNP i and 0θjk1 is the coancestry for individuals j and k with respect to T. (Note that the ai and θjk parameters depend on T and could be different if conditioning on a different common ancestral population.) The coancestry model we utilize also makes the assumption used in previous work [4, 5, 712] that

xijπij~Binomial2,πij

where the xij are jointly independent. Under this model, it follows that

Cxij,xikT=2ai1-ai1+θjjj=k,4ai1-aiθjkjk.

A one-to-one mapping exists with the identity-by-descent kinship model (often used in GWAS methods), denoted by ϕjk, by matching variances and covariances [4, 5]. The parameters map so that

θjk=2ϕjk-1ifj=k,ϕjkifjk. (3)

When minjkθjk=0, then T is the most recent common ancestral population [4]. The full set of parameters is denoted by the n×n symmetric matrix Θ with (j,k) entry θjk.

2.2. Admixture models

General admixture

We first describe a general formulation of the admixture model, of which standard and super admixture are special cases. There are K populations S1,S2,,SK descended from T that precede the present day population, which we refer to as “antecedent populations”. While T has allele frequencies a1,a2,,am, antecedent population Su has allele frequencies p1u,p2u,,pmu for u=1,2,,K. The allele frequencies piu are random parameters from a distribution parameterized by ai plus other possible parameters that characterize the evolutionary process from T to Su.

For each individual j, there is a genealogical process from population T to the present day population. This is captured by a random K-vector q1j,q2j,,qKj of admixture proportions, where 0quj1 and u=1Kquj=1. The parameter quj is the proportion of the individual j randomly descended from Su. Therefore, the IAFs are such that

πij=u=1Kpiuquj. (4)

We collect the antecedent population allele frequencies into the m×K matrix P and the admixture proportions into the K×n matrix Q, it follows that

Π=PQ,

where Π is an m×n matrix with (i,j) entry πij.

Standard admixture

We define the standard admixture model to be the case where the antecedent allele frequencies are independently distributed. Specifically, in this model piu is a random parameter with mean ai and variance ai1-aifu. The standard admixture model is defined as follows for i=1,2,,m and u=1,2,,K.

StandardAdmixture:pi1,pi2,,piKarejointlyindependentE[piuT]=aiV[piuT]=ai(1ai)fu

Under this parameterization, ai is the ancestral allele frequency in T and fu is the inbreeding coefficient or FST of antecedent population Su with respect to T. Since the piu are jointly independent, there is no coancestry among antecedent populations and there is no dependence among loci.

One well-known distribution that could be utilized here is the Balding-Nichols (BN) distribution [18] with parameters ai and fu:

piu~Beta1-fufuai,1-fufu1-ai. (5)

We will write this re-parameterized Beta distribution as BNai,fu. This achieves the expected value and variance of the standard admixture definition. The Balding-Nichols distribution is often used to generate allele frequencies for a set of populations to achieve desired expected allele frequencies and FST values. This distribution has been discussed as useful for generating antecedent allele frequencies in the standard admixture model [47, 23].

Super admixture

The super admixture model extends the standard admixture model in that it includes a covariance among antecedent population allele frequencies, which we refer to as population-level coancestry. While we denoted individual-level coancestry by θjk, we will denote population-level coancestry by λuv for u,v=1,2,,K where 0λuv1. We collect these values into the K×K symmetric coancestry matrix Λ. The super admixture model is defined as follows for i=1,2,,m and u,v=1,2,,K.

SuperAdmixture:pi1,pi2,,piKarejointlydependentE[piuT]=aiV[piuT]=ai(1ai)λuu[piu,pivT]=ai(1ai)λuv (6)

In this model we assume that allele frequencies between loci are independent, so the random K-vectors ph1,ph2,,phK and pi1,pi2,,piK are independent for hi. Thus, a potential generalization of the super admixture model is to include dependence among loci. Otherwise, the super admixture model is general in that it allows for the full range of coancestry values among antecedent populations.

Forward generating probability process

We now describe the super admixture model as a forward generating probability process. Suppose that the admixture proportions Q are drawn from some probability distribution Q. Then, for i=1,2,,m and j=1,2,,n:

pi1,pi2,,piK~(a,Λ)q1j,q2j,,qKj~𝒬πij=u=1Kpiuqujxijπij~Binomial2,πij

The joint probability of all random quantities can be factored as follows:

PX,Q,PT,𝒬=PPTPQ𝒬PXP,Q.

One interpretation of this is that P(PT) represents evolutionary sampling, P(Q𝒬) represents genealogical sampling, and P(XP,Q) represents statistical sampling.

Individual-level coancestry in the admixture models

Recall that in the covariance model, the covariance of two IAFs for a given SNP is Cπij,πikT=ai1-aiθjk, shown in Eq. (2). Conditioning on the admixture proportions Q, which are ancillary to allele frequencies, this covariance under the super admixture model is, for j,k=1,2,,n,

Cπij,πikQ,T=Cu=1Kpiuquj,v=1KpivqvkQ,T=u=1Kv=1KqujqvkCpiu,pivT=ai1-aiu=1Kv=1Kqujqvkλuv. (7)

By setting the covariance from Eq. (2) equal to Eq. (7), it follows that under the super admixture model the individual-level coancestry is the following.

SuperAdmixtureIndividual-levelCoancestry:θjk=u=1Kv=1Kqujqvkλuv (8)

In the standard admixture model, VpiuT=ai1-aifu, whereas in the super admixture model VpiuT=ai1-aiλuu. If we set fu=λuu, the difference between the standard and super admixture models is therefore that in the standard model, λuv=0 for uv. To work with a single notation, we will therefore write λuu in place of fu for the standard admixture model. The coancestry in this model is as follows.

StandardAdmixtureIndividual-levelCoancestry:θjk=u=1Kqujqukλuuλuv=0foruv (9)

Considering all pairs of individuals simultaneously, the individual-level coancestry matrix Θ can be written in terms of the antecedent population-level coancestry Λ and the admixture proportions Q as

Θ=QΛQ,

which is an important relationship we utilize to estimate Λ.

2.3. Estimating coancestry among antecedent populations

Here, we propose a method to estimate the antecedent population-level coancestry Λ under the super admixture model, with the standard admixture model estimate as a special case. The rationale is to leverage the relationship, Θ=QΛQ. Given values for Θ and Q, we identify values of Λ that make QΛQ close to Θ, while obeying the geometic constraints of Λ (i.e., 0λuv1 and λuv=λvu).

Given values for Θ and Q, we formulate the problem of the estimating the antecedent population-level coancestry Λ under the super admixture model as follows.

Problem 1.

minΛRK×KΘ-QΛQF2subjectto:0λuv1andλuv=λvuforu,v=1,,K

where F represents the Frobenius norm defined in Appendix A.1. We utilize the proximal forward-backward (PFB) method [24] to solve this optimization problem, resulting in Algorithm 1 for solving Problem 1. Every sequence of ΛttN generated from this algorithm is guaranteed to converge to a solution of the corresponding problem. The PFB method and how to employ it to our setting are detailed in Appendix B.2. The performance of Algorithm 1 is demonstrated in Appendix C.

Algorithm 1:

Estimating Λ for the super admixture model given Θ and Q

input: Coancestry matrix Θ and admixture proportions matrix Q
1 let L=σmax4(Q)
2 let Λ0(QQ)1Θ(QQ)1
3 for t=1,2, do
4 G2Q(QΛt1QΘ)Q
5 Λ*Λt11LG
6 Λt={λuv,t}whereλuv,t=max(0,min(1,λuv*))
7 return Λt

σmax() denotes the maximum singular value (Appendix A.1).

To estimate all components of the super admixture model, one needs estimates of the n×n individual-level coancestry matrix Θ, the K×K antecedent population-level coancestry matrix Λ, the m×K matrix of antecedent population allele frequencies P, and the K×n matrix of admixture proportions Q. There exists a wide range of methods for estimating P and Q [9, 10, 15, 17, 25]. Here, we utilize the ALStructure method [10], which implements method of moments and geometric constraints to estimate Q similarly to our approach here. In that method, a linear basis of Q is determined from X that has theoretical guarantees to span the true basis as the number of SNPs m grows large. A projection-based estimate Πˆ of the IAFs is also formed. The quantity Πˆ-QPF is then algorithmically minimized through geometrically constrained alternating least squares to yield estimates Qˆ and Pˆ.

We utilize the structural Hardy-Weinberg (sHWE) test [12] for determining the number of antecedent populations K, as outlined in that work. The approach is to consider a range of K values to test the assumption that xijπij~Binomial2,πij based on the estimates πˆij and a goodness-of-fit statistic with a parametric bootstrap null distribution; K is then parsimoniously chosen to satisfy this modeling assumption from a genome-wide perspective. A method of moments estimator of Θ was derived in ref. [4], where it was shown to have favorable properties and is consistent for the true values under certain assumptions. We denote this Ochoa-Storey (OS) estimate by ΘˆOS and review its details in Appendix B.1. If one has alternative ways to estimate Θ and Q, and to determine K, then those can be used within our framework as well.

Algorithm 2:

Estimating Λ for the super admixture model given X

input: Genotype matrix X
1 calculate the OS estimate of individual-level coancestry Θ^OS
2 choose K from the structural Hardy-Weinberg (sHWE) goodness of fit procedure
3 calculate the estimate Q^ for K via the ALStructure method
4 calculate the estimate Λ^sup by applying Algorithm 1 with inputs Θ^OS and Q^
5 return Λ^sup

Note that one can further calculate a corresponding estimate for individual-level coancestry by

Θˆsup=QˆΛˆsupQˆ,

which can be compared to ΘˆOS in order to aid in model fit assessment.

We can estimate Λ under the standard admixture model by modifying the constraints in Problem 1. This leads to Problem B. 1 and Algorithm B. 2 described in Appendix B. 2. Algorithm 2 can then be used to form the estimate Λˆstd under the standard admixture model with Algorithm 1 in Line 4 replaced by Algorithm B.2. The corresponding estimate for individual-level coancestry can be calculated as Θˆstd=QˆΛˆstdQˆ. The performance of Algorithm B. 2 is also demonstrated in Appendix C.

2.4. Simulating antecedent population allele frequencies

We now introduce a method to generate antecedent population allele frequencies with given coancestry Λ. We noted above in Eq. (5) that for the standard admixture model, one way to generate allele frequencies pi1,pi2,,piK is via independent realizations from the Balding-Nichols (BN) distribution: piu~BNai,λuu for u=1,2,,K. As there is no default approach to extending this to the super admixture case, we propose a method here called “double-admixture”. The main idea of the method is that we form two layers of allele frequencies: the first layer is composed of independent draws from the BN distribution, and the second layer mixes these to create pi1,pi2,,piK with coancestry Λ.

Let S be the number of components that will be mixed, W be the S×K matrix of mixture proportions, and Γ an S×S diagonal matrix. The entries of W are wsu where 0wsu1 and s=1Swsu=1 for u=1,2,,K. The diagonal values of Γ are represented by γs where 0γs1, and all other values are 0. Suppose that for i=1,,m we generate

zis~BNai,γs

independently for s=1,,S, and we then set

piu=s=1Sziswsu

for u=1,,K. It can be verified that

E[piu]=aiu=1,,K[piu,piv]=ai(1ai)s=1Swsuwsvγs

for u,v=1,2,,K. By matching these equations with Eq. (6), one can see that if

λuv=s=1Swsuwsvγs (10)

then pi1,pi2,,piK has coancestry Λ as desired. In matrix terms, Eq. (10) is equivalent to

Λ=WΓW. (11)

Therefore, the double-admixture method is based on the following optimization problem.

Problem 2.

minW,ΓΛWΓWF2subjectto:0wsu1,s=1Swsu=1ϵγs1ϵforsmallϵ>0foru=1,2,,K;s=1,2,,S
Algorithm 3:

Calculating W and Γ in the double-admixture method

input: Antecedent populations coancestry Λ, number of BN distributions S, step size parameters τ1 and τ2, and a small positive number ϵ
1 let Γ0 be an S×S diagonal matrix with diagonal elements drawn independently from Uniform(0,1)
2 let W0 be an S×K matrix whose columns (w1u,w2u,,wSu) are drawn independently from Dirichlet(l)
3 for t=1,2, do
4 L14(Λ2Γt12+3KΓt122)
5 G14Γt1Wt1(ΛWt1Γt1Wt1)
6 W*Wt11τ1L1G1
7 for u=1,,K do
8   wu,t𝒫Δ(wu*) wher wu,t and wu* are the corresponding columns of Wt and W*
9 L22Wt24
10 G22Wt(ΛWtΓt1Wt)Wt
11 Γ*Γt11τ2L2G2
12 γs,t=max(ϵ,min(1ϵ,γss*))fors=1,2,,S
13 Γt=diag(γ1,t,γ2,t,,γS,t)
14 return Γt and Wt

Here, we set S=2K,τ1=τ2=1.1,ϵ=0.01; user should investigate their choices. 2 denotes the spectral norm and 𝒫Δ denotes projection onto the unit simplex (Appendix A.1).

We adapted the proximal alternating linearized minimization (PALM) method [26] to solve Problem 2, resulting in Algorithm 3 for calculating the parameters in the double-admixture method. Every sequence Wt,ΓttN generated from Algorithm 3 is guaranteed to converge to a critical point. Integrating Algorithm 3 with the generative steps for piu described above, Algorithm 4 simulates antecedent population allele frequencies with the desired coancestry. In Appendix B.3, the PALM method is briefly introduced and the convergence of Algorithm 3 is proved.

Algorithm 4:

The double-admixture algorithm for simulating P

input: Ancestral allele frequencies a, coancestry among antecedent populations Λ, other input arguments for Algorithm 3
1 calculate Γ^ and W^ using Algorithm 3
2 for i=1,,m do
3  generate zis~BN(ai,γ^s) independently for s=1,2,,S
4  set pius=1Szisw^suforu=1,2,,K
5 return P

One possible drawback of the double-admixture method is that the approach relies on the existence of W and Γ so that Λ=WΓW. We do not currently have a theoretical guarantee for such W and Γ (although one may exist since S can be made large). Therefore, we provide a complementary method in Appendix B.4, the NORmal To Anything (NORTA) approach [27], serving as a tool for simulating P when the double-admixture method is not applicable. It should be noted that the double-admixture method solves the optimization one time for the entire process so that its running time is independent of the number of loci m. In contrast, the NORTA method has to solve K×(K-1)/2 root-finding problems per locus and therefore has a complexity of 𝒪K2m, rendering it significantly more time consuming. The performances of the double-admixture and NORTA methods are demonstrated in Appendix C.

Note that if we set Γ=Λ for a diagonal standard admixture Λ and W=IK (where IK is the K×K identity matrix), then the double-admixture method reduces to the BN sampling from Eq. (5), which produces valid antecedent population frequencies for the standard admixture model. From this observation, the double-admixture method can be viewed as a generalization of BN sampling.

2.5. Generating bootstrap datasets from realistic population structures

By utilizing the double-admixture method, we implemented the following algorithm to simulate genotypes from the super admixture model, shown in Algorithm 5. We assessed whether Algorithm 5 generates genotypes that satisfy the moment constraints imposed by the super admixture model in Appendix C. Algorithm 5 is especially useful when inputs a,Λ, and Q reflect real populations. When these parameters are unavailable one can utilize an admixture method to estimate Q and the method proposed here to estimate Λ. The ancestral allele frequencies can be estimated with simple sample means. We outline Algorithm 6, with the ALStructure algorithm for estimation of Q and the super admixture algorithm for estimation of Λ.

Algorithm 5:

Generating genotypes X from the super admixture model

input: Ancestral allele frequencies a, antecedent populations coancestry Λ, and admixture proportions Q
1 generate P using Algorithm 4
2 let Π=PQ
3 let X={xij} by generating xijπij~Binom(2,πij) independently for i=1,2,,m and j=1,2,,n
4 return X

Line 1 can also be completed with the NORTA method, Algorithm B.4.

Algorithm 6:

Generating bootstrap genotypes X* from observed genotypes X

input: Genotype matrix X
1 let a^={a^i}wherea^i=12nj=1nxijfori=1,2,,m
2 obtain Λ^sup and Q^ from Algorithm 2 with input X
3 generate P* using Algorithm 4 with inputs a^ and Λ^sup
4 let Π*=P*Q^
5 let X*={xij*} by generating xij*πij*~Binom(2,πij*) independently for i=1,2,,m and j=1,2,,n
6 return X*

Λˆsup can be replaced with Λˆstd in Line 2, in which case the BN sampling from Eq. (5) is used in Line 3. Line 3 can also be completed with the NORTA method, Algorithm B.4, if using Λˆsup.

We note that Algorithm 6 is a semi-parametric bootstrap simulation; Line 3 is semiparametric, Π* is semi-parametric because Qˆ is nonparametric, and Line 5 is parametric. The output X* can be interpreted as a bootstrap replication of X, where the population structure in X* recapitulates the structure in X. The process that the bootstrap method recapitulates is not just resampled genotypes for a fixed matrix of estimated IAFs. Rather, the antecedent population allele frequencies are resampled, also leading to resampled IAFs, so both evolutionary and statistical resampling occur.

2.6. Significance test of coancestry among antecedent populations

Here, we develop a hypothesis test of the standard admixture model (null) versus the super admixture model (alternative). We show below that on real data sets the test results are highly significant against the null in favor of the alternative. In terms of model parameters, the test is defined as follows:

H0:maxλuvuv=0(standardadmixturemodel)
H1:maxλuvuv>0(superadmixturemodel)

A straightforward test-statistic is U=Λˆsup-ΛˆstdF. The larger U is, the more evidence there is against the null hypothesis in favor of the alternative hypothesis. In order to calculate a p-value for this test-statistic, we need to know the distribution of U when the null hypothesis is true. To this end, we adapt the bootstrap method of Algorithm 6, leading to Algorithm 7.

Algorithm 7:

Hypothesis test of no coancestry among antecedent populations

input: Genotype matrix X and number of bootstrap replications B
1 calculate a^i=12nj=1nxijfori=1,2,,m
2 calculate estimates Λ^std, Λ^sup, and Q^ by Algorithm 2 with input X
3 calculate the observed test-statistic U=Λ^supΛ^stdF
4 for b=1,2,,B do
5  generate piu*~BN(a^i,λ^uustd) independently and let P*={piu*}fori=1,2,,m and u=1,2,,K
6  let Π*=P*Q^
7  let X*={xij*} by generating xij*πij*~Binom(2,πij*) independently for i=1,2,,m and j=1,2,,n
8  calculate estimates Λ^std* and Λ^sup* by Algorithm 2 with input X*
9  calculate the bootstrap null test-statistic U*(b)=Λ^sup*Λ^std*F
10 return p-value=1Bb=1B1(U*(b)U)

To evaluate the validity of the proposed test, we performed this hypothesis testing on various simulation designs (Appendix C). Our simulations show that the test produces valid p-values, which are conservative (Fig. C.4), meaning the test has a maximum type I error rate less than or equal to the nominal level of the test. On real data sets analyzed below, these p-values are small, so the conservative behavior that we observe in simulations does not appear to be relevant for populations with nontrivial levels of structure.

3. Analysis of human studies

We applied the super admixture framework to four published studies: the human genome diversity panel (HGDP) [19], the 1000 genomes project (TGP) [20], the Human Origins study (HO) [21], and a study on individuals with Indian ancestry (IND) [22]. Within the TGP study, we also analyzed a subset of admixed populations with American ancestry, denoted by AMR. While HGDP, TGP, and HO are sampled from ancestries throughout the world, the IND and AMR data sets are regionally sampled. This yielded five data sets that collectively represent a range of population structures and study designs. Discussions of the results on HO, AMR, and IND are in the main text, while HGDP and TGP are in Appendix D.

3.1. Calculations

We processed the data sets and performed quality control checks to produce a genotype matrix X for each as the starting point of our analysis (Appendix D.1). We next applied Algorithm 2 to X to obtain Λˆsup and Λˆstd, the estimates of antecedent population coancestry for the super admixture and standard admixture models, respectively. We also calculated their corresponding individual-level coancestry estimates Θˆsup and Θˆstd. As a part of Algorithm 2, we calculated the appropriate number of antecedent populations K using the structural Hardy-Weinberg method [12] (detailed in Appendix D.6). The values of K ranged from K=11 for HO to K=3 for AMR, which are consistent with earlier work [10, 12, 28]. Also, in Algorithm 2 we calculated estimates of the admixture proportion matrices Qˆ using the ALStructure method [10].

To evaluate the accuracy of Θˆsup and Θˆstd, we computed the OS estimate [4] of individual-level coancestry ΘˆOS on each data set. The OS estimate of Θ is based on general assumptions and is a consistent estimator for arbitrary population structures under the appropriate conditions. Since ΘˆOS makes no assumptions about the distributions of the IAFs or coancestry parameters, it serves as a benchmark for our methods1, allowing us to observe if the super admixture or standard admixture models lose information about individual-level coancestry relative to OS. As shown in Table D.1, the Frobenius-based distances from Θˆsup to ΘˆOS are about 10 to 40 times smaller than those from Θˆstd to ΘˆOS. The distance from Θˆsup to ΘˆOS is smaller than is arguably practically relevant, meaning that Θˆsup achieves the resolution of ΘˆOS for practical purposes.

We carried out Algorithm 7 to perform a hypothesis test of the standard admixture model versus the super admixture model for all five datasets, with B=1000 bootstrap iterations. For all data sets, no bootstrap null test-statistic was equal to or greater than the observed test-statistic, so p-value < 0.001 for all data sets. The bootstrap null test-statistics and observed test-statistic for all data sets are shown in Fig. D.9.

We applied Algorithm 6 to generate bootstrap replications X* from each data set’s genotype matrix X. We applied the double-admixture method (Algorithm 4) and the NORTA method (Algorithm B.5) to include the performance of both methods. We computed the OS estimate ΘˆOS* of individual-level coancestry for each X*.

3.2. Visualizing results

We firstly visualized the results by making heatmaps of individual-level coancestry estimates ΘˆOS,Θˆsup, and Θˆstd. We also made heatmaps of ΘˆOS* from bootstrap resampled genotypes using both the double-admixture and NORTA methods for generating antecedent population allele frequencies. These are displayed as follows: HO − Fig. 2, AMR − Fig. 4, IND − Fig. 6, HGDP − Fig. D.5, and TGP − Fig. D.7. It can be seen that for all data sets, ΘˆOS and Θˆsup are qualitatively equivalent, which is quantitatively supported by Table D.1 showing they are very close. The estimates ΘˆOS* from the two bootstrap methods are also qualitatively equivalent to ΘˆOS and Θˆsup. Finally, it can be seen that the standard admixture coancestry estimate Θˆstd is not close to the other estimates, further indicating the standard admixture model is not sufficient for these data sets.

Figure 4:

Figure 4:

Heatmaps of individual-level coancestry estimates in the AMR data set.

Figure 6:

Figure 6:

Heatmap of individual-level coancestry estimates in the merged data set of mainland Indians from IND, and Central/South Asians and East Asians from HGDP.

We secondly visualized the results by building on the standard colored stacked bar plots of Qˆ displaying the admixture proportions of the K antecedent populations for the individuals. In our case, we have additional information, which is the estimated antecedent population coancestry matrix Λˆsup from the super admixture model. This matrix gives additional information about the relationship among the antecedent populations that we would like to visualize. The first way we visualized Λˆsup was create a heatmap of its values. We then constructed a dendogram built from Λˆsup that is displayed above the stacked bar plot. This gives the user insight into the relatedness of the antecedent populations and connect them to the stacked bar plots. To this end, we calculated a distance matrix D from Λˆsup according to:

duv=0ifu=vmaxΛˆsup-λˆuvsupifuv.

We then applied the standard agglomerative clustering method to D using “weighted pair group method with arithmetic mean” (WPGMA) to obtain a dendrogram. These are displayed in the data sets as follows: HO − Fig. 3, AMR − Fig. 5, IND − Fig. 7, HGDP − Fig. D.6, and TGP − Fig. D.8.

Figure 5:

Figure 5:

(A) Heatmap of antecedent population coancestry estimates in AMR. (B) Dendrogram representation of the antecedent population coancestry estimates. (C) Stacked bar plot of admixture proportions.

Figure 7:

Figure 7:

Heatmap of antecedent population coancestry estimates in the merged data set of mainland Indians from IND, and Central/South Asians and East Asians from HGDP. (B) Dendrogram representation of the antecedent population coancestry estimates. (C) Stacked bar plot of admixture proportions.

3.3. Human Origins (HO) study

The Human Origins datasets (HO) consists of 2124 individuals from 170 sub-subpopulations grouped into 11 subpopulations. We observed the estimated individual-level coancestry agrees with current knowledge of early human migrations [29−32]. In Fig. 2, we observed the first major split between Sub-Saharan Africa and North Africa. This split reflects the divergence between Sub-Saharan Africans and the rest of human populations resulting from an out-of-Africa migration around 50–60 kya. Another split occurred between South Asia and East Asia, revealing the separation between West Eurasians and East Asians around 40–45 kya. Among the East Asia clade, we identified that the Oceanians have highest coancestry within and lowest coancestry between other subpopulations, consistent with the theory that Oceanians split earliest from the rest of East Asians.

The coancestry among antecedent populations is also compatible with early human dispersals (Fig. 3). Specifically, in the dendrogram plot of the antecedent population coancestry (Fig. 3B), we note that the first branch split individuals from Sub-Saharan Africa represented by the antecedent populations S1 and S2 from individuals outside of Sub-Saharan Africa represented by the other antecedent populations. Individuals outside of Sub-Saharan Africa further branched off into two lineages: the West Eurasians represented by antecedent populations S3,S4 and S5, and the East Asians represented by antecedent populations S6-S11. Then the Oceanians represented by the antecedent population S9 split off from the majority of East Asian ancestry, while the latter further diverged into present-day Asians (antecedent populations S6,S7,S8) and present-day Americans (antecedent populations S10 and S11).

3.4. Admixed individuals (AMR) from the 1000 Genomes Project (TGP)

The AMR subset of TGP has 353 individuals from four regions (Mexican-American (MXL): 65, Puerto Rican (PUR): 104, Colombian (CLM): 97, Peruvian (PEL): 87). The individual-level coancestry plot (Fig. 4) revealed that this dataset does not have a discrete population structure. Instead, the coancestry changes smoothly over individuals, indicating wide-ranging historical admixture events. This is consistent with the AMR population descending from European, Native American, and Sub-Saharan African ancestries during the post-Columbian era [33, 34].

In the analysis of the coancestry among antecedent populations (Fig. 5), we identified three major sources of ancestry: Sub-Saharan African ancestry represented by the antecedent population S1, West Eurasian ancestry represented by the antecedent population S2, and Native American ancestry represented by the antecedent population S3. The first split occurred between Sub-Saharan Africans S1 and individuals outside of Sub-Saharan Africa (S2 and S3), and the second split between the West Eurasians S2 and the Native Americans S3. We also noted that the Puerto Ricans contain the highest amount of Sub-Saharan African ancestry; the Peruvians have the highest proportion of Native American ancestry; the Colombians and the Mexican-Americans display extensive variation in in their admixture proportions of European and Native American ancestry. Our observations were confirmed by previous analyses of AMR populations [28, 33, 34].

3.5. Indian (IND) study

We combined the mainland Indians from the IND study with the Central/South Asia and the East Asia populations from HGDP to study the relationship between present-day Indians and other populations in Asia. Our merged data set consists of 298 mainland Indians from fou linguistic groups (Indo-European (IE): 92, Dravidian: 53, Austro-Asiatic (AA): 79, Tibeto-Burman (TB): 74), together with 190 Central/South Asians and 210 East Asians from HGDP. Previous analyses of South Asian populations have shown that the Indo-European speakers show a considerable amount of the Western Eurasian relatedness and are ancestrally close to Central Asians. The Austro-Asiatic speakers and the Tibeton-Burman speakers were mixed from East Asian ancestry. The Tibeton-Burman speakers generally have significant genomic proportions derived from East Asian ancestry so that some Tibeton-Burman speakers can be difficult to distinguish from East Asian populations based on genome-wide measures of relatedness. Consistent with these findings [22, 35, 36], we observe a split between Indo-European speakers and the rest of mainland Indians in the heatmap of individual-level coancestry (Fig. 6). The Indo-European speakers and the Central/South Asians of HGDP have relatively similar levels of coancestry. The second split occurred between the Austro-Asiatic speakers and the Tibeto-Burman speakers. The Tibeto-Burman speakers and East Asians of HGDP have relatively similar levels of coancestry.

Our analysis reveals that there are three major branches of antecedent populations for this dataset (Fig. 7). The branch of antecedent populations S1 and S2 is most prevalent in Central/South Asians of HGDP and Indo-European speakers, suggesting this branch was at least partially derived from a West Eurasian source. The branch of the antecedent populations S3,S4 and S5 is widespread in Dravidian speakers and Austro-Asiatic speakers, indicating it is relevant to South Indian ancestry and Austro-Asiatic speaker ancestry. The third branch of the antecedent populations S6 and S7 likely represents East Asian ancestry due to its high prevalence in the Tibeto-Burman speakers and East Asians of HGDP.

4. Discussion

The super admixture framework is an extension of the highly used admixture model. It superposes coancestry among the admixed antecedent populations. It provides a forward generating probability process that encompasses random evolutionary, genealogical, and statistical sampling processes. The antecedent populations are modeled to have an arbitrarily complex coancestry. This allows the generation of individual-specific allele frequencies (IAFs) that capture complex population structures and permit the estimation of individual-level coancestry that is at the resolution of general individual-level coancestry and kinship estimators for arbitrarily complex structures.

There are numerous parameters estimated from genome-wide genotype data that relate to structure, such as coancestry, inbreeding, and FST. When traits are included, one often estimates parameters in the context of genome-wide association studies [23, 37], genome-wide heritability [3840] and polygenic risk scores [41, 42]. There does not exist a straightforward, general method for quantifying uncertainty among these various estimates. Within our framework, we have shown how to perform a bootstrap resampling method that randomly generates new genetic data that recapitulate population structure observed in real data. This bootstrap method may provide a way to formulate general methods for quantifying uncertainty in genome-wide genotype studies.

We developed a hypothesis test where one can test the standard versus super admixture model on real data. When we applied it to the five data sets analyzed here, all of them were highly significant in rejecting the standard admixture model in favor of the super admixture model. The individual-level coancestry estimates from the super admixture model also agreed with the general coancestry estimate, whereas the standard admixture individual-level coancestry estimates did not.

The stacked bar plot visualization of admixture proportions among individuals is ubiquitous in analyzing population structure. We showed here how the estimated antecedent population coancestry can be plotted with the stacked bar plot to visualize the relationship among the antecedent populations in conjunction with the bar plot. The admixture proportions among individuals are then interpretable in terms of the evolutionary history of the antecedent populations. We demonstrated this visualization on five data sets and showed how it agreed with known results on these human populations.

Understanding population structure in humans is one of the central problems in modern genetics. We demonstrated that the proposed super admixture framework is a powerful tool for learning admixed population coancestry, improving the analysis of genetic data from structured populations, bridging admixture with individual-level coancestry and kinship, and simulating new data reflecting a structured population. We anticipate that the super admixture framework will be useful in analyzing complex population structure in future applications.

Supplementary Material

1

Acknowledgments

This work was supported in part by US National Institutes of Health grant R01 HG006448.

Footnotes

Resources

The superadmixture software package is available at https://github.com/StoreyLab/superadmixture. The results in this paper can be reproduced with code available at https://github.com/StoreyLab/superadmixture-manuscript-analysis.

1

Note also that the OS estimate of Θ is equal to the OS estimate of kinship, Φ, except for the diagonal elements where θˆjkOS=2ϕˆjkOS-1, as shown in Eq. (3).

References

  • [1].Slatkin M.. “Gene flow and the geographic structure of natural populations”. Science 236(4803) (1987), pp. 787–792. [DOI] [PubMed] [Google Scholar]
  • [2].Bohonak A. J.. “Dispersal, gene flow, and population structure”. The Quarterly Review of Biology 74(1) (1999), pp. 21–45. [DOI] [PubMed] [Google Scholar]
  • [3].Weir B. S. and Hill W. G.. “Estimating F-statistics”. Annual Review of Genetics 36(1) (2002), pp. 721–750. [DOI] [PubMed] [Google Scholar]
  • [4].Ochoa A. and Storey J. D.. “Estimating FST and kinship for arbitrary population structures”. PLoS Genetics 17(1) (2021), e1009241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Ochoa A. and Storey J. D.. “FST and kinship for arbitrary population structures I: Generalized definitions”. bioRxiv (2016), doi: 10.1101/083915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Thornton T. et al. “Estimating kinship in admixed populations”. The American Journal of Human Genetics 91(1) (2012). Publisher: Elsevier, pp. 122–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Hao W., Song M., and Storey J. D.. “Probabilistic models of genetic variation in structured populations applied to global human studies”. Bioinformatics 32(5) (2016), pp. 713–721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Tang H. et al. “Estimation of individual admixture: Analytical and study design considerations”. Genetic Epidemiology 28(4) (2005), pp. 289–301. [DOI] [PubMed] [Google Scholar]
  • [9].Alexander D. H., Novembre J., and Lange K.. “Fast model-based estimation of ancestry in unrelated individuals”. Genome Research 19(9) (2009), pp. 1655–1664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Cabreros I. and Storey J. D.. “A likelihood-free estimator of population structure bridging admixture models and principal components analysis”. Genetics 212(4) (2019), pp. 1009–1029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Song M., Hao W., and Storey J. D.. “Testing for genetic associations in arbitrarily structured populations”. Nat Genet 47(5) (2015), pp. 550–554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Hao W. and Storey J. D.. “Extending tests of Hardy–Weinberg equilibrium to structured populations”. Genetics 213(3) (2019), pp. 759–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Wright S.. “The genetical structure of populations”. Annals of Eugenics 15(1) (1949), pp. 323–354. [DOI] [PubMed] [Google Scholar]
  • [14].Jacquard A.. “Inbreeding: One word, several meanings”. Theoretical Population Biology 7(3) (1975), pp. 338–363. [DOI] [PubMed] [Google Scholar]
  • [15].Pritchard J. K., Stephens M., and Donnelly P.. “Inference of population structure using multilocus genotype data”. Genetics 155(2) (2000), pp. 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Raj A., Stephens M., and Pritchard J. K.. “fastSTRUCTURE: Variational inference of population structure in large SNP data sets”. Genetics 197(2) (2014), pp. 573–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Gopalan P. et al. “Scaling probabilistic models of genetic variation to millions of humans”. Nature Genetics 48(12) (2016), pp. 1587–1590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Balding D. J. and Nichols R. A.. “A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity”. Genetica 96(1) (1995), pp. 3–12. [DOI] [PubMed] [Google Scholar]
  • [19].Cavalli-Sforza L. L.. “The Human Genome Diversity Project: past, present and future”. Nature Reviews Genetics 6(4) (2005), pp. 333–340. [DOI] [PubMed] [Google Scholar]
  • [20].Auton A. and et al. “A global reference for human genetic variation”. Nature 526(7571) (2015), pp. 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Lazaridis I. et al. “Ancient human genomes suggest three ancestral populations for present-day Europeans”. Nature 513(7518) (2014), pp. 409–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Basu A., Sarkar-Roy N., and Majumder P. P.. “Genomic reconstruction of the history of extant populations of India reveals five distinct ancestral components and a complex structure”. Proceedings of the National Academy of Sciences 113(6) (2016), pp. 1594–1599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Price A. L. et al. “Principal components analysis corrects for stratification in genome-wide association studies”. Nature Genetics 38(8) (2006), pp. 904–909. [DOI] [PubMed] [Google Scholar]
  • [24].Combettes P. L. and Wajs V. R.. “Signal recovery by proximal forward-backward splitting”. Multiscale Modeling & Simulation 4(4) (2005), pp. 1168–1200. [Google Scholar]
  • [25].Falush D., Stephens M., and Pritchard J. K.. “Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies”. Genetics 164(4) (2003), pp. 1567–1587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Bolte J., Sabach S., and Teboulle M.. “Proximal alternating linearized minimization for nonconvex and nonsmooth problems”. Mathematical Programming 146(1) (2014), pp. 459–494. [Google Scholar]
  • [27].Cario M. C. and Nelson B. L.. “Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix”. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL: (1997), pp. 1–19. [Google Scholar]
  • [28].Ochoa A. and Storey J. D.. “New kinship and FST estimates reveal higher levels of differentiation in the global human population”. bioRxiv (2019), doi: 10.1101/653279. [DOI] [Google Scholar]
  • [29].Wall J. D.. “Inferring human demographic histories of non-African populations from patterns of allele sharing”. The American Journal of Human Genetics 100(5) (2017), pp. 766–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Lipson M. and Reich D.. “A working model of the deep relationships of diverse modern human genetic lineages outside of Africa”. Molecular Biology and Evolution 34(4) (2017), pp. 889–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Nielsen R. et al. “Tracing the peopling of the world through genomics”. Nature 541(7637) (2017), pp. 302–310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Bergström A. et al. “Origins of modern human ancestry”. Nature 590(7845) (2021), pp. 229–237. [DOI] [PubMed] [Google Scholar]
  • [33].Bryc K. et al. “Genome-wide patterns of population structure and admixture among Hispanic/Latino populations”. Proceedings of the National Academy of Sciences 107 (supplement_2 2010), pp. 8954–8961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Adhikari K. et al. “Admixture in Latin America”. Current Opinion in Genetics & Development. Genetics of human origin 41 (2016), pp. 106–114. [DOI] [PubMed] [Google Scholar]
  • [35].de Barros Damgaard P. et al. “The first horse herders and the impact of early Bronze Age steppe expansions into Asia”. Science 360(6396) (2018), eaar7711. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Narasimhan V. M. et al. “The formation of human populations in South and Central Asia”. Science 365(6457) (2019), eaat7487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Astle W. and Balding D. J.. “Population structure and cryptic relatedness in genetic association studies”. Statistical Science 24(4) (2009), pp. 451–471. [Google Scholar]
  • [38].Kang H. M. et al. “Variance component model to account for sample structure in genome-wide association studies”. Nature Genetics 42(4) (2010), pp. 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Yang J. et al. “Common SNPs explain a large proportion of the heritability for human height”. Nature Genetics 42(7) (2010), pp. 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Zhou X. and Stephens M.. “Genome-wide efficient mixed-model analysis for association studies”. Nature Genetics 44(7) (2012), pp. 821–824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Márquez-Luna C. et al. “Multiethnic polygenic risk scores improve risk prediction in diverse populations”. Genetic Epidemiology 41(8) (2017), pp. 811–823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Weissbrod O. et al. “Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores”. Nature Genetics 54(4) (2022), pp. 450–458. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES