Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2022 Jul 20;22(7):2614–2626. doi: 10.1111/1755-0998.13647

FSTruct: An F ST‐based tool for measuring ancestry variation in inference of population structure

Maike L Morrison 1,, Nicolas Alcala 2, Noah A Rosenberg 1
PMCID: PMC9544611  PMID: 35596736

Abstract

In model‐based inference of population structure from individual‐level genetic data, individuals are assigned membership coefficients in a series of statistical clusters generated by clustering algorithms. Distinct patterns of variability in membership coefficients can be produced for different groups of individuals, for example, representing different predefined populations, sampling sites or time periods. Such variability can be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across predefined groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector and does not depend on its mean. We apply the approach, which makes use of a normalized F ST statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more predefined groups in their level of membership coefficient variability. Our methods are implemented in the r package FSTruct.

Keywords: F ST , admixture, population structure

1. INTRODUCTION

In the past two decades, computational methods for inference of population structure from individual‐level genetic data have contributed a rich and informative set of approaches for the analysis of genetic variation. Model‐based clustering methods such as admixture (Alexander et al., 2009; Alexander & Lange, 2011), baps (Corander et al., 2004, 2008) and structure (Falush et al., 2003, 2007; Hubisz et al., 2009; Pritchard et al., 2000) are now routinely used to generate insights into population structure and evolutionary history in diverse species of interest in ecology, evolution, conservation biology and agriculture (Guillot & Orlando, 2017).

In model‐based inference of population structure, individuals are clustered based on their multilocus genotypes into a series of statistical clusters, such that each individual possesses a membership coefficient for each cluster. Each membership coefficient represents the proportion of an individual's ancestry that is derived from the associated cluster. Interpreting the membership coefficients of individuals from various predefined populations, sampling sites or other groups of biological interest can illuminate patterns of genetic variation and population structure. Researchers often investigate variability of membership patterns within predefined groups, as well as similarities and differences in the membership patterns of distinct groups.

One type of comparison that is frequently of interest is an assessment of relative levels of variation in membership coefficients among the individuals belonging to two or more predefined groups. This type of comparison arises in many contexts, such as when exploring differences in membership variability between admixed and nonadmixed populations, between populations from different time periods or between different types of data from the same sampled individuals.

For example, in a study of ancient human DNA samples dating over a period of thousands of years, Antonio et al. (2019) sought to examine whether the population of Rome possessed greater diversity in ancestry during certain periods of the Roman Empire. They estimated membership coefficients using admixture and interpreted the inferred coefficients to claim that during the Imperial Rome period, when the Roman Empire was at its peak, ancestry was more variable than during earlier periods, when Rome was more isolated (Figure 1 of Antonio et al., 2019).

Interpretations of inferred membership coefficients to make relative claims about membership variability have generally relied on visual assessment of population structure diagrams rather than on statistical hypothesis testing. In particular, as in Antonio et al. (2019), researchers seeking to quantify variability in membership coefficients across individuals or to compare this variability between two or more groups often do so visually or informally.

Here, we introduce a statistical method to measure variability in membership coefficients inferred by model‐based clustering and to compare this variability across populations. We apply the method to examples from real and simulated data. The method is implemented in the r package FSTruct.

2. MATERIALS AND METHODS

2.1. Overview

The output of population structure inference software programs such as structure and admixture is a representation of individual membership coefficients in matrix form. The matrix, often denoted Q and termed a ‘Q matrix’, has I rows, corresponding to I individuals, and K columns, corresponding to the total number of clusters (Figure 1b). The entry in row i and column k, qki, represents the membership coefficient of individual i in cluster k: the proportion of the ancestry of individual i that is assigned to cluster k. Each row sums to 1, or k=1Kqki=1 for each i.

FIGURE 1.

FIGURE 1

The analogy of the use of F ST to measure membership variability. (a) A standard application of F ST to measure variability of allele frequency vectors across populations; qki is the frequency of allele k in population i. (b) Use of F ST to measure variability of membership coefficient vectors across individuals; qki is the membership coefficient of individual i in cluster k. The matrix containing entries qki is a Q matrix

We seek to compute a measure of variability among ancestry vectors for individuals: among rows of Q. We wish for the measure to be comparable across different data sets, possibly representing different samples. This problem is complicated by the fact that different Q matrices might include different numbers of clusters; furthermore, column entries for some clusters might vary greatly across individuals, while other columns are more uniform.

We approach the problem by modifying the population differentiation statistic F ST to fit this ancestry scenario. F ST measures allele frequency variability among subpopulations, and it is computed using a set of allele frequency vectors that each sum to 1. This setting is mathematically analogous to Q matrices, in which vectors of membership coefficients for each individual sum to 1. In the analogy, each individual represents a ‘population’, and its cluster membership is analogous to an ‘allele frequency’ (Figure 1).

By computing F ST among individual vectors of membership coefficients, we can measure the variability of a single Q matrix. To facilitate comparisons of Q matrices with different numbers of individuals or clusters, we use a normalization of F ST. Despite the general understanding that F ST can in principle reach 1, features of a data set constrain the maximal value of F ST, so that the maximum is often less than 1 (Alcala & Rosenberg, 2017, 2019; Jakobsson et al., 2013). The constrained maximum is relatively low when I, the number of individuals in a Q matrix, is small (analogous to a small number of populations), or when M, the mean membership of the highest‐membership ancestry cluster, is close to its minimum, 1K, or its maximum, 1 (analogous to an extreme value for the frequency of the most frequent allele). Denoting this maximum FSTmax, we normalize F ST by its maximum, using the ratio FST/FSTmax as a measure of variability that is comparable across Q matrices of different size. This measure ranges between 0 and 1, equalling 0 when members of a population have identical membership and equalling 1 when vectors of membership coefficients are maximally variable.

2.2. The FST/FSTmax formula

Consider a scenario with I subpopulations and K distinct alleles. Allele k has frequency qki in subpopulation i, with 0qki1 and k=1Kqki=1.

To calculate F ST among the I subpopulations, we use FST=HTHS/HT, where H S represents the mean heterozygosity of the subpopulations and H T represents the heterozygosity of the total population formed by pooling the subpopulations.

The subpopulation heterozygosity H S is the mean expected frequency of heterozygotes across all I subpopulations, assuming Hardy–Weinberg equilibrium within subpopulations, or HS=11Ii=1Ik=1Kqki2. The total heterozygosity H T is the expected frequency of heterozygotes under Hardy–Weinberg equilibrium in a population whose allele frequencies equal the mean allele frequencies across subpopulations: HT=1k=1K1Ii=1Iqki2. The quantity 1Ii=1Iqki gives the mean frequency of allele k across subpopulations.

With the total population assumed to be polymorphic so that H T > 0, for the setting of I subpopulations and K alleles, with K possibly arbitrarily large, Alcala and Rosenberg (2022) obtained the maximal value possible for F ST given a fixed value of M=1Ii=1Iq1i, where allele k = 1 represents the allele of greatest mean frequency across the I subpopulations. Writing σ1=IM, J=σ11 and σ1=σ1σ1, we have (Alcala & Rosenberg, 2022, Equation 3)

FSTmax=1,σ1=1,2,,I1I11σ1J12Jσ1I1σ1J12Jσ1,0<σ1<1II1σ12+σ12I1σ1+2I1σ12II1σ12σ1+2σ1σ12,nonintegerσ1,1<σ1<I. (1)

This maximum is plotted as a function of M for five different values of I in Figure 2.

FIGURE 2.

FIGURE 2

Bounds on F ST as a function of M, the frequency of the most frequent allele—or the ancestry cluster of greatest membership, in our analogy. Bounds are evaluated using Equation 1 for different values of I, the number of populations (or the number of individuals, in our analogy). (a) I = 2. (b) I = 3. (c) I = 5. (d) I = 10. (e) I = 50

In the language of our analogy, I is the number of individuals—the number of rows in the Q matrix; M is the sample mean membership coefficient for the most frequent ancestral cluster across all I individuals; and σ1=IM is the largest entry in the vector that sums column entries of the Q matrix across rows. The latter case of Equation 1, with 1<σ1<I, is generally more relevant in the setting of population clustering, as I is typically larger than K, so that σ1>IK>1.

The ratio FST/FSTmax, which represents a normalized measure of variability that can be compared among different groups of individuals with different values of I or K, or both, ranges between 0 and 1, taking a value of 0 when all individuals in a group have identical membership coefficients. It has a value of 1 when they are as variable as possible given M.

Alcala and Rosenberg (2022) showed that for 0<σ11, the maximum is realized when each ancestry cluster is found in only a single individual and each individual has exactly J ancestry clusters with coefficients greater than zero: J − 1 clusters with coefficients of σ 1, one cluster with a coefficient of 1J1σ1σ1 and all others with coefficients of 0. Note that in the scenario 0<σ11, the number of clusters K is larger than the number of individuals I; at the maximum, multiple clusters are tied with the same mean membership coefficient M.

For 1<σ1<I, the maximum is realized when only the ancestry cluster of greatest membership is shared among individuals, and at most a single individual contains ancestry from multiple sources. More formally, this scenario occurs when σ1 individuals possess all of their membership in the cluster of greatest membership (i.e. q1i=1 for these individuals), a single individual has membership coefficient {σ 1} for the cluster of greatest membership and coefficient 1 − {σ 1} for one other cluster, and the remaining Iσ11 individuals each have membership coefficient 1 for mutually distinct ancestry clusters.

2.3. Statistical test to compare values of FST/FSTmax

In applications, we may wish not only to compute FST/FSTmax for a single population but also to compare this ratio between two or more populations using a statistical test. We accomplish this task by bootstrap resampling of rows to generate replicate Q matrices for each population. We then compute the FST/FSTmax statistic for each of these replicate matrices. This process generates a bootstrap distribution of the statistic for each population. We then use a Wilcoxon rank‐sum test to determine whether pairs of bootstrap distributions of the statistic for different sets of individuals are significantly different; we use a Kruskal–Wallis test to compare three or more sets of individuals.

2.4. Software availability

We have implemented our method in the r package FSTruct (pronounced ‘F‐struct’), which is available for download from github.com/MaikeMorrison/FSTruct. This package includes functions that compute FST/FSTmax from a Q matrix such as those produced by admixture or structure, generate bootstrap samples and distributions for arbitrarily many Q matrices and visualize Q matrices.

3. RESULTS

3.1. Simulation examples

3.1.1. Dirichlet model

To illustrate our method, we used individual membership coefficient vectors drawn from a Dirichlet distribution (Kotz et al., 2000). This distribution is suited for use as the underlying model for finite vectors of nonnegative numbers q1,q2,,qK that sum to one, k=1Kqk=1, and it has appeared in previous studies of membership coefficient vectors (Huelsenbeck & Andolfatto, 2007; Pritchard et al., 2000).

We treat individual membership coefficient vectors in a population as following a Dirichlet distribution with parameter vector αλ=αλ1,λ2,,λK, where k=1Kλk=1. We denote this distribution by Dirαλ1,λ2,,λK. Here, λ is a vector of length K whose elements determine the parametric mean membership coefficient for each ancestral cluster. The value of α controls the variance of q k , the individual membership coefficient in cluster k: Varqk=λk1λk/α+1. Thus, an increase in α lowers the variances of membership coefficients.

To generate a random Q matrix with I individuals and K ancestry clusters, we draw I independent and identically distributed Dirαλ1,λ2,,λK vectors, q1,,qK, which each comprise a set of membership coefficients for a single individual. Each vector is a row of the simulated Q matrix and is a draw from a Dirichlet distribution with mean membership coefficients λ1,λ2,,λK. Variability of membership coefficients across individuals is controlled by α. Hence, we proceed by (1) using the Dirichlet distribution to simulate Q matrices with specified parametric membership coefficient means and variances, (2) computing FST/FSTmax for each Q matrix and (3) examining the relationship between the value of FST/FSTmax for each Q matrix and the parametric variance of the Dirichlet distribution used to simulate it.

3.1.2. Dirichlet simulations

To investigate the behaviour of FST/FSTmax in relation to a measure of variability in membership coefficients, we used the Dirichlet distribution to simulate Q matrices with known variability. We simulated Q matrices with I = 50 individuals and K = 2 clusters. Each simulation replicate thus drew I = 50 ancestry vectors from a Dirαλ1λ2 distribution.

We fixed λ1λ2=23,13, so that membership in cluster 1 has parametric mean 23 across individuals in a population and membership in cluster 2 has parametric mean 13. The parametric variance of the membership coefficient for a specific cluster, across sampled individuals, then equals σ2=Varq1=Varq2=23×13/α+1; both coefficients have the same variance. As α ranges in 0, the variance ranges in 0,29.

We performed 500 replicate simulations of samples of 50 individuals for each of 45 values of α, choosing α values to obtain parametric variances 0.001, 0.005, 0.01, 0.015, …, 0.22, ranging from near the lower bound of 0 on the variance and stopping short of the upper bound of 29.

Next, we compared the value of FST/FSTmax for each simulated Q matrix to the parametric variance of the Dirichlet distribution used to generate it. As FST/FSTmax measures variability of Q matrices, we expect to see a positive relationship between the Dirichlet variance used to generate the Q matrix and our estimate of its variability, FST/FSTmax.

Simulation results, depicting the 500 values of FST/FSTmax for each of the 45 choices of the Dirichlet variance σ2=Varq1=Varq2=23×13/α+1, appear in Figure 3. In the figure, the relationship between FST/FSTmax and σ 2 is strongly linear, with slope 4.5.

FIGURE 3.

FIGURE 3

Linear relationship between FST/FSTmax and Varq1=Varq2, the variance across individuals of individual membership coefficients under a Dirichlet distribution. For each of 45 values of Varq1=Varq2, 500 points are plotted, each representing a random Q matrix with dimensions 50 × 2. Rows of the Q matrix are simulated using a Dirichlet distribution with means λ=λ1λ2=23,13 and variances Varq1=Varq2=λ1λ2/α+1, with α chosen to produce variances 0.001,0.005,0.01,0.015,,0.22. Each Q matrix gives rise to an associated value of FST/FSTmax, plotted on the vertical axis. A regression line fit to the 500×45 points with intercept 0 has slope 4.5, or 1/λ1λ2=1/23×13, and it explains 99% of the variability in FST/FSTmax. Grey lines mark the 2.5% and 97.5% percentiles, and thus contain 95% of the points

Noticing that the empirical slope, 4.5, was the reciprocal of 29, the upper bound of the Dirichlet variance, we sought to obtain a mathematical relationship between EFST/FSTmax;α,λ1,λ2,I, the expectation of FST/FSTmax under the Dirichlet model and the parametric variance of each membership coefficient in the model. This calculation, performed in the Appendix, confirms the relationship (Equation A11)

EFSTFSTmax;α,λ1,λ2,I1α+1=1λ1λ2λ1λ2α+1=1λ1λ2Varq1=1λ1λ2Varq2, (2)

where 1/λ1,λ2=4.5 in the example plotted in Figure 3. Thus, the simulations and an analytical calculation confirm that in a simple Dirichlet model, the FST/FSTmax measure has a linear relationship with the parametric variance across sampled individuals of membership coefficient q 1 (or q 2). Importantly, the expected value of FST/FSTmax in Equation 2 is independent of the parametric mean membership coefficients, depending only on the Dirichlet parameter α, which controls variability. This result supports the use of FST/FSTmax to measure variability in populations that possess different mean membership coefficients.

3.1.3. Visual illustration of values of FST/FSTmax

Continuing with the Dirichlet simulations, we next sought to visually illustrate the relationship of FST/FSTmax to the variance and mean of membership coefficients. We considered Q matrices with four different values of α, representing four levels of parametric variance in membership coefficients, and two different vectors for the parametric mean membership coefficients λ. For each of the eight settings (four variances, two means), we considered two Q matrices.

These eight simulated pairs of Q matrices are visualized in Figure 4a,d, where they are coloured according to the value of the α parameter used to simulate them. For the lowest‐variability case (α 1, red), the simulated individual membership coefficients show little deviation from the mean, λ=23,13 for Figure 4a and 910,110 for Figure 4d. As the variance parameter increases (α 2, purple; α 3, blue), variance in membership coefficients is increasingly visible. For the highest‐variability case (α 4, green), membership coefficients are centred on λ1λ2=1,0 for approximately 23 or 910 of the individuals, and on λ1λ2=0,1 for the remaining individuals.

FIGURE 4.

FIGURE 4

Dependence of bootstrap distributions of FST/FSTmax for simulated Q matrices on the Dirichlet variance parameter α, rather than the Dirichlet mean λ. (a, d) Q matrices simulated using specified Dir(αλ) distributions. (b, c) Bootstrap distributions of FST/FSTmax for Q matrices from (a) and (d), plotted directly below or above the corresponding matrix. In both (a) and (d), eight matrices were simulated, two for each of four values of α selected to span the range of parametric variances: α1=21901/99, α2=341/99, α3=101/99 and α4=1/99. Matrices are annotated by associated parametric variances σ2=λ1λ2/α+1. In (a), matrices are simulated with parametric mean λ=23,13 and are taken from matrices plotted in Figure 3. In (d), matrices are simulated with a more extreme parametric mean, λ=910,110. Each vertical bar represents an individual membership coefficient vector (q 1, q 2); the proportion of each bar coloured a darker shade represents q 1 and the proportion in a lighter shade corresponds to q 2. The parametric variance of a Q matrix, σ2=λ1λ2/α+1, ranges in 0,29 for λ=23,13 and in (0,0.09) for λ=910,110. The empirical variance s2 is computed for each matrix using the sample mean q¯=1Ii=1Iq1i1Ii=1Iq2i in place of the parametric mean λ. The values of FST/FSTmax for the eight matrices in (a) are 0.004 and 0.005 for the two simulated with α 1, 0.203 and 0.230 for α 2, 0.496 and 0.461 for α 3, and 1.000 and 0.997 for α 4. The values of FST/FSTmax for the eight matrices in (d) are 0.003 and 0.005 for α 1, 0.157 and 0.287 for α 2, 0.539 and 0.571 for α 3, and 1.000 and 1.000 for α 4. In (b) and (c), each bootstrap distribution includes 1000 bootstrap samples of the I = 50 individuals in the associated Q matrix

Bootstrap distributions of FST/FSTmax appear in Figure 4b for λ=23,13 and in Figure 4c for λ=910,110. In these panels, we observe that FST/FSTmax increases from the lowest‐variability case (α 1) to the highest‐variability case (α 4), in accord with the interpretation that FST/FSTmax measures variability in membership coefficients. As α increases (membership variability across individuals decreases), the variance of FST/FSTmax across bootstrap samples decreases; this pattern is driven by the fact that the rows of a high‐α (low‐variability) Q matrix are very similar, so bootstrap‐sampled matrices drawn from this matrix will necessarily also be similar to one another.

Comparing Figure 4b with Figure 4c, we observe that the value of FST/FSTmax is similar between matrices simulated with the same Dirichlet α parameter, irrespective of the mean membership coefficient vectors (λ) used to simulate the matrices. This pattern accords with the interpretation that FST/FSTmax is driven by the variance of membership coefficients and not the mean—as reflected in the analytical result in Equation 2 that under the Dirichlet model, EFST/FSTmaxαλI can be written so that it depends on α but not on λ.

In fact, in some cases, matrices simulated with the same value of α but different means (λ) are more similar than matrices simulated with both the same α and the same means. We tested all 162 pairwise comparisons of the 16 bootstrap distributions in Figure 4 and found that nearly all pairs of distributions were significantly different (Wilcoxon rank‐sum tests, p < 10−6). Interestingly, the only two pairs that were not significantly different were pairs with the same α but different means: the left‐hand α 1 distribution with mean 23,13 in Figure 4b and the right‐hand α 1 distribution with mean 910,110 in Figure 4c (Wilcoxon rank‐sum test, p = .270), and the left‐hand α 3 distribution with mean 23,13 in Figure 4b and the left‐hand α 3 distribution with mean 910,110 in Figure 4c (Wilcoxon rank‐sum test, p = .002). That pairs with the same α and different means can have the same FST/FSTmax, while pairs with different α and either the same or different means have different FST/FSTmax underscores the point that FST/FSTmax can be used to compare the variability of Q matrices with quite different mean membership.

We also observe in Figure 4a,d that the sampling variability of features of Q matrices simulated from the Dirichlet distribution with identical parameters—as reflected in comparisons of pairs of matrices of the same colour within a panel—increases with α. We confirm in Figure S1 that the variability in the mean membership coefficients q¯1,q¯2 of simulated Q matrices increases as the α value used to simulate the matrices decreases (i.e. as the Dirichlet variance increases). This increased variability in sampled Q matrix mean memberships leads to increased variability among sampled Q matrix membership variances (Figure S2). Sampling variability can lead Q matrices simulated with the same parameter values to possess quite different sample means and variances, as is the case particularly for the two pairs of matrices simulated with α 4 in Figure 4d. Despite this sampling variability of Q matrices under the Dirichlet model, we observe that FST/FSTmax, which is largely driven by the underlying parameter α, is relatively stable across pairs of Q matrices.

3.2. Data examples

To illustrate the application of FSTruct, we apply the method to data examples that represent each of three distinct scenarios in which ancestry variability is of interest: (1) ancestry comparisons of admixed and nonadmixed populations, (2) ancestry comparisons of populations representing different time periods or spatial locations and (3) ancestry comparisons of distinct data sets corresponding to different sets of loci for the same individuals.

3.2.1. Admixed populations

A characteristic feature of recently admixed populations is that individuals vary greatly in their ancestry, with some individuals possessing most of their ancestry from one source population, and others possessing most of their ancestry from another source (Gravel, 2012; Verdu & Rosenberg, 2011). Thus, in examining inferred cluster memberships, admixed populations might be expected to give rise to greater variability in ancestry than nonadmixed populations.

We therefore evaluated FST/FSTmax in three populations from an admixture analysis performed by Verdu et al. (2017). The populations include an admixed population from Cape Verde, and Gambian and Iberian populations taken to represent African and European sources for the admixed population. The inferred genetic structure for the three populations is redrawn in Figure 5a.

FIGURE 5.

FIGURE 5

Variability of ancestry in admixed and nonadmixed populations. (a) K = 4 admixture analysis of Gambian (n = 109), Cape Verdean (n = 44) and Iberian (n = 107) samples. Adapted from Verdu et al. (2017). (b) Bootstrap distributions of the ancestry variability measure, FST/FSTmax, for each population (1000 samples)

We computed FST/FSTmax for each of the three populations, measuring ancestry variability of the inferred cluster memberships within each of the three groups. For the nonadmixed source populations, this quantity is 0.078 for the Gambian population and 0.064 for the Iberian population (Figure 5b). The value for the admixed Cape Verdean population is greater, equalling 0.100. Pairs of bootstrap distributions of FST/FSTmax are significantly different (p < 2 × 10−16 for all three pairwise combinations, Wilcoxon rank‐sum test). The admixed Cape Verdean population is indeed observed to have greater variability in ancestry according to the FST/FSTmax measure than the putative source populations, supporting the use of the measure to distinguish clustering patterns in admixed and non‐admixed populations.

3.2.2. Populations over time or space

Geographic movements of populations shape patterns of genetic ancestry for samples collected in different spatial locations or from the same location in different time periods. Locations or time periods whose samples contain individuals from many different sources or from recently admixed populations are expected to have highly variable ancestry, whereas locations or periods in which mixing of disparate populations is less salient are expected to have more homogeneous ancestry.

To explore an example of ancestry variability over time, we evaluated FST/FSTmax in a structure analysis conducted by Antonio et al. (2019) on samples from 29 archaeological sites near Rome spanning the last 12,000 years. These samples represent eight time periods: Mesolithic, Neolithic, Copper Age, Iron Age and Roman Republic, Imperial Rome, Late Antiquity, Medieval and Early Modern, and the present. The plot of the inferred genetic structure for these samples is redrawn in Figure 6a. Antonio et al. (2019) argued, based in part on their version of Figure 6a, that ancestry was variable during the Iron Age and Roman Republic, and highly variable during the Imperial Rome and Late Antiquity periods.

FIGURE 6.

FIGURE 6

Variability of ancestry over time. (a) K = 5 structure analysis of samples from eight time periods: Mesolithic (n = 3), Neolithic (n = 10), Copper Age (n = 3), Iron Age and Roman Republic (n = 11), Imperial Rome (n = 48), Late Antiquity (n = 24), Medieval and Early Modern (n = 28) and Present (n = 15). Adapted from Antonio et al. (2019). (b) Bootstrap distributions of the ancestry variability measure, FST/FSTmax, for each population (1000 samples)

We computed FST/FSTmax for each time period. This ratio is 0 for the Mesolithic, 0.0131 for the Neolithic, 0.0041 for the Copper Age, 0.0183 for the Iron Age and Roman Republic, 0.0192 for Imperial Rome, 0.0244 for Late Antiquity, 0.0186 for the Medieval and Early Modern period and 0.0011 for modern individuals (Figure 6b). Pairs of bootstrap distributions of FST/FSTmax are significantly different (p < 2 × 10−9 for all 28 pairwise combinations, Wilcoxon rank‐sum test). The numerical results validate the claims of Antonio et al. (2019) of high variability during the Iron Age and Roman Republic, Imperial Rome and Late Antiquity periods. They lend increased granularity to these claims, suggesting that ancestry variability was steadily increasing during these three periods, with a maximum achieved during Late Antiquity.

3.2.3. Different genetic loci in the same samples

The ancestry patterns identified by population structure inference methods are influenced by the choice of loci used for the analysis. When data sets possess few loci, structure is not observed, and individuals have membership coefficients close to 1K,1K,,1K; different individuals possess similar membership coefficients. As the number of loci increases, individuals come to have different membership coefficients, with, for example, individuals from two predefined populations possessing membership primarily in two distinct clusters.

To explore patterns of ancestry variability in data sets of different size, we evaluated FST/FSTmax using results from a structure analysis conducted by Algee‐Hewitt et al. (2016). This study focused on 13 tetranucleotide loci commonly used for individual identification in forensic applications, the ‘codis loci’. In a worldwide human sample, the study compared analyses with the codis loci to analyses with a larger set of 779 non‐codis loci and to analyses with sets of 13 non‐codis tetranucleotide loci. The study claimed that the codis loci have similar ancestry information to sets of 13 non‐codis tetranucleotide loci.

Four ancestry patterns from Algee‐Hewitt et al. (2016), inferred from the same sample of individuals, are replotted in Figure 7. Figure 7a depicts a plot based on the codis loci. Figure 7b plots a ‘null data set’ designed to possess no structure. Figure 7c plots a set of 13 non‐codis tetranucleotide loci, and Figure 7d depicts a plot with 779 loci. The ‘null’ plot shows little structure, the two plots with 13 loci show some structure, and the plot with 779 loci shows substantial structure.

FIGURE 7.

FIGURE 7

Variability of ancestry for analyses with different loci from the same samples. K = 4 structure analyses of four different sets of loci for a worldwide human sample. Adapted from Algee‐Hewitt et al. (2016). (a) 13 codis tetranucleotide microsatellite loci. (b) A simulated null data set with no population structure. (c) 13 non‐codis tetranucleotide microsatellite loci. (d) Full data set of 779 tetranucleotide loci. (e) Bootstrap distributions of the ancestry variability measure, FST/FSTmax, for each data set (1000 samples)

We computed FST/FSTmax for each analysis, for each plot evaluating variability in ancestry across all individuals within the plot. The ratio is lowest for the null data set, with a value of 0.009. It is 0.100 for both the codis loci and for the 13 non‐codis loci. The ratio is substantially higher for the full 779 loci, with a value of 0.529. Five of the six pairs of bootstrap distributions of FST/FSTmax are significantly different (p < 2 × 10−16, Wilcoxon rank‐sum test), the exception being that the two plots with 13 loci, codis and non‐codis, do not show a significant difference (p = .56). The pattern of FST/FSTmax values, with the smallest value for Figure 7b, intermediate values for Figure 7a,c, and largest value for Figure 7d, captures increasing ancestry variability as the analyses move from a largely unstructured plot (Figure 7b) to partially unstructured plots (Figure 7a,c) to a substantially structured plot (Figure 7d). The lack of a significant difference in FST/FSTmax between the plot for the codis loci and the plot for equally many non‐codis loci supports the claim of Algee‐Hewitt et al. (2016) that the codis loci contain comparable information about ancestry to other sets of loci with the same size.

4. DISCUSSION

We have introduced a measure for quantifying variability across vectors of individual membership coefficients, as produced by population structure inference programs such as structure and admixture. Our measure is based on a mathematical analogy with the population differentiation statistic F ST. Whereas F ST traditionally measures variability in allele frequency vectors among populations, we have used F ST to measure variability in membership coefficient vectors among individuals. Because the upper bound of F ST as a function of the frequency of the most frequent allele is usually less than 1, we have employed a normalized version of this statistic, FST/FSTmax, which ranges in [0,1] for all matrices of membership coefficients and can thus be used to compare ancestry variability among different matrices.

Through both simulation and an analytical calculation under a Dirichlet distribution for membership coefficient vectors, we demonstrated that the expected value of FST/FSTmax increases with the variance of membership coefficients across individuals (Figures 3 and 4); indeed, in a remarkably simple result, we find that it scales approximately linearly with the parametric variance in a model with K = 2 ancestral clusters (Equation 2). This result supports the use of FST/FSTmax as a measure of variability in ancestry across individuals. Note that although our analytical result that EFST/FSTmax;α,λ1,λ2,I1/α+1 relies on the case of K = 2 ancestral clusters, additional simulations with larger K suggest that similar results hold for larger K, as such simulations find that the mean FST/FSTmax values across simulated Q matrices with fixed parameter values match 1/(α + 1), irrespective of the value of K (Figure S3).

We have proposed that the FST/FSTmax measure can be used in a statistical test of the equality of ancestry variability between two Q matrices by generating bootstrap samples of the individuals in each Q matrix, computing FST/FSTmax for each bootstrap‐sampled matrix and comparing bootstrap distributions of FST/FSTmax using a Wilcoxon rank‐sum test. In analysing our simulated and empirical data, this test performed appropriately. It distinguished between matrices with meaningfully distinct variabilities, such as between matrices simulated with different Dirichlet α parameter values (Figure 4). It notably failed to find a significant difference in a case where the true variabilities of the Q matrices were similar, with the Q matrices representing ancestry inferred using two sets of 13 loci (Figure 7). To further support the use of this bootstrap test, we include supplementary figures that demonstrate that under the null hypothesis, p‐values for the test have the appropriate uniform distribution; this result is seen in simulations that consider different numbers of bootstrap replicates (Figure S4), different numbers of clusters (Figure S5) and different numbers of individuals (Figure S6).

The expected value of FST/FSTmax behaves sensibly as the number of individuals, I, increases (Figure S7). In particular, simulated values of EFST/FSTmax;α,λ,I remain constant with I: as the number of simulated individuals increases at a fixed variability of membership, the mean FST/FSTmax across simulations remains the same and the variance of FST/FSTmax decreases. More generally, we have seen that FST/FSTmax does not depend on the mean membership of the Q matrices under analysis, which makes it well suited to comparing the ancestry variabilities of populations with different mean memberships. To clarify, the test of equality of FST/FSTmax values cannot be used to assess the equality of mean membership among Q matrices—it compares their variability, not their mean membership.

We demonstrated the use of the FST/FSTmax measure in data sets exemplifying three scenarios in which ancestry variability is of particular interest. In a comparison of ancestry measured in admixed and nonadmixed populations by Verdu et al. (2017), we found that the recently admixed Cape Verdean population exhibited greater variability in ancestry, as measured by FST/FSTmax, than did nonadmixed populations (Figure 5). In a comparison of ancestries measured in different time periods in the same location, we provided quantitative support for a claim of Antonio et al. (2019) that certain eras in ancient Rome possessed more variable ancestry than others (Figure 6). Finally, in a comparison of different sets of loci studied in the same individuals, we found quantitative support both for the observation of Algee‐Hewitt et al. (2016) that ancestry variability across individuals was similar for two different sets of 13 loci, and for an increase in ancestry variability in high‐resolution data compared to data of lower resolution. In all three cases, our analyses provided quantitative support for claims previously argued primarily by qualitative observation.

Because the FST/FSTmax measure depends on Q matrices, limitations of the methods used to generate the Q matrices extend to its calculation. For example, if individuals were mislabelled prior to analysis with methods such as structure or admixture, then our measure would be affected. Further, Q matrices generated by structure and admixture do not contain information about the magnitude of the difference between ancestral clusters; our measure only captures variation in ancestry with respect to the clusters that such programs infer.

The new measure, which we have implemented in the r package FSTruct, contributes to a body of methods for quantitative analysis of inferred membership coefficients. This collection of methods includes computations useful for analysing the level of support observed for different numbers of clusters K (Alexander & Lange, 2011; Evanno et al., 2005) and methods of aligning the clustering solutions observed in replicate analyses (Behr et al., 2016; Jakobsson & Rosenberg, 2007; Kopelman et al., 2015), as well as software for graphical display (Ramasamy et al., 2014; Rosenberg, 2004) and for managing files and workflows associated with the analysis (Earl & VonHoldt, 2012; Francis, 2017).

A number of other studies have considered related but distinct problems in assessing variability of ancestry based on membership fractions. Rosenberg et al. (2005) described a ‘clusteredness’ statistic that measures the extent to which individuals are placed into single clusters rather than across multiple clusters. This statistic is maximal if each individual possesses a permutation of the membership vector (1,0,…,0) and minimal if all individuals possess membership vector 1K,1K,,1K. Kerminen et al. (2021) evaluated the Shannon entropy applied to individual‐level membership vectors, assessing variation in time in the Shannon entropy for study participants with different birth years. Whereas both the clusteredness statistic of Rosenberg et al. (2005) and the Shannon entropy statistic of Kerminen et al. (2021) consider variability of the ancestry coefficients of single individuals, our FST/FSTmax measure examines variability of ancestry coefficient vectors across individuals. Thus, for example, comparing individuals in corresponding matrices in Figure 4a,d, clusteredness increases (and Shannon entropy decreases) as the membership of the highest‐membership cluster increases from Figure 4a to Figure 4d. However, FST/FSTmax, measuring variability across individuals, is similar in corresponding matrices in the two panels, reflecting the visual similarity between panels of the interindividual patterns.

We note that in addition to analysing the Q‐matrices produced by population structure inference programs such as structure and admixture, FSTruct can quantify variability in any matrix whose rows sum to 1. Applications are potentially numerous. For example, single‐cell sequencing technologies have enabled the identification and quantification of cell populations within tissues, revealing different patterns of variation, with some tissues containing few cell populations, while others are more diverse (Wang et al., 2019). Our method enables comparisons of the variability of within‐tissue cell populations, where tissues are analogous to individuals and cell populations are analogous to cluster memberships. Our method could also be applied to quantify variability among individuals of features such as mutational signatures, where the proportion of mutations belonging to a mutational type is analogous to a cluster membership (Alexandrov et al., 2013; Rahbari et al., 2016).

AUTHOR CONTRIBUTIONS

MLM, NA and NAR designed the study and performed the theoretical analysis. MLM conducted the simulations, analysed the data and wrote the software. NAR supervised the study. All authors wrote the manuscript.

CONFLICT OF INTEREST

The authors have no conflicts of interest to report.

Supporting information

Appendix S1

ACKNOWLEDGEMENTS

We acknowledge support from NIH grant R01 HG005855 and NSF grant BCS‐2116322. MLM acknowledges support from a National Science Foundation Graduate Research Fellowship and the Anne T. and Robert M. Bass Stanford Graduate Fellowship. We thank P. Verdu, M. Antonio and M. Edge for assistance with data sets from their studies, H. Moots and J. Pritchard for helpful comments on the method, and D. Cotter and J. Mooney for suggestions for the software. Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer/World Health Organization.

APPENDIX A.

A.1.  

In this appendix, we evaluate the approximate expected value of FST/FSTmax calculated for a sample of I individuals i=1,2,,I, each with K = 2 membership coefficients q1i,q2i drawn independently from the Dirichlet distribution Dirαλ1,αλ2. We use the notation EFST/FSTmax;α,λ1,λ2,I or simply EFST/FSTmax to denote this expectation.

A.2. Overview

To obtain the expectation EFST/FSTmax, we first sample I independent and identically distributed Dirαλ1,αλ2 random variables q1i,q2i, where q1i is the membership coefficient of individual i in cluster 1, q2i is the membership coefficient of individual i in cluster 2, and q1i+q2i=1. We assume that the sample size I is large.

We assume without loss of generality that the parametric mean membership coefficient for cluster 1 is at least as large as that for cluster 2; that is, λ1λ2. As I, by the strong law of large numbers (Serfling, 1980, section 1.8), the sample mean membership coefficient for cluster 1, 1Ii=1Iq1i, converges almost surely to the parametric mean λ 1, and the sample mean membership coefficient for cluster 2, 1Ii=1Iq2i, converges almost surely to the parametric mean λ 2. Hence, for large I, the probability approaches 1 that 1Ii=1Iq1i1Ii=1Iq2i. As a result, because we consider large I, we assume that the cluster with the greater parametric mean membership coefficient, cluster 1, also has the greater sample mean membership coefficient. We denote this sample mean, the mean membership of cluster 1 in a simulated population, by M=1Ii=1Iq1i. By definition, M12. As stated above, Ma.s.λ1 as I.

The quantity whose expectation we wish to evaluate under the model, FST/FSTmax, is a function of the sampled membership coefficients, q11q21, q12q22, …, q1Iq2I. We let fq1iq2i represent the Dirichlet probability density for q1iq2i; because we are considering vectors with two components, the Dirichlet reduces to a Beta distribution (Kotz et al., 2000, p. 487),

fq1iq2i=q1iαλ11q2iαλ21Bαλ1αλ2, (A1)

where

Bαλ1,αλ2=Γαλ1Γαλ2Γαλ1+αλ2=Γαλ1Γαλ2Γα, (A2)

and Γ is the gamma function.

The q1i,q2i are independent and identically distributed. Hence, the expectation is

EFSTFSTmax;α,λ1,λ2,I=q11=01q12=01q1I=01FSTFSTmaxq11,q12,,q1I×fq11,q21fq12,q22fq1I,q2Idq11dq12dq1I. (A3)

With this expression in hand, we proceed by writing the expression for FST/FSTmax in terms of the membership coefficients q1i,q2i and the sample size I. We then compute the integral, making use of the Dirichlet parameters α, λ1 and λ2.

A.3. Approximating FST/FSTmax under the Dirichlet model

The value of FST/FSTmax calculated for a population of I individuals with membership coefficients q11,q21,,q1I,q2I can be written using Equations 3 and 5 of Alcala and Rosenberg (2017),

FST=1Ii=1Iq1i2M2M1MFSTmax=IM+IM2IM2IM1M.

We obtain

FSTq11q12q1IFSTmaxq11q12q1I=i=1Iq1i2IM2IM+IM2IM2. (A4)

Recall that M=1Ii=1Iq1i is the sample mean membership of the most prevalent ancestral cluster, assuming that the cluster with the greater parametric mean membership is also the cluster with the greater sample mean membership.

We now make an approximation to the denominator of Equation A4. Because IM=IMIM,IM+IM2= IMIMIM2=IMδ where the error term δ=IM1IM lies in 0,14, taking its maximal value of 14 when IM=12. For large sample size I, because M12 and δ14, IMδ, so that IM+IM2IM. Thus, we substitute IM in place of IM+IM2 in Equation A4, obtaining

FSTFSTmaxq11q12q1Ii=1Iq1i2IM2IM1M. (A5)

This assumption is equivalent to setting FSTmax=1.

To find an approximation for EFST/FSTmax, it is convenient to make a further approximation in Equation A5, substituting M with λ1. We justify this substitution by proving that as I,

i=1Iq1i2IM2IM1Ma.s.i=1Iq1i2Iλ12Iλ11λ1. (A6)

Subtracting the right‐hand side from the left‐hand side, proving Equation A6 is equivalent to proving

1Ii=1Iq1i21M1M1λ11λ1+M1M+λ11λ1a.s.0. (A7)

If Xna.s.X and Yna.s.Y, then Xn+Yna.s.X+Y (Grimmett & Stirzaker,  2001a, p. 336, exercise 2; Grimmett & Stirzaker,  2001b, p. 354, exercise 2), so the sum of two terms that converge almost surely to 0 also converges almost surely to 0. Hence, it suffices to separately prove almost sure convergence to 0 of the two terms summed in Equation A7.

For the right‐hand term, we use the continuous mapping theorem, which states that for a continuous function g and a random vector Xn, if Xna.s.X , then gXna.s.gX (van der Vaart, 1998, p. 7, Theorem 2.3). We consider the continuous function gx=x/1x+λ1/1λ1 and recall that as I, Ma.s.λ1. It follows that as I, and Ma.s.λ1, gMa.s.gλ1; that is, M/1M+λ1/1λ1a.s.0.

For the left‐hand term, the factor 1/M1M1/λ11λ1 converges almost surely to 0 by the continuous mapping theorem with gx=1/x1x1/λ11λ1. By the strong law of large numbers, 1Ii=1Iq1i2 converges almost surely to γ=Eq1i2=Varq1i+λ12=λ11λ1/α+1+λ12=λ1αλ1+1/α+1 (Serfling, 1980, Theorem B). If Xna.s.X and Yna.s.Y, then XnYna.s.XY (Grimmett & Stirzaker,  2001a, p. 336, exercise 2; Grimmett & Stirzaker,  2001b, p. 354, exercise 2), so that the left‐hand term of Equation A7 converges almost surely to γ×0=0.

A.4. Evaluating EFST/FSTmax under the Dirichlet model

Inserting our expression for fq1iq2i from Equation A1 and our expression for FST/FSTmax from Equation A6 into Equation A3 allows us to write an approximate expression for the expectation of FST/FSTmax given the parameters of the Dirichlet distribution and the sample size I:

EFSTFSTmax;α,λ1,λ2,Iq11=01q12=01q1I=01i=1Iq1i2Iλ12Iλ11λ1×q11αλ11q21αλ21Bαλ1,αλ2×q12αλ11q22αλ21Bαλ1,αλ2××q1Iαλ11q2Iαλ21Bαλ1,αλ2dq11dq12dq1I. (A8)

Examining the quantity i=1Iq1i2Iλ12, we observe that Equation A8 can be decomposed as a sum of I+1 terms, one for each of the q1i2 terms, and one for the Iλ12 term. Assign the first I of these separate terms the labels L1,L2,,LI and the Iλ12 term the label L*, so that EFST/FSTmax;α,λ1,λ2,IL1+L2++LI+L*.

We begin by evaluating the term L*, which can be written

L*=Iλ12Iλ11λ1i=1Iq1i=01q1iαλ11q2iαλ21Bαλ1αλ2dq1i.

Recalling that q2i=1q1i, we observe that the integrand q1iαλ11q2iαλ21/Bαλ1αλ2 is simply the Beta probability density function, which integrates to one. Hence, the product evaluates to 1 and L* simply equals a constant:

L*=Iλ12Iλ11λ1=λ11λ1. (A9)

We next evaluate the L i terms. For each i in 1,2,,I,

Li=1Iλ11λ1q1i=01q1iαλ1+1q2iαλ21Bαλ1αλ2dq1ij=1jiIq1j=01q1jαλ11q2jαλ21Bαλ1αλ2dq1j.

As was the case for L*, the integrand of the integral inside the product is the Beta probability density function, so the product evaluates to one. Thus,

Li=1Iλ11λ1q1i=01q1iαλ1+1q2iαλ21Bαλ1αλ2dq1i.

The remaining integral can be evaluated by noting that 01xa11xb1dx=Ba,b. We employ this identity to simplify L i , obtaining

Li=1Iλ11λ1Bαλ1+2αλ2Bαλ1αλ2.

By Equation A2 and the property of gamma functions Γz+1=zΓz, this expression simplifies to

Li=αλ1+1I1λ1α+1. (A10)

We now combine Equations A9 and A10 to complete the calculation in Equation A8, noting that L i does not depend on i, so that each L i follows Equation A10.

EFSTFSTmax;α,λ1,λ2,IL1+L2++LI+L*=Iαλ1+1I1λ1α+1λ11λ1=1α+1. (A11)

Morrison, M. L. , Alcala, N. , & Rosenberg, N. A. (2022). FSTruct: An F ST‐based tool for measuring ancestry variation in inference of population structure. Molecular Ecology Resources, 22, 2614–2626. 10.1111/1755-0998.13647

Handling Editor: Nick Hamilton Barton

Contributor Information

Maike L. Morrison, Email: maikem@stanford.edu.

Nicolas Alcala, Email: alcalan@iarc.fr.

DATA AVAILABILITY STATEMENT

The FSTruct r package is available for download from https://github.com/MaikeMorrison/FSTruct. The introductory vignette is linked from the package README file and provides a guide to use of the package. The Q matrices visualized in Figures 5, 6, 7 are available as supplemental files.

REFERENCES

  1. Alcala, N. , & Rosenberg, N. A. (2017). Mathematical constraints on F ST : Biallelic markers in arbitrarily many populations. Genetics, 206, 1581–1600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Alcala, N. , & Rosenberg, N. A. (2019). G' ST , Jost's D, and F ST are similarly constrained by allele frequencies: A mathematical, simulation, and empirical study. Molecular Ecology, 28, 1624–1636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Alcala, N. , & Rosenberg, N. A. (2022). Mathematical constraints on F ST : Multiallelic markers in arbitrarily many populations. Philosophical Transactions of the Royal Society B, 377, 20200414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Alexander, D. H. , & Lange, K. (2011). Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics, 12, 246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Alexander, D. H. , Novembre, J. , & Lange, K. (2009). Fast model‐based estimation of ancestry in unrelated individuals. Genome Research, 19, 1655–1664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Alexandrov, L. B. , Nik‐Zainal, S. , Wedge, D. C. , Aparicio, S. A. J. R. , Behjati, S. , Biankin, A. V. , Bignell, G. R. , Bolli, N. , Borg, A. , Børresen‐Dale, A. L. , Boyault, S. , Burkhardt, B. , Butler, A. P. , Caldas, C. , Davies, H. R. , Desmedt, C. , Eils, R. , Eyfjörd, J. E. , Foekens, J. A. , … Stratton, M. R. (2013). Signatures of mutational processes in human cancer. Nature, 500, 415–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Algee‐Hewitt, B. F. , Edge, M. D. , Kim, J. , Li, J. Z. , & Rosenberg, N. A. (2016). Individual identifiability predicts population identifiability in forensic microsatellite markers. Current Biology, 26, 935–942. [DOI] [PubMed] [Google Scholar]
  8. Antonio, M. L. , Gao, Z. , Moots, H. M. , Lucci, M. , Candilio, F. , Sawyer, S. , Oberreiter, V. , Calderon, D. , Devitofranceschi, K. , Aikens, R. C. , Aneli, S. , Bartoli, F. , Bedini, A. , Cheronet, O. , Cotter, D. J. , Fernandes, D. M. , Gasperetti, G. , Grifoni, R. , Guidi, A. , … Pritchard, J. K. (2019). Ancient Rome: A genetic crossroads of Europe and the Mediterranean. Science, 366, 708–714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Behr, A. A. , Liu, K. Z. , Liu‐Fang, G. , Nakka, P. , & Ramachandran, S. (2016). Pong: Fast analysis and visualization of latent clusters in population genetic data. Bioinformatics, 32, 2817–2823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Corander, J. , Marttinen, P. , Sirén, J. , & Tang, J. (2008). Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics, 9, 539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Corander, J. , Waldmann, P. , Marttinen, P. , & Sillanpää, M. J. (2004). BAPS 2: Enhanced possibilities for the analysis of genetic population structure. Bioinformatics, 20, 2363–2369. [DOI] [PubMed] [Google Scholar]
  12. Earl, D. A. , & VonHoldt, B. M. (2012). STRUCTURE HARVESTER: A website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources, 4, 359–361. [Google Scholar]
  13. Evanno, G. , Regnaut, S. , & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: A simulation study. Molecular Ecology, 14, 2611–2620. [DOI] [PubMed] [Google Scholar]
  14. Falush, D. , Stephens, M. , & Pritchard, J. K. (2003). Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics, 164, 1567–1587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Falush, D. , Stephens, M. , & Pritchard, J. K. (2007). Inference of population structure using multilocus genotype data: Dominant markers and null alleles. Molecular Ecology Notes, 7, 574–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Francis, R. M. (2017). Pophelper: An R package and web app to analyse and visualize population structure. Molecular Ecology Resources, 17, 27–32. [DOI] [PubMed] [Google Scholar]
  17. Gravel, S. (2012). Population genetics models of local ancestry. Genetics, 191, 607–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Grimmett, G. R. , & Stirzaker, D. R. (2001a). One thousand exercises in probability. Oxford University Press. [Google Scholar]
  19. Grimmett, G. R. , & Stirzaker, D. R. (2001b). Probability and random processes (3rd ed.). Oxford University Press. [Google Scholar]
  20. Guillot, G. , & Orlando, L. (2017). Population structure. Oxford Bibliographies. 10.1093/obo/9780199941728-0057 [DOI] [Google Scholar]
  21. Hubisz, M. J. , Falush, D. , Stephens, M. , & Pritchard, J. K. (2009). Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources, 9, 1322–1332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Huelsenbeck, J. P. , & Andolfatto, P. (2007). Inference of population structure under a Dirichlet process model. Genetics, 175, 1787–1802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Jakobsson, M. , Edge, M. D. , & Rosenberg, N. A. (2013). The relationship between F ST and the frequency of the most frequent allele. Genetics, 193, 515–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Jakobsson, M. , & Rosenberg, N. A. (2007). CLUMPP: A cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics, 23, 1801–1806. [DOI] [PubMed] [Google Scholar]
  25. Kerminen, S. , Cerioli, N. , Pacauskas, D. , Havulinna, A. S. , Perola, M. , Jousilahti, P. , Salomaa, V. , Daly, M. J. , Vyas, R. , Ripatti, S. , & Pirinen, M. (2021). Changes in the fine‐scale genetic structure of Finland through the 20th century. PLoS Genetics, 17, e1009347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kopelman, N. M. , Mayzel, J. , Jakobsson, M. , Rosenberg, N. A. , & Mayrose, I. (2015). CLUMPAK: A program for identifying clustering modes and packaging population structure inference across K. Molecular Ecology Resources, 15, 1179–1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kotz, S. , Balakrishnan, N. , & Johnson, N. (2000). Continuous multivariate distributions, volume 1: Models and applications (2nd ed.). Wiley. [Google Scholar]
  28. Pritchard, J. K. , Stephens, M. , & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155, 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Rahbari, R. , Wuster, A. , Lindsay, S. J. , Hardwick, R. J. , Alexandrov, L. B. , Turki, S. A. , Dominiczak, A. , Morris, A. , Porteous, D. , Smith, B. , Stratton, M. R. , UK10K Consortium , & Hurles, M. E. (2016). Timing, rates and spectra of human germline mutation. Nature Genetics, 48, 126–133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Ramasamy, R. K. , Ramasamy, S. , Bindroo, B. B. , & Naik, V. G. (2014). STRUCTURE PLOT: A program for drawing elegant STRUCTURE bar plots in user friendly interface. Springerplus, 3, 431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Rosenberg, N. A. (2004). DISTRUCT: A program for the graphical display of population structure. Molecular Ecology Notes, 4, 137–138. [Google Scholar]
  32. Rosenberg, N. A. , Mahajan, S. , Ramachandran, S. , Zhao, C. , Pritchard, J. K. , & Feldman, M. W. (2005). Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genetics, 1, 660–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Serfling, R. (1980). Approximation theorems of mathematical statistics. Wiley. [Google Scholar]
  34. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press. [Google Scholar]
  35. Verdu, P. , Jewett, E. M. , Pemberton, T. J. , Rosenberg, N. A. , & Baptista, M. (2017). Parallel trajectories of genetic and linguistic admixture in a genetically admixed creole population. Current Biology, 27, 2529–2535. [DOI] [PubMed] [Google Scholar]
  36. Verdu, P. , & Rosenberg, N. A. (2011). A general mechanistic model for admixture histories of hybrid populations. Genetics, 189, 1413–1426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Wang, X. , Park, J. , Susztak, K. , Zhang, N. R. , & Li, M. (2019). Bulk tissue cell type deconvolution with multi‐subject single‐cell expression reference. Nature Communications, 10, 380. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1

Data Availability Statement

The FSTruct r package is available for download from https://github.com/MaikeMorrison/FSTruct. The introductory vignette is linked from the package README file and provides a guide to use of the package. The Q matrices visualized in Figures 5, 6, 7 are available as supplemental files.


Articles from Molecular Ecology Resources are provided here courtesy of Wiley

RESOURCES