Abstract
The site frequency spectrum is an important summary statistic in population genetics used for inference on demographic history and selection. However, estimation of the site frequency spectrum from called genotypes introduces bias when working with low-coverage sequencing data. Methods exist for addressing this issue but sometimes suffer from 2 problems. First, they can have very high computational demands, to the point that it may not be possible to run estimation for genome-scale data. Second, existing methods are prone to overfitting, especially for multidimensional site frequency spectrum estimation. In this article, we present a stochastic expectation–maximization algorithm for inferring the site frequency spectrum from NGS data that address these challenges. We show that this algorithm greatly reduces runtime and enables estimation with constant, trivial RAM usage. Furthermore, the algorithm reduces overfitting and thereby improves downstream inference. An implementation is available at github.com/malthesr/winsfs.
Keywords: site frequency spectrum, next-generation sequencing, expectation–maximization, genotype likelihoods, low-coverage data, demographic history
Introduction
The site frequency spectrum (SFS) is the joint distribution of allele frequencies among 1 or more populations, and it serves as an important summary statistic in population genetics. For instance, the SFS is sufficient for computing nucleotide diversity (Korneliussen et al. 2013), (Bhatia et al. 2013), and f-statistics (Peter 2016). Furthermore, the SFS may be used for inferring demographic history (Marth et al. 2004; Gutenkunst et al. 2009; Excoffier et al. 2013) and selection (Tajima 1989; Fay and Wu 2000; Nielsen et al. 2005).
When working with high-quality data, it is usually straightforward to estimate the SFS from called genotypes. However, when genotype calls are uncertain, standard methods lead to significant bias in the estimated SFS (Nielsen et al. 2011), which propagates to downstream inference (Han et al. 2013). In particular, this situation arises when working with next-generation sequencing (NGS) data at low coverage and may be compounded by additional data-quality issues. Low-coverage NGS data are sometimes the only available option, for instance when working with ancient DNA (Olalde et al. 2019; Margaryan et al. 2020; van der Valk et al. 2021). Sequencing at low coverage is also a popular choice to reduce sequencing costs, since most of the key population genetics analysis remain possible with such data (Lou et al. 2021).
To estimate the SFS from low-coverage data, several methods have been proposed, which account for the genotype uncertainty in estimation of the SFS (Li 2011; Nielsen et al. 2011). These are based on finding the SFS that maximizes the data likelihood using numeric optimization. Two factors combine to create a computational challenge for such methods. First, to achieve an accurate estimate of the SFS, these methods usually require many iterations, each of which requires a full pass over the input data. Second, unlike most genetics analyses, the SFS cannot be based on only the small subset of the variable sites but must consider all sites. Taken together, this means that some summary of the full data must be held in RAM and iterated over many times. For genome-scale NGS data from more than a few dozen samples, or in more than 1 dimension, this is often not computationally feasible, as tens of hours of runtime and hundreds of gigabytes of RAM may be required. Current approaches for dealing with this issue restrict the analysis to fewer individuals and/or smaller regions of the genome (Sánchez-Barreiro et al. 2021), leading to less accurate results.
An additional problem with current methods is that they are prone to overfitting. In the multidimensional setting in particular, there is often very little information available for many of the entries in the frequency spectrum. Therefore, by considering the full data set, existing algorithms risk fitting noise, leading to estimates with poor generalizability.
In this article, we present a novel version of the stochastic expectation–maximization (EM) algorithm for estimation of the SFS from NGS data. In each pass through the data, this algorithm updates the SFS estimate multiple times in smaller blocks of sites. We show that for low-coverage whole genome sequencing (WGS) data, this algorithm requires only a few full passes over the data. This considerably decreases running time and means that it is possible to estimate the SFS using constant, negligible RAM usage by streaming data from disk. Moreover, by only considering smaller subsets of the data at a time, we show that this method reduces overfitting, which in turn leads to improved downstream inference.
Materials and methods
Estimation of the SFS from low-coverage sequencing data requires precomputing site allele frequency (SAF) likelihoods for each site, and these are based on genotype likelihoods. We begin by briefly reviewing these concepts.
Genotype likelihoods
Assume we have NGS data X sampled from K different populations (indexed by k), with Nk individuals in the kth population. Furthermore, say that we have M diallelic sites (indexed by m), so that is the genotype of a diploid individual n at site m in population k, coding genotypes as the number of derived alleles. In the same way, we use Xmkn to refer to the sequencing data at this location.
We define the genotype likelihood as the probability of the data given a particular genotype. Genotype likelihoods form the basis of genotype calling and are calculated from aligned sequencing reads by various bioinformatic tools including bcftools/samtools (Li et al. 2009; Danecek et al. 2021), GATK (McKenna et al. 2010), and ANGSD (Korneliussen et al. 2014), using slightly different models. For clarity, we outline the basic GATK model below, though the choice of model is not important for our purposes.
For D sequencing reads aligned to position m for individual n in population k, let bd be the base call of the dth read. Assuming independence of base calls, we have
| (1) |
If we consider the genotype as 2 alleles such that , then by random sampling of the parental alleles,
| (2) |
In turn, this probability is modeled by
| (3) |
where ϵd is the sequencing error probability associated with the dth base.
SAF likelihoods
Using genotype likelihoods, we can calculate SAF likelihoods, also sometimes known as sample allele frequency likelihoods. It is possible to think of the SAF likelihoods as the generalization of genotype likelihoods from individuals to populations: instead of asking about the probability of the data for 1 individual given a genotype, we ask about the probability of the data for a population given the sum of their derived alleles.
More formally, define the sum of derived alleles for population k at site m,
| (4) |
with each corresponding to possible sample frequencies . Now define the SAF likelihood for a single population k,
| (5) |
where Xmk is the data for all individuals sampled in population k at site m, is the combinatorial probability of the genotype vector conditional on the sum of the genotypes being jk, and is a standard genotype likelihood. Using a dynamic programming algorithm, SAF likelihoods can be calculated from the genotype likelihoods of N individuals in time per site (Nielsen et al. 2012), and a linear time approximation has also been given (Han et al. 2014).
To extend this to the multidimensional SFS with K populations, let be the set of possible derived allele count combinations across populations, let Xm be the data across all individuals in all populations at site m, and define . Then
| (6) |
is the joint SAF likelihood for K populations.
Site frequency spectrum
Using the definition of above, we define the SFS as a parameter such that is the probability that Zm = j. That is,
| (7) |
for site m. That is, the SFS is the probability of a particular vector of derived allele sums at a site chosen at random.
When genotypes are available, the SFS can be estimated simply by counting observed allele count combinations. When genotypes cannot be called, the standard approach is maximum likelihood estimation.
Assuming independence of sites, we write the likelihood function
| (8) |
where Xm refers to all sequencing data for site m. Note that the likelihood can be expressed solely in terms of joint SAF likelihoods.
The maximum likelihood estimate cannot be found analytically. Instead, is typically estimated using some iterative procedure such as BFGS (Nielsen et al. 2012) or an EM algorithm (Li 2011; Korneliussen et al. 2014), of which the latter has become the standard choice. An overview of the this algorithm is given below. For details and proof, see Supplementary Section 1.
Standard EM algorithm
Before optimization, we precompute the SAF likelihoods for all sites, populations, and possible sample frequencies. In addition, we make an arbitrary initial guess of the SFS . The EM algorithm then alternates between an E step and an M step.
The E step consists of computing posterior probabilities of derived allele counts conditional on the current SFS estimate,
| (9) |
for all sites and possible derived allele counts . Note that this conditional posterior depends only on the current SFS estimate and the (joint) SAF likelihoods.
Using the result of the E step, the M step updates the estimate by setting
| (10) |
for all .
The EM algorithm guarantees a monotonically increasing likelihood of successive values of . The runtime of the algorithm is linear in the number of iterations required before convergence, with each iteration taking time. In practice, the standard implementation is realSFS (Nielsen et al. 2012) from the software suite ANGSD (Korneliussen et al. 2014), which uses a generic EM acceleration scheme (Varadhan and Roland 2008). The details of this acceleration will not be important in this context, so we omit the details.
Window EM algorithm
As in standard EM, we precompute all SAF likelihoods and make an arbitrary initial guess of the SFS. In addition, we choose 2 hyperparameters B (the number of blocks) and W (the window size). Before starting optimization, all site indices are randomly assigned to one of B blocks with for , and . The reason for doing so is simply to break patterns of linkage disequilibrium in particular blocks of input data, which will make the SFS within each block more similar to the global SFS. Blocks are nonoverlapping and exhaustive, so that and .
After this initialization, the window EM algorithm is defined as an iterative procedure that alternates between an E step and an M step, where the M step in turn is split into an M1 step and an M2 step.
The E step of the algorithm involves computing posteriors conditional on the current estimate of the SFS, much like standard EM. The difference is that we only process a single block of sites. Let , so that for . Then, at time step t, we compute for all and all possible derived allele counts using Equation (9).
In the M1 step, the qs for the current block are used to give a block SFS estimate . This is analogous to the standard M step [Equation (10)], so that for each
| (11) |
These block estimates are then used in the M2 step to update the overall SFS estimate for each ,
| (12) |
where is the window of the W latest block indices at time t. We use to express equality under the common special case when either or , so that there are no issues with blocks of unequal sizes in the current window. In this case, the M2 step simplifies to the mean of the past W block estimates.
Pseudo-code for window EM is given in Algorithm 1, and an illustration comparing window EM to standard EM is shown in Fig. 1.
Algorithm 1.
Window EM algorithm
Input (1) SAF likelihoods for sites and Nk individuals in each of populations , with . (2) Random, nonoverlapping assignment of site indices from 1 to M into B blocks . (3) Initial SFS estimate .
Output Estimate of the K-dimensional SFS.
Parameters Number of blocks B, number of blocks per window W.
while not converged do
bt ← t mod B + 1 ▹ Block index
for do
for do
▹ E step
← {(t − w) mod B + 1|w ∈{0, …, min(t, W − 1)}}
▹ Window indices
for do
▹ M1 step
▹ M2 step
return
Fig. 1.
Schematic illustration of the standard and window EM algorithms for input consisting of a single population with N = 3 individuals and M = 50 sites. Sites are shown horizontally, and derived allele frequencies are shown vertically. The precomputed SAF likelihoods are illustrated at the bottom with blocks indicated by dashed lines. Standard EM computes the conditional posterior derived allele counts over all sites (E step) and uses these to update the SFS estimate (M step). Window EM computes the conditional posteriors for a small blocks of sites (E step), computes a block SFS estimate after each block (M1 step), and updates the overall estimates as sliding window average (M2 step) of the W past block estimates. In this example, the sites have been split into B = 5 blocks with 10 sites each, and the sliding window covers W = 3 blocks.
In the below, we are interested in comparing standard EM and window EM. For clarity, we will use the term “epoch” to refer to a full pass through the data for either algorithm. In the case of standard EM, an epoch is simply a single iteration; for window EM, an epoch corresponds to B iterations.
Convergence
In the standard EM algorithm, the data log-likelihood [Equation (8)] can typically be evaluated with little computational overhead during the E step. Therefore, a common convergence criterion is based on the difference between the log-likelihood values of successive epochs. That is, let
| (13) |
and convergence is reached when , for some tolerance δ decided ahead of time.
For window EM, the same does not apply, since no full E step is ever taken. However, the likelihood for each block can be calculated cheaply during each block E step. Therefore, we define for epoch ,
| (14) |
that is, the sum of log-likelihoods of SFS estimates used over the past epoch, each evaluated in the block for which they were used in a block E step, normalized by block size for convenience.
Results
To test the window EM algorithm, we implemented it in the winsfs program, available at github.com/malthesr/winsfs. We compare winsfs to realSFS, which implements the standard EM algorithm and serves as the current state of art. We adopt 2 complementary approaches for evaluating performance of winsfs. First, we use 2 different real-world WGS data sets to compare winsfs to realSFS. realSFS has already been validated on simulated data (Han et al. 2014; Korneliussen et al. 2014), and we use split training and test data sets to evaluate any observed differences from winsfs. Second, we use simulated data to validate winsfs under conditions of known truth across a range of data qualities and sample sizes.
Real-world data sets
We tested winsfs and realSFS on 2 real-world WGS data sets of very different quality as described below. An overview is shown in Table 1.
Table 1.
Overview of the input training data.
| Population | Individuals | Sites | Median depth (range) | Contigs | n 50 | |
|---|---|---|---|---|---|---|
| Human | ||||||
| YRI | 10 | 1.17 × 109 | 5.0× (3.2×–7.2×) | |||
| CEU | 10 | 1.17 × 109 | 6.2× (2.9×–8.0×) | 52 | 1.5 × 108 | 0.13 |
| Impala | ||||||
| Maasai Mara | 12 | 6.34 × 108 | 2.8× (1.4×–10.2×) | |||
| Shangani | 8 | 6.34 × 108 | 2.9× (2.6×–16.8×) | 7,811 | 3.4 × 105 | 0.24 |
We first analyzed 10 random individuals from each of the YRI (Yoruba Nigerian) and CEU (Europeans in Utah) populations from the 1000 Genomes Project (The 1000 Genomes Project Consortium 2015). This human data were sequenced to 3×–8× coverage and mapped to the high-quality human reference genome. We created SAF files using ANGSD (Korneliussen et al. 2014) requiring minimum base and mapping quality 30 and polarizing the spectrum using the chimpanzee as an outgroup. We then split these input data into test and training data, such that the first half of each autosome was assigned to the training set and the second half to the test set. The resulting training data set contains 1.17 × 109 sites for both YRI and CEU, while the test data set contains 1.35 × 109 sites for both. Training set depth distributions for each individual are shown in Supplementary Fig. 1.
We also analyze a data set of much lower quality consisting of 12 and 8 individuals from 2 impala populations that we refer to as “Maasai Mara” and “Shangani,” respectively, based on their sampling locations. These populations were sequenced to only 1×–3× with the addition of a single high-depth sample in each population (see Supplementary Fig. 2). The data were mapped to a very fragmented assembly, and then, we split the data into training and test sets just as for the human data. However, due to the low-quality assembly, we analyzed only sites on contigs larger than 100 kb and filtered sites based on depth outliers, excess heterozygosity, mappability, and repeat regions. We polarized using the impala reference itself. This process is meant to mirror a realistic workflow for working with low-quality data from a nonmodel organism. The impala input data end up somewhat smaller than the human data set, with approximately 6.3 × 108 sites in both test and training data sets.
Broadly, the human data are meant to exemplify medium-quality data with coverage toward the lower end, but with no other significant issues. The impala data, on the other hand, represent low-quality data: not only the coverage low and fewer sites are available, but also the impala reference genome is poor quality with 7,811 contigs greater than 100 kb and (i.e. 50% of the assembly bases lie on contigs of this size or greater). This serves to introduce further noise in the mapping process, which amplifies the overall data uncertainty. Finally, the impala populations are more distinct, with compared to 0.13 between the human populations. As we will see below, this creates additional challenges for estimation of the 2-dimensional SFS.
Estimation
Using the training data sets, we estimated the 1-dimensional SFS for YRI and Maasai Mara, as well as the 2-dimensional SFS for CEU/YRI and Shangani/Maasai Mara. We ran winsfs for 500 epochs using a fixed number of blocks B = 500 and window sizes . We will focus on the setting with window size W = 100. For convenience, we introduce the notation winsfs100 to refer to winsfs with hyperparameter settings B = 500 and W = 100. We return to the topic of hyperparameter settings below.
To compare, we ran realSFS using default settings, except allowing it to run for a maximum of 500 epochs rather than the default 100. We will still take the 100 epochs cutoff to mark convergence, if it has not occurred by other criteria before then, but results past 100 will be shown in places.
In each case, we evaluated the full log-likelihood [Equation (8)] of the estimates after each epoch on both the training and test data sets. In addition, we computed various summary statistics from the estimates after each epoch. For details, see Supplementary Section 2.
One-dimensional SFS
Main results for the 1-dimensional estimates are shown in Fig. 2.
Fig. 2.
One-dimensional SFS estimation. a) YRI SFS estimates from realSFS and winsfs100 after various epochs. Only variable sites are shown, proportion of fixed sites is shown in the legend. The final realSFS estimate is overlaid with dots on the winsfs plot for comparison. b) YRI Tajima’s θ estimates calculated from realSFS and winsfs over epochs. c) Maasai Mara SFS estimates from realSFS and winsfs100 after various epochs. Only variable sites are shown, proportion of fixed sites is shown in the legend. The final realSFS estimate is overlaid with dots on the winsfs plot for comparison. d) Maasai Mara Tajima’s θ estimates calculated from realSFS and winsfs over epochs.
For the human YRI population, we find that a single epoch of winsfs100 produces an estimate of the SFS that is visually indistinguishable from the converged estimate of realSFS at 39 epochs (Fig. 2a). Train and test set log-likelihoods (Supplementary Fig. 3) confirm that the likelihood at this point is only very marginally lower for winsfs100 than the last realSFS. By increasing the window size to 250 or 500, we get test log-likelihood values equal to or above those achieved by realSFS, and still within the first 5 epochs.
As an example of a summary statistic derived from the 1-dimensional SFS, Fig. 2b shows that winsfs100 finds an estimate of Tajima’s θ that is very near to the final realSFS, with a difference on the order of 1 × 10−6. Increasing the window size removes this difference at the cost of a few more epochs.
In the case of Maasai Mara, realSFS runs for the 500 epochs, so we take epoch 100 to mark convergence. On these data, winsfs100 requires 2 epochs to give a good estimate of the SFS, as shown in Fig. 2c. Some subtle differences relative to the realSFS results remain, however, especially at the middle frequencies: the realSFS estimate exhibits a “wobble” such that even bins are consistently higher than odd bins. Such a pattern is not biologically plausible and is not seen in the winsfs spectrum.
Supplementary Fig. 4 shows train and test log-likelihood data for Maasai Mara, which again support the conclusions drawn from looking at the estimates themselves. In theory, we expect that the test log-likelihood should be adversely impacted by the realSFS “wobble” pattern. In practice, however, with more than 99.5% fixed sites, the fixed end of the spectrum dominate the likelihood to the extent that the effect is not visible. We return to this point below.
Finally, Fig. 2d shows that Tajima’s θ is likewise well-estimated by 1 or 2 epochs of winsfs100 on the impala data.
Two-dimensional SFS
Overall results for the joint spectra are seen in Fig. 3.
Fig. 3.
Two-dimensional SFS estimation. a) CEU/YRI SFS estimates from realSFS after 93 epochs (converged) and from winsfs100 after a single epoch. Fixed sites not shown for scale, total proportion indicated by arrows. b, c) CEU/YRI SFS train and test log-likelihood over epochs for realSFS and winsfs. d) CEU/YRI Hudson’s estimates calculated from realSFS and winsfs over epochs. e) Shangani/Maasai Mara SFS estimates from realSFS after 100 epochs (converged) and from winsfs100 after a single epoch. Fixed reference sites not shown for scale, proportions indicated by arrows. f, g) Shangani/Maasai Mara SFS train and test log-likelihood over epochs for realSFS and winsfs. h) Shangani/Maasai Mara Hudson’s estimates calculated from realSFS and winsfs over epochs.
On the human data, winsfs100 takes a single epoch for an estimate of the SFS that is near-identical to realSFS at convergence after 93 epochs. Looking at the log-likelihood results, it is notable that while realSFS does better than winsfs when evaluated on the training data (Fig. 3b), the picture is reversed when evaluated on the test data (Fig. 3c). In fact, all winsfs hyperparameter settings achieved better test log-likelihood values in the first 10 epochs than achieved by realSFS at convergence. This is likely caused by a faint “checkerboard” pattern in the realSFS estimate due to overfitting, as we expect the spectrum to be smooth. We note that both realSFS and winsfs preserve an excess of sites where all individuals are heterozygous, corresponding to the peak in the center of the spectrum. This is a known issue with this data set (Meisner and Albrechtsen 2019), likely caused by paralogs in the mapping process. It is an artifact that can be removed by filtering the data before SAF calculation, which we have not done here. Given this choice, it is to be expected that this peak remains.
In 2 dimensions, we compute both Hudson’s (Fig. 3d) and the f2-statistic (Supplementary Fig. 5) from SFS estimates after all epochs, and we note similar patterns for these as we have seen before: 1 epoch of winsfs100 gives an estimate of the summary statistic that is almost identical to the final realSFS estimate.
For the impalas, winsfs100 requires 2 epochs for a good estimate of the spectrum, while realSFS again does not report convergence within the first 100. What is immediately striking about the impala results, however, is that the checkerboard pattern is very pronounced for realSFS, and again absent for winsfs (Fig. 3e). The problem for realSFS is likely exacerbated by 2 factors: first, the sequencing depth is lower, increasing the uncertainty, and second, the relatively high divergence of the impala populations pushes most of the mass in the spectrum toward the edges. Together, this means that very little information is available for most of the estimated parameters. It appears that realSFS therefore ends up overfitting to the particularities of the training data at these bins.
This is also reflected in the difference between train and test log-likelihood (Fig. 3f and g). Like in the case of the human data, the SFS estimated by winsfs performs better on the test data compared to realSFS, while realSFS performs the best on the training data. On the test data, all winsfs settings again reach log-likelihood values comparable to or better than realSFS in few epochs. However, the differences between realSFS and winsfs remain relatively small in terms of log-likelihood, even on the test set. This is somewhat surprising, given the marked checkerboarding in the spectrum itself. Again, we attribute this to the fact that the log-likelihood is dominated by all the mass lying in or around the zero–zero bin. We expect, therefore, that methods that rely on the “interior” of the SFS should do better when using winsfs, compared to realSFS.
Before turning to test this prediction, we briefly note that (Fig. 3h) and the f2-statistic (Supplementary Fig. 5) are also adequately estimated for the impalas by winsfs100 in 1 epoch.
Demographic inference
All the SFS-derived summary statistics considered so far are heavily influenced by the bins with the fixed allele bins (i.e. count 0 or in all populations), or they are sums of alternating frequency bins. In either case, this serves to mask issues with checkerboard areas of the SFS in the lower-frequency bins. However, this will not be the case for downstream methods that rely on the shape of the spectrum in more detail.
To illustrate, we present a small case study of inferring the demographic history of the impala populations using the (Gutenkunst et al. 2009) software with the estimated impala spectra shown in Fig. 3e, though folded due to the lack of an outgroup for proper polarization. Briefly, based on an estimated SFS and a user-specified demographic model, fits a model SFS based on the demographic parameters so as to maximize the likelihood of these parameters. Our approach was to fit a simple demographic model for the Shangani and Maasai Mara populations and then gradually add parameters to the model as required based on the residuals of the input and model spectra. We take this to be representative of a typical workflow for demographic inference.
For each successive demographic model (Portik et al. 2017), we ran on the folded spectra by performing 100 independent optimization runs from random starting parameters and checking for convergence by requiring the top 3 results to be within 5 log-likelihoods units of each other. If the optimization did not converge, we did additional optimization runs until either they converged or 500 independent runs were reached without likelihood convergence. In that case, we inspected the results for the top runs, to assess whether they were reliably reaching similar estimates and likelihoods. Results are shown in Fig. 4.
Fig. 4.
Demographic inference results. Each row corresponds to a demographic model fitted using . On the left, a schematic of the model is shown including parameter estimates using SFS estimates from realSFS after 100 epochs or from winsfs100 after 2 epochs. Time is given in years, population sizes in number of individuals, and migration rates in per chromosome per generation. All parameters were scaled assuming a mutation rate of 1.41 × 10−8 per site per generation and a generation time of 5.7 years. On the right, the residuals of the SFS fitted by . Note that folds the input SFS, hence the residuals are likewise folded. The fixed category is omitted to avoid distorting the scale. a, b) Model with symmetric migration and constant population size. c, d) Model with asymmetric migration and constant population size. e, f) Model with asymmetric migration and a single, instantaneous population size change.
The first, basic model assumes that the populations have had constant populations sizes and a symmetric migration rate since diverging. The parameter estimates based on realSFS and winsfs are similar, though the winsfs model fit has significantly higher log-likelihood (Fig. 4a). However, when inspecting the residuals in Fig. 4b, the realSFS residuals suffer from a heavy checkerboard pattern, making it hard to distinguish noise from model misspecification. In contrast, the winsfs residuals clearly show areas of the spectrum where the model poorly fits the data.
In particular, the residuals along the very edge of the spectrum suggest that a symmetric migration rate is not appropriate. Therefore, we fit a second model with asymmetric migration (Fig. 4c). Now finds migration rates from Shangani to Maasai Mara an order of magnitude higher than vice versa. The results for winsfs (Fig. 4d) show improved residuals, while the realSFS residuals remain hard to interpret.
Finally, an area of positive residuals in the fixed and rare-variant end of the Shangani spectrum suggests that this population has recently undergone a significant bottleneck. Therefore, the third model allows for an instantaneous size change in each of the impala populations (Fig. 4e). At this point, the winsfs residuals (Fig. 4f) are negligible, suggesting that no more parameters should be added to the model. Once again, however, the realSFS residuals leave us uncertain whether further model extensions are required.
When looking at the final model fits, the parameter estimates from realSFS and winsfs also start to differ slightly. In several instances, estimates disagree by about 50%, and the log-likelihood remains much higher for winsfs, with a difference of 45,000 log-likelihood units to realSFS. In addition, we confirmed that the log-likelihood of the data in the original test SAF files given the SFS fitted by is higher for winsfs () than for realSFS (). We stress, however, that we would have likely never found the appropriate model without using winsfs, since the interpretation of the realSFS results is difficult. In relation to this point, we note that the final model results in considerably different estimates for parameters of biological interest, such as split times and recent population sizes, relative to the initial model. We also find that the last model is supported by the literature: previous genetic and fossil evidence suggests extant common impala populations derive from a refugium in Southern Africa that subsequently colonized East Africa in the middle-to-late Pleistocene (Lorenzen et al. 2006, 2012; Faith et al. 2013). This is broadly consistent with the estimated split time, and the reduction in population size in East African populations as they colonized the new habitat. The difference in effective population size between the southern Shangani population and the eastern Maasai Mara was previously also found using microsatellite data (Lorenzen et al. 2006).
Simulations
To validate these findings in conditions with a known SFS, we ran simulations using msprime (Baumdicker et al. 2021) and tskit (Kelleher et al. 2018). Briefly, we simulated 2 populations, which we simply refer to as A and B. Populations A and B diverged 10,000 generations ago and both have effective populations sizes of 10,000 individuals, except for a period of 1,000 generations after the split, during which time B went through a bottleneck of size 1,000. We simulated 22 independent chromosomes of 10 Mb for a total genome size of 220 Mb, using a mutation rate of 2.5 × 10−8 and a uniform recombination rate of 1 × 10−8. To explore the consequences of varying sample sizes, we sampled 5, 10, or 20 individuals from the 2 populations. For each of these 3 scenarios, we calculated the true SFS from the resulting genotypes (shown in Supplementary Fig. 6).
Using the true genotypes as input, we simulated the effects of NGS sequencing with error for both the variable and invariable sites. At every position in the genome, including the monomorphic sites, we sample bases and introduce errors with a constant rate of independently for each base. We calculated genotype likelihoods according to the GATK model outlined in Equations (1)–(3) and output GLF files. Using these, we created SAF files for A and B with no further filtering using ANGSD. The mean depth λ was set to either 2, 4, or 8 to investigate the performance of winsfs at different sequencing depths. This results in a grid of 3 × 3 simulated NGS data sets with 3 different sample sizes and 3 different mean depth values.
From the simulated SAF files, we ran winsfs and realSFS as above to generate the 2-dimensional SFS, except for a maximum of 100 epochs. For each method and each epoch e until convergence, we calculated the log-likelihood for the corresponding SFS ,
| (15) |
where is the observed true SFS and M is the total number of sites. Figure 5 shows how the log-likelihood evolves over epochs for winsfs () and realSFS for sample sizes and simulated mean depths . We observe that at a mean depth of 2, winsfs100 outperforms realSFS by a significant margin both in terms of speed and the final log-likelihood. At mean depth 4, the winsfs remains much faster and still achieves meaningfully better log-likelihoods, especially at higher sample sizes. Finally, at mean depth 8, winsfs100 still converges 5–10 times faster than realSFS (measured in epochs), but the methods provide estimates of similar quality.
Fig. 5.
Log-likelihood over epochs of the true observed SFS given the 2-dimensional SFS estimated by winsfs () and realSFS. Different simulated scenarios (mean depth 2, 4, or 8; sample size 5, 10, or 20) shown. For each method, the epoch at which the default stopping criterion is triggered is shown. Note that the y-scale varies across sample sizes and depths in order to show the full range of data (main plot) and the difference between realSFS and winsfs (zoom plot). For each column of plots, corresponding to a simulated sample size, the y-scale in the zoom plot is held constant to allow for comparison across depths.
The estimated spectra for realSFS and winsfs100 at their default stopping points are shown in Supplementary Figs. 7 and 8, respectively. These confirm that the spectra on the whole are well-estimated by winsfs100 as compared to the true SFS (Supplementary Fig. 6). Moreover, we again observe that realSFS introduces a checkerboard pattern in the low-information part of the spectrum at 2×–4×, which is not present in the true spectrum, and which is not inferred by winsfs. The pattern is more pronounced at higher sample sizes. This supports the hypothesis that realSFS tends to overfit in situations where many parameters must be inferred with little information.
Peak simulations
The averaging of block estimates in the window EM algorithm appears to induce a certain “smoothing” of the spectrum at low depth. This smoothing effect is implicit in the sense of being nowhere explicitly modeled, and each parameter is estimated independently. Nevertheless, this observation may give rise to a concern that winsfs, unlike the maximum likelihood estimate from realSFS, might remove true abrupt peaks in the SFS.
To investigate, we modified the demographic simulation with sample size 20 described above in the following way. In each of 7 arbitrarily chosen bins near to the center of the SFS, we artificially spiked 10,000 counts into the true spectrum after running the demographic simulations (Supplementary Fig. 9). This represents an increase of 30- to 40 times relative to the original count and the neighboring cells. Based on this altered spectrum, we simulated sequencing data for depth 2×, 4×, and 8×, created SAF files, and ran realSFS and winsfs100 as before. The residuals of the realSFS and winsfs estimates are shown in Supplementary Figs. 10 and 11, respectively. In this fairly extreme scenario, the spectra inferred by both winsfs and realSFS appear to have a small but noticeable downwards bias in the peak region at 2× and 4×. However, compared to realSFS, winsfs has smaller residuals in all scenarios, and the apparent bias is inversely correlated with depth. These results confirm that using the window EM algorithm does not lead to excess flattening of SFS peaks compared with the maximum likelihood estimate from the standard EM algorithm.
Hyperparameters
The window EM algorithm requires hyperparameter settings for B and W. Moreover, it requires a choice of stopping criterion. For ease of use, the winsfs software ships with defaults for these settings, and we briefly describe these.
We expect that the choice of B is less important than the term W/B, which governs the fraction of data that is directly considered in any one update step. Having analyzed input data varying in size from 220 Mb (simulations) to 1.17 GB (human data), we find that fixing B = 500 works fine as a default across a wide range of input sizes. Therefore, the more interesting question is how to set the window size. In theory, there should be a tradeoff between speed of convergence and accuracy of results, where lower window size favors the former and higher window size the latter. However, in practice, based on our results, we have not seen evidence that using W = 500 over W = 100 leads to significantly better inference. On the other hand, the lower window size has significantly faster convergence. Based on this, we feel that window size of 100 makes for the best general default. By default, the winsfs software uses B = 500 blocks and a window size W = 100.
As for stopping, winsfs implements the criterion based on differences δ in [Equation (14)] over successive epochs. Based on the initial analysis of the human and impala data, we chose (see Supplementary Fig. 12) as the default value and used the simulations to validate this choice. Figure 5 shows the point at which stopping occurs, which is generally around the maximum log-likelihood as desired.
Streaming
In the main usage mode, precalculated SAF likelihoods are read into RAM, as in realSFS. However, it is also possible to run winsfs while keeping the data on disk and streaming through the intersecting sites in the SAF files. We refer to this as a “streaming mode.”
Since the window EM algorithm requires randomly shuffling the input data, a preparation step is required in which SAF likelihoods are (jointly) shuffled into a new file. We wish to avoid loading the data into RAM to perform a shuffle, and we also do not want multiple intermediate writes to disk. To our knowledge, it is not possible to perform a true shuffle of the input data within these constraints. Instead, since we are only interested in shuffling for the purposes of breaking up blocks of LD, we perform a pseudo-shuffle according to the following scheme. We preallocate a file with space for exactly M intersecting sites in the input data. This file is then split into S contiguous sections of roughly equal size, and we then assign input site with index to position in section , where is the remainder operation. That is, the first S sites in the input end up in the first positions of each section, and the next S sites in the input end up in the second positions of each section, and so on. This operation can be performed with constant memory, without intermediate writes to disk, and has the benefit of being reversible.
After preparing the pseudo-shuffled file, winsfs can be run exactly as in the main mode. To confirm that this pseudo-shuffle is sufficient for the purposes of the window EM algorithm, we ran 10 epochs of winsfs in the streaming mode for the impala and human data sets in both 1 and 2 dimensions. After each epoch, we calculated the log-likelihood of the resulting SFS and compared them to the log-likelihood obtained by running in main mode above. The results are shown in Supplementary Fig. 13 and show that streaming mode yields comparable results to the main, in-RAM usage: the likelihood differs slightly, but is neither systematically better or worse.
Benchmark
To assess its performance characteristics, we benchmarked winsfs in both the main mode and the streaming mode as well as realSFS on the impala data. For each of the 3, we ran estimation until convergence, as well as until various epochs before then, collecting benchmark results using Snakemake (Koster and Rahmann 2012). Both realSFS and winsfs were given 20 cores. Results are shown in Fig. 6. In terms of runtime, we find that running winsfs in RAM is significantly faster than realSFS (Fig. 6a). This is true in part because winsfs requires fewer epochs, but also since winsfs runs faster than realSFS epoch by epoch. As expected, when switching winsfs to the streaming mode, runtime suffers as epochs increase. However, taking the number of epochs required for convergence into account, and streaming winsfs remains competitive with realSFS, even when including the initial overhead to shuffle SAF likelihoods on disk.
Fig. 6.
Computational resource usage of winsfs and realSFS for the joint estimation of the Shangani and Maasai Mara impala populations. winsfs can be run while loading input data into RAM, or streaming through it on disk. In the latter case, data must be shuffled on disk before hand. a) Runtime required with 20 threads for various numbers of epochs. Results for winsfs are shown for in-memory usage and streaming mode. For streaming modes, times are given with and without the extra time taken to shuffle data on disk before running. b) Peak memory usage (maximum resident set size).
Looking at memory consumption, streaming winsfs has a trivial peak memory usage of 10 MB, including the initial pseudo-shuffle. In comparison, when reading data into RAM, realSFS and winsfs require 137 and 107 GB, respectively, even on the fairly small impala data set.
Discussion
We have presented the window EM algorithm for inferring the SFS from low-depth data, as well as the winsfs implementation of this algorithm. The window EM algorithm updates SFS estimates in smaller blocks of sites and averages these block estimates in larger windows. We have argued that this approach has 3 related advantages relative to current methods. First, by updating more often, convergence happens 1–2 orders of magnitude faster. Due to the window averaging, this improvement in convergence times does not occur at the cost of stability. Second, due to the fast convergence, it is feasible to run the window EM algorithm out of memory. This brings the memory requirements of the algorithm from hundreds of gigabytes of RAM to virtually nothing. Third, by optimizing over different subsets of the data in each iteration, the algorithm is prevented from overfitting to the input data. In practice, this means that we get biologically more plausible spectra.
On this last point, it is worth emphasizing that while winsfs appears to have the effect of smoothing the spectrum in a beneficial way, this smoothing effect is entirely implicit. That is, it is nowhere explicitly modeled that each estimated bin should be similar to neighboring bins to avoid checkerboard patterns. Rather, the apparent smoothing emerges because winsfs mitigates some of the issues with overfitting that may otherwise manifest as a checkerboard pattern. As shown in the simulations, winsfs does not remove true peaks in the SFS. In the broader setting of stochastic optimization, window EM is in this way related to forms of Polyak–Ruppert iterate averaging schemes as used in stochastic gradient methods (Ruppert 1988; Polyak and Juditsky 1992), variants of which have also been shown to control variance and induce regularization (Jain et al. 2018; Neu and Rosasco 2018), similar to what we have observed here.
Within the EM literature, window EM is prima facie quite similar in spirit to other versions of the stochastic EM algorithm (Neal and Hinton 1998; Sato and Ishii 2000; Cappé and Moulines 2009; Liang and Klein 2009; Chen et al. 2018). They too work on smaller blocks and seek some way of controlling stability in how the block estimate is incorporated in the overall estimate . Typically, this involves an update of the form for some weight γt decaying as a function of iteration t. During initial experimentation, we empirically found that such methods tended to increase the noise in the spectrum, rather than reduce it. This problem likely arises because estimating the multidimensional SFS requires estimating many parameters for which very little information is available in any 1 batch. Therefore, by having an update step involving only the current estimate and a single, small batch of sites, significant noise is introduced in the low-density part of the spectrum. In contrast, the window EM approach still optimizes over smaller batches for speed but actually considers large amounts of data in the update step by summing the entire window of batch estimates, thereby decreasing the noise.
For SFS inference specifically, prior work exists to improve estimation for low-depth sequencing data. For example, it has been proposed to “band” SAF likelihoods to make estimation scale better in the number of sampled individuals (Han et al. 2014; Mas-Sandoval et al. 2022). Briefly, the idea is that at each site, all the mass in the SAF likelihood tends to be concentrated in a small band around the most likely sample frequency and downstream inference can be adequately carried out by only propagating this band and setting all others to zero. By doing so, runtime and RAM can be saved by simply ignoring all the zero bins outside the chosen band. We note that such ideas are orthogonal to the work presented here, since they are concerned with the representation of the input data, and thereby indirectly modify all downstream optimization methods. Future work on winsfs may involve the ability to run from banded SAF likelihoods. This will be important with large sample sizes, in the hundreds of individuals.
Others have focused on the implementation details of the EM algorithm, for instance using GPU acceleration (Lu et al. 2012). Such efforts still have the typical high memory requirements and do not address the overfitting displayed by the standard EM algorithm. Moreover, we find that the presented algorithmic improvements, combined with an efficient implementation, serve to make winsfs more than competitive with such efforts in terms of runtime. Indeed, with winsfs converging in-memory in less than an hour on genome-scale data, runtime is no longer a significant bottleneck for SFS estimation.
We emphasize, however, that the window EM algorithm and winsfs are unlikely to yield any meaningful benefits with sequencing data at above around 10×–12× coverage. With such data, better inference of the SFS will be obtained by estimation directly from genotype calls with appropriate filters. Nevertheless, efficient and robust methods remain important for low-coverage data. This is partly because low-coverage data may sometimes be the only option, for example when working with ancient DNA. Also, such methods allow intentionally sequencing at lower coverage, decreasing the sequencing cost per individual.
In addition, we do not expect winsfs to perform better than realSFS when data are not available for many sites (e.g. <100 Mb) due to the fact that winsfs only uses parts of the available data directly in the final estimation.
Finally, improvements in the SFS estimates by winsfs are unlikely to be significant for simple summary statistics like θ, , or f-statistics. For such purposes, winsfs simply produces results similar to realSFS, although much faster. However, as the number of dimensions and samples increase, and as sequencing depth decreases, overfitting will start to influence the low-density bins of the spectrum. Where this information is used downstream, winsfs will lead to better and more interpretable results and can potentially help solve commonly known biases in parameter estimates arising from model misspecification (Momigliano et al. 2021). We have seen this in the case study, but we believe that the same would be true of other popular demographic inference frameworks including fastsimcoal (Excoffier and Foll 2011; Excoffier et al. 2013), moments (Jouganous et al. 2017), and momi (Kamm et al. 2017). It may also be significant for other methods for complex inference from the multidimensional spectrum, including inference of fitness effects using fit (Kim et al. 2017; Huang et al. 2021) or introgression using (Martin and Amos 2020), though we have not explored these methods.
Supplementary Material
Contributor Information
Malthe Sebro Rasmussen, Department of Biology, University of Copenhagen, 2200 København N, Denmark.
Genís Garcia-Erill, Department of Biology, University of Copenhagen, 2200 København N, Denmark.
Thorfinn Sand Korneliussen, Globe Institute, University of Copenhagen, 1350 København K, Denmark.
Carsten Wiuf, Department of Mathematical Sciences, University of Copenhagen, 2100 København Ø, Denmark.
Anders Albrechtsen, Department of Biology, University of Copenhagen, 2200 København N, Denmark.
Data Availability
The human data analyzed are part of the 1000 Genomes (The 1000 Genomes Project Consortium 2015) phase 3 low depth sequencing data. Alignments have been made available by the 1000G project and can be accessed at ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/. The impala data have been made available via the SRA with accession PRJNA862915. Analysis and plotting code, as well as the cleaned data corresponding to the final results, are available at github.com/malthesr/window and the winsfs software itself at github.com/malthesr/winsfs.
Supplemental material is available at GENETICS online.
Funding
AA, CW, and MSR are supported by the Independent Research Fund Denmark (grant numbers: 8021-00360B and 0135-00211B) and the University of Copenhagen through the Data+ initiative. GGE is supported by the Independent Research Fund Denmark (grant number: 8049-00098B). TSK is funded by a Carlsberg Foundation Young Researcher Fellowship awarded by the Carlsberg Foundation in 2019 (CF19-0712).
Conflicts of interest
None declared.
Literature cited
- Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG. et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2021;220:iyab229. doi: 10.1093/genetics/iyab229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhatia G, Patterson N, Sankararaman S, Price AL.. Estimating and interpreting FST: the impact of rare variants. Genome Res. 2013;23:1514–1521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cappé O, Moulines E.. On-line expectation-maximization algorithm for latent data models. J R Stat Soc Ser B: Stat Methodol. 2009;71:593–613. [Google Scholar]
- Chen J, Zhu J, Teh YW, Zhang T. Stochastic expectation maximization with variance reduction. Proceedings of the 32nd International Conference on Neural Information Processing Systems. Vol. 31; 2018. p. 7967–7977. [Google Scholar]
- Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM. et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M.. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9:e1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Excoffier L, Foll M.. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011;27:1332–1334. [DOI] [PubMed] [Google Scholar]
- Faith JT, Tryon CA, Peppe DJ, Beverly EJ, Blegen N.. Biogeographic and evolutionary implications of an extinct late Pleistocene impala from the Lake Victoria Basin, Kenya. J Mamm Evol. 2013;21:213–222. [Google Scholar]
- Fay JC, Wu CI.. Hitchhiking under positive Darwinian selection. Genetics. 2000;155:1405–1413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD.. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han E, Sinsheimer JS, Novembre J.. Characterizing bias in population genetic inferences from low-coverage sequencing data. Mol Biol Evol. 2013;31:723–735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han E, Sinsheimer JS, Novembre J.. Fast and accurate site frequency spectrum estimation from low coverage sequence data. Bioinformatics. 2014;31:720–727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang X, Fortier AL, Coffman AJ, Struck TJ, Irby MN, James JE, León-Burguete JE, Ragsdale AP, Gutenkunst RN.. Inferring genome-wide correlations of mutation fitness effects between populations. Mol Biol Evol. 2021;38:4588–4602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain P, Kakade SM, Kidambi R, Netrapalli P, Pillutla VK, Sidford A. A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv, 2018.
- Jouganous J, Long W, Ragsdale AP, Gravel S.. Inferring the joint demographic history of multiple populations: beyond the diffusion approximation. Genetics. 2017;206:1549–1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kamm JA, Terhorst J, Song YS.. Efficient computation of the joint sample frequency spectra for multiple populations. J Comput Graph Stat. 2017;26:182–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelleher J, Thornton KR, Ashander J, Ralph PL.. Efficient pedigree recording for fast population genetics simulation. PLoS Comput Biol. 2018;14:e1006581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim BY, Huber CD, Lohmueller KE.. Inference of the distribution of selection coefficients for new nonsynonymous mutations using large samples. Genetics. 2017;206:345–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Korneliussen TS, Albrechtsen A, Nielsen R.. ANGSD: analysis of next generation sequencing data. BMC Bioinformatics. 2014;15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Korneliussen TS, Moltke I, Albrechtsen A, Nielsen R.. Calculation of Tajima’s D and other neutrality test statistics from low depth next-generation sequencing data. BMC Bioinformatics. 2013;14:289. doi: 10.1186/1471-2105-14-289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koster J, Rahmann S.. Snakemake–a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2522. [DOI] [PubMed] [Google Scholar]
- Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R.. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang P, Klein D. Online EM for unsupervised models. 2009. p. 611–619.
- Lorenzen ED, Arctander P, Siegismund HR.. Regional genetic structuring and evolutionary history of the impala Aepyceros melampus. J Hered. 2006;97:119–132. [DOI] [PubMed] [Google Scholar]
- Lorenzen ED, Heller R, Siegismund HR.. Comparative phylogeography of African savannah ungulates. Mol Ecol. 2012;21:3656–3670. [DOI] [PubMed] [Google Scholar]
- Lou RN, Jacobs A, Wilder AP, Therkildsen NO.. A beginner’s guide to low-coverage whole genome sequencing for population genomics. Mol Ecol. 2021;30:5966–5993. [DOI] [PubMed] [Google Scholar]
- Lu M, Zhao J, Luo Q, Wang B.. Accelerating Minor Allele Frequency Computation with Graphics Processors. BigMine '12: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. ACM Press; 2012. [Google Scholar]
- Margaryan A, Lawson DJ, Sikora M, Racimo F, Rasmussen S, Moltke I, Cassidy LM, Jørsboe E, Ingason A, Pedersen MW. et al. Population genomics of the Viking world. Nature. 2020;585:390–396. [DOI] [PubMed] [Google Scholar]
- Marth GT, Czabarka E, Murvai J, Sherry ST.. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics. 2004;166:351–372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin SH, Amos W.. Signatures of introgression across the allele frequency spectrum. Mol Biol Evol. 2020;38:716–726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mas-Sandoval A, Pope NS, Nielsen KN, Altinkaya I, Fumagalli M, Korneliussen TS.. Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data. GigaScience. 2022;11:giac032. 10.1093/gigascience/giac032 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meisner J, Albrechtsen A.. Testing for Hardy–Weinberg equilibrium in structured populations using genotype or low-depth next generation sequencing data. Mol Ecol Resour. 2019;19:1144–1152. [DOI] [PubMed] [Google Scholar]
- Momigliano P, Florin AB, Merilä J.. Biases in demographic modelling affect our understanding of recent divergence. Mol Biol Evol. 2021;38:2967–2985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models. Netherlands: Springer; 1998. p. 355–368.
- Neu G, Rosasco L. Iterate averaging as regularization for stochastic gradient descent; 2018. p. 3222–3242.
- Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, Hubisz MJ, Fledel-Alon A, Tanenbaum DM, Civello D, White TJ. et al. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 2005;3:e170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R, Korneliussen T, Albrechtsen A, Li Y, Wang J.. SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data. PLoS One. 2012;7:e37558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R, Paul JS, Albrechtsen A, Song YS.. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011;12:443–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olalde I, Mallick S, Patterson N, Rohland N, Villalba-Mouco V, Silva M, Dulias K, Edwards CJ, Gandini F, Pala M. et al. The genomic history of the Iberian Peninsula over the past 8000 years. Science. 2019;363:1230–1234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peter BM. Admixture, population structure, and f-statistics. Genetics. 2016;202:1485–1501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polyak BT, Juditsky AB.. Acceleration of stochastic approximation by averaging. SIAM J Control Optim. 1992;30:838–855. [Google Scholar]
- Portik DM, Leaché AD, Rivera D, Barej MF, Burger M, Hirschfeld M, Rödel MO, Blackburn DC, Fujita MK.. Evaluating mechanisms of diversification in a Guineo-Congolian tropical forest frog using demographic model selection. Mol Ecol. 2017;26:5245–5263. [DOI] [PubMed] [Google Scholar]
- Ruppert D. Efficient estimations from a slowly convergent Robbins-Monro process; 1988. Technical report. Cornell University.
- Sánchez-Barreiro F, Gopalakrishnan S, Ramos-Madrigal J, Westbury MV, Manuel M, Margaryan A, Ciucani MM, Vieira FG, Patramanis Y, Kalthoff DC. et al. Historical population declines prompted significant genomic erosion in the northern and southern white rhinoceros (Ceratotherium simum). Mol Ecol. 2021;30:6355–6369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sato M, Ishii S.. On-line EM algorithm for the normalized gaussian network. Neural Comput. 2000;12:407–432. [DOI] [PubMed] [Google Scholar]
- Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Valk T, Pečnerová P, del Molino DD, Bergström A, Oppenheimer J, Hartmann S, Xenikoudakis G, Thomas JA, Dehasque M, Sağlican E. et al. Million-year-old DNA sheds light on the genomic history of mammoths. Nature. 2021;591:265–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Varadhan R, Roland C.. Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand J Stat. 2008;35:335–353. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The human data analyzed are part of the 1000 Genomes (The 1000 Genomes Project Consortium 2015) phase 3 low depth sequencing data. Alignments have been made available by the 1000G project and can be accessed at ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/. The impala data have been made available via the SRA with accession PRJNA862915. Analysis and plotting code, as well as the cleaned data corresponding to the final results, are available at github.com/malthesr/window and the winsfs software itself at github.com/malthesr/winsfs.
Supplemental material is available at GENETICS online.






