Inferring the Demographic History of Inbred Species from Genome-Wide SNP Frequency Data

Paul D Blischak; Michael S Barker; Ryan N Gutenkunst

doi:10.1093/molbev/msaa042

. 2020 Feb 18;37(7):2124–2136. doi: 10.1093/molbev/msaa042

Inferring the Demographic History of Inbred Species from Genome-Wide SNP Frequency Data

Paul D Blischak ^1,^2,^✉, Michael S Barker ¹, Ryan N Gutenkunst ²

Editor: Daniel Falush

PMCID: PMC7828618 PMID: 32068861

Abstract

Demographic inference using the site frequency spectrum (SFS) is a common way to understand historical events affecting genetic variation. However, most methods for estimating demography from the SFS assume random mating within populations, precluding these types of analyses in inbred populations. To address this issue, we developed a model for the expected SFS that includes inbreeding by parameterizing individual genotypes using beta-binomial distributions. We then take the convolution of these genotype probabilities to calculate the expected frequency of biallelic variants in the population. Using simulations, we evaluated the model’s ability to coestimate demography and inbreeding using one- and two-population models across a range of inbreeding levels. We also applied our method to two empirical examples, American pumas (Puma concolor) and domesticated cabbage (Brassica oleracea var. capitata), inferring models both with and without inbreeding to compare parameter estimates and model fit. Our simulations showed that we are able to accurately coestimate demographic parameters and inbreeding even for highly inbred populations (F = 0.9). In contrast, failing to include inbreeding generally resulted in inaccurate parameter estimates in simulated data and led to poor model fit in our empirical analyses. These results show that inbreeding can have a strong effect on demographic inference, a pattern that was especially noticeable for parameters involving changes in population size. Given the importance of these estimates for informing practices in conservation, agriculture, and elsewhere, our method provides an important advancement for accurately estimating the demographic histories of these species.

Keywords: conservation, demography, domestication, inbreeding, site frequency spectrum

Introduction

Estimating the demographic history of closely related populations or species is an important first step in understanding the interplay of the evolutionary forces shaping genetic variation. Divergence, migration, changes in population size, and other historical events all contribute to population allele frequency dynamics over time, a process that can be modeled using a variety of approaches. Connecting the expectations from these models with observed genomic data is often achieved using the site frequency spectrum (SFS), a genome-wide summary of genetic polymorphism within and between populations (Sawyer and Hartl 1992; Adams and Hudson 2004; Caicedo et al. 2007; Gutenkunst et al. 2009; Nielsen et al. 2009). The ease and affordability of collecting genomic SNP data make inferences of demography using the SFS especially appealing, highlighting their importance in gaining insights into the historical factors affecting neutral variation in populations. Several recent analyses have also applied SFS-based methods to infer the fitness effects of mutations (Kim et al. 2017; Tataru et al. 2017; Fortier et al. 2019), allowing researchers to model patterns of selection while simultaneously controlling for demography (Williamson et al. 2005).

Generating the SFS from a demographic model is a well-studied problem with several possible approaches, all based on different underlying methodologies, currently implemented (e.g., diffusion, Gutenkunst et al. 2009; spectral methods, Lukić and Hey 2012; the coalescent, Excoffier et al. 2013; moment closure, Jouganous et al. 2017). However, these methods generally assume panmixia or random mating within populations, which may not be a realistic assumption for many groups of organisms that are inbred. The reason for this assumption is that the approximations used by these approaches are all built on top of the Wright–Fisher model and rely on the simplicity of its binomial sampling scheme for deriving expectations. The excess of homozygosity caused by inbreeding deviates from binomial expectations, leading to changes in the observed SFS that cannot be captured by models assuming random mating that may affect estimates of demography. Generalizations of the standard Wright–Fisher model have been made to include inbreeding through partial self-fertilization (Wright 1951). Nevertheless, these modifications have yet to be implemented in SFS-based methods for demographic inference.

Despite this lack of available SFS-based methods, previous approaches to infer demography from inbred samples have successfully used alternative representations of genomic data to capture the extent to which samples share blocks of their genome through nonrandom mating. This typically entails identifying parts of the genome that are identical by descent (IBD), or that contain runs of homozygosity, and using the length and distribution of these blocks to infer levels of inbreeding and past population size dynamics (Kirin et al. 2010; Kardos et al. 2017; Browning et al. 2018). Large IBD blocks are usually an indication of recent inbreeding, whereas the frequency and distribution of smaller IBD blocks, which are shared due to common ancestry rather than inbreeding, contain information about more long-term trends in population size (Kirin et al. 2010; Ceballos et al. 2018). However, these methods are generally only used to model size changes in single populations, which do not allow them to estimate other important demographic events such as population divergence or rates of gene flow. Furthermore, the reliance of these methods on fully sequenced genomes prevents them from being used in systems that lack such resources.

The ability to estimate demography in organisms that do not have a reference genome is a strength of SFS-based methods. This flexibility allows researchers using reduced representation methods (e.g., restriction enzyme-based approaches) to collect genomic data for demographic inference. A large motivating factor for the work that we have conducted here is to understand demography in domesticated crop species, which are often highly inbred due to how they are bred and propagated (Gaut et al. 2018). Inbreeding is also of great concern in threatened and endangered species (Shafer et al. 2015; Xue et al. 2015; Robinson et al. 2016, 2019). For many of the most economically or ecologically important species in these categories, full-genome sequences are typically available and can be used to guide estimates of genetic variation and past population dynamics that will help to inform breeding practices or management strategies, respectively. However, for less well-studied agricultural or threatened species, it is crucial to have tools available that can also provide this essential information without necessarily needing to obtain a fully sequenced genome.

In this article, we introduce a new method for including inbreeding in estimates of demography by modifying the sampling distribution used to generate the expected SFS for a given demographic model. We have implemented the approach in the Python package $\partial a \partial i$ (Gutenkunst et al. 2009), building on top of its existing machinery for estimating demography using the diffusion approximation. To assess our ability to coestimate inbreeding and demography, we generated frequency spectra in both $\partial a \partial i$ and SLiM (Haller and Messer 2019) and used the new model to make inferences from these simulated data. We also used simulated frequency spectra from $\partial a \partial i$ to see how inbreeding affects estimates of demography when it is ignored. Finally, we used genomic data from two empirical examples, American pumas (Puma concolor) and domesticated cabbage (Brassica oleracea var. capitata), and evaluated estimates of their demographic histories both with and without inbreeding. In general, our model is shown to be accurate even for highly inbred populations (F = 0.9). We also found that failing to account for inbreeding leads to inaccurate estimates of parameters and poor model fit. Taken together, the model we have developed provides a powerful tool to jointly estimate inbreeding and demography, and will help to facilitate evolutionary inferences in a wide range of species.

New Approaches

We start with a brief overview of the SFS and describe its derivation from the population distribution of allele frequencies (DAF), which can be obtained using the diffusion approximation as described previously (Gutenkunst et al. 2009). We then propose a probability model for calculating the number of derived mutations in an inbred population and provide an expression for the expected SFS incorporating this distribution. Using this expression for the expected SFS with inbreeding, we can perform parameter inference with a composite likelihood assuming a Poisson random field model (Sawyer and Hartl 1992).

The Site Frequency Spectrum

The SFS is a multidimensional summary of genetic variation within and across populations that records how often derived biallelic variants of different frequencies are observed in a sample of individuals. For example, given a sample of 20 chromosomes (10 diploid individuals) from three populations, the SFS entry at index [3,8,17] records how often we observe a variant in 3, 8, and 17 out of the 20 chromosomes in populations one, two, and three, respectively. In general, for P populations with sample sizes $n_{1}, n_{2}, \dots, n_{P}$ , we index the SFS using $[d_{1}, d_{2}, \dots, d_{P}]$ to record how often we observe a variant with frequency $d_{1}, d_{2}, \dots, d_{P}$ in populations one through P.

The observed SFS can be obtained from empirical data by tabulating derived SNP frequencies across sampled populations to generate the P-dimensional array described earlier. When a derived allele cannot be determined, we can instead record the frequency of the minor allele, effectively “folding” the spectrum in half by only considering the variants with frequency <0.5. Demographic inference can then be conducted by comparing the observed SFS with the SFS obtained from a demographic model (Sawyer and Hartl 1992).

Given the P-dimensional DAF obtained from a given demographic model, $ϕ$ , the expected SFS can be obtained by calculating the probability of drawing $d_{1}, \dots, d_{P}$ derived alleles while integrating across the DAF in the populations. Within each population, the number of derived alleles has a binomial distribution under panmixia. We then integrate across all possible allele frequencies, weighting the binomial probability of drawing d_i derived alleles by the density determined by $ϕ$ within population i. Taking this P-dimensional integral across the weighted product of binomial probabilities gives us the expression for the joint expected SFS:

E [d_{1}, \dots, d_{P}] = \int_{0}^{1} \dots \int_{0}^{1} \prod_{i = 1, \dots, P} (\begin{matrix} n_{i} \\ d_{i} \end{matrix}) x_{i}^{d_{i}} {(1 - x_{i})}^{n_{i} - d_{i}} ϕ (x_{1}, \dots, x_{P}) d x_{i} .

(1)

The Expected SFS with Inbreeding

Through its use of binomial sampling, the preceding derivation for the expected SFS makes the assumption that matings within populations are random. When inbreeding has occurred, individual genotypes are more likely to be homozygous due to being IBD. One way to capture this excess in homozygosity is to incorporate the inbreeding coefficient F into a generalized form for the expected genotype frequencies under Hardy–Weinberg equilibrium (Wright 1951). Here, we use an alternative model that captures the fact that genotypes within populations will be correlated due to inbreeding, pushing the distribution of genotypes toward homozygotes. To capture this correlation among genotypes, Balding and Nichols (1995, 1997) proposed a probability model to incorporate inbreeding using a beta-binomial distribution. Under this model, individual genotypes are a random variable, $G_{i} \in {0, 1, 2}$ , for the number of copies of the derived allele in individual i ( $i = 1, \dots, n$ ) such that $P r (G_{i} = g)$ at an individual locus with allele frequency $p \in (0, 1)$ and population inbreeding coefficient $F \in (0, 1)$ is beta-binomial with the following form:

\begin{matrix} P r (G_{i} = g | p, F) = B B (g, α = p [\frac{1 - F}{F}], β = (1 - p) [\frac{1 - F}{F}]) \\ = (\begin{matrix} 2 \\ g \end{matrix}) \frac{B (g + α, 2 - g + β)}{B (α, β)} . \end{matrix}

(2)

Here, $B B$ denotes the probability mass function for the beta-binomial distribution and B(x, y) is the beta function with dummy parameters x and y. The parameterization of $α = p [\frac{1 - F}{F}]$ and $β = (1 - p) [\frac{1 - F}{F}]$ introduces the overdispersion of probability toward homozygous genotypes that is expected as inbreeding increases (Balding and Nichols 1995, 1997).

To get the expected SFS, we need to be able to model the total number of derived alleles sampled in the population, which is the sum across the genotypes of all individuals. Given a sample of n diploid individuals (2n chromosomes), we use the random variable $D \in {0, \dots, 2 n}$ to denote this quantity. The probability mass function for D is an n-fold convolution of beta-binomial distributions, which do not have a simple distributional form. However, we can obtain the probability mass function by considering all possible combinations of the probability of drawing D = d alleles across n beta-binomial distributions, giving us a closed form expression for the convolution of n beta-binomial random variables:

\begin{matrix} P (D = d | p, F) = B B_{n}^{*} (d, α = p [\frac{1 - F}{F}], β = (1 - p) [\frac{1 - F}{F}]) \\ = \sum_{R \in p_{n} (d)} \frac{n!}{n_{0}! n_{1}! n_{2}!} [\prod_{r \in R} B B (r, α, β)] . \end{matrix}

(3)

Breaking this down, we can think of it as enumerating all possible ways to generate genotypes in n individuals such that they sum to d, times the beta-binomial probability of sampling each genotype. More specifically, let $p_{n} (d)$ be an array of integer partitions with n entries that sum to d such that all entries in the partition are 0, 1, or 2 (corresponding to the possible genotype values). For example, the partitions defined by $p_{5} (4)$ are $[2, 2, 0, 0, 0], [2, 1, 1, 0, 0]$ , and $[1, 1, 1, 1, 0]$ . Then, for each of these partitions, we use the multinomial coefficient $\frac{n!}{n_{0}! n_{1}! n_{2}!}$ , with n₀, n₁, and n₂ corresponding to the number of partition entries equal to 0, 1, and 2, respectively, to account for all possible rearrangements of the partition entries. Next, we multiply the beta-binomial probability for each genotype in a partition using equation (2). Taking the product across all possible partitions gives us the full expression for the n-fold convolution, which we denote $B B_{n}^{*}$ ( $*$ is the mathematical operator for convolutions). Inserting this distribution into equation (1) gives us the final form for the expected SFS with inbreeding:

E_{F} [d_{1}, \dots, d_{P}] = \int_{0}^{1} \dots \int_{0}^{1} \prod_{i = 1, \dots, P} B B_{n_{i}}^{*} (d_{i}, x_{i} [\frac{1 - F_{i}}{F_{i}}], (1 - x_{i}) [\frac{1 - F_{i}}{F_{i}}]) ϕ (x_{1}, \dots, x_{P}) d x_{i} .

(4)

We have written a small R Shiny application illustrating the probability distribution for the beta-binomial convolution (available on GitHub). Figure 1 also shows a sample of example frequency spectra for different levels of inbreeding.

Fig. 1. — Comparison of expected spectra for F = 0.5, 0.75, and 0.9 between $\partial a \partial i$ (dark gray) and SLiM (light gray) for the equilibrium and bottleneck+growth models.

Results

Comparison with SLiM

We used SLiM (Haller and Messer 2019) to validate the expectations of the SFS with inbreeding by simulating frequency spectra under three models (described in more detail in Simulations below): a simple equilibrium model (standard neutral model), a one-population bottleneck and growth model, and a two-population divergence and one-way migration model. Inbreeding was assumed to occur through selfing and expected frequency spectra were obtained by taking the mean of 5,000 simulations for each model. Figure 1 plots the comparison between the SFS obtained from $\partial a \partial i$ and SLiM for the equilibrium and bottleneck models with F = 0.5, 0.75, and 0.9, respectively. The frequency spectra for these models for F = 0.1 and F = 0.25 are presented in supplementary figure S1, Supplementary Material online and the comparisons for the two-population divergence model are in supplementary figure S2, Supplementary Material online. The percent differences between the frequency spectra from $\partial a \partial i$ and SLiM were between 0.1% and 0.2% for the one-population models and were between 0.02% and 0.03% for the two-population model, demonstrating that our results from modeling the expected SFS with beta-binomial distributions corresponds well with the spectra simulated from SLiM.

We also used simulated frequency spectra from SLiM to estimate parameters for these three models in $\partial a \partial i$ . Figure 2 shows the distribution of estimated inbreeding coefficients for the bottleneck and growth model (root mean-squared deviation [RMSD] = 0.094) and divergence and one-way migration models (RMSD = 0.163). Similar plots for all other estimated parameters across all three models are presented in supplementary figures S3–S5, Supplementary Material online.

Fig. 2. — (a) Estimates of F from data generated with SLiM for the bottleneck and growth model (lower) plus an illustration of the model (upper). In this model, N_A is the ancestral population size, ν₀ is the size of the bottleneck (proportion of N_A remaining after population reduction), and T is the amount of time for the population to recover back to a size of N_A. (b) Estimates of F from data generated with SLiM for the divergence with one-way migration model (lower) plus an illustration of the model (upper). N_A in this model is the same as the bottleneck model, ν₂ is the size of the diverging population (again a proportion of N_A), T is the divergence time between populations, and M₂₁ is the one-way migration rate of individuals from population one into population two.

Simulations

Simulation 1: Coestimating Inbreeding and Demography

To assess our ability to estimate demographic parameters under increasing levels of inbreeding (F = 0.1, 0.25, 0.5, 0.75, and 0.9), as well as the inbreeding coefficient within a population itself, we performed demographic inference using simulated frequency spectra under three models: 1) a standard neutral model, 2) a one-population bottleneck and growth model, and 3) a two-population divergence model with unidirectional gene flow (models two and three are illustrated in fig. 2). For the standard neutral model, the inbreeding coefficient is the only parameter that needs to be estimated. The one-population bottleneck and growth model has three parameters: the inbreeding coefficient, the relative size of the bottlenecked population (ν₀ = 0.1, 0.25, and 0.5), and the recovery time back to the original size (T = 0.1, 0.2, and 0.3). The two-population model has four parameters: the inbreeding coefficient, the relative size of the diverging population (ν₂ = 0.1, 0.25, 0.5), the time of divergence from the main population (T = 0.1, 0.2, and 0.3), and the rate of gene flow from the main population into the diverged population (M₂₁ = 0.5, 1.0, and 1.5). All parameters are specified relative to the ancestral population size, which in $\partial a \partial i$ defaults to 1.0.

Supplementary figures S6–S8, Supplementary Material online, show the distribution of estimated inbreeding coefficients across 20 replicate experiments for every combination of simulation parameters for the equilibrium, bottleneck, and divergence models. For all three models, we are able to recover accurate estimates of F (Model 1 RMSD: 0.0139; Model 2 RMSD: 0.0176; Model 3 RMSD: 0.0406) even when inbreeding is quite high (F = 0.9). Supplementary figure S7, Supplementary Material online, also shows plots for estimates of bottleneck size and recovery time across inbreeding levels for model two. The RMSD values for these estimates across all simulated values were 0.0236 and 0.0184 for ν₀ and T, respectively. Supplementary figure S8, Supplementary Material online, shows similar plots for estimates of population size, divergence time, and one-way migration rate across inbreeding levels for model three. The RMSD values for these estimates across all simulated values were 0.0131 for ν₂, 0.0103 for T, and 0.158 for M₂₁.

Simulation 2: Parameter Estimation When Inbreeding Is Ignored

To understand the impact of ignoring inbreeding on demographic inference, we simulated data sets with inbreeding under the same bottleneck and divergence models as above (models two and three) but performed inference under the assumption that inbreeding was absent. Because of initial issues with convergence in these analyses, particularly with the bottleneck model, and the fact that higher levels of inbreeding cause increasingly conspicuous changes to the SFS (e.g., see fig. 1), we used a smaller range for F in these simulations: 0.1, 0,2, 0.3, 0.4, and 0.5.

Parameter estimates for the bottleneck model had higher rates of error compared with when inbreeding was directly modeled. The RMSDs for ν₀ and T were 0.191 and 0.117, respectively. Estimates of these parameters also got worse as inbreeding increased (supplementary fig. S9, Supplementary Material online), clearly demonstrating the issues that can arise when inbreeding is ignored. In contrast, results for the divergence model were surprising in that they did not show the high levels of estimation error seen with the bottleneck model (supplementary fig. S10, Supplementary Material online). The RMSD values for the parameters of the divergence model were 0.0261 for ν₂, 0.0130 for T, and 0.142 for M₂₁. Interestingly, the RMSD for M₂₁ was actually lower in this simulation experiment than when inbreeding was modeled (0.158). However, the increase in RMSD for the simulations where inbreeding is modeled is due to using higher levels of inbreeding (F > 0.5). If we restrict the calculation of RMSD in the estimates including inbreeding to only those with $F \leq 0.5$ , the RMSD is lower than when inbreeding is ignored, as expected (0.109). RMSD values for ν₂ and T were higher for model two than in Simulation 1, indicating that these parameters may be more sensitive to the effects of unmodeled inbreeding.

Simulation 3: Masking Rare Variants

Several techniques to “side-step” the impact of inbreeding have been taken in empirical analyses. This includes sampling only a single haplotype/chromosome per individual (e.g., Beissinger et al. 2016; Koenig et al. 2019) or masking rare variants (e.g., Cornejo et al. 2018), which are disproportionately affected at lower levels of inbreeding (fig. 1). One obvious effect of sampling only a single chromosome is that it cuts the sample size in half. However, both Pollak (1987) and Nordborg and Donnelly (1997) have described results for the diffusion and coalescent processes from a sample of single chromosomes, respectively, showing that inbreeding simply rescales the rate of these processes. The more important result of sampling a single chromosome per individual is that it discards information about levels of homozygosity, preventing us from being able to jointly estimate the level of inbreeding during a demographic analysis. Because of this, and the fact that investigations on the effect of sample size on demographic inference have already been explored (Robinson et al. 2014), we instead focused on the effect of masking rare variants under increasing levels of inbreeding. For the bottleneck model, we masked the singleton and doubleton entries of the 1D-SFS, and for the divergence model, we masked the bottom corner of the 2D-SFS (i.e., singletons, doubletons, and their combinations across both populations). We then used the same range of parameters as in the previous simulations to see how much masking affected our inferences.

For the bottleneck and growth model, data masking had a small but noticeable effect on parameter estimation. The bottleneck size was estimated with less accuracy compared with when inbreeding was included (RMSD = 0.0296) and estimates of recovery time also had higher error (RMSD = 0.0218), typically in the direction of underestimation (supplementary fig. S11, Supplementary Material online). The effect of masking was more pronounced in the divergence model (supplementary fig. S12, Supplementary Material online), particularly for the migration parameter, where the amount of gene flow was almost always underestimated across all parameter combinations (RMSD = 0.193). Estimates of population size and divergence time were also slightly underestimated when compared with models including inbreeding (RMSD = 0.0122 and 0.0103, respectively), but the effect was less pronounced.

Simulation 4: Misspecified Inbreeding

As a final test of the model for inbreeding, we simulated frequency spectra under the bottleneck and divergence models without inbreeding but included it as a parameter to be estimated. The expectation in this case is that inbreeding should be estimated close to 0 and that its inclusion in the model does not lead to poor estimates of other model parameters. However, for both models, the inbreeding parameter was always estimated to be >0. The mean estimates of F for the bottleneck and divergence models were 0.0934 and 0.212, respectively. Nevertheless, despite not estimating an absence of inbreeding, the other model parameters were estimated with only slightly higher levels of error (supplementary figs. S13 and S14, Supplementary Material online). For the bottleneck model, bottleneck size and duration had RMSD values of 0.0280 and 0.0268, respectively, which are both higher levels of error than the simulations where inbreeding was truly present. Parameters in the divergence model had RMSDs of 0.0183 for ν₂, 0.0110 for T, and 0.132 for M₂₁, showing that the two-population model was not strongly affected by the level of inbreeding estimated in population two.

Empirical Examples

American Puma

The American puma (P. concolor) is an iconic carnivore distributed primarily in western North America and South America, occupying a large diversity of habitats across its range. However, in the eastern United States, the only remnant population is the highly endangered Florida panther (Hansen 1992; Culver et al. 2000). Florida panthers have been the subject of large-scale conservation efforts aimed at ameliorating the adverse effects of small population size, including moving individuals from their closest sister population, the Texas puma, to introduce novel genetic variation (Seal and Lacy 1994; Johnson et al. 2010). Using genomic data from five individuals of Texas pumas and two individuals of “canonical” Florida panthers from Ochoa et al. (2019), we estimated the demographic history of these two populations to investigate their divergence time, changes in population size, and levels of inbreeding (see cartoon in fig. 3). More specifically, we fit a model that included an initial change in population size to mimic the colonization of North America by the Texas population (N_TX), the duration of time spent at the new population size (T₁), the divergence time between Texas pumas and Florida panthers (T₂), and the inbreeding coefficients for both the Texas and Florida populations (F_TX and F_FL).

Fig. 3. — The observed joint site frequency spectrum for *Puma concolor* in Texas and Florida, along with the model fit and residuals, for models with inbreeding (middle) and without inbreeding (right). Residuals for each model are plotted below their expected spectra and a cartoon representation of the proposed demographic model is given in the bottom left.

After processing (see Materials and Methods), 6,262,417 variant sites were retained for constructing the 2D-SFS. Because we lacked a suitable outgroup for determining ancestral versus derived allelic states, we used the folded SFS for all model fits. Table 1 lists parameter estimates and their 95% CI for models fit with and without inbreeding ( $ϵ = 10^{- 2}$ ) and uncertainty estimates across different step sizes for numerical differentiation using the Godambe information matrix (Coffman et al. 2016) are presented in supplementary tables S1 and S2, Supplementary Material online. In both models, the Texas and Florida populations are estimated to have diverged 7,000–8,000 years ago, with both also having similar estimates of the ancestral population size (120,000–130,000 individuals). As expected, the Florida population experienced a severe reduction in population size down to 1,200–1,600 individuals, as well as having a high estimate of F in the model including inbreeding (F_FL = 0.607). Texas pumas were also inferred to be inbred, though less so than Florida panthers (F_TX = 0.440). Estimates of population size for the Texas population were different between the models with and without inbreeding (70,800 individuals vs. 23,700 individuals) and the duration of the initial population size change (T₁) was especially different as well (247,000 years vs. 26,800 years). The log-likelihoods for the model with and without inbreeding are −318,058.079 and −453,003.048, respectively, and the Godambe-adjusted likelihood ratio statistic is 425.489 (P value = ∼0.0; Coffman et al. 2016), demonstrating that the model with inbreeding has a significantly better fit to the data. In addition, when comparing the predicted SFS from the models with the observed SFS (fig. 3), the residuals for the model with inbreeding were lower overall, providing even more support for preferring the model with inbreeding. Uncertainty estimates were also typically more stable across step sizes for the model with inbreeding.

Table 1.

Parameter Estimates for Puma concolor from Demographic Models Estimated Both with and without Inbreeding.

Parameter	Estimate with Inbreeding	Estimate without Inbreeding
N _A	130,000 (129,000–132,000)	120,000 (92,400–157,000)
N _TX	70,800 (63,300–79,200)	23,700 (3,490–161,000)
N _FL	1,600 (128–19,100)	1,210 (118–12,500)
T ₁	247,000 (169,000–359,000)	26,800 (504–1,420,000)
T ₂	7,820 (650–94,200)	8,230 (784–86,500)
F _TX	0.440 (0.408–0.474)	—
F _FL	0.607 (0.588–0.626)	—

Open in a new tab

Note.—95% CI is given in parentheses and was estimated using a step size of $ϵ = 10^{- 2}$ for numerical differentiation. Population sizes are given in number of individuals and divergence time is given in years.

Domesticated Cabbage

Brassica oleracea is an agronomically important plant species cultivated primarily in Europe, Asia, and North America (Maggioni 2015). It is especially well known for its morphological diversity, having been domesticated into several different crops including broccoli, Brussels sprouts, cauliflower, cabbage, kale, and kohlrabi, among others. The timing and origin of domestication for these different B. oleracea crops are still uncertain, but several hypotheses have been proposed to explain their evolutionary history (Maggioni 2015). Cabbage, B. oleracea var. capitata, is thought to have been domesticated roughly 500 years ago in the Mediterranean region (Cheng, Sun, et al. 2016; Cheng, Wu, et al. 2016), providing an interesting hypothesis that we can test using demographic models.

To infer the demographic history of domesticated cabbage, we used SNP data from publicly available resequencing data for 45 individuals from Cheng, Sun, et al. (2016) and Cheng, Wu, et al. (2016). We then fit a demographic model for cabbage domestication that included two changes in population size (N₁ and N₂), the amount of time spent at these population sizes (T₁ and T₂), and the level of inbreeding (F) (see cartoon in fig. 4). We used 2,941,018 intergenic SNPs to build the folded SFS for B. oleracea var. capitata and fit models with and without inbreeding. Parameter estimates were obtained using newly implemented optimization routines in the $\partial a \partial i$ library built on top of the nlopt Python package (Johnson 2014). Parameter estimates and their 95% CI are listed in table 2 ( $ϵ = 10^{- 2}$ ). Uncertainty estimates across different step sizes for numerical differentiation using the Godambe information matrix (Coffman et al. 2016) are presented in supplementary tables S3 and S4, Supplementary Material online.

Fig. 4. — The observed site frequency spectrum for *Brassica oleracea* var. *capitata*, along with the model fit (light gray) and residuals (bottom panels), for models with inbreeding (middle) and without inbreeding (right). On the left is a cartoon of the proposed demographic model with parameters labeled.

Table 2.

Parameter Estimates for Brassica oleracea var. capitata from Demographic Models Estimated Both with and without Inbreeding.

Parameter	Estimate with Inbreeding	Estimate without Inbreeding
N _A	17,500 (16,900–18,100)	19,100 (18,500–19,800)
N ₁	31,600 (28,900–34,700)	123,000 (80,400–190,000)
N ₂	215,000 (4,910–9,370,000)	592 (547–641)
T ₁	16,600 (12,900–21,200)	5,870 (5,200–6,620)
T ₂	322 (94.2–1,097)	38.3 (32.5–45.1)*
F	0.578 (0.557–0.599)	—

Open in a new tab

Note.—95% CI is given in parentheses and was estimated using a step size of $ϵ = 10^{- 2}$ for numerical differentiation. Population sizes are given in number of individuals and times are given in years. Parameters estimated at the upper/lower bound of the given search space are marked with an asterisk (*).

Much like the models inferred with and without inbreeding for American pumas, the estimates of demography for cabbage are markedly different between the two analyses. When inbreeding was not modeled, we infer an ancestral population size for cabbage of 19,100 individuals, which expanded to a size of 123,000 individuals ∼6,000 years ago. This expanded population then experienced a very recent and severe bottleneck 38 years ago down to a size of 592 individuals. The time estimate for the bottleneck consistently hits the lower bound of the parameter search space, however, suggesting that this estimate is likely not very reliable. Parameter estimates for the inbreeding model inferred an ancestral population size of 17,500 individuals, which expanded to a size of 31,600 individuals ∼17,000 years ago. This population then experienced an even larger expansion to a size of 215,000 individuals 322 years ago. The model with inbreeding inferred F to be 0.578, showing that inbreeding in these cabbage samples is fairly high. The log-likelihoods for the model with and without inbreeding were −4,281.145 and −24,330.403, respectively, and the Godambe-adjusted likelihood ratio statistic was 127.562 (P value = ∼0.0; Coffman et al. 2016). Figure 4 also shows the observed and predicted SFS for each model plus their residuals. The residual plots clearly show that the model with inbreeding is able to capture more of the “zig-zagging” pattern of the lower frequency variants than the model without inbreeding, demonstrating its overall better fit. Uncertainty estimates were again typically more stable across step sizes for the model with inbreeding.

Discussion

The prevalence of inbreeding in nature, especially among plant lineages and small and endangered populations, makes it an important process to include in demographic models. Unlike previous approaches that rely on full-genome sequences to characterize patterns of identity by descent or the distribution of runs of homozygosity, our model uses the frequency spectrum of biallelic SNPs to infer demography, allowing it to be employed not only in model systems but in organisms that lack a suitable reference genome as well. The impact of inbreeding on the SFS has important consequences for demographic inference, however, a result that is well demonstrated by our simulations and example analyses. The relationship between inbreeding and population size is especially relevant for understanding inferences of past population dynamics. Below, we describe this connection in the context of our simulations and the results of our empirical analyses, drawing on previous theoretical work to help qualify our results. We then discuss the importance of considering how our current model behaves for recent versus sustained inbreeding.