Abstract
Many studies have quantified the distribution of heterozygosity and relatedness in natural populations, but few have examined the demographic processes driving these patterns. In this study, we take a novel approach by studying how population structure affects both pairwise identity and the distribution of heterozygosity in a natural population of the self-incompatible plant Antirrhinum majus. Excess variance in heterozygosity between individuals is due to identity disequilibrium, which reflects the variance in inbreeding between individuals; it is measured by the statistic g2. We calculated g2 together with FST and pairwise relatedness (Fij) using 91 SNPs in 22,353 individuals collected over 11 years. We find that pairwise Fij declines rapidly over short spatial scales, and the excess variance in heterozygosity between individuals reflects significant variation in inbreeding. Additionally, we detect an excess of individuals with around half the average heterozygosity, indicating either selfing or matings between close relatives. We use 2 types of simulation to ask whether variation in heterozygosity is consistent with fine-scale spatial population structure. First, by simulating offspring using parents drawn from a range of spatial scales, we show that the known pollen dispersal kernel explains g2. Second, we simulate a 1,000-generation pedigree using the known dispersal and spatial distribution and find that the resulting g2 is consistent with that observed from the field data. In contrast, a simulated population with uniform density underestimates g2, indicating that heterogeneous density promotes identity disequilibrium. Our study shows that heterogeneous density and leptokurtic dispersal can together explain the distribution of heterozygosity.
Keywords: heterozygosity, identity disequilibrium, population structure, isolation-by-distance
Surendranadh, Arathoon, et al. integrate a large dataset of genotypes and spatial locations with simulations to ask whether the observed distribution of heterozygosity is consistent with random mating with a patchy distribution and with various pollen dispersal scenarios. They find an excess variance in heterozygosity, which reflects significant variation in inbreeding. Simulated matings with leptokurtic pollen dispersal and a spatial pedigree conditional on actual plant locations are also consistent with the observed variation in heterozygosity, indicating that realistic population density and dispersal can explain isolation by distance and the distribution of heterozygosity.
Introduction
For most organisms, gene dispersal and therefore relatedness are spatially structured, such that individuals closer in space are more likely to mate, and be more closely related, than individuals further apart (Wright 1946; Vekemans and Hardy 2004). Such spatial population structure causes decreasing genetic similarity with geographic distance [isolation-by-distance; Wright (1943)]; this reduces the mean heterozygosity of the whole population relative to a well-mixed population. Despite the ubiquity of these patterns in nature, the role of demography and gene dispersal in determining the spatial pattern of genetic variation has not been thoroughly explored. Commonly used spatial models typically assume discrete demes and/or a uniform population density. However, natural populations are typically patchy, with heterogeneity in both the distribution and density of individuals. Patchy and heterogeneous spatial distributions within natural populations should result in spatial variation in inbreeding and, consequently, excess variance in heterozygosity. Despite this prediction, the effect of spatial heterogeneity on heterozygosity has rarely been examined in the population structure literature. Moreover, it is the interplay of heterogeneous density and dispersal that likely shapes the spatial structuring of genetic relatedness between individuals. This highlights the importance of understanding the factors (e.g. life history, demography, and population structure) that contribute to shaping the full distribution of heterozygosity and relatedness in a spatially structured population.
Understanding the drivers of variation in inbreeding within populations is fundamental, given its importance to genetic diversity and to fitness. Quantifying variation in inbreeding and combining this with measures of fitness (or fitness proxies) makes it possible, in principle, to estimate inbreeding depression either through pedigrees (Charlesworth and Charlesworth 1987; Lynch and Walsh 1998) or heterozygosity-fitness correlations (HFCs). For HFCs, inbreeding depression is estimated by comparing proxy measures of fitness against heterozygosity, with the expectation that offspring from related individuals will have lower heterozygosity. Variance in inbreeding is therefore essential for HFCs to be detected (Szulkin et al. 2010). In addition, variance in inbreeding is interesting per se because it depends on both demographic history [e.g. Sin et al. (2021)] and mating system (selfing, partial selfing, or outcrossing) (Winn et al. 2011). Outcrossing species, with generally low levels of inbreeding, provide an opportunity to examine factors other than mating system variation that may affect inbreeding variation, and thus, variance in heterozygosity.
If there is variation in inbreeding between individuals, heterozygosity at different loci will be correlated. The covariance between loci in heterozygous state is termed identity disequilibrium (ID), by analogy with linkage disequilibrium, which is the covariance in allelic state between loci. ID can be calculated across individuals and divided by the square of the mean heterozygosity to calculate the population statistic g2, which is a measure of variance in identity by descent (IBD) (Szulkin et al. 2010) amongst individuals. For an outcrossing organism with fine-scale population structure, spatial patterns of density and mating could have strong effects on the degree of mating with related individuals, and thus affect ID and g2. Furthermore, as sessile organisms, mating, and offspring dispersal in plants are mediated by external vectors (pollinators and seed dispersal mechanisms) (Loveless and Hamrick 1984). Consequently, the shape of the distribution of dispersal of both pollen and seed will also have an impact on g2. Additionally, as partial selfing will produce identity disequilibria across loci for selfed individuals, g2 can be used to estimate the selfing rate of a population, with this estimator being robust to null alleles and biparental inbreeding (David et al. 2007; Hardy 2016). If the sources of variation in inbreeding are better understood, we may be able to combine g2 with other statistics of population structure to improve inferences about demographic history (Milligan et al. 2018; Bradburd and Ralph 2019).
For over a decade, we have sampled a population of the self-incompatible plant Antirrhinum majus, the long-term aim being to build a pedigree that will allow us to estimate fitness and dispersal directly. Through that project, we have collected an exceptionally large sample of individuals with SNP genotypes that are spatially mapped. This dataset enables a powerful test of whether the observed density and dispersal in this population can account for both the decay of pairwise relatedness with distance, and for the distribution of heterozygosity across individuals. Here, we first verify that there is excess variance in heterozygosity, which reflects an underlying variance in inbreeding. Second, to understand the role of spatial patterns of dispersal in generating variance in heterozygosity, we compare the empirical distribution of heterozygosity with that of offspring from simulated matings where parents were drawn from different dispersal scales. Third, we ask whether heterogeneous population density promotes variation in inbreeding, by comparing simulated pedigrees conditioned on uniform density vs on the observed locations of plants. Taken together, addressing these questions provides insight into the underlying drivers of the distribution of heterozygosity and relatedness, and provides novel ways to study the effects of mating patterns and demography in nature.
Methods
Study system
Antirrhinum majus is a self-incompatible, hermaphroditic, short-lived perennial herb native to the Iberian Peninsula. It has a seed bank with most individuals’ parents recorded 3–4 years before they are sampled (D. Field, unpublished data). It grows in a variety of microhabitats with relatively bare soil or frequent disturbance, including rail embankments, rocky cliffs, and regularly mowed roadsides. Our study includes 2 “subspecies” that differ only in flower color: A. majus pseudomajus has magenta flowers and occurs in northern Spain and south-western France, including the Pyrenees. A. majus striatum has yellow flowers and a smaller range, encircled by A. m. pseudomajus. The subspecies are parapatric; narrow clines with intermediate color hybrids form wherever they meet, and there is no evidence for postzygotic reproductive barriers (Andalo et al. 2010). We focus on such a hybrid zone in the Vall de Ribès, Spain (Whibley et al. 2006), where we have collected demographic data annually since 2009. Across nearly all of the genome, there is little divergence within our study area between plants with different flower color, except for limited regions associated with floral pigmentation, which show steep clines (Tavares et al. 2018). Thus, the study area can be considered as a single population for studying neutral genetic variation.
Field sampling
Genetic samples were obtained annually from 2009 to 2019 from every accessible flowering individual in ∼5 km stretches of 2 parallel roads that cross the Vall de Ribès, dubbed the “lower road” (GIV-4016; ∼1,150 m elevation) and “upper road” (N-260; ∼1,350 m) (Fig. 1). We also sampled along small side roads, railroad embankments, rivers, and hiking trails. The plants grow preferentially along exposed areas such as roads, therefore, density was very low away from these disturbed areas between the main sampling sites of the lower and upper roads. In some years, we were limited to genotyping only in the core area, ∼1 km along each road. The total genotyped sample summed over the 11 years is 22,353 plants, ranging from ∼750 plants in the smallest year (2018), to ∼5,500 plants in the largest year (2014). Eighteen percent of individuals were sampled in more than 1 year. Sampling was conducted during peak flowering (early-June to late-July). Each year there were fewer than 100 visible but inaccessible plants; consequently, we estimate that we found the majority of individuals in the sampled area.
Fig. 1.
Distribution of A. majus individuals (shown as circles) in Vall de Ribès, Spain from the years 2009 to 2019.
For each plant, we collected leaf samples for genotyping, and recorded spatial locations with GeoXT handheld GPS units (Trimble, Sunnydale, CA, USA). These devices are accurate to within 3.7 m, determined by the mean distance between samples that had been inadvertently recorded twice in the field (individuals with similar geographic location and near-identical genotypes, allowing for SNP errors). Leaf samples were refrigerated upon return to the field station, dried in silica gel, and stored.
SNP panel
Previously, a panel of 248 SNPs spread throughout the genome was designed for the focal population [see methods in Ringbauer et al. (2018)]. We follow these methods but include an additional 5 years of data (2015–2019) and use a subset of 91 non-clinal SNPs; the mean sample size per SNP was 21,212, or ∼95% of the total [see Supplemental Material 1.1 (SM1.1) for SNP filtering methods].
IBD vs identity in state
Throughout this paper, it will be important to distinguish between IBD and identity in state. We denote the probability that 2 genes are identical by descent by F; this is defined relative to an ancestral reference population, and can in principle be calculated from the pedigree that descends from that population, independent of the actual allelic state. What we observe are biallelic SNP genotypes; the 2 homologous genes in a diploid individual will be identical in state if the genes are identical by descent, or if the ancestral genes carried the same allele. Thus, probabilities of IBD (F) can be estimated from observed identities in state. We denote the heterozygosity at locus i in a particular individual by hi, with hi = 0 if the genes are identical in state, and hi = 1 otherwise. The mean heterozygosity of an individual is the average of hi over n loci, denoted multilocus heterozygosity H = .
Isolation by distance
The panel of 91 SNPs was used to calculate FST and isolation by distance, both of which relate to the mean heterozygosity. We imputed the ∼5% missing genotypes for each SNP by randomly assigning genotypes according to the population-wide allele frequencies at each marker. FST is defined as the average IBD among individuals within a subpopulation, FS, relative to the total population, FT: (Wright 1931). These identities are estimated from SNP genotypes since we do not have the full pedigree. Two genes will have a different allelic state only if they are not identical by descent, and if they derive from different alleles in the ancestral population. Given overall ancestral allele frequencies p + q = 1, the expected heterozygosity () of offspring from parents whose genes have a probability of IBD F is , where is an average over loci. Thus, there is a direct relation between FST and the mean heterozygosity: = . We use this relation to compute FST for this dataset (Jakobsson et al. 2013). (Note that here, H is the probability of nonidentity in state, which depends on the SNP genotype. The subscripts S and T refer to the specified quantity within subpopulation and total population, respectively.) Since we have a single continuous population, a subpopulation is defined as the set of pairs of individuals within a geographic separation of 20 m and total population denotes all distinct pairs of individuals in the population. Note that 20 m is an arbitrary choice of distance class used to define FST.
Isolation-by-distance—the decay of genetic similarity with geographic distance—can be observed by measuring pairwise relatedness between individuals. If individuals are separated by a distance r, then pairwise relatedness can be calculated as an extension of FST (which we refer to as pairwise Fij, denoting the relatedness between individuals i and j) by setting FS to be the probability of IBD and, correspondingly, to be the probability of nonidentity in state between genes which are at a distance r apart, thereby extending the idea of FS from subpopulation to a set of pairs of individuals separated by any geographic distance class. is calculated by finding the average pairwise heterozygosity between every pair of individuals which are within some interval {r, r + r} of distance apart. This formulation is used to estimate Fij between every pair of individuals relative to the total population, as a function of their geographic separation. Pairs of individuals are binned into distance classes of 20 m each (i.e. individuals within 20 m, 21–40 m, and so on) and the average pairwise Fij and the distance corresponding to each bin is calculated. This was done for every year from 2009 to 2019, and the average calculated.
Variation in inbreeding
We calculated multilocus heterozygosity for each individual pooling across all years, denoted here by H, defined as the fraction of heterozygous loci in an individual. In this system “generations” cannot be clearly defined because of seed dormancy and perenniality. However, pooling data across years only reduced H by 0.08%.
We observed an excess of individuals with around half the mean heterozygosity (see Results). To check whether the pattern was consistent with rare selfing, we compared the likelihood of a single Gaussian with a mixture of 2 Gaussian distributions, one with the observed mean and variance and the other with half its mean and variance.
The variance in individual heterozygosity consists of 2 components. The first is due to the variance in whether an individual locus is heterozygous, and decreases in proportion to the number of SNP, n: it equals . The second is due to covariance in heterozygosity between loci, which is termed the ID. For a given pedigree, unlinked genes flow independently. Thus, heterozygosity is independent across unlinked loci, and so this second component is proportional to the variance in inbreeding across individuals, var(F). The first component can be estimated from the allele frequencies, or simply by shuffling the data across individuals within loci, to eliminate ID. The excess variance is then proportional to the variance in F across individuals, and is measured by the statistic g2:
[from Equation (1) in Szulkin et al. (2010)]. Here, cov[] is the ID between loci i and j, and the sum over all distinct i, j is the excess variance in H due to ID. Dividing by the square of the mean heterozygosity E[h]2 eliminates dependence on allele frequency, such that g2 estimates the variance in F across individuals.
To describe the variance of inbreeding across individuals, we first check if the variance in the distribution of individual heterozygosity is significantly greater than the average variance obtained from 100 replicates. This was done by shuffling heterozygous status randomly across individuals within loci, which would eliminate correlations between loci generated by ID. We then computed g2 using the g2_snps function from the R package InbreedR [in R version 3.6.1 (R Core Team 2014)], which implements a modified formula for large data sets to estimate g2, and provides confidence intervals via bootstrapping to account for the finite number of individuals sampled (Stoffel et al. 2016). We decomposed ID into components due to linked and unlinked SNPs by comparing correlations of H for all individuals to those with low H, at several scales: across all pairs, within linkage groups, and between adjacent SNPs (SM1.2: Supplementary Table 1).
Additionally, g2 can be used to estimate selfing rate within a population (David et al. 2007). Using the software SPAGeDi (Hardy and Vekemans 2002), which implements the g2-based selfing rate calculation described in David et al. (2007), the selfing rate was estimated for the full population using the 91 SNP data.
Effects of pollen dispersal on heterozygosity
With isolation by distance, the distribution of heterozygosity is expected to depend on the distance between parents: heterozygosity of offspring from nearby parents will have a lower mean and higher variance compared with offspring from distant parents. To test this prediction, we simulated offspring using all field individuals as mothers and choosing fathers from a given distance away (detail in SM1.3). Then we compared the distribution of H between the field data and offspring simulated from matings with 3 models of pollen dispersal: the nearest neighbor to the mother, a Gaussian distribution (σ = 300 m), and a leptokurtic dispersal kernel sampled from 1,463 empirical measurements of pollen dispersal, estimated as the distances between assigned parents (Supplementary material; D. Field, unpublished data). A CDF of the latter distribution (SM1.3: Supplementary Fig. 3) shows that 75% of the matings occur within 60 m and has a kurtosis of 16.5. The genotype of the offspring was assigned using Mendelian inheritance, either without linkage between markers, or using the known linkage map (Supplementary material; courtesy of Yongbiao Xue, Beijing Institute of Genomics). Including linkage did not substantially change results, so we mainly show results for simulations without linkage. We compared distributions, means, and variance of H using Kolmogorov–Smirnov tests, t-tests, and F-tests, respectively. For the leptokurtic pollen dispersal simulation, we checked for an excess of low-heterozygosity individuals generated by mating between close relatives by asking whether a mixture of 2 Gaussian distributions is more likely than a single Gaussian distribution.
Heterozygosity in a simulated spatial pedigree
In order to compare the actual distribution of heterozygosity with that expected for a spatially structured population, we simulated a continuous 2D population, conditioned on the known locations of the individuals and the empirically measured seed and pollen dispersal distances (Supplementary material; D. Field, unpublished data), using Mathematica 12.0 (Wolfram Research, Inc 2019). Our simulation differs from commonly used models [e.g. island (Wright 1931), stepping stone (Kimura and Weiss 1964), and continuous Wright–Malécot model (Wright 1943; Malécot 1948)] in that we include heterogeneity in density by specifying actual locations to determine relationships in the pedigree. Thus, our simulation parameters should be seen as “effective” values, analogous to the traditional Ne. Additionally, we also validated our simulation by comparing pairwise relatedness directly from the simulated pedigree and from replicate genotypes, and compared the realized and proposed dispersal kernels (SM1.4: Supplementary Figs. 6 and 7).
First, we simulated a population with uniform density (the continuous Wright–Malécot model) as a null model, to compare expected heterozygosity with and without heterogeneous spatial structure. We simulate a region of ∼1.1 ×1.8 km that was sampled consistently in the A. majus focal population (SM1.4: Supplementary Fig. 4). Locations were assigned by randomly sampling N points from a uniform distribution each generation, for 1,000 generations. Genetic diversity is shaped over the coalescent timescale [2Ne, ∼170,000 generations in A. majus (Tavares et al. 2018)], which is far longer than the 1,000 generations that we simulate. However, we are concerned here with the local population structure that determines the variation in inbreeding amongst individuals within an area of a few kilometers square, which will equilibrate rapidly (Malécot 1948). The spatial pedigree was generated by choosing parents for each individual according to a backwards dispersal distribution measured empirically. The seed and pollen dispersal distances are estimated respectively as the distance between offspring and nearest parent (assumed to be the mother) and between parents (Supplementary material; D. Field, unpublished data). For every offspring, the mother and father are chosen from randomly drawn distances from the seed and pollen dispersal distributions. To choose a parent from a distance r, 6 points are assigned randomly on a circle of radius r centered at the focal individual and the nearest individual to each of them are found. The closest individual to any of these points is then chosen as the parent. The accuracy of our algorithm is verified by comparing the specified and realized seed and pollen dispersal distributions for the simulated pedigrees (SM1.4: Supplementary Fig. 7 and Table 4). The same procedure is repeated for the father, taking the mother as the starting point. Since A. majus is self-incompatible, the mother and father are not allowed to be the same individual.
Once the spatial pedigree is generated, 10 replicate sets of genotypes are assigned by dropping genes down the pedigree, starting with equal expected frequencies of both alleles at each of 91 loci. In fact, one could start with any initial frequencies, since FST-like measures are independent of them. Population size was adjusted so that FST matched the empirical data for the simulated sampling area. This was done by first simulating the population with an initial population size (N) and then repeating the process with higher or lower N until the desired FST is attained.
Next, we simulated a population with realistic heterogeneous spatial structure by using the individual locations available for the years 2009 to 2019 in the A. majus focal population (SM1.4: Supplementary Fig. 5). There were fewer individuals from 2017 to 2018, so these were merged, giving distribution data for 10 time points. We randomly sample from the 10 consecutive time points, and repeat for 100 cycles, thus iterating for 1,000 generations. We subsample from these locations to maintain a constant population size (N). If N is greater than the number of plants available in a given time point, say k, all k plants are first included and the remaining N–k locations were resampled from the same time point, displaced at a random angle on a circle of radius 3 m to avoid having plants in the same location. This naïve approach allows us to simulate a spatial structure that is realistic over at least small scales. We then generated a pedigree following the procedure used for the uniform population, again adjusting population size to match the empirically observed FST. Ten replicate sets of genotypes were run for each of 5 replicate pedigrees.
Patterns of isolation by distance, heterozygote deficit (FIS), and ID were compared between the 2 simulation types and the field data (calculated from the simulated subarea of the field site). As the fitted population sizes were large (see Results), obtaining direct estimates of IBD and thus FST from the pedigrees was not feasible. Instead, FST was obtained for a pedigree as the average of replicate genotype sets generated from that pedigree. FIS was calculated from the observed and expected heterozygosity. Values of g2 were calculated for each replicate from each pedigree using InbreedR [in R version 3.6.1 (R Core Team 2014)].
Results
Isolation by distance
If we consider pairs of individuals within 20 m of each other, the average FST over the 11 years is 0.0244; however, this is an average over a quantity that depends strongly on distance. The average pairwise Fij was calculated each year for individuals separated by different distance classes and then averaged across years. Pairwise relatedness (pairwise Fij) between individuals decreased rapidly with geographic distance, showing isolation by distance (Fig. 2a). The sharp decline in pairwise identity over short spatial scales corresponds precisely to a rapid increase in H with distance between parents (SM1.3: Supplementary Fig. 1), since heterozygosity is determined by the probability of IBD between the genes from each parent. Note that over large separations (>1 km), pairwise Fij values are necessarily negative, because distant individuals are less closely related than the average for the whole population.
Fig. 2.
a) Pairwise relatedness (pairwise Fij) between individuals decreases rapidly with geographic distance showing isolation-by-distance in the field data. b) Probit transform of the cumulative distribution function (CDF) of the distribution of individual heterozygosity (H). A Gaussian appears as a straight line on a probit scale, and the y-axis is the number of SDs of the standard normal distribution.
Variation in inbreeding
Excess variance in the distribution of individual heterozygosity (H) in the field data shows that there is variance in inbreeding in the population (Fig. 2b). Furthermore, there is an excess of individuals with around half the mean heterozygosity (i.e. with H∼0.22, rather than 0.44; Fig. 2b, blue, lower left). These might be due to a low rate of selfing, and using the g2 estimator calculated with SPAGeDi, the selfing rate for the population is estimated to be 1.2%. Indeed, a mixture between 2 Gaussians with means ∼ 0.22 and 0.44, and variances in the same ratio, fits significantly better than a single Gaussian (Fig. 2b, compare red and black to blue) with an increased likelihood of 11.3. However, we shall see in the next section that this excess is also consistent with matings between close relatives, without the need to invoke a breakdown in self-incompatibility.
To examine whether the observed distribution of heterozygosity is significantly different to a distribution taken from a population with zero ID, we compared the field data with heterozygous values shuffled across individuals, which eliminates ID by removing correlations between loci. We found greater variance in heterozygosity in the observed compared with the randomly shuffled field data (Fig. 2b, gray). For both data sets, the mean heterozygosity (0.44602) necessarily remains the same, but the observed variance in the field data [var(H) = 0.00336] was significantly higher than the average variance in 100 shuffled replicates [mean var(H) = 0.00282, SD 0.000029]. This excess variance between the observed and shuffled data implies that the mean standardized ID is g2 = 0.0029 (95% CI: 0.0026–0.0033), representing a significant variance in inbreeding between individuals.
The overall ID, as measured by g2, is due to correlations in heterozygosity between all pairs of loci, most of which are unlinked. We expect stronger correlations between linked loci, because relatives will share blocks of genome. We found that the mean covariance in heterozygosity between SNP on the same linkage group is substantially stronger than the overall mean (0.00265 vs 0.00056). If we restrict attention to those individuals with H < 0.3, we find that the covariance in heterozygosity between SNP on the same linkage group is still higher (0.00649), as expected if close relatives share long blocks of genome IBD. This higher covariance in heterozygosity translates to higher mean g2, which is seen within linkage groups compared with the overall value (SM1.2: Supplementary Table 1).
Effects of pollen dispersal on heterozygosity
The heterozygosity of simulated offspring depends on distance between their parents, with a rapid increase in mean H with distance (SM1.3: Supplementary Fig. 1). We compared the observed distribution of heterozygosity with 3 alternative scenarios for pollen dispersal. There was no significant difference between the mean and variance of heterozygosity between the field data and offspring simulated from the observed leptokurtic dispersal. However, the mean and variance of heterozygosity differed between the field data and simulated matings with either nearest neighbors, or with Gaussian dispersal (Fig. 3a; SM1.3: Supplementary Tables 2 and 3). While all 3 dispersal schemes differed in the distribution tail as assessed by Kolmogorov–Smirnov tests, Gaussian and nearest neighbor matings are very different from the field data compared with the leptokurtic distribution (SM1.3: Supplementary Table 3). These comparisons were made for a single replicate, but because each involves 22,353 individuals, there was little variation in the mean and variance between replicates.
Fig. 3.
a) Probit transform of the CDF of multilocus heterozygosity, H, for the field data vs a single replicate of offspring simulated from Gaussian pollen dispersal, nearest neighbor matings, and leptokurtic pollen dispersal. A normal distribution (straight line) with the same mean and SD as the field data is included for comparison (b) ID (g2) for the same data as above indicating mean and 95% CI.
We next examined deviations in the left tail of the distribution, where an excess of low heterozygosity individuals might arise from selfing or from matings between close relatives. We focused on the leptokurtic dispersal curve, which was the distribution closest to the field data. We estimated the increase in likelihood between fitting a single vs mixed Gaussian distribution (see Variation in inbreeding) for 100 replicate simulations. We found that the mixed Gaussian was a better fit than a single Gaussian, with an increase in log likelihood >2 for 69 of 100 replicates. The estimated fraction of putatively “selfed” individuals was 0.00043, averaged over replicates, which is about half the estimate from the actual data, 0.00086. In comparison, only 4/100 replicates gave higher estimates than that observed (SM1.3: Supplementary Fig. 2). This suggests that the excess of individuals with low heterozygosity can to a large extent be explained by matings between relatives under leptokurtic pollen dispersal. Nevertheless, there is a marginally significant excess of such individuals, with twice as many being seen as expected from our simulations. There is considerable variation in fit between replicates, simply because deviations in the tail involve few individuals.
The coefficient g2 reflects excess variation due to ID, and showed similar patterns as the variance in H. Here, we found no significant difference between g2 from field data and offspring from simulated matings with leptokurtic pollen dispersal. However, g2 from Gaussian and neighbor matings were 80% higher than g2 from field data and leptokurtic matings. This nominally represents a significant difference given that the 95% confidence intervals between these groups do not overlap (Fig. 3b). However, as we discuss below, these confidence intervals only include sampling error, and not the additional variance due to random evolutionary realizations.
Heterozygosity in a simulated spatial pedigree
In the previous section, we simulated offspring across one generation. To examine whether the observed heterozygosity is consistent with a spatially structured model, we simulated pedigrees over 1,000 generations with uniform and heterogeneous density, conditioned on the locations of individuals observed over 10 years, repeated over 100 cycles for the latter case. The realized seed and pollen dispersal matched the empirical seed and pollen dispersal distribution for both density types (SM1.4: Supplementary Fig. 7 and Table 4). We required N = 15,500 individuals for the heterogeneous density model and 40,000 individuals for the uniform density, in order to match the observed FST ∼ 0.022 calculated over a 20-m scale from the simulated subarea of the field site (SM1.4: Supplementary Table 5). Up to distances of 1 km, the decline in pairwise identity with distance matched between the field data and the 5 replicate pedigrees simulated with heterogeneous density (Fig. 4a; SM1.4: Supplementary Fig. 8a). High variation among replicates suggests that many more SNPs would be needed to match the pattern from the pedigree (SM1.4: Supplementary Fig. 8b); moreover, linkage would increase this variance to some extent. We also compared the pattern of isolation by distance from the field data with that from the pedigrees generated for both the heterogeneous and uniform density scenarios (Fig. 4b; SM1.4: Supplementary Fig. 9); the heterogeneous density is a much better fit than the uniform density (SM1.4: Supplementary Table 5).
Fig. 4.
a) Isolation by distance compared between the field data and 5 replicate simulated pedigrees based on a heterogeneous population density. b) Isolation by distance from the field data compared between the simulated pedigree with a heterogeneous and uniform population density.
ID (g2) estimates from the genotypes from pedigrees simulated with heterogeneous density showed substantial variation between the 5 simulated pedigrees, and between the 10 draws of 91 SNPs from each pedigree (Fig. 5). The average g2 estimated from the 5 pedigrees (each with 10 replicates) is 0.00264, which is consistent with the observed mean annual g2 from the field of 0.00262. On the other hand, when assuming a uniform density, the average g2 of 0.00171 is significantly lower than the field data. Note that the confidence limits for the field data, generated by InbreedR, only include error due to sampling a limited number of individuals. These errors do not account for sampling a limited number of SNPs, or the random variation between evolutionary realizations (see Discussion).
Fig. 5.
ID (g2) calculated from field data vs simulated pedigrees. Five of 10 replicates per pedigree are shown (‘Het.Dens. 1-5': 5 simulated pedigrees with heterogeneous density; ‘Uniform': 1 simulated pedigree with uniform density). Mean from the field (‘Mean Field') is across 2009–2019, while mean from the heterogeneous (‘Mean Het.Dens.') and uniform (‘Mean Unif.') simulations is across all replicates. The final year of field data (‘Field 2019') is comparable to g2 calculated from the final year of pedigree replicates.
Discussion
An enduring problem in evolutionary biology is understanding how demographic processes, such as heterogeneous density and dispersal, interact with spatial structure to determine the distribution of heterozygosity within populations. In this study of a long-term dataset, including more than 20,000 plants sampled over 11 years, we combine field data and simulations to address questions central to understanding how demography can influence patterns of heterozygosity. Namely, can we predict the distribution of heterozygosity for an outcrossing species from key demographic parameters? To address this question, we first confirmed that there was significant correlation in heterozygosity between markers (g2, a measure of ID), which implies variation in inbreeding. By simulating offspring from matings between geo-referenced, genotyped individuals, we show that the mean heterozygosity increases, and the variance of heterozygosity decreases, with increasing distance between parents; strikingly, these changes occur over very short scales (∼10 m, SM1.3: Supplementary Fig. 1). We found that the observed distribution of heterozygosity is consistent with the known leptokurtic distribution of pollen dispersal. We also simulate the population over 1,000 generations using the actual seed and pollen dispersal kernels, and the observed heterogeneous density. We found that this model matches the observed ID, whereas a model with uniform density substantially underestimates the observed patterns. Thus, we explain the distribution of heterozygosity (mean, variance, and tails) using known features of the population. Moreover, our results also highlight the limitations of making theoretical predictions from simulations that only assume simple demographies. Taken together, our findings highlight the potential for using the observed demography to explain the distribution of genetic diversity, and specifically the variance in inbreeding in spatially continuous populations.
Variation in heterozygosity within populations provides the potential for selection to reduce the frequency of less fit, inbred individuals. The association between inbreeding and fitness is often tested through HFC, which quantify inbreeding depression in natural populations by correlating measures of fitness with heterozygosity (Szulkin et al. 2010). Many studies that test for HFCs find that the excess variation in heterozygosity, g2, which arises from ID, is low and rarely significant (Miller and Coltman 2014). In our study, we estimate a significant g2 of 0.0029 (95% CI: 0.0026–0.0033). Although low, this estimate is of the same order as most of the g2 values found across 105 vertebrate populations in a meta-analysis of 50 HFC studies (average of 0.007) (Miller and Coltman 2014), and on the same order as ∼60% of the local populations surveyed in a long-lived tree (Rodríguez-Quilón et al. 2015). Our estimate of significant variation in heterozygosity provides the opportunity to examine potential drivers of this variance and examine how density, spatial structure and dispersal contribute to a non-uniform distribution of heterozygosity.
In our study, beyond simply estimating ID, we use 2 types of simulation to explore how demography shapes variation in inbreeding. The first simulation shows how the spatial pattern of pollen dispersal affects the distribution of heterozygosity. Simulated matings with the empirically measured leptokurtic pollen dispersal curve were consistent with the actual g2, compared with matings with nearest neighbors or a Gaussian pollen dispersal. This result is somewhat surprising because we did not include the complexities of the mating system of A. majus. Antirrhinum majus has a gametophytic self-incompatibility system (McCubbin et al. 1992), whereby the pollen detoxifies secretions from the style unless the pollen and style genotypes share alleles at the S-locus (Fujii et al. 2016). This system not only prevents selfing, but also reduces mating among relatives (i.e. biparental inbreeding) because related plants are more likely to share S-alleles (Charlesworth and Charlesworth 1987; Cartwright 2009). Thus, we might expect that our simulated matings would have lower mean heterozygosity than the empirical measurement; yet we found no evidence for such an effect. Indeed, we found that the excess of individuals with low heterozygosity, around half the mean, can be explained largely by a small amount of bi-parental inbreeding with leptokurtic pollen dispersal (Fig. 3; SM1.3: Supplementary Fig. 2). However, we have little statistical power to distinguish this from rare selfing, which can occur in self-incompatible species. In fact, using the g2 estimator of selfing rate from Equation (9) in David et al. (2007), our significant g2 value would imply a selfing rate of 1.2% for this population. However, as shown by Hardy (2016), this estimate could be within the bounds of the upward bias of the estimator if strong biparental inbreeding is present, hence, this does not necessarily imply a breakdown of self-incompatibility. We believe that our method, fitting a model of 2 Gaussians, is a more robust way to estimate selfing than using g2, since it focuses on the low-H individuals rather than the whole variance. However, it is still challenging to distinguish selfing from close inbreeding.
Our second simulation approach asked whether heterogeneous density promotes variation in inbreeding, given strong fine-scale population structure indicated by a rapid decay in pairwise Fij (over a few meters, Fig. 2a). We only provide a proof-of-principle, by asking whether a plausible model of spatial structure can explain the observed heterozygosity. We do not include all features of the actual population—in particular, we extrapolate by repeatedly sampling 10 years of spatial distributions; we ignore linkage; we simplify the self-incompatibility system; and we assume an annual life cycle (no perenniality or seed bank). Indeed, simulated pedigrees with uniformly distributed plants gave less ID than we observed. In contrast, simulated pedigrees conditioned on the actual, heterogeneous density of plants were consistent with ID measured in the field. This indicates that patchiness combined with leptokurtic dispersal shapes the distribution of heterozygosity. Simulations with heterogeneous density also better capture empirical isolation-by-distance patterns than those with a uniform density (Fig 4b; SM1.4: Supplementary Fig. 9). However, the effective population size of 15,500 individuals in the heterogeneous-density simulations is an order of magnitude larger than the average number of plants observed in a year (∼2,500). We believe that most plants are sampled each year, so that this discrepancy is more likely to be due to a seed bank, which is expected to substantially increase the effective population size (Heinrich et al. 2018). Nevertheless, despite simplifications such as nonoverlapping generations, no seed bank, and a simple SI system, the heterogeneous-density simulation accurately captures patterns of ID and isolation-by-distance.
Our estimation of ID illustrates a general problem with statistical comparisons in evolutionary biology. There are 3 sources of error in estimating g2: firstly, error generated from sampling a limited number of individuals, secondly, from sampling a limited number of SNPs, and thirdly from random variation between evolutionary realizations or trajectories. In our study, the first source (a limited number of individuals) is shown by the confidence intervals in Fig. 5, obtained by bootstrapping across individuals (Stoffel et al. 2016). The second source of error (a limited number of SNPs) is shown by the substantial variation in g2 of the 10 replicates of each of 5 pedigrees. Here, variation is generated by random meiosis amongst unlinked markers on a fixed pedigree. This variation could be reduced by increasing the number of SNPs, but the effective number of segregating sites that can be included in the analysis is fundamentally limited by the length of the genetic map. Finally, there is additional variation between pedigrees, due to the random assignment of parents in the simulations, which generates a random pedigree. The wide variation in estimates of g2 due to random meiosis, and to the random generation of the pedigree (Fig. 5) is an important reminder that estimates of parameters are typically limited by the randomness of evolution. The stochasticity of evolution can potentially generate error variance far higher than that due to the limited number of individuals or SNPs sampled.
In addition to analyzing the effect of population structure on the distribution of heterozygosity, our study highlights the potential of utilizing multiple statistics to estimate population structure. We have shown that the variance of heterozygosity due to ID can distinguish alternative dispersal and density distributions, which implies that in combination with pairwise Fij as a function of distance, g2 can help estimate the demography. Genetic data contain far more information than is described by FST and g2; for example, the mean squared disequilibrium can be used to estimate effective population size (Hill 1981; Vitalis and Couvet 2001), and this extends naturally to the covariance of pairwise linkage disequilibrium as a function of distance. We could simply use a set of such statistics to inform demographic inference via ABC (Beaumont 2010). However, our preference would be to first develop a theoretical understanding of how realistic demographies influence statistical measures of spatial covariance in allele frequency, identity disequilibria, and linkage disequilibria.
The distribution of heterozygosity has often been measured to estimate inbreeding depression and examine correlation with fitness. Yet, this type of data has rarely been used to investigate population structure per se and as a complement to the more widely used pairwise identity, FST. By bringing together local inbreeding and isolation-by-distance, our study provides a novel assessment of how dispersal and population density can explain both pairwise identity and the distribution of heterozygosity in spatially continuous populations. However, we have only begun to investigate how the distribution of heterozygosity can be shaped by population structure and demographic parameters. Our future work will focus on understanding how other features such as a seed bank influence genetic diversity, with the ultimate goal of deriving information about demographic history from the distribution of heterozygosity in populations that have fewer measured parameters. New models that include these complexities, as well as ecological, mating system and life history factors are required to extend our understanding of the drivers of population structure in natural populations.
Data availability
All data and code used to generate simulated data and carry out analysis is available at https://doi.org/10.15479/AT:ISTA:11321. Data include processed field data for 11 years of A. majus sample collection, including SNP values, GPS locations, and trait measurement values for each plant. Also included are dispersal data and a linkage map of 91 SNPs.
Supplemental material is available at GENETICS online.
Supplementary Material
Acknowledgments
We thank the many volunteers and friends who have contributed to data collection in the field site over the years, in particular those who have managed field seasons: Barbora Trubenova, Maria Clara Melo, Tom Ellis, Eva Cereghetti, Lenka Matejovicova, Beatriz Pablo Carmona. Frederic Ferrer and Eva Salmerón Mateu have been immensely helpful with logistics at our informal field station, El Serrat de Planoles. We thank Sean Stankowski for technical help in producing Fig. 1. This research was also supported by the Scientific Service Units (SSU) of IST Austria through resources provided by Scientific Computing (SciComp).
Funding
Part of this work was funded by Marie Curie COFUND Doctoral Fellowship and Austrian Science Fund FWF (grant P32166).
Conflicts of interest
None declared.
Contributor Information
Parvathy Surendranadh, IST Austria, 3400 Klosterneuburg, Austria.
Louise Arathoon, IST Austria, 3400 Klosterneuburg, Austria.
Carina A Baskett, IST Austria, 3400 Klosterneuburg, Austria.
David L Field, School of Science, Edith Cowan University, Joondalup WA 6027, Australia.
Melinda Pickup, IST Austria, 3400 Klosterneuburg, Austria; Greening Australia, Perth, WA 6000, Australia.
Nicholas H Barton, IST Austria, 3400 Klosterneuburg, Austria.
Literature cited
- Andalo C, Cruzan MB, Cazettes C, Pujol B, Burrus M, Thébaud C.. Post-pollination barriers do not explain the persistence of two distinct Antirrhinum subspecies with parapatric distribution. Plant Syst Evol. 2010;286(3–4):223–234. doi: 10.1007/s00606-010-0303-4. [DOI] [Google Scholar]
- Beaumont MA. Approximate Bayesian computation in evolution and ecology. Annu Rev Ecol Evol Syst. 2010;41(1):379–406. doi: 10.1146/annurev-ecolsys-102209-144621. [DOI] [Google Scholar]
- Bradburd GS, Ralph PL.. Spatial population genetics: it’s about time. Annu Rev Ecol Evol Syst. 2019;50(1):427–449. doi: 10.1146/annurev-ecolsys-110316-022659. [DOI] [Google Scholar]
- Cartwright RA. Antagonism between local dispersal and self-incompatibility systems in a continuous plant population. Mol Ecol. 2009;18(11):2327–2336. doi: 10.1111/J.1365-294X.2009.04180.X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth D, Charlesworth B.. Inbreeding depression and its evolutionary consequences. Annu Rev Ecol Syst. 1987;18:237–268. doi: 10.1146/annurev.es.18.110187.001321. [DOI] [Google Scholar]
- David P, Pujol B, Viard F, Castella V, Goudet J.. Reliable selfing rate estimates from imperfect population genetic data. Mol Ecol. 2007;16(12):2474–2487. doi: 10.1111/J.1365-294X.2007.03330.X. [DOI] [PubMed] [Google Scholar]
- Fujii S, Kubo KI, Takayama S.. Non-self- and self-recognition models in plant self-incompatibility. Nat Plants. 2016;2(9):1–9. doi: 10.1038/nplants.2016.130. [DOI] [PubMed] [Google Scholar]
- Hardy OJ, Vekemans X.. SPAGeDI: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Mol Ecol Notes. 2002;2(4):618–620. doi: 10.1046/J.1471-8286.2002.00305.X. [DOI] [Google Scholar]
- Hardy OJ. Population genetics of autopolyploids under a mixed mating model and the estimation of selfing rate. Mol Ecol Resour. 2016;16(1):103–117. doi: 10.1111/1755-0998.12431. [DOI] [PubMed] [Google Scholar]
- Heinrich L, Müller J, Tellier A, Živković D.. Effects of population- and seed bank size fluctuations on neutral evolution and efficacy of natural selection. Theor Popul Biol. 2018;123:45–69. doi: 10.1016/J.TPB.2018.05.003. [DOI] [PubMed] [Google Scholar]
- Hill WG. Estimation of effective population size from data on linkage disequilibrium1. Genet Res. 1981;38(3):209–216. doi: 10.1017/S0016672300020553. [DOI] [Google Scholar]
- Jakobsson M, Edge MD, Rosenberg NA.. The relationship between FST and the frequency of the most frequent allele. Genetics. 2013;193(2):515–528. doi: 10.1534/genetics.112.144758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M, Weiss GH.. The stepping stone model of population structure and the decrease of genetic correlation with distance. Genetics. 1964;49(4):561–576. doi: 10.1093/genetics/49.4.561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loveless MD, Hamrick JL.. Ecological determinants of genetic structure in plant populations. Annu Rev Ecol Syst. 1984;15(1):65–95. doi: 10.1146/annurev.es.15.110184.000433. [DOI] [Google Scholar]
- Lynch M, Walsh B. “Inbreeding Depression.” In Genetics and Analysis of Quantitative Traits, 251–91. Sunderland, Massachusetts: Sinauer Associates, Inc., 1998.
- Malécot G. The Mathematics of Heredity (English Translation, 1969). San Francisco: WF Freeman; 1948. [Google Scholar]
- McCubbin A, Carpenter R, Coen E, Dickinson H.. Self-incompatibility in Antirrhinum. In: Ottaviano E, Gorla MS, Mulcahy DL, Mulcahy GB, (editors). Angiosperm Pollen and Ovules. New York, NY: Springer. 1992;104–109. doi: 10.1007/978-1-4612-2958-2_16. [DOI] [Google Scholar]
- Miller JM, Coltman DW.. Assessment of identity disequilibrium and its relation to empirical heterozygosity fitness correlations: a meta-analysis. Mol Ecol. 2014;23(8):1899–1909. doi: 10.1111/MEC.12707. [DOI] [PubMed] [Google Scholar]
- Milligan BG, Archer FI, Ferchaud AL, Hand BK, Kierepka EM, Waples RS.. Disentangling genetic structure for genetic monitoring of complex populations. Evol Appl. 2018;11(7):1149–1161. doi: 10.1111/eva.12622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna (Austria: ): R Foundation for Statistical Computing; 2014. [Google Scholar]
- Ringbauer H, Kolesnikov A, Field DL, Barton NH.. Estimating barriers to gene flow from distorted isolation-by-distance patterns. Genetics. 2018;208(3):1231–1245. doi: 10.1534/genetics.117.300638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodríguez-Quilón I, Santos-del-Blanco L, Grivet D, Jaramillo-Correa JP, Majada J, Vendramin GG, Alía R, González-Martínez SC.. Local effects drive heterozygosity-fitness correlations in an outcrossing long-lived tree. Proc Biol Sci. 2015;282(1820):20152230. doi: 10.1098/rspb.2015.2230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sin SYW, Hoover BA, Nevitt GA, Edwards SV.. Demographic history, not mating system, explains signatures of inbreeding and inbreeding depression in a large outbred population. Am Nat. 2021;197(6):658–676. doi: 10.1086/714079. [DOI] [PubMed] [Google Scholar]
- Stoffel MA, Esser M, Kardos M, Humble E, Nichols H, David P, Hoffman JI.. inbreedR: an R package for the analysis of inbreeding based on genetic markers. Methods Ecol Evol. 2016;7(11):1331–1339. doi: 10.1111/2041-210X.12588. [DOI] [Google Scholar]
- Szulkin M, Bierne N, David P.. Heterozygosity-fitness correlations: a time for reappraisal. Evolution. 2010;64(5):1202–1217. doi: 10.1111/j.1558-5646.2010.00966.x. [DOI] [PubMed] [Google Scholar]
- Tavares H, Whibley A, Field DL, Bradley D, Couchman M, Copsey L, Elleouet J, Burrus M, Andalo C, Li M, et al. Selection and gene flow shape genomic islands that control floral guides. Proc Natl Acad Sci U S A. 2018;115(43):11006–11011. doi: 10.1073/pnas.1801832115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vekemans X, Hardy OJ.. New insights from fine-scale spatial genetic structure analyses in plant populations. Mol Ecol. 2004;13(4):921–935. doi: 10.1046/J.1365-294X.2004.02076.X. [DOI] [PubMed] [Google Scholar]
- Vitalis R, Couvet D.. Estimation of effective population size and migration rate from one- and two-locus identity measures. Genetics. 2001;157(2):911–925. doi: 10.1093/genetics/157.2.911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whibley AC, Langlade NB, Andalo C, Hanna AI, Bangham A, Thébaud C, Coen E.. Evolutionary paths underlying flower color variation in Antirrhinum. Science. 2006;313(5789):963–966. doi: 10.1126/science.1129161. [DOI] [PubMed] [Google Scholar]
- Winn AA, Elle E, Kalisz S, Cheptou P-O, Eckert CG, Goodwillie C, Johnston MO, Moeller DA, Ree RH, Sargent RD, et al. Analysis of inbreeding depression in mixed-mating plants provides evidence for selective interference and stable mixed mating. Evolution (NY). 2011;65(12):3339–3359. doi: 10.1111/j.1558-5646.2011.01462.x. [DOI] [PubMed] [Google Scholar]
- Wolfram Research, Inc. Mathematica. Champaign (IL: ): Wolfram Research, Inc; 2019. [Google Scholar]
- Wright S. Evolution in Mendelian populations. Genetics. 1931;16(2):97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. Isolation by distance under diverse systems of mating. Genetics. 1946;31(1):39–59. doi: 10.1093/genetics/31.1.39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. Isolation by distance. Genetics. 1943;28(2):114–138. doi: 10.1016/B978-0-12-374984-0.00820-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data and code used to generate simulated data and carry out analysis is available at https://doi.org/10.15479/AT:ISTA:11321. Data include processed field data for 11 years of A. majus sample collection, including SNP values, GPS locations, and trait measurement values for each plant. Also included are dispersal data and a linkage map of 91 SNPs.
Supplemental material is available at GENETICS online.





