Abstract
Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
Author summary
Samples of genome sequences can be informative about the history of the population from which they were drawn, and about mutation and other processes that led to the observed sequences. However, obtaining reliable inferences is challenging, because of the complexity of the underlying processes and the large amounts of sequence data that are often now available. A common approach to simplifying the data is to use only genome segments that are very similar between two sequences, called identical-by-descent (IBD). The longer the IBD segment the more informative about recent shared ancestry, and current approaches restrict attention to IBD segments above a length threshold. We instead are able to use IBD segments of any length, allowing us to extract much more information from the sequence data. To reduce the computation burden we identify subsets of the available sequence pairs that lead to little information loss. Our approach exploits recent advances in inferring aspects of the ancestral recombination graph (ARG) underlying the sample of sequences. Computational cost still limits the size and complexity of problems our method can handle, but where feasible we obtain dramatic improvements in the power of inferences.
Introduction
A common data-reduction technique when analysing samples of genome sequences is to identify identical-by-descent (IBD) genome segments [1–5]. In practice IBD is often identified by searching for regions with no evidence for recombination along two sequences since their most recent common ancestor (MRCA). Further, only IBD segments (IBDs) above a given length threshold, often 2 to 4 cM, are retained. This practice wastes valuable information, but has been necessary because the inference of short IBDs is too noisy to be useful for downstream analyses.
The ancestral recombination graph (ARG) is widely used to represent the genealogical history of a sample [6–8] and recent developments in inferring aspects of the ARG [9–13] now permit us to rapidly extract IBD directly from inferred shared ancestors, without requiring zero recombination. Further, computationally fast ARG inference and extraction of IBD can be implemented within an approximate Bayesian computation (ABC) algorithm which removes the need for an information-wasteful length threshold. Instead, we reduce computational cost by using an efficient subset of IBDs that scales linearly with sample size with little information loss relative to using all IBDs.
Our approach relies on an efficient data structure encoding features of an ARG underlying a sample of genome sequences, called the succinct tree sequence (TS) [14]. The TS minimises redundant storage of subsequences that are similar due to shared ancestry. It has led to spectacular improvements in storage and simulation of large genome datasets [15], and has recently been applied to IBD-based inferences about demographic history and evolutionary parameters [16].
We first demonstrate powerful inferences of mutation and sequencing error rates, TMRCA (time since the MRCA), and past and present population sizes, given true IBD information in simulation studies. For real datasets, we propose TSABC: ABC with statistics computed from IBDs extracted from an inferred TS. We demonstrate the performance of TSABC with inferences of the mutation rate and population size in simulation studies and real data, and we compare mutation rate estimates with previously-published results and with analyses using a range of IBD length thresholds.
We find that using IBDs extracted from an inferred ARG leads to a surprisingly small loss of precision relative to use of true IBDs. Further, even a low threshold on IBD length reduces the quality of inferences, despite the fact that short IBDs are poorly inferred. TSABC is computationally demanding, which limits the size and complexity of inference problems that can be tackled. However, TSABC can achieve comparable results to previous estimators using much smaller data sets: we show similar precision to a previously-published estimator despite a 4 000 fold reduction in data available for inference.
Methods
Definition and notations
The TS encodes genome sequence data efficiently by storing subsequences that are similar due to shared ancestry as variations of an ancestral sequence. It is defined [17] as {C, P,E,M}, where C = {1,…,m} is the set of leaf (or tip) nodes corresponding to m observed sequences each of length ℓ, and P = {m+1,…,n} is the set of internal (ancestral) nodes of the TS ordered backwards in time from the present. An edge in E = {(ci, pi, li, ri) : i = 1, 2,…, I} represents inheritance of sites in the segment [li, ri], with 1 ≤ li ≤ ri ≤ ℓ, from internal node pi ∈ P to its child ci ∈{1,…,pi−1}, while M = {(cj, sj) : j = 1, 2,…} stores the set of sites sj at which there is a sequence difference between cj and its parent, due either to a mutation or, if cj is a leaf node, sequencing error. The TS has the “succinct” property that any tree component conserved over a genome segment is stored only once, which greatly reduces data storage requirements compared with retaining all distinct marginal trees.
Identity by descent and efficient subsets
We denote the ith IBD segment in the TS by IBDi = (ci1, ci2, li, ri, pi,Mi), i = 1,…, I, ordered such that ci1 is non-decreasing in i. Here ci1 and ci2 are the leaf nodes of the two sequences, [li, ri] is the IBD genome segment, pi is the MRCA node of ci1 and ci2 for this segment, and Mi denotes the set of sites in [li, ri] at which ci1 and ci2 differ. As there is no length threshold, the IBDs of any sequence pair partition the genome: every sequence site is included in exactly one of the IBD segments.
Each IBDi has the same MRCA at each site in [li, ri], and a different MRCA at adjacent sites. Imposing a no-recombination requirement as part of the definition of IBD would be more restrictive, since the absence of recombination implies a common MRCA but the reverse does not hold (see Figure 1, left, for examples of recombinations that do not change the MRCA).
Fig 1.
An ancestral recombination graph (ARG) spanning a genome sequence of length ℓ = 100 (left), the corresponding sequence of local trees (middle) and efficient IBD subset (right). The ARG has leaf nodes {1, 2, 3, 4} = C, named ancestral nodes {5, 6, 7, 8} = P, and a recombination at site 42 of an internal ancestral node (red dot). The two dashed lines in the ARG represent inheritance paths due to two ineffective recombination events, which are not represented in the TS. The efficient IBD subset includes two IBD segments for the node pair (1, 2), corresponding to intervals [1, 42] and [43, 100] which have MRCA 6 and 8, respectively, and one IBD segment spanning the whole sequence for pairs (3, 4) and (4, 1).
To reduce computational effort, we use for inference only an “efficient” subset of IBDs. After fixing an arbitrary order for the sequences, we include in the subset only the IBDs of the sequence pairs (1,m) and (c, c+1) for c = 1,…,m−1 (see Figure 1, right, and Appendix S1). An efficient subset has the property that each edge of the TS is included in a descent path from the MRCA for at least one IBD segment in the subset, which ensures that information is retained in the subset about every mutation.
Imposing a length threshold on IBDs is also a form of data reduction but we show below that it can lead to high information loss, because mutations are ignored if they occur at sites not contained in a sufficiently long IBD segment.
Estimation
Let μ and ϵ be the per-site per-generation mutation rate and the per site sequencing error rate, both assumed constant over sites. For i = 1,…, I, let gi denote the age of pi in generations, and let N(g), g = 0, 1 , 2,…, be the population size g generations in the past. In Appendix S2 we derive method-of-moment estimators for μ and ϵ, and non-parametric estimators of gi, i = 1,…, I, and N(g), g = 0, 1, 2,…, based on statistics computed from IBD lengths. We investigate the performance of these estimators when true IBD information is available in simulation studies. The recombination rate r is assumed constant over sites and known for all inferences; the extension to a known recombination map is straightforward (a recombination at site s means between sites s and s + 1).
For observed sequence data, true IBD information is not available and we extract IBDs from an inferred TS. TSABC uses summary statistics derived from these IBDs and related to the method-of-moments estimators. For inference of μ and ϵ, we use the statistics and C1 (Appendix S2.1) which are linear transformations of the method-of-moments estimators and . Nonparametric estimation of N(g) is not feasible, but we can estimate the parameters of a demographic model, which allows powerful inference provided that the model is adequate. We use as statistics the mean and standard deviation (SD) of IBD lengths ri − li, i = 1,…, I.
Simulation study design: true IBD available
We jointly estimated μ, ϵ, gi, i = 1,…,I, and N(g), g ≥ 0, using our novel estimators. We used msprime [19] to generate TS under the coalescent with recombination model [20, 21], assuming demographic models C, Ga and S (Table 1). From each TS we extracted an efficient subset of IBDs (Algorithm 1). Sequencing error was simulated by adding elements to M at leaf nodes of the generated TS. At the largest error rate (ϵ = 10−3), any singleton variant is a few times more likely to arise from sequencing error rather than a mutation.
Table 1.
Parameter values, sample properties and demographic models for the simulation study. Unless otherwise stated, 25 simulation replicates were generated in each scenario. Model Ga is used for inferences given true IBD and Model Gb is used for inferences from inferred IBD. The value of r is assumed known for all inferences, whereas μ, ϵ and N(g), g ≥ 0, are targets of inference.
| Evolutionary parameters and sample properties | ||
|---|---|---|
| Symbol | Definition | Value(s) in simulations |
| ℓ | sequence length | 106, 107 or 108 sites |
| Demographic models (N(g) = population size g generations ago) | ||
| Model C | N(g) constant | N(g) = 2 × 104 |
| Model G | N(g) = N(0) × e−τg |
N(0) = 106, τ = 10−4 (Model Ga) N(0) = 2 × 105, τ = 10−3 (Model Gb) |
| Model S |
N(g) = N(0) for 0 < g < G = N(G) for g ≥ G |
N(0) = G = 4 × 104 N(G) = 104 |
| Model EA | European-American demographic history [18] | See Figure S2 for N(g) values; gene conversion: tract length 100 bp rate 10−8/site/generation. |
Simulation study design: inferred IBD
We used msprime to generate simulated sequences, recoded them as binary strings using 0 and 1 for the ancestral and derived alleles and added sequencing errors by assigning 1 to randomly selected sites at rate ϵ (see [22] for alternative models of sequencing error). We choose tsinfer [10] to infer the TS from the resulting sequence data; speed is critical for an ABC algorithm, and tsinfer is the fastest of the current methods, while retaining high accuracy [13,23]. Unless otherwise stated, in each scenario we used η = 2500 simulations with ABC acceptance rate 0.05 (125 acceptances).
We first use simulations to confirm previous reports [24] that the quality of IBD inference is often poor, particularly for short IBDs. We compared the number of true and inferred IBDs for datasets simulated under Model C with μ ranging from 1 to 20 units of 10−8 per site per generation and m = 10, 20 and 160. We also compared the length distribution of true and inferred IBDs for m = 160 and μ = 1.3 × 10−8.
To investigate the effect of including short IBDs, both true and inferred, we also modified TSABC to include only IBDs with length greater than a threshold of 1, 2 or 4 units of 104 bp. When a threshold was applied, we included all IBDs satisfying the threshold, rather than using only the efficient subset of IBDs.
We next investigated TSABC estimation of μ under Model C and Model Gb with ℓ = 107. The N(g) values and ϵ = 0 were assumed known for the inference and we adopted a Uniform(10−8, 2 × 10−8) prior distribution for μ. For the Model C simulations with m = 10, we also applied TSABC after thresholding on IBD length and repeated using true IBD extracted from the msprime simulations.
To study TSABC estimation of the population size N(g), we used m = 200 and ℓ = 106 under each of Model C and Model Gb. For both data simulation models, the TSABC inference used Model G but with different prior distributions. When the simulation model was Model C, we fitted Model G with independent prior distributions Uniform(104, 3 × 104) for N(0) and Uniform(−2 × 10−5, 2 × 10−5) for τ. Whenever τ < 0, we impose a population size limit N(g) ≤ 2 × N(0). With simulation model Model Gb, the independent prior distributions were Uniform(105, 3 × 105) for N(0), and Uniform(0, 0.002) for τ. All parameters were treated as known except the targets of inference N(0) and τ.
We performed additional simulations to allow comparison with the inferences of μ reported by [18]. Data were simulated under Model EA, which aims to capture key features of the demographic history of European-Americans (Table 1), and Model C modified to include sequencing errors. The Model C simulations of [18] used ϵ = 10−4 but no gene conversion, while for Model EA they set ϵ = 0 and included gene conversion. We include both sequencing error and gene conversion in both Model C and Model EA simulations. We used a 400-fold smaller sample size than [18] (m = 10 versus m = 4 × 103) and 10-fold smaller genome length (ℓ = 107 per chromosome, versus ℓ = 108).
We did not include gene conversion in the TSABC inference, thus challenging it with model misspecification. As a further challenge, we treated N(g) as unknown when inferring μ, and misspecified the model for N(g) in the TSABC simulations.
When the data were simulated under Model C, TSABC used independent prior distributions Uniform(10−8, 2 × 10−8) for μ and Uniform(0.6 × 10−4, 1.6 × 10−4) for ϵ. For N(g), we adopted Model G with independent priors N(0) ~ Uniform(14 000, 30 000) and τ ~ Uniform(−2 × 10−5, 10−5).
When the data were simulated under Model EA, TSABC used a Uniform(10−8, 2 × 10−8) prior distribution for μ. For inference of N(g), we adopted Model S with independent prior distributions N(0) ~ Uniform(11 000, 15 000), G ~ Uniform(4500, 6500) and N(G) ~ Uniform(45 000, 49 000).
Mutation and growth rates in the 1000 Genome Project
We analyse chromosomes 20 and 21 from 8 of the 26 human populations of the 1000 Genomes Project (1KGP) [25] making use of the demographic model of [26] which we refer to as the 1KGP model. See Figure S2 for plots of the 1KGP model and Appendix S3 for details of the data analysis. Separately for each chromosome, we use TSABC to infer μ assuming the prior Uniform(10−8, 2 × 10−8) and the 1KGP model. The 16 sets of 125 accepted values were analysed in a two-way ANOVA to assess differences in μ across chromosomes and over populations.
Next, we use chromosome 20 and 21 data to estimate population size N(g) assuming the 1KGP model for g ≥ 1000 and fitting demographic Model G for 0 ≤ g ≤ 1 000, constrained such that N(1000) in Model G matches the 1KGP model value. The constrained Model G has one free parameter N(0), for which we adopt a Uniform(10000, 240000) prior distribution. To reduce computational effort with little loss of information, in both the observed dataset and TSABC simulations we removed SNPs with a minor allele count > 40, which typically arose at g ≫ 1 000. We estimate N(g) from each chromosome separately and average the results.
Results
Simulation study results: true IBD available
While use of the efficient subset of IBDs reduces computational cost in proportion to the reduction in sequence pairs from m(m−1)/2 to m, the average estimated SD of in our simulation study increased only slightly, from 0.017 to 0.019 units of 10−8 (see also Figure S3, left panel). This gain in computation time is typically worth the small loss of statistical efficiency.
Both and are well estimated in all demographic models, with no indication of bias (Figure 2). Increasing m has only a modest effect on the SD of estimators, whereas ℓ has a larger effect (SD scales with , Figures 2, S3 (right) and S4). Sequencing errors only inflate the number of singleton variants, so is little affected by increasing ϵ (Figure 2).
Fig 2.
Inference of mutation rate μ and sequencing error rate ϵ with two sequence lengths (columns), when true IBD was available for inference. Line segments show indicative 95% CIs computed from the average estimate (indicated by a symbol, see legend box) and the empirical SD of the estimates from 25 simulated datasets in each scenario. Bottom left panel shows the impact of ϵ on when m = 10, in the other three panels ϵ = 10−4.
Although individual are not precise, the empirical and theoretical densities obtained from all , i = 1 …, I, are close (Figure 3) despite the TS used for input only including information about the order of the coalescent events, and not their times. The population size estimator is accurate under all models, at least for g ≤ 5 × 105 (Figure 4). Figures S3 (right) and S4 show more precise estimates of and N(g), with a longer sequence length.
Fig 3.
Histogram of the , obtained from one sample simulated under each of Model C (left) and Model Ga (right), with sample size m = 80, sequence length ℓ = 108 and sequencing error rate ϵ = 10−3. Also shown is a probability density obtained by kernel smoothing of the together with the true density. True IBD was available for inference but no time information.
Fig 4.
Estimates of the population size N(g), g ≥ 0, from each of 25 simulation replicates under Model C, Model Ga and Model S, when true IBD was available for inference. Sequence length is ℓ = 108, sequencing error rate is ϵ = 10−3 and sample size is m = 80.
Simulation study results: inferred IBD
The number of inferred IBDs tends to increase with both μ and m, but except for very high μ (over 10 times the average human value when m = 160) it remains well below the true number of IBDs (Figure 5, left). Correspondingly, the length distribution of inferred IBDs is highly skewed towards larger values relative to the true distribution (Figure 5, right), as previously reported [27,28]. Despite this poor detection of small IBDs, and consequent tendency for inferred IBDs to be longer than the true IBDs, Table 2 shows that each increase in the length threshold reduced the precision of inference, both for inferred and true IBD, so that even poorly-inferred short IBDs do contribute useful information for inference. We also see in Table 2 (final column) further evidence that use of the efficient subset of IBDs leads to only a small loss of statistical efficiency. As expected, the use of true IBD improves TSABC compared with using inferred IBD, but the magnitude of the improvement is modest in the case of standard TSABC (threshold = 0). For higher thresholds, bias can be high due to low precision of inference and the prior boundary at 10−8.
Fig 5.
Comparison of true and inferred IBDs. Left: each symbol and vertical line segment shows the mean and 95% CI of the mean ratio of IBD counts over 25 Model C simulations with sample sizes m = 10, 20 and 160. Right: histograms of true and inferred IBD length distributions for a Model C simulated dataset with m = 160 and sequence length ℓ = 106.
Table 2.
Comparison of TSABC inference for μ using different IBD length thresholds. Each result is an average over 25 Model C simulation replicates with m = 10 and ϵ = 0. In the last column, values based only on IBDs in the efficient subset are given in ().
| Threshold (104 bp) | 4 | 2 | 1 | 0 | |
|---|---|---|---|---|---|
| inferred IBD | |||||
| # IBD | 1 001 | 7 033 | 24 683 | (10 667) | |
| 1.51 | 1.44 | 1.32 | (1.31) | ||
| SD (10−8) | 0.219 | 0.167 | 0.067 | (0.043) | |
| true IBD | |||||
| # IBD | 478 | 1 803 | 6 940 | (26 394) | |
| 1.33 | 1.30 | 1.31 | (1.29) | ||
| SD (10−8) | 0.141 | 0.089 | 0.064 | (0.034) | |
Although TSABC can provide approximations to the full posterior distribution, we report here only posterior mean estimates of unknown parameters. For inference of μ, it appears that any bias of TSABC is small for both models (Figure 6). Some under-estimation is expected because the binarisation of the sequence data obscures instances of multiple mutations at the same site, but this effect is negligible.
Fig 6.
TSABC estimation of mutation rate μ. Symbols and line segments show mean and 95% CI over 25 simulations with no sequencing error (ϵ = 0) and sequence length ℓ = 107.
Parametric estimation of N(g) also performs well (Figure 7). When the data simulation model was Model C, the average estimate of N(0) (true value 20 000) over the 25 replicates is 20 931 with standard error (SE) , while for the growth rate τ (true value 0) the average estimate ± SE is (in units of 10−6). When the data simulation model was Model Gb, for N(0) (true value 200 000) we obtained 202 534 ± 2 173 while for τ (true value 1) we obtained 1.08 ± 0.07 (in units of 10−3).
Fig 7.
Fitted exponential curves for the population size N(g) obtained using TSABC. Each of the 25 curves corresponds to a dataset simulated under Model C (left) and Model Gb (right) with no sequencing error (ϵ = 0), sample size m = 200 and sequence length ℓ = 106.
Table 3 shows that TSABC performs similarly to the results reported by [18] despite a 4 000-fold reduction in data used for inference, and despite the challenges we imposed on TSABC: gene conversion was incorporated in data simulation models but not the ABC inference simulations, and the latter also used a misspecified demographic model.
Table 3.
Comparison of TSABC inference of μ (in units of 10−8) with results reported in [18]. TSABC results are obtained from 25 simulated datasets under each model.
| Model C | Model EA | |||||
|---|---|---|---|---|---|---|
| SD | SD | m | ℓ | |||
| [18] | 1.30 | 0.020 | 1.34 | 0.007 | 4 000 | 108 |
| TSABC | 1.30 | 0.017 | 1.28 | 0.007 | 10 | 107 |
1000 Genomes data analysis
The global mean over the two chromosomes and eight populations is 1.27 × 10−8 (Table 4), similar to previous estimates assuming μ to be constant over populations [26,29,30], and also those finding small between-family differences in μ [31,32]. A two-way ANOVA revealed no significant difference between the two chromosomes, but highly significant differences across populations, which may be due to differences in heritable factors or environmental exposures.
Table 4.
Estimates of the posterior mean and SE of the mutation rate per site per generation (in units of 10−8) on human chromosome 20 and 21 for populations MSL (Mende in Sierra Leone), LWK (Luhya in Webuye, Kenya), BEB (Bengali from Bangladesh), ITU (Indian Telugu from the UK), FIN (Finnish in Finland), GBR (British in England and Scotland), JPT (Japanese in Tokyo, Japan), and CHB (Han Chinese in Beijing, China). The TSABC analysis assumes the 1KGP demographic model in each population.
| MSL | LWK | BEB | ITU | FIN | GBR | JPT | CHB | ||
|---|---|---|---|---|---|---|---|---|---|
| Sample size: | 170 | 198 | 172 | 204 | 198 | 182 | 208 | 206 | |
| Chr 20 | 1.27 | 1.24 | 1.23 | 1.22 | 1.32 | 1.36 | 1.32 | 1.20 | |
| SE | 0.004 | 0.004 | 0.004 | 0.004 | 0.004 | 0.011 | 0.004 | 0.004 | |
| Chr 21 | 1.25 | 1.26 | 1.21 | 1.29 | 1.29 | 1.35 | 1.33 | 1.24 | |
| SE | 0.004 | 0.005 | 0.006 | 0.005 | 0.006 | 0.006 | 0.006 | 0.006 | |
| Combined | 1.26 | 1.25 | 1.22 | 1.25 | 1.31 | 1.36 | 1.33 | 1.22 | |
| SE | 0.003 | 0.003 | 0.003 | 0.004 | 0.004 | 0.006 | 0.004 | 0.004 | |
Figure 8 shows positive growth in the past 1 000 generations for all eight populations. CHB and BEB (both in Asia) have the highest N(0) while MSL and LWK (both in Africa) have the lowest N(0) despite having the highest values of N(1000). These findings are consistent with the results of [26] for 400 ≤ g ≤ 1 000, and the recent growth estimates obtained using Relate [11, Figure 3].
Fig 8.
Estimates of recent population sizes for eight populations sampled in the 1000 Genomes Project (curves are shown in order of decreasing N(0)). See Table 4 caption for explanation of the population labels.
Discussion
We have shown that ARG-derived IBD combined with ABC can deliver big advantages over previous IBD-based methods for inferring evolutionary and demographic parameters from a sample of genome-wide sequences, including nonparametric estimation of past population sizes. Despite verifying that IBD extracted from an inferred TS is often inaccurate, we have shown that it provides powerful inferences for the mutation rate and historic population sizes. For example, we obtained similar estimation results to a previous study that used 4 000 times more data for inference. These advantages arise because we can define IBD in terms of a common MRCA, avoiding both the problem of detecting recombinations and the need for a minimum IBD length. Further, we require only IBDs from m sequence pairs, rather than all m(m−1)/2 pairs, which reduces computational effort with little loss of statistical efficiency.
We illustrated our TSABC approach in simple scenarios, finding that it suffers only modest loss of efficiency relative to using true IBD. Importantly, removing IBDs with length below even a low threshold reduces the precision of inferences despite the poor quality of ARG-based IBD inferences.
TSABC can be computationally demanding for complex demographic models, and the results presented here are limited to inferring the mutation rate and two parameters of a demographic model. However, we were able to incorporate unknown nuisance parameters such as the sequence error rate and misspecification of the demographic model to challenge TSABC inference without substantial detriment to inference quality.
Our results open the way for more powerful demographic and evolutionary inferences from samples of genome sequences than have previously been available.
Supplementary Material
Acknowledgment
ZH is funded by Australian Research Council grant DP210102168 awarded to YBC, DJB and JK. JK acknowledges support from the Robertson Foundation, US National Institutes of Health (grants HG011395 and HG012473) and UK Engineering and Physical Sciences Research Council (grant EP/X024881/1).
Data availability
Data and code used here are available at: github.com/ZhendongHuang/Estimating_evolutionary_and_demographic_parameters_Huang
References
- 1.Browning SR, Browning BL. Identity by descent between distant relatives: detection and applications. Annual Review of Genetics. 2012;46:617–633. [DOI] [PubMed] [Google Scholar]
- 2.Palamara PF, Pe’er I. Inference of historical migration rates via haplotypesharing. Bioinformatics. 2013;29(13):i180–i188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sticca EL, Belbin GM, Gignoux CR. Current developments in detection ofidentity-by-descent methods and applications. Frontiers in Genetics. 2021; p. 1725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Tang K, Naseri A, Wei Y, Zhang S, Zhi D. Open-source benchmarking of IBDsegment detection methods for biobank-scale cohorts. GigaScience. 2022;11:giac111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Chen H, Naseri A, Zhi D. FiMAP: A fast identity-by-descent mapping test forbiobank-scale cohorts. PLoS Genetics. 2023;19(12):e1011057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Griffiths RC, Marjoram P. An ancestral recombination graph. In: Donnelly P,Tavare S, editors. IMA volume on Mathematical Population Genetics. New York: Springer–Verlag; 1997. p. 257–270. [Google Scholar]
- 7.Lewanski A, Grundler M, Bradburd G. The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLoS Genetics. 2024;20(1):e1011110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Brandt DY, Huber CD, Chiang CW, Ortega-Del Vecchyo D. The promise ofinferring the past using the ancestral recombination graph. Genome Biology and Evolution. 2024;16(2):evae005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference ofancestral recombination graphs. PLoS Genetics. 2014;10(5):e1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kelleher J, Wong Y, Wohns AW, Fadil C, Albers PK, McVean G. Inferringwhole-genome histories in large population datasets. Nature Genetics. 2019;51(9):1330–1338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Speidel L, Forest M, Shi S, Myers SR. A method for genome-wide genealogyestimation for thousands of samples. Nature Genetics. 2019;51(9):1321–1329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Mahmoudi A, Koskela J, Kelleher J, Chan Y, Balding D. Bayesian inference ofancestral recombination graphs. PLoS Computational Biology. 2022;18(3):e1009960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang BC, Biddanda A, Gunnarsson ÁF, Cooper F, Palamara PF. Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nature Genetics. 2023; p. 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A generaland efficient representation of ancestral recombination graphs. BioRxiv. 2023; p. 2023–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kelleher J, Etheridge AM, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Computational Biology. 2016;12(5):e1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Silcocks M, Farlow A, Hermes A, Tsambos G, Patel H, Huebner S, et al. Indigenous Australian genomes show deep structure and rich novel variation. Nature. 2023;624(7992):593–601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kelleher J, Lohse K. Coalescent simulation with msprime. Statistical Population Genomics. 2020;986:191–230. [DOI] [PubMed] [Google Scholar]
- 18.Tian X, Browning BL, Browning SR. Estimating the genome-wide mutation rate with three-way identity by descent. The American Journal of Human Genetics. 2019;105(5):883–893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G,et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2022;220(3):iyab229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Griffiths R. Neutral two-locus multiple allele models with recombination. Theoretical Population Biology. 1981;19(2):169–186. [Google Scholar]
- 21.Hudson RR. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology. 1983;23(2):183–201. [DOI] [PubMed] [Google Scholar]
- 22.Albers PK, McVean G. Dating genomic variants and shared ancestry inpopulation-scale sequencing data. PLoS Biology. 2020;18(1):e3000586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.YC Brandt D, Wei X, Deng Y, Vaughn AH, Nielsen R. Evaluation of methods for estimating coalescence times using ancestral recombination graphs. Genetics. 2022;221(1):iyac044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chiang CW, Ralph P, Novembre J. Conflation of short identity-by-descentsegments bias their inferred length distribution. G3: Genes, Genomes, Genetics. 2016;6(5):1287–1296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Fairley S, Lowy-Gallego E, Perry E, Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Research. 2020;48(D1):D941–D947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.1000 Genomes Project Consortium and others. A global reference for human genetic variation. Nature. 2015;526(7571):68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Deng Y, Song YS, Nielsen R. The distribution of waiting distances in ancestralrecombination graphs. Theoretical Population Biology. 2021;141:34–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ignatieva A, Favero M, Koskela J, Sant J, Myers SR. The distribution of branch duration and detection of inversions in ancestral recombination graphs. BioRxiv. 2023; p. 2023–07. [Google Scholar]
- 29.Tian X, Cai R, Browning SR. Estimating the genome-wide mutation rate fromthousands of unrelated individuals. The American Journal of Human Genetics. 2022;109(12):2178–2184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636–639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Conrad DF, Keebler JEM, DePristo MA, Lindsay SJ, Zhang Y, Cassals F, et al. Variation in genome-wide mutation rates within and between human families. Nature Genetics. 2011;43(7):712–714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Harris K. Evidence for recent, population-specific evolution of the humanmutation rate. Proceedings of the National Academy of Sciences. 2015;112(11):3439–3444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation inpopulation genetics. Genetics. 2002;162(4):2025–2035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Savitzky A, Golay MJ. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry. 1964;36(8):1627–1639. [Google Scholar]
- 35.Wohns AW, Wong Y, Jeffery B, Akbari A, Mallick S, Pinhasi R, et al. A unified genealogy of modern and ancient genomes. Science. 2022;375(6583):eabi8264. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data and code used here are available at: github.com/ZhendongHuang/Estimating_evolutionary_and_demographic_parameters_Huang








