Abstract
Genotype imputation is an indispensable step in human genetic studies. Large reference panels with deeply sequenced genomes now allow interrogating variants with minor allele frequency < 1% without sequencing. Although it is critical to consider limits of this approach, imputation methods for rare variants have only done so empirically; the theoretical basis of their imputation accuracy has not been explored. To provide theoretical consideration of imputation accuracy under the current imputation framework, we develop a coalescent model of imputing rare variants, leveraging the joint genealogy of the sample to be imputed and reference individuals. We show that broadly used imputation algorithms include model misspecifications about this joint genealogy that limit the ability to correctly impute rare variants. We develop closed-form solutions for the probability distribution of this joint genealogy and quantify the inevitable error rate resulting from the model misspecification across a range of allele frequencies and reference sample sizes. We show that the probability of a falsely imputed minor allele decreases with reference sample size, but the proportion of falsely imputed minor alleles mostly depends on the allele count in the reference sample. We summarize the impact of this error on genotype imputation on association tests by calculating the r2 between imputed and true genotype and show that even when modeling other sources of error, the impact of the model misspecification has a significant impact on the r2 of rare variants. To evaluate these predictions in practice, we compare the imputation of the same dataset across imputation panels of different sizes. Although this empirical imputation accuracy is substantially lower than our theoretical prediction, modeling misspecification seems to further decrease imputation accuracy for variants with low allele counts in the reference. These results provide a framework for developing new imputation algorithms and for interpreting rare variant association analyses.
Keywords: coalescent model, genotype imputation, rare variants, HMM, population
Introduction
Genome sequencing studies with deep coverage allow directly assessing rare genetic variants (Lee et al.2014) to elucidate their impact on the etiology of complex diseases (Timpson et al. 2018). However, such studies are still expensive. As an alternative, array-based technologies can be employed at a substantially lower cost. Commercial genotyping arrays cover a pre-selected set of common variants, and in some cases low-frequency variants known to be of interest from previous studies. To recover high-resolution genetic information, genotype imputation compares assayed genotypes to a sequenced reference panel, thus leveraging the shared genealogy between genotyped (target) individuals and reference panel to infer unobserved genotypes (Servin and Stephens 2007).
Modern imputation methods combined with large reference panels have achieved high accuracy among even low-frequency variants. For example, using the TOPMed data as reference, the average imputation quality (r2) of variants with frequency 0.1% is over 0.90 for both African and European ancestry genomes (Taliun et al. 2021). Such high resolution improves the power of genome-wide association studies for low frequency and rare variants, and enables joint analysis across studies with different sets of genotyped variants (Das et al. 2016): Recently, the TOPMed consortium identified a new risk variant for breast cancer by imputing rare variants with minor allele frequency (MAF) < 0.5% into the UK Biobank (Taliun et al. 2021). The advent of affordable whole-genome sequencing generating large collections of reference haplotypes combined with efficient imputation algorithms will power more of such discoveries.
Most of these modern imputation methods are based on the Li and Stephens’ (2003) model. They leverage that haplotypes from unrelated individuals share chromosome segments from a common ancestor. These segments are more similar to each other if their common ancestor is more recent. Among haplotypes in a large reference panel, the haplotype that has the most recent common ancestor (MRCA) with the target haplotype can thus provide information about the unobserved genotypes of the target. Using a Hidden Markov Model (HMM), the Li and Stephens’ algorithm models each target haplotype in the study as an imperfect mosaic of haplotypes from the reference panel. The haplotypes making up this mosaic are inferred to be the most closely related to the target, and thus provide information for unobserved genotype information. The HMM framework is computationally tractable and naturally approximates recombination and mutation in its transition and emission probabilities.
Simulation studies have shown that imputation quality depends on the imputed variants’ allele frequency, genomic context, the size of the reference panel, and population demographics (Huang et al. 2013; Das et al. 2016). Such studies illustrate that, despite significant improvements, imputation of rare variants still has high uncertainty: The average squared correlation (r2) for variants with MAF 0.01% is below 0.5 even with the largest reference from the same continental population (Taliun et al. 2021). But, simulation studies only provide a limited opportunity to understand these limits. Although they have the flexibility to mimic a particular imputation setting, they are often computationally expensive and hard to generalize. Alternatively, probabilistic models capturing the basic properties of the imputation process can make predictions before data collection and are easily generalizable.
The key in such a probabilistic model is the relatedness between reference and target haplotypes, which explains how the reference informs unobserved genotypes in target haplotypes. Kingman’s (1982) coalescent provides a suitable theoretical framework for modeling the shared genealogy of these haplotypes. The coalescent traces the genealogy back in time, modeling a sequence of events where individuals find their common ancestors. Mutations resulting in polymorphisms can be mapped to branches on the coalescence tree at each locus, with all its descendants carrying the derived allele (Figure 1). The coalescent time (the time it takes for two or more haplotypes to find their MRCA and TMRCA) thus gives a measure of expected genetic dissimilarity, since only mutations occurring more recent than the coalescent time can result in different alleles among those haplotypes.
Some aspects of imputation accuracy have already been explored using coalescent theory. Jewett et al. (2012) studied the scenario where the target and reference are sampled from two populations diverged in the past, and derived expected imputation error rate as a function of reference size and divergence time. Huang et al. (2013) further included mutation rate and marker density as factors; and analyzed the potential gain in accuracy by choosing the reference panel from a more closely related population.
Here, we develop a coalescent approach to understand imputation accuracy within population, focused on rare variants. It is useful to recognize that imputation error has two types of sources: (1) failure to identify haplotypes most closely related to the target as the template and (2) true differences between the template and the target haplotype due to recent mutation events. Although improving imputation algorithms may reduce error of the first type, error of the second type is a result of the underlying genealogical relation between the target and the reference sample and how this relationship is modeled. Here, we focus on modeling the second type of error, where we determine the error that is immanent to the Li and Stephens’ model.
For this purpose, we consider one target haplotype with missing genotype and n fully sequenced haplotypes as references. These n + 1 observed lineages form the leaves of a binary tree with their (unobserved) ancestors as internal nodes. Intuitively, if the target first finds its MRCA with a set of reference haplotypes, their genotypes are the most similar to the target thus the most informative for imputation. The Li and Stephens’ algorithm assumes that there is exactly one such most closely related reference haplotype, but we show that this assumption is wrong with probability 1/3.
We compute how often this misspecification leads to an ambiguous or wrongly imputed genotype, assuming the imputation algorithm correctly identifies exactly those haplotypes that are most closely related to the target sample in the reference. We then provide the probability of generating a particular imputed dosage conditional on allele frequency of the variant and the size of reference panel. We also quantify the imputation accuracy in terms of the r2 between the imputed dosage and the true genotype, and show that, as a result of this misspecification, the r2 largely depends on the allele count in the reference panel, improving only marginally with increased reference size. We assess the impact of population history on these results and use coalescent simulation to confirm our analytic results. Taken together, our approach provides the minimum size of the reference panel necessary to achieve the desired imputation accuracy for a given allele frequency.
We evaluate the upper bounds predicted with this model where the only source of error is model misspecification by imputing a sample of 56,984 individuals using reference panels of different sizes and evaluating the empirical imputation accuracy > 60,000 variants. We observe substantially higher imputation error in the empirical results than the theoretical bound in our model. Moreover, when conditioning on minor allele count (MAC), imputation error is larger in larger reference panels. We also observe that for MAC 2–10, imputation error increases much faster with decreasing MAC, consistent with our result that especially for these MACs, model misspecification reduces imputation accuracy.
The model we develop here can also be leveraged to improve current imputation algorithms. For example, most imputation algorithms assume that the distance between switching template haplotypes reflects recombination events and is exponentially distributed (Paul et al. 2011). We derive the length distribution of contiguous segments without observable recombination breakpoints, and show that it differs substantially from the exponential assumption by having thicker tails for both very short and long segments.
Methods
We consider imputing a single-target variant for one target haplotype with a reference panel consisting of n reference haplotypes. For simplicity, we only discuss biallelic sites denoting the ancestral allele as 0 and derived allele as 1. At each SNP site, all reference haplotypes and the target form a coalescent tree with n + 1 leaves. This tree is unknown but represents the complete information that imputation can possibly use (Figure 1). If the target’s MRCA with the reference sample occurs on a branch that is ancestral to reference haplotypes, all u present day descendants of that branch are the most closely related reference haplotypes. We assume that the imputation algorithm identifies these most closely related reference haplotypes, then assigns the mean genotype of them to the target haplotype as the imputation dosage.
Under this assumption, we consider three scenarios: (1) The mutation generating the target variant is ancestral to the TMRCA of the target haplotype and all its most closely related reference haplotypes. In this case, it will be imputed correctly with dosage 1 (Figure 2a). (2) The mutation occurred more recently than the TMRCA and the mutation occurred on the branch to the target haplotype. Then the target variant is not polymorphic in the reference sample and will always be falsely imputed to be the ancestral allele (Figure 2b). (3) The mutation occurred more recently than the TMRCA and the mutation occurred on the branches to the reference haplotype. Then some or all of the most closely related reference haplotypes carry the derived allele while the template haplotype carries the ancestral allele and the reference sample will be imputed to carry the derived allele with some dosage > 0 (Figure 2, c and d).
We focus on the third scenario where j of the u most closely related templates carry the derived allele while the target does not. The Li and Stephens model assumes a single closest haplotype, and in the HMM implementation the imputed fractional genotype (dosage) is a weighted average of multiple reference haplotypes based on their posterior probabilities of being that right template. When the u is equivalently close to the target, they are expected to have equal posterior probabilities. Thus the imputed dosage is while the true genotype g is g = 0 (“false positive”), resulting in loss of information in downstream analysis.
In the following sections, we derive the probability for all possible (j, u) configuration, conditional on observing the derived allele count (DAC)j in the whole reference. For simplicity, we distinguish the one target haplotype from the rest n reference ones, making the whole tree size n + 1, although they are exchangeable under the assumption of homogeneity and random mating. We will always consider time backward with t = 0 being the current generation. Notations are listed in Table 1.
Table 1.
Symbol | Meaning |
---|---|
g | True genotype at the imputed locus |
Estimated genotype at the imputed locus | |
n | Number of haplotypes in the reference panel. |
u | Number of most closely related haplotypes (MCRH) in the reference panel |
j | Number of the rare allele in the reference panel |
k | Number of lines remaining on the tree just after all MCRH have reached their MRCA |
d | Number of lines remaining on the tree when the target haplotype coalesces with one ancestor of the reference haplotypes |
l | Length of an internal branch |
m | Number of mutation events on a particular branch |
Number of most closely related templates
We first give the probability of having u equally good templates at any random position for a target haplotype: . This probability depends only the topology of the genealogy, independent from mutation events. We leverage that the probability that a set of k lines coalesce before they coalesce with any line among the rest n—k (Wiuf and Donnelly 1999) is:
(1) |
The generative process can then be imagined as three steps: u templates coalesce first before their MRCA meets the target; then the resulting branch of size u + 1 meets the rest of the tree. Finally, we sum over all possible sets of u templates:
(2) |
The expected number of best templates is close to 2 when the reference is large:
(3) |
High certainty error
We now derive the probability of imputing the target to carry the mutation with dosage while the truth is g = 0, conditional on the observed DAC in the reference panel: . This happens when the target haplotype first coalesces with a branch (l) of size j carrying a mutation (Figure 2C).
Throughout the following derivation, we use the number of ancestral lines k for the current day sample to keep track of coalescent time and to connect topology and branch lengths. We first introduce some quantities useful for our derivation:
Let be the probability for j lines to reach their MRCA at the coalescent event that reduces the overall number of ancestral lines from k + 1 to k, without any of the j lines coalescing with any of the n−j other lines first. Let be the probability for one line to encounter no coalescent event for (at least) the first n−d events. Let be the probability of having at least one mutation event on a branch with length l, and let be the probability density function for the length of one internal branch starting when there are k lines left and ending when there are d lines left, in a tree of size n. Using these terms, we can now calculate:
We calculate the joint probability in the following three steps by conditioning on d and k (Equation 4 and Figure 2C). We calculate by adapting the second step (Equation 5).
To coalesce in the th coalescent event, when d lines remain on the tree, the target haplotype cannot coalesce in the first n−d events. By definition of P0, this is .
A branch of size j ancestral to all most closely related reference haplotypes arises in the reference at the th event () and encounters a mutation before it coalesces with the target branch in event (). Here, the probability for a mutation to occur on that particular branch is computed by integrating over all possible branch length l given k and d: .
The size-j branch does not coalesce till the th event and then coalesces with the target: .
Multiplying the probabilities of these sequential events and summing over all possible values of d, k give the joint probability:
(4) |
(5) |
To compute , define , the number of possible configurations of the topological history for n lines to coalesce till there are k ancestral lines left. Then:
(6) |
Next we consider the integral corresponding to the mutation event (Equation 7). Let, denote the expected time for the number of ancestral lines to go from i to i−1 in the coalescence process for a sample of n current haplotypes, which depends on the population history model. Thus, depends on mutation rate μ and expected coalescent time intervals , which can be numerically computed (Polanski et al. 2017) or approximated by Monte Carlo.
(7) |
Substituting Equations (6), (7), and into Equations (4) and (7), we now have the conditional probability fully defined:
(8) |
Following similar logic as above, we then consider the conditional probability distribution for fractional dosages. We derive , the probability of having u equally closely related best templates, j of which carry the mutation, for a target actually carrying the ancestral allele, conditional on observing the DAC as j in a reference of size n (Figure 2D). Details of derivation are in the Appendix.
Modeling r2 as a function of misidentification proportion
The above derivation fully characterizes the distribution of the imputed dosages assuming we perfectly identify the optimal templates. Although it is difficult to model all the possible sources of error in the real imputation process, we can use the total weight (q) attributed to haplotypes outside the set of most closely related templates to measure the identification error. Here, we derive its influence on the square of correlation coefficient (r2) between the true genotype and imputed dosage.
Let g be the true genotype, 1 for the derived allele and 0 for the ancestral allele; let f be the derived allele frequency (DAF), so . Let and be the imputed dosage with and without misidentification, respectively.
Let z be the random variable representing the proportion of carriers of the minor allele among the suboptimal templates contributing to the imputed dosage, so that can be modeled as a mixture . If we assume that the suboptimal templates are sampled at random from the population and each of the suboptimal templates is given equal weight, z is the average genotype of m random haplotypes, so E(z) = f and .
When the optimal templates for a target sample provide the perfect information , misidentified templates may be the alternate allele and therefore increase imputation error. However, where the target sample carries the ancestral allele but some of its optimal templates carry the derived allele, suboptimal templates attenuate the resulting error. Nevertheless, here we show that in expectation imputation error will always reduce the correlation between the true and the imputed genotype. Let’s express in terms of quantities regarding the theoretical imputed dosage .
(9) |
Comparing to the theoretical :
(10) |
Since is based on a few reference haplotypes and equal to g with large probability, we can reasonably assume and (misidentified haplotypes are more distant). Therefore the middle terms in the numerator and denominator of Equation (10) are approximately equal; the difference between the r2’s is governed by the last terms, which have a ratio strictly smaller than one: .
If misidentified haplotypes are close to random draws from all reference haplotypes (), the ratio simplifies to:
(11) |
In practice, the suboptimal templates contributing to the imputed dosage are more likely to be from lineages closer to the target, so their effect on imputation accuracy would be smaller than that from the above assumptions.
Length of haplotype before the next recombination breakpoint
Next we aim to derive the distance between consecutive recombination events on an external branch. For the purpose of modeling imputation, these distances represent the lengths of segments that are copied from the same haplotype. Let X be the genetic distance to the next observed recombination event on the target, and T be the length of its corresponding external branch. We calculate (we will use for continuous and for discrete distributions). Similar to above, we compute the probability density function of T by considering it first coalesces at the ()th event. is the sum of coalescent time intervals, thus following a convolution distribution as the sum of exponential with different rates. Conditional on T, X follows an exponential distribution with rate 2T since a recombination event on either the current template or the target haplotype results in a switch of template:
(12) |
When population size is constant, , where MGFT is the moment generating function of the hypoexponential distribution . When population size changes through time so time intervals are correlated, we can approximate the expected value with and compute with Monte Carlo.
The probability for a haplotype to first coalesce at the ()th event is , with being the probability of not coalescing with any lineage in the first (n−k) events (starting from total sample size n + 1). Therefore, we have the distribution of X is:
(13) |
(14) |
Simulation
We performed a standard Kingman coalescent simulation under population history models suggested by recent studies (detailed in ‘Results’). We simulate size-(n + 1) trees with n varying from 500 to 50k. For each generated tree, we keep track for all subtree configurations () (one external branch first coalesces with a branch of size u, which contains a branch of size j) of interests, and record the length of this size j branch. The sum of all such relevant branch lengths in the tree divided by the number of present haplotypes n + 1 is one realization of the joint probability , which will then be approximated by averaging such realizations over all simulated trees. To approximate the probability of observing j derived alleles in the reference panel, we simulated another independent set of trees of size n and summed over the lengths of all size j branches to get sample frequency spectrum. We present results from 50k independent simulations under each parameter setting throughout this article.
Empirical evaluation
To evaluate the predictions of our model, we imputed genotypes from the Michigan Genomics Initiative (MGI; Fritsche et al. 2018) using different imputation panels. To assess imputation quality, we masked some genotypes and compared the imputed genotypes to the genotypes generated by the array. MGI is collected from the patient population of the University of Michigan Hospitals and thus mostly () consists of individuals of European decent. The MGI individuals analyzed here are genotyped on a Illumina Infinium CoreExome chip with custom markers, providing sufficient low-frequency variants to assess imputation performance for rare variants.
We considered five reference panels: (1) The Haplotype Reference Consortium (HRC) r1.1 reference panel (n = 32,470 individuals of mostly European descent, variants; McCarthy et al. 2016), (2) the 1000 Genomes (1KG) Project variant calls from low-coverage sequencing (n = 2548, variants; Lowy-Gallego et al. 2019), (3) the 1KG variant calls from high-coverage sequencing (n = 2504, variants; Byrska-Bishop et al. 2021), (4 and 5) subsets of the high- and low-coverage 1KG variant calls containing only samples from the European super population (n = 503). Note that the HRC reference panel only provides variants with MAC 5 (McCarthy et al. 2016). We subsetted all 1KG variant reference panels to the set of overlapping individuals and markers and filtered out monomorphic and singleton sites, resulting in 2503 samples and variants for panels (2) and (3) and 503 samples and variants for panels (4) and (5). We also re-phased all 1KG panels using Eagle (v2.4.1; Loh et al. 2016) without a reference panel to avoid confounding due to differing phasing quality.
Using each reference panel, we imputed all autosomal variants by running Minimac4 (v1.0.0; Howie et al. 2012) on a computing cluster maintained by the Center for Statistical Genetics at the University of Michigan in Ann Arbor. We set Minimac4 parameters to use 16 cpus and output data in genotype, estimated alternate allele dosage, and estimated haploid alternate allele dosage formats. For all rare variants genotyped in MGI, we estimated imputation quality using the squared Pearson correlation coefficient between known and imputed genotypes.
Data availability
1000 Genomes low-coverage sequence data were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/. 1000 Genomes high-coverage sequence data were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/.
Results
We derived an analytical approach to calculate the impact of model misspecification of the current imputation framework with our coalescent model on rare variants (see ‘Methods’). Assuming a Li and Stephens (2003) model-based imputation algorithm correctly identifies all and only those haplotypes in the reference that are most closely related to the target sample (“closest templates”), we calculated the theoretical error rate as a function of reference size n and the DAC. Using coalescent simulations, we confirm the analytical results, and incorporate general population models. We compared our theoretical predictions with empirical imputation accuracy observed in the MGI, with the 1000 Genomes (Abecasis et al. 2010) or the Haplotye Reference Consortium (McCarthy et al. 2016) as reference.
The number of closest template haplotypes
In the coalescence context, identifying a single-reference haplotype that is most similar to the target haplotype is equivalent to selecting the reference haplotype that has the MRCA with the target haplotype or, equivalently, whose lineage is the first to coalesce with the target. However, by the time this lineage coalesces with the target, it may be ancestral to multiple reference haplotypes (Figure 1). In this case, those reference haplotypes are equally closely related with the target, each of them in expectation providing the same amount of information for imputing the target sample. Thus, the model assuming one single best template is misspecified, and imputation is more likely to be ambiguous when the number of closest templates is larger.
The probability for the target haplotype to first coalesce with a lineage having u descendants in the reference is:
(15) |
Note that the derivations of this probability do not condition on the presence of a variable site. Thus, this probability depends only on the topology of the coalescent tree, independent of population history (see ‘Methods’). Interestingly, the probability distribution of the number of best matches does not depend on the size of the reference panel (n), except for the extreme case u = n, where the entire reference has equal genetic distance to the target. In that case, . The expected number of closest templates , approaching 2 as the size of the reference panel increases.
From Equation (15), we see that with probability , the target haplotype first coalesces with an external branch (u = 1) and the reference contains exactly one best match. Thus, the Li and Stephens model of exactly one best template is misspecified with probability . The probability of u equally good templates drops rapidly with increasing u (Table 2). With probability , the target haplotype first coalesces with an internal branch with two descendants in the reference (u = 2), while the probability is 0.0108 that the target haplotype first coalesces with an internal branch with more than 10 descendants in the reference (u > 10).
Table 2.
No. templates | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Probability | 0.6667 | 0.1667 | 0.0667 | 0.0333 | 0.0190 | 0.0119 | 0.0079 | 0.0056 | 0.0040 | 0.0030 |
Impact on imputation accuracy
As we have just demonstrated, the model assumption that each target haplotype has a single most closely related template haplotype is a model misspecification for 1/3 of the genome where there are multiple closest templates. This model misspecification contributes to imputation error beyond the typical source of error created by failing to identify the most closely related haplotype. To isolate this additional error, we now evaluate the imputation of a missing genotype assuming that all closest templates are correctly identified but analyzed under a model that assumes a single best haplotype. In this scenario, imputation algorithms will correctly impute all non-singleton variants if the target haplotype has a single most closely related template haplotype. Imputation errors can only occur if the target haplotype has more than one closest template, i.e. a model misspecification. In this case, imputation algorithms consider all these template haplotypes to be equally likely to be the“best” template. Accordingly, it will interpret identifying multiple equally close templates as uncertainty in identifying the best template. For variants that differ between these template haplotypes, the imputed genotype is then usually the average of the DAC of those templates, a fractional genotype (dosage). Thus, for one given target haplotype carrying the ancestral allele to be imputed with a nonzero DAC dosage (“false positive”), some of its templates have to carry the derived allele.
Although the probability of this misspecification is independent of population history, the probability of a mutation event causing an imputation error depends on the history of the population template and target haplotypes are sampled from. In our primary analysis, we consider a population history model approximating European population history (Terhorst et al. 2017; Speidel et al. 2019). Starting from an ancestral population with effective population size 104, it undergoes a bottleneck with for 100k years (∼3450 generations), then grows with an accelerated rate (faster than exponential growth;Reppell et al. 2014) to during the most recent 10k years. The effect of these model parameters on the following results is marginal within a model space reasonable for human population (Appendix).
We first consider the special case where all the closest templates for a target carry the derived allele, while the target haplotype carries the ancestral allele. In this case, the imputed dosage of the derived allele is 1 when the truth is 0 (Figure 2A). By integrating over all possible tree shapes and branching times (see ‘Methods’), we calculated the probability for this configuration for reference sample sizes of 500, 5000, 20,000, and 50,000 individuals. We verified all results using computer simulations. Across all considered reference sizes, the probability of this completely wrong imputation for one imputed haplotype is small ranging from to (Figure 3A andTable 3). Holding reference size constant, this probability decreases with rising DAC. Similarly, holding DAC constant, this probability of imputation error decreases with increasing number of templates.
Table 3.
Reference size (individuals) | DAC in the reference panel |
||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 |
3 |
5 |
10 |
20 |
|||||||||||
Imputed dosage for the derived allele | |||||||||||||||
500 | 2145 | 1122 | 848 | 1802 | 1006 | 653 | 1361 | 815 | 420 | 840 | 541 | 153 | 435 | 292 | 57 |
5000 | 214 | 109 | 80 | 179 | 97 | 60 | 136 | 78 | 41 | 84 | 52 | 23 | 49 | 32 | 13 |
20,000 | 53 | 27 | 18 | 44 | 24 | 15 | 34 | 19 | 11 | 21 | 13 | 7 | 12 | 7 | 4 |
50,000 | 21 | 10 | 7 | 17 | 9 | 6 | 13 | 8 | 4 | 8 | 5 | 3 | 5 | 3 | 2 |
Consider imputing a sample dataset with one million individuals using a reference panel containing 500–50,000 individuals from the same population as the sample, the expected number of individuals who are homozygous for the ancestral allele (g = 0) but have an imputed genotype dosage as or . Each combined column represents one DAC in the reference. Results are from 50k independent coalescence simulations, each data point is an average of over 107 loci.
To put these error rates in relationship with the number of true carriers of the derived allele, we assume that the DAF in the target population is the same as that in the reference population. For instance, consider a site with two copies of the derived allele in a reference of 500 individuals (MAF ). For this site, we expect to impute one individual as false carrier for every 1210 individuals (Figure 3). Among these 1210 individuals, we also expect 4.84 true carriers of this variant assuming the same MAF in the target population. Thus, for variants that are doubletons in a reference sample of 500 individuals, about one in six imputed rare alleles will be a false positive (Table 3). More generally, for a fixed DAC in the reference, the probability of falsely imputing the derived allele decreases with increasing reference size while the number of true carriers of the derived allele also decreases. As a result, the proportion of falsely imputed derived alleles decreases only moderately with increasing reference size. For example, for variants that are doubletons in the reference, the proportion of false positive-derived alleles decreases only from 17.5% for a reference size of 500 to 14.7% for a reference size of 50,000.
If we now consider the more general case where at least one of the closest templates carry the derived allele while the target carries the ancestral allele (dosage 0). In this case the imputed dosage (Figure 3). Across reference sizes of 500, 5000, 20,000, and 50,000 individuals, we calculated the probability of for each imputed haplotype. This probability is notably larger than the probability of falsely imputing dosage 1, ranging from to for doubletons (with DAC 2;Figure 3B). This probability decreases with rising DAC and increasing reference size.
From these probabilities, we calculate the expected number of haplotypes carrying the ancestral allele (g = 0) that are falsely imputed to have either a nonzero derived allele dosage (), a higher dosage or a dosage of 1 for every million target individuals (Table 3). Haplotypes with represent cases where a“best-guess” imputation algorithm would infer the alternate allele. Beyond the previously described impact of DAC and reference size, these results show that for DAC < 5, about half of all falsely imputed derived alleles have a dosage > 0.5. This observation can be explained by the fact that for a given DAC, observing a dosage < 0.5 requires a larger number of equally good templates than observing a larger dosage. As large numbers of equally good templates are rare (Table 2), the proportion of higher dosage among all falsely imputed nonzero dosages increases with the DAC.
To summarize the impact of this error on association tests, we calculated the squared correlation coefficient (r2) between the imputed dosages and the true genotypes using simulations, both under the assumption that model misspecification is the only source of error and under the assumption that some reference haplotypes are falsely identified as being most closely related. r2 is a commonly used measure for imputation quality, as it is directly related to statistical power in downstream association tests (Guan and Stephens 2008; Huang et al. 2009; Wojcik et al. 2018).
Still assuming that closest templates are identified perfectly, we generated the distribution of expected r2 for a range of MACs and reference sizes using simulations (see ‘Methods’). The expected r2 increases monotonically with DAF in the reference and reference size (Figure 4). If we consider the increase with DAF across reference sizes, we observe that the curves of expected r2 values look very similar for each reference panel size, only shifted by a factor of 1/panel size. In other words, r2 of variants with the same MAC remains almost the same across all considered reference sizes. For example, the average r2 among doubletons is 0.822 in a reference size of 500 and 0.831 in a reference sample size of 50,000, where its frequency is only 1/100 of the former. For variants observed 10 or more times in the reference, the expected r2 is > 0.97 regardless of the size of the reference, as model misspecification do not play a major role for the imputation quality of these variants, but for variants observed less often, r2 decreases rapidly with decreasing allele count.
To include other sources of error in the imputation process, we model the false identification of reference haplotypes that are more distantly related to the target as templates. Those distantly related haplotypes are less likely to carry the same allele as the target haplotype and thus introduce an additional source of error. We parameterize this identification error as the sum of all probabilities assigned to falsely identified templates, denoted as q (see ‘Methods’). In practice, this imputation error depends on the choice of imputation algorithm, marker density, quality of genotyping and statistical phasing; detailed modeling of these factors is beyond the scope of this article.
As expected, imputation error of the variant increases with increasing q shown as the mean squared error (MSE) of the imputed genotype (Figure 4B). As the imputed genotype for all individuals converges to the allele frequency f as q goes to 1, the MSE increases faster with higher f. Similarly, the r2 between the true genotype and the imputed genotype decreases with increasing q. However, including some suboptimal haplotypes only has a small effect for rare variants. If we assume a single suboptimal template, only has marginal effect on r2, when q increases beyond this threshold the decrease of r2 becomes almost linear. This threshold depends on the number of suboptimal templates: with a larger number of templates, the variance of the imputed genotype decreases and the effect of including suboptimal templates is more deterministic (see ‘Methods’). Accordingly, if the number of suboptimal templates is larger, the threshold where r2 starts to decrease with q is larger while the MSE is lower (Figure 4B). For example, consider imputing a variant that is a doubleton in the reference with 20k haplotypes. Assuming no identification error, . If we identify one suboptimal templates with posterior probability 0.3 as well as all the closest templates, the expected r2 is 0.70 and the MSE is . If we instead identify 5 suboptimal templates with total posterior probability 0.3 as well as all the closest templates, the expected r2 increases to 0.80 while the MSE decreases to .
For variants with DAC < 5 in the reference, the error caused by model misspecification dominates for a wide range of q; while for variants of DAC over 20 the error caused by model misspecification is negligible and the with perfectly identified optimal templates is above 0.99 regardless of the reference size (Table 4), thus the identification error dominates the empirical imputation error.
Table 4.
Reference size (individuals) | DAC in the reference |
|||||
---|---|---|---|---|---|---|
2 | 3 | 4 | 5 | 10 | 20 | |
500 | 0.822 (0.089) | 0.885 (0.067) | 0.919 (0.051) | 0.940 (0.041) | 0.979 (0.017) | 0.995 (0.0055) |
2000 | 0.823 (0.089) | 0.886 (0.066) | 0.920 (0.051) | 0.940 (0.040) | 0.979 (0.016) | 0.994 (0.0054) |
5000 | 0.826 (0.088) | 0.887 (0.065) | 0.921 (0.050) | 0.941 (0.040) | 0.979 (0.016) | 0.994 (0.0055) |
20,000 | 0.828 (0.087) | 0.889 (0.065) | 0.922 (0.050) | 0.942 (0.040) | 0.980 (0.016) | 0.994 (0.0053) |
50,000 | 0.831 (0.087) | 0.891 (0.065) | 0.923 (0.050) | 0.943 (0.039) | 0.980 (0.016) | 0.994 (0.0053) |
Numbers in the parentheses are SD summarized over 107 loci, representing the variation among variants in a population sample. Results are from 50 k independent coalescence simulations.
Comparison with empirical imputation accuracy
We imputed genotypes of 56,984 participants in MGI using two standard reference panels, the 1KG (2504 individuals) and the HRC (32,470 individuals). For variants present in the MGI dataset, we calculated r2 between observed genotypes and imputed genotypes and stratified these by the MAC of the variant in the reference panel (Figure 5).
We recapitulate previous results (e.g.Das et al. 2016) that panel size had a notable effect on imputation accuracy. For a given MAC, r2 for variants imputed with the largest panel (HRC) is lower than that for variants imputed with the full 1KG panel by , the later is in turn lower than that for variants imputed with the European ancestry individuals from 1KG by (difference relatively stable across MAC). However, when conditioning on MAF, the order of panels flips, as the same MAC reflects much smaller MAF in larger panels. Here, r2 for variants imputed with HRC is higher than that for variants imputed with the full 1KG panel by , the later is in turn higher than that for variants imputed with the European ancestry individuals from 1KG by .
Sequencing depth of the reference panels on the other hand had only a very small effect. The use of deep coverage 1KG panels increased imputation accuracy typically by < 0.01, compared with the low-coverage panel. Note that markers called only in the deep coverage panel are not included in this comparison.
Comparing these empirical result with the theoretical upper bound from our model (Figure 4), we observe that the empirical r2 is much lower than what is predicted under the assumption that only model misspecification affects imputation accuracy. For doubletons imputed with the whole 1KG, the average r2 is 0.32; for doubletons imputed with the European subset (504 individuals) the average r2 is 0.41, while the theoretical upper bound is above 0.82 for both sample sizes. Although empirical imputation accuracy of rare variants decreases with decreasing MAC across all reference panels, it decreases more rapidly for MAC < 10. This more rapid decrease is mirrored by the rapid decrease of the theoretical upper bound for MAC < 10 observed in the theoretical prediction.
Length between template switches
We now consider another potential source of error in current imputation model: the switch between templates. When we model the target haplotype as a mosaic of templates from the reference, switching between templates can be interpreted as a historical recombination event that breaks the genealogical bond between the target and its current template, i.e. the branches connecting the two leaves in a coalescent tree. The two haplotypes will become practically independent beyond the recombination break point.
Conditional on a known local genealogy, the length to the next recombination break point follows an exponential distribution, with the rate proportional to twice the TMRCA between the target and the template. In practice, the genealogy is unknown, and the length to the next recombination break point is a combination of the distribution of the conditional length and the time to the TMRCA (detailed in ‘Methods’). Here we compare this mixture distribution with the exponential distribution that is typically assumed in current imputation methods (Li and Stephens 2003; Browning et al. 2018).
Comparing the mixture distribution to an exponential distribution with the same mean shows that with genealogy and population history aware modeling, the distribution of length between switches is very similar for small reference samples (n = 500; Figure 6). For larger reference samples (n = 20,000) the mixture distribution has larger variance, with higher density in both extremely short and long intervals but lower density for intermediate ones. If the excess of short no-recombination intervals is not well captured due to model misspecification, we expect to have more suboptimal templates thus higher imputation error rate. When the reference size increases, switches are less frequent as it is more likely to find a template sharing a very recent MRCA with the target.
Discussion
The empirical performance of imputation methods has been extensively studied (Chan et al. 2016; Das et al. 2016; Shi et al. 2018; Taliun et al. 2021), while the theoretical behavior and limit of the underlying framework has not been well characterized, especially for rare variants. We formulate this problem of imputation accuracy into a coalescent model, considering imputing the missing genotypes on one target haplotype by copying from a set of reference haplotypes that share the MRCA with the target. Our approach identifies two model misspecifications in modern imputation algorithms and explores their impact on our ability to impute rare variants.
First, most imputation algorithms model a single best template as the hidden variable to infer. We show that in 1/3 of the cases, multiple haplotypes are equally good templates for imputing one target in any given reference panel, independent of reference sample size or population history. The resulting model misspecification leads to imputation error when genetic variants are shared by some of the optimal templates but not with the target. We develop analytical expressions for the imputation error resulting from this model misspecification as a function of the DAF and reference size.
For this purpose, we assume that the imputation algorithm correctly identifies the most closely related haplotypes so that this model misspecification is the only source of error. In this idealized scenario, we observe that for variants observed five or less times in the reference panel, of variants with nonzero dosage are noncarriers; up to MAC 10, the mean r2 between the imputed dosage and true genotypes < 0.98. Conditional on the DAC, this effect is broadly independent of reference panel size: although the probability of falsely imputing carriers of the rare allele decreases as the reference panel size becomes large, the number of true carriers also decreases and the proportion of false carriers over true carriers stays about the same. Including other sources of error that occur in practice further decreases imputation accuracy. Thus the expected r2 we present here designate the upper bound of achievable mean imputation accuracy for rare variants in current imputation framework.
Such imputation error substantially reduces the power for detecting novel risk variants in an association study based on imputed genotypes. Two scenarios can be considered here: First, single-marker tests of imputed rare variants can be powerful if the case-control dataset is much larger than the imputation panel (Taliun et al. 2021). In this scenario, inaccurately inferred genotypes at a disease-related locus attenuate the allele frequency difference between cases and controls, and r2 is directly related to this loss of information that compromises the statistical power in association tests. Imputation error rate of 2–6% leads to 10–60%increase in required sample size in a single-marker test (Huang et al. 2009). As a second scenario, imputed rare variants can be aggregated into a single-test statistic (Lee et al. 2014). In this study design, poorly imputed variants will have an attenuated signal, potentially diluting the signal from better imputed variants. This loss of power can be limited by focusing on imputed variants where misspecification will not impede imputation accuracy, e.g. variants that occur more than five times in the reference.
We consider a second model misspecification: the use of exponential distributions to model the length of contiguous haplotypes without template switches. The true distribution of this length is driven both by the recombination rate and the relationship between the template haplotype and the target haplotype. For smaller reference such as the 1000 Genomes (Abecasiset al. 2010; 2504 genomes) this model misspecification has a negligible effect, but for reference samples of the scale of TopMed (Zachary ; 53,831 genomes), gnomAD (Karczewski et al. 2020; 15,708 genomes), or the Haplotye Reference Consortium (McCarthy et al. 2016; 38,821 genomes), the length distribution of shared segments between the target and a single template has a much heavier tails than modeled. This misspecification penalizes both extremely short and extremely long switching intervals, decreasing the probability of finding the optimal templates. This in turn reduces imputation accuracy, especially for low-frequency variants beyond the effect of the first model misspecification described earlier.
The theoretical results we present here are broadly robust to assumptions about population size history in a range reasonable for major human populations. We assume that the imputed region is evolutionary neutral; for loci under selection the impact of the described model misspecifications would likely depend both on the selection model and on the population history. Further, we assume that the reference is from the same homogeneous population as the target sample. If we instead modeled a diverse reference sample, results would depend on the frequency distribution of the imputed variant among the reference samples. For rare alleles, which are typically private to a single population, only members of that single population among the reference would affect imputation accuracy. Our results can be extended to including migration, admixture, or selection, as the mathematical derivation and simulation scheme in this work are general for coalescence at a single locus.
In our empirical evaluation of imputation quality using two commonly used reference panel, the accuracy of rare variants is well below theoretical upper bound, suggesting that we have not reached the limit of the Li and Stephen’s framework. Conditional on MAC, imputation becomes less accurate with increasing reference size, suggesting that failure to identify the genealogically best templates is likely the biggest source of error. Identifying these best templates may not always be possible with available marker data, whose resolution is limited by the observed polymorphic sites and local variation of recombination rate. Other factors that are likely to reduce imputation accuracy include error in statistical phasing in the reference or the target haplotypes and departures from the infinite sites assumption (parallel mutations or back mutations). However, as our model predicts, for all reference panels, empirical imputation accuracy decreases much faster for MAC < 10, suggesting that for these low counts the model misspecification modeled here further reduces the ability to correctly impute rare variants.
Our results suggest that potential improvement of the imputation framework may lie in more detailed modeling of the underlying genealogy, especially for extremely rare variants where only a small subset of the reference contributes information about the imputed genotype. For example, suppose the ancestral recombination graph (ARG) including both the target and the reference haplotypes is constructed, the scenarios resulting in fractional dosage could be avoided (up to uncertainty in ARG). However, alleles absent from the reference would still be missed, and there would be uncertainty when all and only the closest reference haplotypes carry an allele: we can only probabilistically decide whether the mutation event or the coalescence event with the target haplotype happens first.
Overall, we identify that the model misspecifications in imputation algorithms limit our ability for imputing rare variants. Such inherent errors reduce the power of single-variant tests as well as aggregation tests in studies that impute genotypes in large cohorts, especially if they focus on alleles that are observed only a few times in the reference panel. Beyond improving our understanding of the performance of imputation algorithms, these results point to potential new imputation strategies that help identify new risk variants.
Software availability
Codes used to perform analytical calculation and coalescent simulation are available from https://github.com/Yichen-Si/ImputationBound.
Acknowledgments
We thank Noah Rosenberg and Jonathan Terhorst for thoughtful comments, and two anonymous reviewers for helpful feedback.The authors acknowledge the Michigan Genomics Initiative participants, Precision Health at the University of Michigan, the University of Michigan Medical School Central Biorepository, and the University of Michigan Advanced Genomics Core for providing data and specimen storage, management, processing, and distribution services, and the Center for Statistical Genetics in the Department of Biostatistics at the School of Public Health for genotype data curation, imputation, and management in support of the research reported in this publication.
Funding
Funding for this research was provided by US National Institutes of Health (NIH) grant R01 HG005855. The following cell lines/DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research [‘NA06984, NA06985, NA06986, NA06989, NA06994, NA07000, NA07037, NA07048, NA07051, NA07056, NA07347, NA07357, NA10847, NA10851, NA11829, NA11830, NA11831, NA11832, NA11840, NA11843, NA11881, NA11892, NA11893, NA11894, NA11918, NA11919, NA11920, NA11930. NA11931, NA11932, NA11933, NA11992, NA11994, NA11995, NA12003, NA12004, NA12005, NA12006, NA12043, NA12044, NA12045, NA12046, NA12058, NA12144, NA12154, NA12155, NA12156, NA12234, NA12249, NA12272, NA12273, NA12275, NA12282, NA12283, NA12286, NA12287, NA12340, NA12341, NA12342, NA12347, NA12348, NA12383, NA12399, NA12400, NA12413, NA12414, NA12489, NA12546, NA12716, NA12717, NA12718, NA12748, NA12749, NA12750, NA12751, NA12760, NA12761, NA12762, NA12763, NA12775, NA12776, NA12777, NA12778, NA12812, NA12813, NA12814, NA12815, NA12827, NA12828, NA12829, NA12830, NA12842, NA12843, NA12872, NA12873, NA12874, NA12878, NA12889, NA12890, and NA12890]. These data were generated at the New York Genome Center with funds provided by NHGRI Grant [3UM1HG008901-03S1 (Michael Zody, personal communication).
Conflicts of interest
None declared.
Appendix
Probability distribution of fractional dosages
In the ‘Methods’ section, we derived the probability of having even the best possible evidence suggesting a wrong allele type. Here we will generalize it to the cases where the most closely related templates have different allele types, resulting in a fractional estimated dosage (Figure 2D).
Let be the probability of having u equally good optimal templates, j of which carrying the mutation, for a target actually carrying the ancestral allele; conditional on observing the DAC as j. Let denote the numbers of ancestral lines within the size u subtree when the branch carrying the mutation starts and ends; denote the numbers of lines in the whole tree at the corresponding time point.
We will outline the calculation of the joint probability (the numerator), while the denominator is the same unfolded frequency spectrum as before. We proceed by the following steps, each subsequent probability is conditional on the previous one, in parallel with that for the special case in ‘Methods’.
The target haplotype does not coalesce in the first events: .
A branch of size u arises in the reference at the th event, then remains alone till the th event to coalesce with that external branch: .
The subtree of size u contains a branch of size j: (this step only concerns the topology within the subtree).
The size-j branch encounters a mutation: (here, we need to put the subtree back to the whole size-(n + 1) tree to get the branch length and introduce mutation).
Steps (1) and (2) are similar to the three steps in the previous section except for not involving mutation events, we combine them to : an external branch first coalesces at the th event with an internal branch of size u which starts at the th event. In step (3), is similar to in step (2), as : the probability for a branch of size j to start at the th event and end at the th event.
(16) |
The only component left is , the link between the topology within the subtree and the branch lengths, which is relative to the whole sample and involves the population size history model. Since coalescent time intervals are measured for the whole tree, tells us when the branch of size j starts and ends, which leads to the probability for a mutation to occur.
We divide the coalescent process among the whole reference from n lines to k lines into three parts: . There are unfixed events (since we are conditioning on the three time points, and (k)). Within each part, we consider the number of events that has to happen inside the subtree: . With the number of lines fixed, is calculated by considering the possible ways to arrange those events in the three time intervals:
(17) |
The denominator in Equation (10) comes from removing the constrains introduced by the mutation event. Conditional on (u, k), there are in total unfixed events in the whole tree, including in the subtree. Since defines the coalescent time in the whole size-(n + 1) tree, we can calculate the length of the branch carrying mutation by Equation (6): .
Comparison of population growth models (Figures A1 and A2)
Genotyping and sample quality control
DNA samples from the blood of MGI participants were processed on one of two production batches of a customized Illumina Infinium CoreExome-24 bead array. Genotype calls were produced with the Illumina GenomeStudio 2.0 software operating the Genotyping Module v2.0.4 and the Gentrain clustering algorithm v3.0. Variant- and sample-level quality control (QC) is detailed as following.
Bead array probe mapping and genotype calling were performed according to the Genome Reference Consortium Human Build 37 (GRCh37), 502,255 variants passed variant-level QC. Variants were excluded if (1) bead array probe sequences did not perfectly and uniquely map to the reference (BLAT v.351; Kent 2002), (2) variant-level call-rate was below 99%, (3) variant fail Hardy–Weinberg equilibrium exact test with P < 10−6 in unrelated European samples (PLINK v1.90; Purcell et al. 2007), (4) GenomeStudio GenTrain score < 0.15 or Cluster Separation score < 0.3, or (5) allele frequency differed between bead array production batches (P < 10−3), Fisher’s exact test). The genetic ancestry of MGI participants was inferred by projecting MGI samples onto the space created by the first two principal components (PCs) of 938 unrelated samples of the Human Genome Diversity Project (HGDP) reference panel (Li et al. 2008; PLINK). MGI samples were inferred to belong to a HGDP reference population if they fell within a circle drawn around that population in a plot of the PCs. We also lift it over to GRCh38, where 501,607 variants remained.
Only consented individuals were included, and 56,984 samples passed sample-level QC. Samples were excluded in QC if (1) genotype-inferred sex is abnormal or did not match the self-reported gender of the participant or the self-reported gender was missing, (2) sample shared a kinship coefficient > 0.45 with another sample with a different identification tag (KING v2.1.3; Manichaikul et al. 2010), (3) sample-level call-rate was below 99% or any chromosome had a call-rate below 1/5 of the average, or (4) estimated contamination level exceeded 2.5% (VICES; Zajac et al. 2019).
Literature Cited
- Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. ;1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature. 467:1061–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning BL, Zhou Y, Browning SR.. 2018. A one-penny imputed genome from next-generation reference panels. Am JHum Genet. 103:338–348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrska-Bishop M, , Evani US, , Zhao X, , Basile AO, , Abel HJ, et al. 2021. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. bioRxiv doi: 10.1101/2021.02.06.430068 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chan AW, Hamblin MT, Jannink J-L.. 2016. Evaluating imputation algorithms for low-depth genotyping-by-sequencing (GBS) data. PLoS One. 11:e0160733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Das S, Forer L, Schönherr S, Sidore C, Locke AE, et al. 2016. Next-generation genotype imputation service and methods. Nat Genet. 48:1284–1287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fritsche LG, Gruber SB, Wu Z, Schmidt EM, Zawistowski M, et al. 2018. Association of polygenic risk scores for multiple cancers in a phenome-wide study: results from the Michigan Genomics Initiative. Am JHum Genet. 102:1048–1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guan Y, Stephens M.. 2008. Practical issues in imputation-based association mapping. PLoS Genet. e1000279.4: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR.. 2012. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 44:955–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang L, Buzbas EO, Rosenberg NA.. 2013. Genotype imputation in a coalescent model with infinitely-many-sites mutation. Theor Popul Biol. 87:62–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang L, Wang C, Rosenberg NA.. 2009. The relationship between imputation error and statistical power in genetic association studies in diverse populations. Am JHum Genet. 85:692–698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jewett EM, Zawistowski M, Rosenberg NA, Zöllner S.. 2012. A coalescent model for genotype imputation. Genetics. 191:1239–1255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, et al. ; Genome Aggregation Database Consortium. 2020. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 581:434–443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent WJ. 2002. BLAT—The BLAST-like alignment tool. Genome Res. 12:656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingman JFC. 1982. On the genealogy of large populations. JAppl Prob. 19:27–43. [Google Scholar]
- Lee S, Abecasis GR, Boehnke M, Lin X.. 2014. Rare-variant association analysis: study designs and statistical tests. Am JHum Genet. 95:5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, et al. 2008. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 319:1100–1104. [DOI] [PubMed] [Google Scholar]
- Li N, Stephens M.. 2003. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 165:2213–2233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loh PR, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, et al. 2016. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 48:1443–1448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lowy-Gallego E, Fairley S, Zheng-Bradley X, Ruffier M, Clarke L,. et al. ;The 1000 Genomes Project Consortium. 2019. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Res. 4:50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, et al. 2010. Robust relationship inference in genome-wide association studies. Bioinformatics. 26:2867–2873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, et al. ;Haplotype Reference Consortium.2016. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet. 48:1279–1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paul JS, Steinrucken M, Song YS.. 2011. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics. 187: 1115–1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polanski A, Szczesna A, Garbulowski M, Kimmel M.. 2017. Coalescence computations for large samples drawn from populations of time-varying sizes. PLoS One. 12:e0170701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am JHum Genet. 81:559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reppell M, Boehnke M, Zollner S.. 2014. The impact of accelerating faster than exponential population growth on genetic variation. Genetics. 196:819–828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Servin B, Stephens M.. 2007. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 3:e114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi S, Yuan N, Yang M, Du Z, Wang J, et al. 2018. Comprehensive assessment of genotype imputation performance. Hum Hered. 83:107–116. [DOI] [PubMed] [Google Scholar]
- Speidel L, Forest M, Shi S, Myers SR.. 2019. A method for genome-wide genealogy estimation for thousands of samples. Nat Genet. 51:1321–1333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Terhorst J, Kamm JA, Song YS.. 2017. Robust and scalable inference of population history from hundreds of unphased whole-genomes. Nat Genet. 49:303–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Timpson NJ, Greenwood CM, Soranzo N, Lawson DJ, Richards JB.. 2018. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat Rev Genet. 19:110–124. [DOI] [PubMed] [Google Scholar]
- Wiuf C, Donnelly P.. 1999. Conditional genealogies and the age of a neutral mutant. Theor Popul Biol. 56:183–201. [DOI] [PubMed] [Google Scholar]
- Wojcik GL, Fuchsberger C, Taliun D, Welch R, Martin AR, et al. 2018. Imputation-aware tag SNP selection to improve power for large-scale, multi-ethnic association studies. G3 (Baltimore).8:3255–3267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taliun DHarris D NKessler M DCarlson JSzpiech Z A, . et al. 2021. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 590:290–299. 10.1038/s41586-021-03205-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zajac GJM, Fritsche LG, Weinstock JS, Dagenais SL, Lyons RH, et al. 2019. Estimation of DNA contamination and its sources in genotyped samples. Genet Epidemiol. 43:980–995. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
1000 Genomes low-coverage sequence data were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/. 1000 Genomes high-coverage sequence data were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/.