Probabilistic Estimation of Identity by Descent Segment Endpoints and Detection of Recent Selection

Sharon R Browning; Brian L Browning

doi:10.1016/j.ajhg.2020.09.010

. 2020 Oct 13;107(5):895–910. doi: 10.1016/j.ajhg.2020.09.010

Probabilistic Estimation of Identity by Descent Segment Endpoints and Detection of Recent Selection

Sharon R Browning ^1,^∗, Brian L Browning ^1,²

PMCID: PMC7553009 PMID: 33053335

Summary

Most methods for fast detection of identity by descent (IBD) segments report identity by state segments without any quantification of the uncertainty in the endpoints and lengths of the IBD segments. We present a method for determining the posterior probability distribution of IBD segment endpoints. Our approach accounts for genotype errors, recent mutations, and gene conversions which disrupt DNA sequence identity within IBD segments, and it can be applied to large cohorts with whole-genome sequence or SNP array data. We find that our method’s estimates of uncertainty are well calibrated for homogeneous samples. We quantify endpoint uncertainty for 77.7 billion IBD segments from 408,883 individuals of white British ancestry in the UK Biobank, and we use these IBD segments to find regions showing evidence of recent natural selection. We show that many spurious selection signals are eliminated by the use of unbiased estimates of IBD segment endpoints and a pedigree-based genetic map. Eleven of the twelve regions with the greatest evidence for recent selection in our scan have been identified as selected in previous analyses using different approaches. Our computationally efficient method for quantifying IBD segment endpoint uncertainty is implemented in the open source ibd-ends software package.

Keywords: identity by descent, natural selection, recent positive selection

Introduction

Pairs of individuals within a population can share one or more long segments of their genomes identical by descent due to inheritance from common ancestors. Identity by descent (IBD) segments are used in many applications, including estimation of kinship,¹^,²^,³ recent demography,⁴^,⁵^,⁶^,⁷^,⁸ mutation rates,⁹^,¹⁰^,¹¹^,¹² and recombination rates¹³ and detection of recent selection.¹⁴^,¹⁵^,¹⁶

The three main types of test for recent positive selection are based on population differentiation, admixture proportions, and haplotype structure. The first type looks for variants that differ markedly in frequency between populations.¹⁷^,¹⁸ The second type looks for regions in which the sample ancestry proportions in admixed individuals differ from those elsewhere in the genome.¹⁹^,²⁰ One subtype of this test involves archaic admixture, such as introgression from Neanderthals into modern humans, and searches for regions in which the frequency of the archaic haplotype in a modern population is unusually high.²¹ The third type looks for high-frequency haplotypes that are unusually long.²²^,²³ IBD-based selection scans fall into this category.¹⁴ IBD scans look for genomic regions that have a significantly higher than average number of IBD segments. If the genome were completely neutral, and there are no biases in detecting IBD segments or estimating their centiMorgan (cM) lengths, the expected number of IBD segments exceeding some cM length threshold would be constant across the genome. In contrast, if certain haplotypes in a genomic region have a selective advantage, the effective size of the population is reduced in that region, which leads to a higher than expected number of IBD segments. IBD-based tests can also detect the effects of negative selection and balancing selection, since any type of selection will tend to decrease the effective population size within the genomic region.

An IBD segment for a pair of haplotypes is a segment of DNA inherited from a single common ancestor, with no crossovers occurring within the segment in the lineages of the two haplotypes since the common ancestor.⁴^,⁶ Within a shared IBD segment, sequence identity can be disrupted by mutation and gene conversion. In addition, genotype error can cause two haplotypes to appear to be discordant at a position. At such positions, two “identical by descent” haplotypes are in fact not identical. This non-identity needs to be considered when detecting IBD segments.

In the human genome, de novo single-nucleotide mutations occur at an average rate of around 1.3 × 10⁻⁸ per base pair per meiosis,¹² which is similar to the average rate of crossing over per base pair per meiosis. Thus, regardless of the number of generations since the most recent common ancestor, an average of approximately one mutation is expected in the lineage of an IBD segment.

For a pair of haplotypes drawn at random from an outbred population, most of the genome is comprised of very short segments of IBD, with a very large number of generations to the most recent common ancestor. Since each short segment contains an average of approximately one discordance caused by mutation in addition to discordances caused by gene conversion, a series of closely spaced discordances is a clear indication that the genomic interval containing the discordances is comprised of a sequence of short IBD segments. In contrast, when one observes a long segment without discordances, it is usually (depending on the population’s demographic history) highly probable that this segment is primarily comprised of a single long IBD segment resulting from recent common ancestry.

There are three primary paradigms for IBD segment detection. The first paradigm considers a pair of haplotypes to be either “IBD” or “not IBD” at each position in the genome. A hidden Markov model, with pre-determined IBD proportion and rates of transition between the IBD and non-IBD states, may be used to obtain posterior probabilities of IBD and non-IBD at each position.²⁴^,²⁵^,²⁶^,²⁷^,²⁸^,²⁹ This paradigm developed out of the analysis of pedigree data and is very natural in that setting.³⁰ However, for population data with unknown relationships, the dichotomy into IBD and non-IBD is artificial and ignores the fact that each pair of haplotypes has a common ancestor at each position in the genome, although that ancestor may have lived a long time ago.

The second paradigm considers the length of shared segments. A segment is identical by descent if it is inherited from a common ancestor and exceeds a length threshold. In practice, if identity by state (IBS) sharing extends beyond some threshold, the segment is reported as identical by descent.³¹^,³²^,³³ This paradigm recognizes the potential existence of IBD segments that are shorter than the threshold, but does not try to find them. The threshold is typically chosen to be a length above which the accuracy of the reported segments is high.³¹^,³²

The third paradigm considers two haplotypes to be identical by descent if their time to most common ancestor (TMRCA) is less than some specified number of generations.¹⁶^,³⁴

In this work we take a different perspective. We recognize that a pair of haplotypes is, strictly speaking, identical by descent at every point in the genome. However, for any given point in the genome, the endpoints of the IBD segment containing that point are unknown (Figure 1A). It is possible that one or more discordances at the end of the segment are actually contained within the long IBD segment (Figure 1B). It is also possible that IBD ends before IBS ends, so that the end of the IBS segment is not part of the long IBD segment, but instead contains one or more neighboring short IBD segments (Figure 1C). In some cases, two or more long IBD segments in a region can be mistaken for a single long IBD segment (Figure 1D).³⁵ Our approach quantifies this uncertainty. In Results, we show that our quantification is well calibrated, and we apply our method to perform an IBD-based selection scan in the UK Biobank.

Uncertainty in IBD Endpoints

(A) Allele discordances between two haplotypes are represented as crosses. We wish to estimate the endpoints of the IBD segment that covers the focal position in the middle of the longest identity by state (IBS) interval.

(B–D) Three of the possibilities for the shared IBD segment that covers the focal position.

(B) The IBD segment contains the first discordance to the right of the focal position.

(C) The IBD segment does not extend all the way to the discordances and has short flanking segments of IBS.

(D) Two moderately long IBD segments are adjacent. In this case, the second IBD segment is not of direct interest because it does not cover the focal position.

Material and Methods

Overview of Method

The input data for our method are phased genotypes and candidate IBD segments. Highly accurate phased genotypes can be obtained from statistical phasing in large cohorts of accurately genotyped individuals.³⁶ The candidate IBD segments may be obtained using a length-based IBD detection method such as hap-ibd.³³ Our method estimates the posterior probability distributions of the endpoints of the candidate IBD segments and outputs quantiles and samples from these posterior distributions.

Before estimating segment endpoints, we apply a minor allele frequency (MAF) filter to the phased genotypes that excludes variants with frequency less than 0.1%. In Results we show that these rare variants are not modeled as well as the more common variants and that including these rare variants negatively impacts the accuracy of the endpoint estimates.

We model allele discordance within an IBD segment using a user-specified error rate. Analysis results are not overly sensitive to the exact choice of error rate (see Results). Discordances within a segment are assumed to occur independently except when two or more closely spaced discordances could have originated from the same gene conversion event.

We also model IBS extending beyond the end of an IBD segment. IBS segments can be comprised of multiple IBD segments. We do not try to directly model each of these IBD segments individually, but instead model the distribution of IBS segments found in the data. Short regions of IBS are modeled using the local context, because the IBS length distribution varies across the genome due to factors such as mutation rate and selection. Longer segments of IBS are modeled using chromosome-wide data because there is limited information about longer IBS segments from the local context.

We estimate the probability of the observed discordance data as a function of the IBD endpoints. We then use Bayes’ rule to obtain the probability distribution of each IBD endpoint. We work from a focal position within an IBS segment (Figure 1A) and estimate the probability distributions for the positions of the left and right endpoints of the IBD segment that covers the focal position.

Notation

We wish to estimate the endpoints of the IBD segment covering a position x₀ for a given pair of haplotypes, H₁ and H₂. All positions are measured in terms of genetic distance in Morgans, and haplotype phase is assumed to be known. In this description we are only concerned with the estimation of the right endpoint of the IBD segment covering x₀. The estimation of the left endpoint is similar. Index the markers to the right of x₀ by $1, 2, 3, \dots, M - 1$ , where $M - 1$ is the last marker on the chromosome. Let the positions (in Morgans) of these markers be $x_{1}, x_{2}, x_{3}, \dots, x_{M - 1}$ . In addition, we add a nominal position $x_{M} = x_{M - 1} + 1$ , which is located 1 Morgan beyond the last marker on the chromosome. This additional position is used to model IBD that extends beyond the end of the chromosome. We define haplotypes H₁ and H₂ to have discordant alleles at $x_{M}$ .

The IBS data, D is the observed IBS status (identical or discordant) for the alleles on haplotypes H₁ and H₂ at the markers to the right of x₀. Let D[a,b] denote the IBS status at markers with indices $a \leq i \leq b$ .

Let $ε$ be the average proportion of discordant markers within IBD segments (the “error rate” mentioned above). We approximate $(1 - ε)$ with 1 and thus omit terms of $(1 - ε)$ in our calculations.

Modeling the IBS Data for the IBD Segment

We model the IBS data, D, from the focal point x₀ rightward as being generated by two processes. The first, up to the IBD endpoint R, requires that alleles should be identical except at a small number of discordances due to mutation, gene conversion, or genotype error. If discordances in the IBD segment are independent and the right endpoint is in the interval $(x_{i}, x_{i + 1})$ , the probability of the data in the part of the IBD segment to the right of the focal point (i.e., $P (D [1, i])$ ) is $ϵ^{n_{i}}$ , where n_i is the number of discordances between the first marker after the focal point and the i^th marker (inclusive) and factors of $(1 - ε)$ are approximated by 1.

We use Bayes rule to obtain the posterior distribution of R, the position of the right endpoint $(R > x_{0})$ . For each inter-marker interval $(x_{i}, x_{i + 1})$ , $i = 0, 1, \dots, M - 1$ , the probability that the right endpoint is contained in the open interval $(x_{i}, x_{i + 1})$ satisfies:

P (R \in (x_{i}, x_{i + 1}) | D) \propto P (D | R \in (x_{i}, x_{i + 1})) P (R \in (x_{i}, x_{i + 1})) = ϵ^{n_{i}} P (D [i + 1, M] | R \in (x_{i}, x_{i + 1})) P (R \in (x_{i}, x_{i + 1}))

(Equation 1)

where “ $\propto$ ” denotes proportionality. Normalizing the probabilities in Equation 1 to sum to 1 over the i gives the posterior probability that the endpoint occurs in each interval $(x_{i}, x_{i + 1})$ .

The probability $P (R \in (x_{i}, x_{i + 1}))$ is a prior probability for the position of the right endpoint, given the position L₀ of the left endpoint (see Iterative Updating of Endpoints and Focal Point below). We model the population as having constant effective size 10,000 to obtain these probabilities. Details are given in Appendix A.

The remaining component in Equation 1 is the probability $P (D [i + 1, M] | R \in (x_{i}, x_{i + 1}))$ , which is the probability of the IBS data to the right of the right IBD endpoint. We model this by considering the points of discordance as being the points of renewal in a renewal process. That is, we obtain probabilities of the length of each segment that contains all of the non-discordant positions up to and including the next discordance, with each such segment being treated as being independent. The probabilities of each of these segments is obtained empirically from the observed data. Details are given in Appendix B.

Appendix C describes how to obtain the posterior cumulative distribution function for the endpoint from the interval probabilities given in Equation 1.

Iterative Updating of Endpoints and Focal Point

In the preceding section, we assumed that the left endpoint L₀ was known, but it also needs to be estimated, and we do this estimation iteratively. We start by using the left endpoint of the input candidate IBD segment as the value of L₀. After estimating the posterior distribution of the right endpoint, we use this distribution to obtain a new “right endpoint” R₀ that is set equal to the 5^th percentile of this distribution. Percentiles are referenced by distance from the focal point x₀. Thus, small percentiles are located closer to the focal point than larger percentiles. This choice of percentile is conservative (it reduces the estimated length of the IBD segment for the purpose of these calculations) and thus is likely to speed the convergence of the iterative approach. After estimating the right endpoint distribution, we estimate the left endpoint using the newly calculated value of R₀ and obtain a new value of L₀ as the 5^th percentile of the left endpoint distribution. Then if L₀ − x₀ is significantly altered (>10% change in length) from the previous value, we use the new value of L₀ to re-estimate the right endpoint distribution and obtain a new value of R₀. If R₀ is significantly altered from the previous value, we use the new value of R₀ to re-estimate the left endpoint, and so on. Whenever we change the value of L₀ or R₀, we update the focal point x₀ which is located half-way between L₀ and R₀ in base coordinates. We perform a maximum of 10 updates of each endpoint. In order to prevent the focal point from moving outside the input candidate IBD segment, we constrain L₀ and R₀ to stay within the input candidate IBD segment.

Estimation of Error Rate

After running the endpoint estimation algorithm, our ibd-ends software estimates the error parameter $ε$ by measuring the rate of discordant alleles within inferred IBD segments. The procedure is as follows. For each IBD segment that has been analyzed with the endpoint estimation algorithm, take the interval bounded by the posterior 5^th percentile of the left and right endpoint distributions. We use the 5^th percentiles so that that we are unlikely to include alleles beyond the end of the IBD segment. If the length of this interval is <2 cM, ignore the segment. Within the genomic region bounded by these endpoints, examine the alleles on the two IBD haplotypes. Count the number of mismatches and the total number of positions examined. Across all segments, report the total number of mismatches, divided by the sum of the number of positions examined in each segment. If the estimated error rate does not differ significantly from the error rate used in the analysis (e.g., less than a 3-fold difference), it is not necessary to re-run the analysis with the new value (see Results). For a large study, a pilot analysis on a small chromosome can be used to determine the error rate that should be used in the full analysis.

Modified Error Rate to Account for Gene Conversion

Gene conversions copy material from one haplotype to the other during meiosis and can thus result in discordant alleles between IBD haplotypes. The typical length of a gene conversion tract is around 300 base pairs.³⁷ Changes will occur only at positions at which the individual in whom the gene conversion occurred was heterozygous. Thus, many gene conversions have no effect on allele discordance, but some gene conversions can result in more than one allele discordance occurring in proximity. Since these discordances are not independent events, we do not include an error term $ε$ for each one, since that would be overly harsh and tend to result in premature truncation of the IBD segment. Instead, when more than one discordance occurs within 1 kb, we apply the error rate $ε$ for the first discordance, and a less severe gene conversion error rate of $ε^{'}$ for each successive discordance within 1 kb of the first discordance (by default, $ϵ = 0.0005$ and $ϵ^{'} = 0.1$ ).

Analysis Pipeline

Our software, ibd-ends, requires the input of candidate segments for which endpoints will be evaluated. In this work, we use hap-ibd³³ to find the candidate segments. For many applications, one wishes to assess endpoint uncertainty for all IBD segments that exceed a length threshold. In that case, the key consideration is to avoid false negatives when detecting candidate IBD segments. If a potential IBD segment is not included in the input data to ibd-ends, it will not be included in the results. False positives (candidates for which the true IBD segment is actually shorter than the threshold) are less serious—they increase compute time but will be shown to be unlikely to be true long IBD segments when the segment endpoints are estimated. Thus, one should try to cast a wide net when identifying candidate IBD segments.

When analyzing sequence data, the high density of variants and the presence of genotype error can cause a high rate of discordances between IBD haplotypes. The hap-ibd method permits some discordances in a segment. It does so by finding seed IBS segments that exceed a certain length, and then extending these segments if there is another IBS segment that exceeds a minimum extension length and that is separated from the seed segment by a short non-IBS gap. One way to apply hap-ibd to sequence data is to reduce the seed and extension lengths, which effectively increases the permitted density of discordances, but this can significantly increase computation time. Here we take a different approach. We reduce the marker density of the sequence data for the hap-ibd analysis (but not for the ibd-ends analysis). We choose a minimum MAF for hap-ibd that reduces the marker density to approximately that of a 600k SNP array, and we apply this threshold using hap-ibd’s “min-mac” parameter. We retain the highest MAF variants because these are the most informative for detecting the candidate segments. This approach greatly reduces the density of variants, and thus reduces the number of IBD segments that would otherwise go undetected due to genotype error.

Except as otherwise noted, genotype data were phased using Beagle 5.1,³⁸ input IBD segments for ibd-ends were obtained using hap-ibd,³³ and default parameters were used for all programs.

Simulation Overview

We generated three sets of simulated data to investigate three conditions under which estimation of endpoints could be challenging: gaps in marker coverage, non-constant population size, and heterogeneous samples.

We added genotype error to the simulated data for each marker at a rate equal to the minimum of 0.02% and one-half the MAF for the marker. This error rate produces a discordance rate of 0.04% in markers with MAF > 0.04%, which matches the discordance rate seen in the TOPMed data³⁹ and is six times higher than the 0.0067% discordance rate seen in the UK Biobank data.⁴⁰ We also wanted to confirm that the method produces accurate results with higher error rates, so we added error to one dataset at a rate equal to the minimum of 0.1% and one-half the MAF for the marker.

True IBD segments of length ≥c cM were obtained from the ancestral recombination graph. For a pair of haplotypes, we note their most recent common ancestor (MRCA) at grid-point positions located every $c / 10$ cM along the simulated chromosome. If the same MRCA is found at ≥7 grid-point positions, with gaps comprising another MRCA (potentially due to gene conversion) occurring only at isolated (i.e., not adjacent) grid-point positions, we consider the pair to be identical by descent across the region. We then check all MRCAs at positions located near the first and last of the grid-points having the common MRCA to determine the exact endpoint of the IBD segment. We discard any such segments with length <c cM.

Simulation of Constant Size Population with Variable Marker Density

We generated 60 Mb of data for 2,000 individuals from a population with constant size of 10,000 diploid individuals. The recombination rate is 1 × 10⁻⁸ per base pair per meiosis. During the most recent 5,000 generations, gene conversion initiations occurred at a rate of 2 × 10⁻⁸ and gene conversion tracts had a mean length of 300 base pairs. We used SLiM v.3.3 to simulate the past 5,000 generations⁴¹ and msprime to add mutations and simulate the more distant past.⁴²^,⁴³ The mutation rate varied along the simulated region, with a new mutation rate each 100 kb that was uniformly distributed between 0 and 3 × 10⁻⁸ per bp per meiosis. In addition, we made a 3 Mb gap by removing genetic markers between positions 20 and 23 Mb, to represent a centromeric region. We added genotype error at a rate of 0.02% as described above. We used the simulated ancestral recombination graph to determine the true endpoints of IBD segments of length >1 cM for all pairs of individuals within a subset of 500 individuals so that we could evaluate the accuracy of the inferred IBD segment endpoints.

We used the true (simulated) haplotype phase, including any alleles changed by the addition of genotype error, for all analyses of these data. When detecting candidate IBD segments with hap-ibd, we used a minor allele count threshold of 1,700 (minor allele frequency of 0.425; see Analysis Pipeline), resulting in 10,760 markers after excluding markers in the 3 Mb gap region, which corresponds to a mean density of one marker per 5.4 kb in the remaining 57 Mb. All markers with MAF > 0.1% (241,010 markers) were used in the ibd-ends analysis.

Simulation of UK-like Population with Non-constant Size

We generated 60 Mb of data for 50,000 individuals from a UK-like population. These simulated data have been described previously.³³ The demographic model has a population size of 24,000 in the distant past, a reduction to 3,000 occurring 5,000 generations ago, growth at rate 1.4% per generation starting 300 generations ago, and growth at rate 25% beginning 10 generations ago. The mutation rate is 1.3 × 10⁻⁸ per base pair per meiosis, while the recombination rate is 1 × 10⁻⁸ per base pair per meiosis. During the most recent 5,000 generations, gene conversion initiations occurred at a rate of 2 × 10⁻⁸ and gene conversion tracts had a mean length of 300 base pairs. We used SLiM v.3.3 to simulate the most recent 5,000 generations⁴¹ and msprime to add mutations and simulate the more distant past.⁴²^,⁴³ We generated two copies of the data: one with 0.02% added genotype error and one with 0.1% added genotype error as described above. We also created a SNP-array version of the data with 0.02% genotype error and 10,000 randomly selected markers with minor allele frequency >5% (1 marker per 6 kb on average, corresponding to approximately 500k markers genome-wide). We used the simulated ancestral recombination graph to determine the true endpoints of IBD segments of length >1 cM for all pairs of individuals within a subset of 1,000 individuals so that we could evaluate the accuracy of the inferred IBD segment endpoints.

When applying hap-ibd to the UK-like sequence data to provide candidate IBD segments for ibd-ends, we applied a minor allele frequency threshold of 0.45 (see Analysis Pipeline), resulting in 11,524 markers across the 60 Mb with a mean density of one marker per 5.2 kb. All markers with MAF > 0.1% (198,566 markers) were used in the ibd-ends analysis of these candidate IBD segments.

We also ran hap-ibd on the UK-like sequence data with the hap-ibd parameters that are suggested for sequence data (min-seed = 1.0, min-extend = 0.2, and a minor allele frequency filter of 10%)³³ for comparison with the ibd-ends analysis.

Simulation of a Heterogeneous Population

We simulated 10 Mb of data for 500 individuals of African-like ancestry and 500 individuals of European-like ancestry. The demographic history is the two-population model of Tennessen et al.,⁴⁴^,⁴⁵ implemented in stdpopsim.⁴⁶ The combined sample represents an ancestrally heterogeneous population, which violates the assumption that all pairs of individuals in the sample have the same distribution of IBS segment lengths. The recombination rate and mutation rate are both 1 × 10⁻⁸ per base pair per meiosis. We did not include gene conversion in the simulation. We simulated the data with msprime,⁴² and we added genotype error at a rate of 0.02% as described above. We used the simulated ancestral recombination graph to determine the true endpoints of IBD segments of length >0.5 cM for all pairs of individuals so that we could evaluate the accuracy of the inferred IBD segment endpoints.

When applying hap-ibd to these data, we applied a MAF threshold of 0.35 (see Analysis Pipeline), resulting in 2,215 markers across the 10 Mb with a mean density of one marker per 4.5 kb. Although we used default settings for hap-ibd in other analyses (except as noted), in this analysis we set the hap-ibd min-seed and min-output parameters to 1 cM since there are very few IBD segments of length >2 cM in these data. All markers with MAF > 0.1% were used in the ibd-ends analyses (48,074 markers).

UK Biobank Data

We phased QC-filtered UK Biobank data (487,373 individuals) using Beagle 5.1,³⁸ and then used hap-ibd with default settings to find candidate IBD segments among 408,883 white British individuals identified by the UK Biobank.⁴⁰ We ran ibd-ends with default settings on the candidate IBD segments from the white British individuals to estimate the uncertainty in the endpoints of these IBD segments. We used Bherer et al.’s European genetic map which is based on family data from Iceland and other European populations.⁴⁷ Variants located outside the bounds of the map are excluded from the ibd-ends analyses because extrapolated cM positions for markers outside the map can differ significantly from their true cM positions, leading to substantial under- or over-estimation of IBD segment lengths.

Results

Simulated Data with Variable Marker Density

Figure 2 shows the results of ibd-ends analysis on the simulated sequence data with constant population size, variable marker density, and a 3 Mb gap in marker coverage (see Material and Methods for details). Even with the large gap and uneven marker density, the endpoints uncertainty is well calibrated (Figure 2A), and coverage of IBD segments across the simulated region is even except for dips in coverage across the gap and at the chromosome ends (Figure 2B). 61% of sampled endpoints are located within 5 kb of the true endpoint, as are 72% of posterior median endpoints.

Method Performance with Uneven Marker Density

Sequence data on 2,000 individuals were simulated under a constant effective population size. Markers located between 20 and 23 Mb were removed, and marker density varies every 100 kb (see Material and Methods). The true haplotype phase is used in the analysis.

(A) Quantile-quantile plot assessing the calibration of the estimated endpoint uncertainty. The actual quantile (y axis) corresponding to a given nominal quantile (x axis) is the proportion of segments for which the reported nominal quantile of the right endpoint is greater than the true right endpoint (points on the plot). The $y = x$ line is shown for comparison. Results for the left endpoints are similar but are not shown.

(B) The y axis is the IBD rate, which is the percentage of pairs of haplotypes for which the position on the chromosome is covered by an estimated IBD segment with length >2 cM for the haplotype pair. Estimated IBD segment endpoints are the posterior medians. The IBD rate is calculated at 10 kb intervals.

The ibd-ends analyses of these data used the default error rate of 0.0005. The error rate estimated by ibd-ends was 0.00039.

UK-like Simulated Data

Figure 3 shows calibration and distribution of uncertainty for the simulated UK-like data with a 0.02% genotype error rate. The estimated uncertainty is well calibrated (upper row of Figure 3), even when using inferred haplotype phase and when using data thinned to represent a 500k SNP array. Calibration is a little better with the inferred-phase SNP array data rather than with the inferred-phase sequence data, which may be because phasing of common variants is more accurate than that of rare variants. As expected, average endpoint uncertainty, as measured by the difference between the endpoint sampled from the uncertainty distribution and the median endpoint, is much higher when analyzing SNP array data rather than full sequence data (lower row of Figure 3, right versus left and middle columns). Results are also well calibrated with smaller sample sizes (200 or 1,000 individuals; Figure S1). We performed further analyses to evaluate endpoint estimation accuracy with a higher genotype error rate (Figure S2) and with different MAF thresholds (Figure S3). The results are well calibrated with the higher genotype error rate (0.1%) when using true haplotype phase. When using inferred haplotype phase, the higher genotype error rate results in phase errors that reduce accuracy. The results are not particularly sensitive to the choice of MAF, but some miscalibration is observed when very rare variants are included (0.01% MAF), and uncertainty increases when a high MAF threshold is used.

Method Performance on UK-like Simulated Sequence and SNP Array Data

The data comprise 50,000 individuals simulated from a UK-like demographic history (see Material and Methods), with a genotype error rate of 0.02%. True IBD segment endpoints were determined for 1,000 individuals, and these individuals were used to generate the results in this figure. The top row shows quantile-quantile plots that assess the calibration of the estimated endpoint uncertainty. The $y = x$ line is shown for comparison. The actual quantile (y axis) corresponding to a given nominal quantile (x axis) is the proportion of segments for which the reported nominal quantile of the right endpoint is greater than the true right endpoint. The bottom row shows histograms of the right endpoint sampled from the estimated posterior distribution minus the posterior median right endpoint. The histograms represent the distribution of uncertainty, averaged over segments. Histogram bin widths are 5 kb. Results for the left endpoints are similar but are not shown. The left column is for analysis using the true haplotype phase. The middle column is for analysis using haplotype phase inferred using Beagle 5.1. The right column is for data thinned to match a SNP array with 500,000 markers genome-wide (10,000 markers in the simulated 60 Mb interval), and with haplotype phase inferred using Beagle 5.1.

In a recent analysis, hap-ibd and GERMLINE gave the highest accuracy among competing IBD segment detection methods on sequence data,³³ so we compared the precision of estimated IBD segment endpoints between hap-ibd and ibd-ends for data with true haplotype phase (Figure S4). Using recommended parameters for sequence data for hap-ibd (min-output = 2.0, min-seed = 1.0, min-extend = 0.2, and a minimum minor allele frequency of 10%),³³ 22% of estimated endpoints were more than 50 kb from the true value when the genotype error rate was 0.02% and 43% were more than 50 kb from the true value when the genotype error rate was 0.1%; the corresponding percentages with ibd-ends were 10% and 12%. Further, hap-ibd missed many segments when the error rate was high. Only 58% of the IBD segments with true simulated length >3 cM were found by hap-ibd with the recommended parameters when the genotype error rate was 0.1%. Increasing the minor allele frequency to a high value, as we did when running hap-ibd to provide candidate segments to ibd-ends, reduces loss of IBD segments. With hap-ibd with a high minor allele frequency threshold (45%), 88% of the segments with true length >3 cM were found when the genotype error rate was 0.1%. However, using hap-ibd with a high minor allele frequency filter without estimating the segment endpoints with ibd-ends significantly reduces precision (Figure S4).

We found that using an analysis error rate that is up to three times higher or lower than the estimated error rate gave well-calibrated results (Figure S5). When using an analysis error rate that is outside this range, the estimated error rates produced by ibd-ends are within the range of error rates that will provide good results in a subsequent analysis. For example, analysis of 1,000 UK-like simulated samples using an analysis error rate of 5 × 10⁻⁵ (ten times lower than optimal) led to an estimated error rate of 4.2 × 10⁻⁴, while using an analysis error rate of 5 × 10⁻³ (ten times higher than optimal) led to an estimated error rate of 6.2 × 10⁻⁴, with both estimated error rates being close to the optimal rate for these data of around 5 × 10⁻⁴. If the estimated error rate differs from the analysis error rate by more than 3-fold, we recommend repeating the analysis using the estimated rate. Pilot results on a small chromosome can be used to determine whether the analysis error rate needs to be changed from the default value. When we initially ran the analysis of the SNP-array data using the default error rate of 0.0005, we obtained an estimated error rate of 0.00016, which is 4-fold lower, so we re-ran the analysis with this new rate, and the results shown in Figure 3 reflect this second analysis.

Compute times for ibd-ends analyses with default settings were 1.3 h for the full UK-like data with 50,000 individuals (17 million IBD segments) and 0.5 h for 1,000 individuals (7,000 IBD segments) using a 24-core compute node with 24 Intel Xeon Silver 4214 2.2 GHz processors and 382 GB of memory.

Heterogeneous Simulated Data

In the heterogeneous simulation, half of the simulated individuals are from a population with an African demographic history, while the other half are from a population with a European demographic history (see Material and Methods). Analyzing these data together violates the assumption that all pairs of individuals have the same distribution of IBS segment lengths, but the results are not excessively mis-calibrated (Figure S6A). For example, 18% of the true endpoints are closer to the center of the IBD segment than the nominal 10^th percentile. When analyzing the African individuals separately (Figure S6B) or the European individuals separately (Figure S6C), the results are well calibrated.

UK Biobank Data

We determined the endpoint uncertainty distributions of 77.7 billion candidate autosomal IBD segments of length 2 cM or larger that hap-ibd detected in the 408,883 white British UK Biobank individuals. In the downstream analyses, we use the posterior medians to define the IBD segment endpoints. Analysis was parallelized by chromosome. Total wall clock computing time across all chromosomes was 7.5 h for hap-ibd and 144 h for ibd-ends using a 24-core Intel Xeon Silver 4214 2.2 GHz compute node. The estimated error rates from each chromosome varied from 0.00027 to 0.00034 when using the default analysis error rate of 0.0005.

There were 54.7 billion ibd-ends IBD segments with length >2 cM. Every 10 kb along each chromosome, we computed the number of IBD segments with length >2 cM covering the position (Figure 4). The IBD rate is the number of IBD segments covering a position divided by the number of haplotype pairs. Each individual contributes two haplotypes, and all haplotype pairs are considered except those pairs within the same individual. A high rate of IBD at a position is a signal of possible recent strong natural selection.¹⁴^,¹⁵^,¹⁶

Rate of IBD Segments along the Autosomes in UK Biobank White British Data

The x axis shows position along each chromosome. Chromosomes alternate in color. Notable genes and regions (*LCT*, MHC, OAS, and *TRPM1*) located within the four highest peak regions are labeled. The y axis is the IBD rate, which is the percentage of pairs of haplotypes for which the position on the chromosome is covered by an IBD segment with length >2 cM for the haplotype pair. IBD segment endpoints are posterior medians. The IBD rate is calculated at 10 kb intervals. The black dashed lines show the thresholds of 0.025% and 0.021% used for the results in Tables 1 and 2, respectively.

The median IBD rate is 0.0132%. The standard deviation of IBD rate varies by chromosome from 0.0020 to 0.019; the chromosome with the highest standard deviation is chromosome 2 because of the large spike in IBD rate around LCT (MIM: 603202) (see Figure 4), and the median is $σ_{m e d} = 0.0027$ . There are 12 regions with an IBD rate higher than 0.025% (Table 1), which is more than $4 \times σ_{m e d}$ higher than the median IBD rate. Eleven of the twelve regions are known to be undergoing significant levels of selection, indicating the success of this approach in finding real signals of selection. Four of the regions have been shown to have adaptive introgression from Neanderthals (OAS locus, CCR9/CXCR6 [MIM: 604738, 605163], TLR1/6/10 [MIM: 601194, 605403, 606270], Type II Keratins). Five of the regions of selection play a role in immunity (MHC locus, OAS locus, CCR9/CXCR6, TLR1/6/10, PRDM1 [MIM: 603423]). Other regions are involved in nutrition (LCT), skin and hair traits (SLC45A2 [MIM: 606202], Type II Keratins, and TRPM1 [MIM: 603576]), and fertility (MAPT [MIM: 157140] inversion). In all cases, the genes discussed below are completely within the region for which the IBD rate is within 80% of the corresponding peak rate (column 3 of Table 1).

Table 1.

Regions of Highest IBD Rate in UK Biobank White British Analysis

Chromosome	Peak position (Mb)	Region (Mb)^a	Max IBD %	Notes^b
2	135.78	135.05–138.42 (2q21-22)	0.1702	LCT (136.55–136.59) is known to be subject to recent positive selection⁴⁸
3	47.60	45.18–53.19 (3p21)	0.0264	chemokine receptor genes including CCR9 (45.93–45.94) and CXCR6 (45.98–45.99); this region shows adaptive introgression from Neanderthals⁴⁹
3	123.44	122.36–124.69 (3q21)	0.0251	ADCY5 (123.00–123.17) is under long-term balancing selection⁵⁰
4	38.84	38.25–40.18 (4p14)	0.0270	Toll-like receptor genes including TLR1, TLR6, and TLR10 (38.77–38.86); this locus is known to be under positive selection⁵¹
5	33.89	32.66–34.57 (5p13)	0.0262	SLC45A2 (33.94–33.98, previously known as MATP) is a pigmentation gene that has undergone recent positive selection in Europe⁵²
6	24.87	24.01–28.96	0.0267	MHC locus containing HLA genes (28.48–33.45) is known to be subject to strong selection⁵³
6	33.96	32.70–36.27 (6p21-22)	0.0298
6	106.33	105.97–107.31 (6q21)	0.0271	PRDM1 (106.53–106.56); this locus shows a signal of recent selection⁵⁴
12	53.59	52.62–54.60 (12q13)	0.0260	Type II Keratins (KRT1-8; 52.63–53.32); this locus has experienced adaptive introgression from Neanderthals⁵⁵
12	113.46	111.64–114.11 (12q24)	0.0316	OAS locus (OAS1-3) (113.34–113.45); this locus has experienced adaptive introgression from Neanderthals⁵⁶
15	31.59	30.76–32.58 (15q13)	0.0542	TRPM1 (31.29–31.45), a pigmentation gene under selection in non-Africans⁵⁷
17	44.75	42.17–45.88 (17q21)	0.0274	17q21.31 MAPT inversion (43.44–44.85);^c H2 form has been positively selected in Europe⁵⁸
22	22.56	21.68–22.99 (22q11)	0.0267	UBE2L3 (21.90–21.98) is associated with multiple auto-immune diseases⁵⁹

Open in a new tab

Regions in which the maximum IBD percentage is at least 0.025% are shown. Positions are in GRCh37 coordinates.

Region in which the IBD rate is at least 80% of the value at the peak.

See the main text for further notes and references. Genes listed within this column are contained with the region given in the third column.

Positions from Zody et al.⁶⁰ lifted over from build 36 (chromosome 17: 40.80–42.20 Mb) to build 37.

The highest selection signal, with an IBD rate of 0.17%, comes from a chromosome 2 region containing LCT which has a variant selected for lactose tolerance in Europeans.⁴⁸ The selected variant is thought to have arisen, or at least begun to increase in frequency, around the time of the advent of cattle farming in Europe, around 7,500 years ago,⁶¹ and selection has been so strong that the selected variant allele frequency is now around 75% in individuals of British descent.⁶² Since IBD is affected by recent selection, it is not surprising that this signal is most prominent.

In contrast, the immunity-related HLA genes in the major histocompatibility complex (MHC) on chromosome 6 have been under selection over a much longer time period,⁵³^,⁶³ and this region has a much lower peak IBD rate (0.030%) than the LCT region. Various sub-regions within the MHC appear to have been subject to adaptive introgression from Neanderthals and Denisovans, but it is difficult to be certain because long-term balancing selection across the region can produce signals that look like adaptive introgression.⁶⁴

The second-highest signal, with an IBD rate of 0.054%, comes from a chromosome 15q13 region containing TRPM1, a pigmentation gene that has been shown to have been subject to selection in non-Africans.⁵⁷^,⁶⁵

The high IBD rate region on chromosome 12q24 (0.032% IBD rate) encompasses the OAS locus (OAS1-3 [MIM: 164350, 603350, 603351]) which is involved in immunity.⁶⁶ This locus has a Neanderthal haplotype present at high frequency in non-Africans that has been subject to positive selection.⁵⁶

The high IBD rate region on chromosome 17q21 (0.027% IBD rate) encompasses the 17q21.31 MAPT inversion, for which the H2 form has undergone positive selection in Europe and is associated with increased fertility and higher recombination rates in females in Iceland.⁵⁸

The high IBD rate region on chromosome 4p14 (0.027% IBD rate) contains several toll-like receptor genes (TLR1, TLR6, and TLR10) that are involved in immunity, and this region has experienced adaptive introgression from Neanderthals.⁵⁵ As well as the adaptive introgression, this region shows other signals of selection, including geographic differentiation within the UK⁵¹ and signs of recent positive selection among non-Africans.⁶⁷

The high IBD rate region on chromosome 6q21 (0.027% IBD rate) contains the PRDM1 and ATG5 (MIM: 604261) genes. This pair of genes is associated with autoimmune diseases and cancer⁶⁸^,⁶⁹^,⁷⁰ and shows a signal of recent selection in HapMap data.⁵⁴

The high IBD rate region on chromosome 3p21 (0.026% IBD rate) is a region of adaptive introgression from Neanderthals.⁴⁹^,⁵⁵ It contains chemokine receptor genes, including CCR9 and CXCR6, that are involved in immunity.⁷¹^,⁷² This region is also associated with COVID-19 disease severity.⁷³^,⁷⁴

The high IBD rate region on chromosome 5p13 (0.026% IBD rate) contains SLC45A2 (formerly known as MATP), a pigmentation gene that has undergone recent positive selection in Europe.⁵²^,⁷⁵

The high IBD rate region on chromosome 12q13 (0.026% IBD rate) contains the Type II Keratin genes, which code filament proteins that provide a major structural role in epithelial cells.⁷⁶ This region has experienced adaptive introgression from Neanderthals.⁵⁵

The high IBD rate region on chromosome 3q21 (0.025% IBD rate) contains ADCY5 (MIM: 600293). This gene is associated with birth weight⁷⁷ and type II diabetes⁷⁸ and is under long-term balancing selection.⁵⁰

The remaining locus on chromosome 22q11 (0.027% IBD rate) has not previously been highlighted as being under selection, to the best of our knowledge. This locus includes UBE2L3 (MIM: 603721) which is associated with multiple auto-immune diseases.⁵⁹ These associations make this locus a strong candidate for natural selection.⁷⁹

We also investigated the next six regions with highest IBD rates (Table 2; IBD rate between 0.021% and 0.025%). One of these regions is the epidermal differentiation complex locus on chromosome 1q21.3 (0.021% IBD rate), which is known to have undergone recent positive selection.⁸⁰

Table 2.

Secondary Regions from UK Biobank Selection Scan

Chromosome	Peak Position (Mb)	Region (Mb)^a	Max IBD %	Notes^b
1	76.61	74.86–77.27 (1p31.1)	0.0218	ST6GALNAC3 (MIM: 610133) (76.54–77.10) is subject to virus-driven selective pressure⁸¹
1	152.93	151.75–153.65 (1q21.3)	0.0214	epidermal differentiation complex locus (151.96–153.60);⁸² the locus with the greatest differentiation between humans and chimpanzees;⁸³ recent positive selection in humans⁸⁰
4	184.59	184.30–185.39 (4q35.1)	0.0217	–
5	2.72	2.32–2.98 (5p15.33)	0.0221	–
16	18.08	17.36–18.58 (16p12.3)	0.0244	–
17	36.05	35.18–36.51 (17q12)	0.0239	HNF1B (MIM: 189907) (36.05–36.10, previously known as TCF2) is associated with diabetes⁸⁴^,⁸⁵

Open in a new tab

Regions with maximum IBD rate between 0.021% and 0.025% are shown. Positions (in Mb) are in GRCh37 coordinates.

Region in which the IBD rate is at least 80% of the value at the peak.

Genes listed within this column are contained with the region given in the third column.

The variability of IBD rate across the genome is much lower for ibd-ends than for hap-ibd (Figure S7). The standard deviation of the ibd-ends IBD rate is 0.0064%, whereas the standard deviation for hap-ibd is more than three times higher at 0.022%. Ibd-ends is a much better tool for investigating regions of potential selection than length-based methods because length-based methods can report artifactually high rates of IBD in some regions, such as regions containing large gaps in marker coverage. For example, IBD segment detection with four length-based IBD detection methods produce a 40- to 3,000-fold higher IBD rate at the chromosome 1 centromere in UK Biobank data compared to the background rate.³³

Misspecification of the genetic map can also lead to spurious regions of high IBD rate in IBD selection scans, because overestimation of genetic length will result in a larger number of IBD segments that pass the length threshold. To investigate the impact of map choice on the IBD selection analysis, we analyzed a subset of 50,000 white British individuals with three maps. We observed considerable differences in the regions of highest IBD rate when using different maps (Figure S8). With the HapMap linkage-disequilibrium (LD)-based map,⁸⁶ there are 20 regions with IBD rate >0.05% (compared to 2 with Bherer et al.’s family-based map),⁴⁷ with 3 of these occurring at centromeres. Furthermore, the IBD rate has higher variability when using the HapMap map, with a standard deviation of 0.018% (compared to 0.0064% with Bherer et al.’s map in this 50,000-individual subset). LD-based maps are known to be biased in regions of selection, with genetic distances underestimated in these regions,⁸⁷ which would lead to an apparent decrease in IBD rate in these regions. Thus, increases in apparent IBD rate with the LD-based HapMap map compared to Bherer et al.’s pedigree-based map must be due to other effects, such as unmodelled features of the HapMap data. The IBDrecomb map is designed to obtain uniform IBD rate across the genome.¹³ Thus regions of likely selection such as LCT do not have high IBD rates when using this map. The analysis with IBDrecomb still has some regions of high IBD rate (21 regions with IBD rate > 0.05%), which are concentrated at chromosome ends (where inflation of map distances is known to occur¹³) and at centromeric gaps. These regions are likely to be artifacts.

The MHC region has a much higher selection signal from analysis with the HapMap map than with Bherer et al.’s map. The genetic lengths of the MHC are 3.29 cM for the HapMap map, 2.20 cM for Bherer et al.’s map, and 1.44 cM for the IBDrecomb map. Three previous IBD-based selection analyses have used the HapMap map,¹⁴^,¹⁵^,¹⁶ and two of these analyses also had very strong MHC signals.¹⁴^,¹⁶

Discussion

We presented a method for calculating the posterior probability distributions of IBD segment endpoints. We showed that the method can be applied to large datasets, such as the UK Biobank SNP array data on 408,883 white British individuals and simulated sequence data on 50,000 individuals. In the UK Biobank data, we analyzed 77.7 billion candidate IBD segments and found 54.7 billion IBD segments for which the length based on posterior median endpoints is greater than 2 cM.

In addition to quantifying endpoint uncertainty, a major advantage of our method is that it handles genotype errors and other discordances within IBD segments in a principled way, in contrast to many other methods for IBD segment detection which use ad hoc approaches. Our method does not directly account for haplotype phase uncertainty, but statistical phasing of non-rare variants in large array-typed cohorts is now extremely accurate,³⁶ and technologies for generating highly accurate, phase-resolved sequence data are becoming available.⁸⁸

Unlike most existing methods for IBD segment estimation, ibd-ends is designed for sequence data. We found that a length-based IBD detection method could not simultaneously detect IBD segments with high power and precisely determine endpoints, but the ibd-ends method detects nearly all segments exceeding a specified length and accurately determines the endpoints, even in the presence of relatively high rates of genotype error.

IBD segments defined by the sampled endpoints or posterior quantiles such as the median can be used in downstream analyses. This can enable the inclusion of more data in analyses. For example, when estimating genome-wide mutation rates from IBD segments, it is important to be confident that one does not count mutations that actually lie outside the IBD segment. In the past, this has been achieved by trimming 0.5 cM from the putative IBD endpoints,¹⁰^,¹² but trimming using a small quantile of the uncertainty distribution would be expected to result in less trim (and hence more data), while maintaining accuracy. Alternatively, methods could be developed that directly account for endpoint uncertainty without trimming.

Another analysis that could incorporate estimated IBD segment end point uncertainty is IBD-based estimation of recombination maps. Endpoints of IBD segments are points of past recombination which are used to estimate the map. Misspecification of IBD endpoints adds noise and bias to the resulting map. Thus, the higher precision of posterior median endpoints relative to endpoints from other IBD detection methods could improve estimation of recombination rate. One could also improve recombination map estimation by incorporating endpoint uncertainty into an iterative procedure that uses the current estimated recombination map to refine the estimates of IBD segment endpoints.

IBD segment endpoint uncertainty could also be used to improve IBD-based estimation of recent demographic history. For this application, it is not the actual IBD endpoints that matter, but rather the distribution of IBD segment lengths, which one could estimate by sampling endpoints from the posterior distribution. A threshold of 2 or 3 cM on IBD segment length is usually applied with current methods since lengths of shorter segments have higher relative uncertainty.⁴^,⁵^,⁶^,⁷^,⁸ With sampled endpoints, it may be possible to use a lower length threshold and thus estimate demographic history further into the past.

We found that the IBD segments obtained from sampled endpoints provide excellent input data for an IBD-based selection scan. Eleven of the twelve top regions in our UK Biobank analysis are regions of known selection, and the remaining region is a good candidate for selection. Two particular features of our IBD-based selection analysis contribute to its success. The first is accurate estimation of IBD segment lengths, even in the presence of gaps in marker coverage, which eliminates many spurious signals. In contrast, length-based IBD detection methods tend to have regions with inflated IBD rates,³³ which would cause spurious signals if used in a selection scan. The second is the choice of genetic map. We found that for IBD-based selection scans, pedigree-based maps based on actual observations of recombination produce more accurate results than maps based on LD or on IBD sharing, because effects of selection and other unmodelled features in local genomic regions can bias LD-based and IBD-based maps, and this bias results in spurious signals of selection.

The modeling underlying our method assumes a homogeneous population with the same distribution of IBS segment lengths for all pairs of individuals. Analysis of a simulated combined African and European ancestry sample indicates that violation of this assumption introduces a little mis-calibration in the posterior endpoint probabilities. One solution for future work would be to adjust for ancestry of the samples, or for local ancestry in the case of admixed samples.

Declaration of Interests

The authors declare no competing interests.

Acknowledgments

Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award numbers HG005701 and HG008359. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research has been conducted using the UK Biobank Resource under Application Number 19934.

Published: October 13, 2020

Footnotes

Supplemental Data can be found online at https://doi.org/10.1016/j.ajhg.2020.09.010.

Appendix A: Prior for IBD Length Distribution

In this appendix we derive a formula for the prior probabilities $P (R \in (x_{i}, x_{i + 1}))$ in Equation 1. Let Y = R − L₀ be the length (measured in Morgans) from the left endpoint, L₀, of the segment to the right endpoint R. Here we assume that the left endpoint is known, but in practice we iteratively update its estimated value as described in the main text. We write $F (y) = P (Y \leq y)$ for the prior probability distribution on Y. We model the population size as constant diploid size N, with N = 10,000, which reflects the approximate average historical size of out-of-Africa populations.

Summing over possible values for G, the number of generations to the common ancestor of the IBD segment, we obtain

F (y) = P (Y \leq y) = \sum_{g = 1}^{\infty} P (Y \leq y | G = g) P (G = g) = \sum_{g = 1}^{\infty} (1 - e^{- 2 g y}) {(1 - \frac{1}{2 N})}^{g - 1} \frac{1}{2 N} = 1 - {(2 N e^{2 y} - 2 N + 1)}^{- 1}

In these calculations, we use the fact that the IBD segment length conditional on the number of generations g to the common ancestor is exponentially distributed with rate parameter $2 g$ ,^⁷ and the fact that the number of generations to the most recent common ancestor in a population of constant size n is geometrically distributed.^⁴ The final expression for $F (y)$ in the calculation is obtained by applying the formula for a geometric series.

The prior distribution for the right endpoint is:

P (R \leq x) = P (Y \leq x - L_{0}) = F (x - L_{0}) for L_{0} < x \leq x_{M}

and the probability in Equation 1 that R falls in a certain interval can be calculated as: $P (R \in (x_{i}, x_{i + 1})) = F (x_{i + 1} - L_{0}) - F (x_{i} - L_{0}) .$

In Appendix C, we will require the inverse of F, which is $F^{- 1} (p) = \frac{1}{2} \log (\frac{p + 2 N (1 - p)}{2 N (1 - p)})$ .

Appendix B: Modeling the IBS Data beyond the IBD Segment

In this appendix we describe how to calculate the conditional probability $P (D [i + 1, M] | R \in (x_{i}, x_{i + 1}))$ in Equation 1. This is the probability of the IBS data to the right of the right IBD endpoint, including a nominal discordant position indexed by M that is 1 Morgan beyond the last marker on the chromosome.

Let $m_{i} (j)$ $(j \geq 1)$ be the ordered indices of the discordant markers to the right of x_i. The final value of this sequence is index M. We approximate the IBS process to the right of the IBD endpoint as a renewal process with a renewal every time there is a discordant marker. Then the probability of the IBS data to the right of x_i given that the IBD endpoint occurred in the interval $(x_{i}, x_{i + 1})$ is:

P (D [i + 1, M] | R \in (x_{i}, x_{i + 1})) \approx P (D [i + 1, m_{i} (1)]) \prod_{j > 1} P (D [m_{i} (j - 1) + 1, m_{i} (j)])

Each interval of IBS data in the preceding equation has the form $D [a, b]$ , where $a$ and $b$ are marker indices, and has the property that the alleles on the two haplotypes H₁ and H₂ are identical at all positions $i$ satisfying $a \leq i < b$ , but are discordant at position b. Our approach to estimating the probabilities of these IBS interval data is to estimate the probability p_a,b that two randomly chosen haplotypes have identical alleles over the interval [a,b] (i.e., for all marker indicies $i$ with $a \leq i \leq b$ ). Then we set $P (D [a, b]) = p_{a, (b - 1)} - p_{a, b} .$ In the case where $a = b$ (due to two consecutive discordances), we define $p_{a, (b - 1)} = 1$ , so that $P (D [b, b]) = 1 - p_{b, b}$ which is the probability of discordance at marker b. We estimate p_a,b empirically from the data.

Our method for estimating p_a,b depends on the length of the interval [a,b]. For short intervals, we use data in the interval so that the estimated probability incorporates the local genomic context, such as high or low marker density, high or low heterozygosity, and high or low levels of LD. For long intervals, we will use chromosome-wide data. As the interval length becomes longer, p_a,b will tend to decrease because the probability of observing a long IBS interval for a random pair of haplotypes is small. Small probabilities are more difficult to estimate than long ones, so we need to bring in data from the rest of the genome. Furthermore, long intervals represent long IBS (alleles at all markers in the interval are identical except for the final marker), and long IBS is likely to be the result of a long IBD segment. Excluding the effects of selection, the distribution of IBD lengths should be uniform across the genome, so estimating the frequency of such long IBS segments from data across the genome is appropriate.

For short intervals, we estimate p_a,b by the proportion of pairs of haplotypes that have identical by state alleles for markers in the interval [a,b]. We consider an interval [a,b] to be a short interval if the estimated ${\hat{p}}_{a, b}$ satisfies ${\hat{p}}_{a, b} \geq 0.001$ when it is estimated from the haplotypes in the interval. We use 10,000 randomly sampled haplotypes (or all haplotypes if the sample size is 5,000 or fewer individuals) to estimate the short interval IBS probabilities.

For longer intervals, we estimate p_a,b using the global distribution of sampled one-sided IBS lengths. A one-sided IBS length is the distance in Morgans from a random starting position to the first non-IBS marker in the direction toward the center of the chromosome for a random pair of haplotypes. We estimate p_a,b by the proportion of sampled one-sided IBS lengths that are greater than $(x_{b} - x_{a})$ . When estimating the global IBS length, we randomly select 1,000 positions and randomly sample 2,000 one-sided IBS lengths for each position. We then exclude the sampled lengths at positions for which the IBS length are significantly longer than average. We do this by calculating the 90th percentile of the 2,000 segment lengths for each position. If the 90^th percentile at a position is more than 3 times the median 90^th percentile from all 1,000 positions, we discard the position. This filtering protects against selecting positions near gaps in marker coverage.

Appendix C: Obtaining the Posterior Cumulative Distribution Function for the Endpoint

In this appendix we describe how to calculate the cumulative distribution function of the posterior distribution for the right endpoint of the IBD segment, as well as how to sample from this posterior distribution.

We assume that the data D are not informative about the location of the endpoint within the inter-marker interval given that the endpoint occurs within the interval, i.e., we assume that $P (x_{i} < R \leq x | x_{i} < R < x_{i + 1}, D) = P (x_{i} < R \leq x | x_{i} < R < x_{i + 1})$ for $x_{i} < x < x_{i + 1}$ , since there are no data within the interval. Write $p_{i} = P (R \leq x_{i} | D)$ , which can be estimated using Equation 1 with the procedures described above. As in Appendix A, we write $Y = R - L_{0}$ for the length of the segment (in Morgans), and $F (y) = P (Y \leq y)$ for the prior on IBD lengths. For $x$ satisfying $x_{i} < x < x_{i + 1}$ :

P (R \leq x | D) = P (R \leq x_{i} | D) + P (x_{i} < R \leq x | D) = P (R \leq x_{i} | D) + P (x_{i} < R \leq x | x_{i} < R \leq x_{i + 1}, D) P (x_{i} < R \leq x_{i + 1} | D) = P (R \leq x_{i} | D) + P (x_{i} < R \leq x | x_{i} < R \leq x_{i + 1}) P (x_{i} < R \leq x_{i + 1} | D) = P (R \leq x_{i} | D) + \frac{P (x_{i} < R \leq x)}{P (x_{i} < R \leq x_{i + 1})} P (x_{i} < R \leq x_{i + 1} | D) = p_{i} + \frac{P (x_{i} - L_{0} < Y \leq x - L_{0})}{P (x_{i} - L_{0} < Y \leq x_{i + 1} - L_{0})} (p_{i + 1} - p_{i}) = p_{i} + \frac{F (x - L_{0}) - F (x_{i} - L_{0})}{F (x_{i + 1} - L_{0}) - F (x_{i} - L_{0})} (p_{i + 1} - p_{i})

Note that $p_{i} = P (R \leq x_{i} | D)$ is conditional on the data, whereas $F (x_{i} - L_{0}) = P (Y \leq x_{i} - L_{0}) = P (R \leq x_{i})$ is a prior probability and is not conditional on the data.

Then to find the p^th quantile, we want to find $x^{(p)}$ such that $P (R < x^{(p)} | D) = p$ . Solving the above equation for $x = x^{(p)}$ , we obtain

x^{(p)} = L_{0} + F^{- 1} ((F (x_{i + 1} - L_{0}) - F (x_{i} - L_{0})) \frac{p - p_{i}}{p_{i + 1} - p_{i}} + F (x_{i} - L_{0})) .

The formula for $F^{- 1} (p)$ can be found in Appendix A.

In order to obtain a sampled value from the posterior probability of the endpoint, we first generate a realization $u$ from the Uniform(0,1) distribution, and we then obtain $x^{(u)}$ using the above formula with $u$ in place of $p$ , which is then a realization from the desired distribution. This is an example of using the inverse transform principle to sample from a distribution.^⁸⁹

Data and Code Availability

The ibd-ends software is freely available under the open source Apache License 2.0 (see Web Resources).

The UK Biobank IBD rates shown in Figure 4 are available from Mendeley Data https://doi.org/10.17632/m4nxyv6rw8.1

Web Resources

Bherer et al.’s refined European map, https://github.com/cbherer/Bherer_etal_SexualDimorphismRecombination/blob/master/Refined_EUR_genetic_map_b37.tar.gz
hap-ibd, https://github.com/browning-lab/hap-ibd
HapMap map, ftp://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz
ibd-ends, https://github.com/browning-lab/ibd-ends
IBDrecomb European-American map, https://github.com/YingZhou001/IBDrecomb/tree/master/maps.b37/FHS.map.tar.gz
OMIM, https://www.omim.org/

Supplemental Information

Document S1. Figures S1–S8

mmc1.pdf^{(631.1KB, pdf)}

Document S2. Article plus Supplemental Information

mmc2.pdf^{(1.7MB, pdf)}

References

1.Browning B.L., Browning S.R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 2011;88:173–182. doi: 10.1016/j.ajhg.2011.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Huff C.D., Witherspoon D.J., Simonson T.S., Xing J., Watkins W.S., Zhang Y., Tuohy T.M., Neklason D.W., Burt R.W., Guthery S.L. Maximum-likelihood estimation of recent shared ancestry (ERSA) Genome Res. 2011;21:768–774. doi: 10.1101/gr.115972.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ramstetter M.D., Dyer T.D., Lehman D.M., Curran J.E., Duggirala R., Blangero J., Mezey J.G., Williams A.L. Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives. Genetics. 2017;207:75–82. doi: 10.1534/genetics.117.1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Palamara P.F., Lencz T., Darvasi A., Pe’er I. Length distributions of identity by descent reveal fine-scale demographic history. Am. J. Hum. Genet. 2012;91:809–822. doi: 10.1016/j.ajhg.2012.08.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Palamara P.F., Pe’er I. Inference of historical migration rates via haplotype sharing. Bioinformatics. 2013;29:i180–i188. doi: 10.1093/bioinformatics/btt239. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Ralph P., Coop G. The geography of recent genetic ancestry across Europe. PLoS Biol. 2013;11:e1001555. doi: 10.1371/journal.pbio.1001555. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Browning S.R., Browning B.L. Accurate non-parametric estimation of recent effective population size from segments of identity by descent. Am. J. Hum. Genet. 2015;97:404–418. doi: 10.1016/j.ajhg.2015.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Browning S.R., Browning B.L., Daviglus M.L., Durazo-Arvizu R.A., Schneiderman N., Kaplan R.C., Laurie C.C. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 2018;14:e1007385. doi: 10.1371/journal.pgen.1007385. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Campbell C.D., Chong J.X., Malig M., Ko A., Dumont B.L., Han L., Vives L., O’Roak B.J., Sudmant P.H., Shendure J. Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 2012;44:1277–1281. doi: 10.1038/ng.2418. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Palamara P.F., Francioli L.C., Wilton P.R., Genovese G., Gusev A., Finucane H.K., Sankararaman S., Sunyaev S.R., de Bakker P.I., Wakeley J., Genome of the Netherlands Consortium Leveraging Distant Relatedness to Quantify Human Mutation and Gene-Conversion Rates. Am. J. Hum. Genet. 2015;97:775–789. doi: 10.1016/j.ajhg.2015.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Narasimhan V.M., Rahbari R., Scally A., Wuster A., Mason D., Xue Y., Wright J., Trembath R.C., Maher E.R., van Heel D.A. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat. Commun. 2017;8:303. doi: 10.1038/s41467-017-00323-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Tian X., Browning B.L., Browning S.R. Estimating the Genome-wide Mutation Rate with Three-Way Identity by Descent. Am. J. Hum. Genet. 2019;105:883–893. doi: 10.1016/j.ajhg.2019.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Zhou Y., Browning B.L., Browning S.R. Population-specific recombination maps from segments of identity by descent. Am. J. Hum. Genet. 2020;107:137–148. doi: 10.1016/j.ajhg.2020.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Albrechtsen A., Moltke I., Nielsen R. Natural selection and the distribution of identity-by-descent in the human genome. Genetics. 2010;186:295–308. doi: 10.1534/genetics.110.113977. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Han L., Abney M. Using identity by descent estimation with dense genotype data to detect positive selection. Eur. J. Hum. Genet. 2013;21:205–211. doi: 10.1038/ejhg.2012.148. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Palamara P.F., Terhorst J., Song Y.S., Price A.L. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nat. Genet. 2018;50:1311–1317. doi: 10.1038/s41588-018-0177-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lewontin R.C., Krakauer J. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics. 1973;74:175–195. doi: 10.1093/genetics/74.1.175. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Akey J.M., Zhang G., Zhang K., Jin L., Shriver M.D. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12:1805–1814. doi: 10.1101/gr.631202. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Workman P.L., Blumberg B.S., Cooper A.J. Selection, gene migration and polymorphic stability in a US White and Negro population. Am. J. Hum. Genet. 1963;15:429–437. [PMC free article] [PubMed] [Google Scholar]
20.Tang H., Choudhry S., Mei R., Morgan M., Rodriguez-Cintron W., Burchard E.G., Risch N.J. Recent genetic selection in the ancestral admixture of Puerto Ricans. Am. J. Hum. Genet. 2007;81:626–633. doi: 10.1086/520769. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Green R.E., Krause J., Briggs A.W., Maricic T., Stenzel U., Kircher M., Patterson N., Li H., Zhai W., Fritz M.H. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Sabeti P.C., Reich D.E., Higgins J.M., Levine H.Z., Richter D.J., Schaffner S.F., Gabriel S.B., Platko J.V., Patterson N.J., McDonald G.J. Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002;419:832–837. doi: 10.1038/nature01140. [DOI] [PubMed] [Google Scholar]
23.Voight B.F., Kudaravalli S., Wen X., Pritchard J.K. A map of recent positive selection in the human genome. PLoS Biol. 2006;4:e72. doi: 10.1371/journal.pbio.0040072. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Leutenegger A.L., Prum B., Génin E., Verny C., Lemainque A., Clerget-Darpoux F., Thompson E.A. Estimation of the inbreeding coefficient through use of genomic data. Am. J. Hum. Genet. 2003;73:516–523. doi: 10.1086/378207. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Browning S.R. Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics. 2008;178:2123–2132. doi: 10.1534/genetics.107.084624. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Thompson E.A. The IBD process along four chromosomes. Theor. Popul. Biol. 2008;73:369–373. doi: 10.1016/j.tpb.2007.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Browning S.R., Browning B.L. High-resolution detection of identity by descent in unrelated individuals. Am. J. Hum. Genet. 2010;86:526–539. doi: 10.1016/j.ajhg.2010.02.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Albrechtsen A., Sand Korneliussen T., Moltke I., van Overseem Hansen T., Nielsen F.C., Nielsen R. Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet. Epidemiol. 2009;33:266–274. doi: 10.1002/gepi.20378. [DOI] [PubMed] [Google Scholar]
30.Han L., Abney M. Identity by descent estimation with dense genome-wide genotype data. Genet. Epidemiol. 2011;35:557–567. doi: 10.1002/gepi.20606. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kong A., Masson G., Frigge M.L., Gylfason A., Zusmanovich P., Thorleifsson G., Olason P.I., Ingason A., Steinberg S., Rafnar T. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 2008;40:1068–1075. doi: 10.1038/ng.216. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Gusev A., Lowe J.K., Stoffel M., Daly M.J., Altshuler D., Breslow J.L., Friedman J.M., Pe’er I. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19:318–326. doi: 10.1101/gr.081398.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Zhou Y., Browning S.R., Browning B.L. A fast and simple method for detecting identity by descent segments in large-scale data. Am. J. Hum. Genet. 2020;106:426–437. doi: 10.1016/j.ajhg.2020.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Thompson E.A. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics. 2013;194:301–326. doi: 10.1534/genetics.112.148825. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Chiang C.W., Ralph P., Novembre J. Conflation of Short Identity-by-Descent Segments Bias Their Inferred Length Distribution. G3 (Bethesda) 2016;6:1287–1296. doi: 10.1534/g3.116.027581. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Delaneau O., Zagury J.-F., Robinson M.R., Marchini J.L., Dermitzakis E.T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 2019;10:5436. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Williams A.L., Genovese G., Dyer T., Altemose N., Truax K., Jun G., Patterson N., Myers S.R., Curran J.E., Duggirala R., T2D-GENES Consortium Non-crossover gene conversions show strong GC bias and unexpected clustering in humans. eLife. 2015;4:4. doi: 10.7554/eLife.04637. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv. 2019 doi: 10.1101/563866. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Haller B.C., Messer P.W. SLiM 3: Forward genetic simulations beyond the Wright–Fisher model. Mol. Biol. Evol. 2019;36:632–637. doi: 10.1093/molbev/msy228. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Kelleher J., Etheridge A.M., McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput. Biol. 2016;12:e1004842. doi: 10.1371/journal.pcbi.1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Haller B.C., Galloway J., Kelleher J., Messer P.W., Ralph P.L. Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Mol. Ecol. Resour. 2019;19:552–566. doi: 10.1111/1755-0998.12968. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Tennessen J.A., Bigham A.W., O’Connor T.D., Fu W., Kenny E.E., Gravel S., McGee S., Do R., Liu X., Jun G., Broad GO. Seattle GO. NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Fu W., O’Connor T.D., Jun G., Kang H.M., Abecasis G., Leal S.M., Gabriel S., Rieder M.J., Altshuler D., Shendure J., NHLBI Exome Sequencing Project Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493:216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Adrion J.R., Cole C.B., Dukler N., Galloway J.G., Gladstein A.L., Gower G., Kyriazis C.C., Ragsdale A.P., Tsambos G., Baumdicker F. A community-maintained standard library of population genetic models. eLife. 2020;9:e54967. doi: 10.7554/eLife.54967. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Bhérer C., Campbell C.L., Auton A. Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nat. Commun. 2017;8:14994. doi: 10.1038/ncomms14994. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Bersaglieri T., Sabeti P.C., Patterson N., Vanderploeg T., Schaffner S.F., Drake J.A., Rhodes M., Reich D.E., Hirschhorn J.N. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 2004;74:1111–1120. doi: 10.1086/421051. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Ding Q., Hu Y., Xu S., Wang J., Jin L. Neanderthal introgression at chromosome 3p21.31 was under positive natural selection in East Asians. Mol. Biol. Evol. 2014;31:683–695. doi: 10.1093/molbev/mst260. [DOI] [PubMed] [Google Scholar]
50.Bitarello B.D., de Filippo C., Teixeira J.C., Schmidt J.M., Kleinert P., Meyer D., Andrés A.M. Signatures of long-term balancing selection in human genomes. Genome Biol. Evol. 2018;10:939–955. doi: 10.1093/gbe/evy054. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Graf J., Hodgson R., van Daal A. Single nucleotide polymorphisms in the MATP gene are associated with normal human pigmentation variation. Hum. Mutat. 2005;25:278–284. doi: 10.1002/humu.20143. [DOI] [PubMed] [Google Scholar]
53.Meyer D., Single R.M., Mack S.J., Erlich H.A., Thomson G. Signatures of demographic history and natural selection in the human major histocompatibility complex Loci. Genetics. 2006;173:2121–2142. doi: 10.1534/genetics.105.052837. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Ramos P.S. Population genetics and natural selection in rheumatic disease. Rheum. Dis. Clin. North Am. 2017;43:313–326. doi: 10.1016/j.rdc.2017.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Browning S.R., Browning B.L., Zhou Y., Tucci S., Akey J.M. Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture. Cell. 2018;173:53–61. doi: 10.1016/j.cell.2018.02.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Sams A.J., Dumaine A., Nédélec Y., Yotova V., Alfieri C., Tanner J.E., Messer P.W., Barreiro L.B. Adaptively introgressed Neandertal haplotype at the OAS locus functionally impacts innate immune responses in humans. Genome Biol. 2016;17:246. doi: 10.1186/s13059-016-1098-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Storz J.F., Payseur B.A., Nachman M.W. Genome scans of DNA variability in humans reveal evidence for selective sweeps outside of Africa. Mol. Biol. Evol. 2004;21:1800–1811. doi: 10.1093/molbev/msh192. [DOI] [PubMed] [Google Scholar]
58.Stefansson H., Helgason A., Thorleifsson G., Steinthorsdottir V., Masson G., Barnard J., Baker A., Jonasdottir A., Ingason A., Gudnadottir V.G. A common inversion under selection in Europeans. Nat. Genet. 2005;37:129–137. doi: 10.1038/ng1508. [DOI] [PubMed] [Google Scholar]
59.Lewis M.J., Vyse S., Shields A.M., Boeltz S., Gordon P.A., Spector T.D., Lehner P.J., Walczak H., Vyse T.J. UBE2L3 polymorphism amplifies NF-κB activation and promotes plasma cell development, linking linear ubiquitination to multiple autoimmune diseases. Am. J. Hum. Genet. 2015;96:221–234. doi: 10.1016/j.ajhg.2014.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Zody M.C., Jiang Z., Fung H.-C., Antonacci F., Hillier L.W., Cardone M.F., Graves T.A., Kidd J.M., Cheng Z., Abouelleil A. Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 2008;40:1076–1083. doi: 10.1038/ng.193. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Itan Y., Powell A., Beaumont M.A., Burger J., Thomas M.G. The origins of lactase persistence in Europe. PLoS Comput. Biol. 2009;5:e1000491. doi: 10.1371/journal.pcbi.1000491. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Ho M.W., Povey S., Swallow D. Lactase polymorphism in adult British natives: estimating allele frequencies by enzyme assays in autopsy samples. Am. J. Hum. Genet. 1982;34:650–657. [PMC free article] [PubMed] [Google Scholar]
63.Hughes A.L., Yeager M. Natural selection at major histocompatibility complex loci of vertebrates. Annu. Rev. Genet. 1998;32:415–435. doi: 10.1146/annurev.genet.32.1.415. [DOI] [PubMed] [Google Scholar]
64.Racimo F., Sankararaman S., Nielsen R., Huerta-Sánchez E. Evidence for archaic adaptive introgression in humans. Nat. Rev. Genet. 2015;16:359–371. doi: 10.1038/nrg3936. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Hider J.L., Gittelman R.M., Shah T., Edwards M., Rosenbloom A., Akey J.M., Parra E.J. Exploring signatures of positive selection in pigmentation candidate genes in populations of East Asian ancestry. BMC Evol. Biol. 2013;13:150. doi: 10.1186/1471-2148-13-150. [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Hu J., Wang X., Xing Y., Rong E., Ning M., Smith J., Huang Y. Origin and development of oligoadenylate synthetase immune system. BMC Evol. Biol. 2018;18:201. doi: 10.1186/s12862-018-1315-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Barreiro L.B., Ben-Ali M., Quach H., Laval G., Patin E., Pickrell J.K., Bouchier C., Tichit M., Neyrolles O., Gicquel B. Evolutionary dynamics of human Toll-like receptors and their different contributions to host defense. PLoS Genet. 2009;5:e1000562. doi: 10.1371/journal.pgen.1000562. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Raychaudhuri S., Thomson B.P., Remmers E.F., Eyre S., Hinks A., Guiducci C., Catanese J.J., Xie G., Stahl E.A., Chen R., BIRAC Consortium. YEAR Consortium Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk. Nat. Genet. 2009;41:1313–1318. doi: 10.1038/ng.479. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Zhou X.J., Lu X.L., Lv J.C., Yang H.Z., Qin L.X., Zhao M.H., Su Y., Li Z.G., Zhang H. Genetic association of PRDM1-ATG5 intergenic region and autophagy with systemic lupus erythematosus in a Chinese population. Ann. Rheum. Dis. 2011;70:1330–1337. doi: 10.1136/ard.2010.140111. [DOI] [PubMed] [Google Scholar]
70.Best T., Li D., Skol A.D., Kirchhoff T., Jackson S.A., Yasui Y., Bhatia S., Strong L.C., Domchek S.M., Nathanson K.L. Variants at 6q21 implicate PRDM1 in the etiology of therapy-induced second malignancies after Hodgkin’s lymphoma. Nat. Med. 2011;17:941–943. doi: 10.1038/nm.2407. [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Paust S., Gill H.S., Wang B.Z., Flynn M.P., Moseman E.A., Senman B., Szczepanik M., Telenti A., Askenase P.W., Compans R.W., von Andrian U.H. Critical role for the chemokine receptor CXCR6 in NK cell-mediated antigen-specific memory of haptens and viruses. Nat. Immunol. 2010;11:1127–1135. doi: 10.1038/ni.1953. [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Papadakis K.A., Prehn J., Nelson V., Cheng L., Binder S.W., Ponath P.D., Andrew D.P., Targan S.R. The role of thymus-expressed chemokine and its receptor CCR9 on lymphocytes in the regional specialization of the mucosal immune system. J. Immunol. 2000;165:5069–5076. doi: 10.4049/jimmunol.165.9.5069. [DOI] [PubMed] [Google Scholar]
73.Ellinghaus D., Degenhardt F., Bujanda L., Buti M., Albillos A., Invernizzi P., Fernandez J., Prati D., Baselli G., Asselta R. The ABO blood group locus and a chromosome 3 gene cluster associate with SARS-CoV-2 respiratory failure in an Italian-Spanish genome-wide association analysis. medRxiv. 2020 doi: 10.1101/2020.05.31.20114991. [DOI] [Google Scholar]
74.Shelton J.F., Shastri A.J., Ye C., Weldon C.H., Filshtein-Somnez T., Coker D., Symons A., Esparza-Gordillo J., Aslibekyan S., Auton A. Trans-ethnic analysis reveals genetic and non-genetic associations with COVID-19 susceptibility and severity. medRxiv. 2020 doi: 10.1101/2020.09.04.20188318. [DOI] [PubMed] [Google Scholar]
75.Mathieson I., Lazaridis I., Rohland N., Mallick S., Patterson N., Roodenberg S.A., Harney E., Stewardson K., Fernandes D., Novak M. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015;528:499–503. doi: 10.1038/nature16152. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Rogers M.A., Edler L., Winter H., Langbein L., Beckmann I., Schweizer J. Characterization of new members of the human type II keratin gene family and a general evaluation of the keratin gene domain on chromosome 12q13.13. J. Invest. Dermatol. 2005;124:536–544. doi: 10.1111/j.0022-202X.2004.23530.x. [DOI] [PubMed] [Google Scholar]
77.Freathy R.M., Mook-Kanamori D.O., Sovio U., Prokopenko I., Timpson N.J., Berry D.J., Warrington N.M., Widen E., Hottenga J.J., Kaakinen M., Genetic Investigation of ANthropometric Traits (GIANT) Consortium. Meta-Analyses of Glucose and Insulin-related traits Consortium. Wellcome Trust Case Control Consortium. Early Growth Genetics (EGG) Consortium Variants in ADCY5 and near CCNL1 are associated with fetal growth and birth weight. Nat. Genet. 2010;42:430–435. doi: 10.1038/ng.567. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Dupuis J., Langenberg C., Prokopenko I., Saxena R., Soranzo N., Jackson A.U., Wheeler E., Glazer N.L., Bouatia-Naji N., Gloyn A.L., DIAGRAM Consortium. GIANT Consortium. Global BPgen Consortium. Anders Hamsten on behalf of Procardis Consortium. MAGIC investigators New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 2010;42:105–116. doi: 10.1038/ng.520. [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Ramos P.S., Shedlock A.M., Langefeld C.D. Genetics of autoimmune diseases: insights from population genetics. J. Hum. Genet. 2015;60:657–664. doi: 10.1038/jhg.2015.94. [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Mathyer M.E., Brettmann E.A., Schmidt A.D., Goodwin Z.A., Quiggle A.M., Oh I.Y., Tycksen E., Zhou L., Estrada Y.D., Wong X.C.C. An enhancer: involucrin regulatory module impacts human skin barrier adaptation out-of-Africa and modifies atopic dermatitis risk. bioRxiv. 2019 doi: 10.1101/816520. [DOI] [Google Scholar]
81.Fumagalli M., Pozzoli U., Cagliani R., Comi G.P., Bresolin N., Clerici M., Sironi M. Genome-wide identification of susceptibility alleles for viral infections through a population genetics approach. PLoS Genet. 2010;6:e1000849. doi: 10.1371/journal.pgen.1000849. [DOI] [PMC free article] [PubMed] [Google Scholar]
82.Mischke D., Korge B.P., Marenholz I., Volz A., Ziegler A. Genes encoding structural proteins of epidermal cornification and S100 calcium-binding proteins form a gene complex (“epidermal differentiation complex”) on human chromosome 1q21. J. Invest. Dermatol. 1996;106:989–992. doi: 10.1111/1523-1747.ep12338501. [DOI] [PubMed] [Google Scholar]
83.Mikkelsen T., Hillier L., Eichler E., Zody M., Jaffe D., Yang S.-P., Enard W., Hellmann I., Lindblad-Toh K., Altheide T., Chimpanzee Sequencing and Analysis Consortium Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]
84.Horikawa Y., Iwasaki N., Hara M., Furuta H., Hinokio Y., Cockburn B.N., Lindner T., Yamagata K., Ogata M., Tomonaga O. Mutation in hepatocyte nuclear factor-1 β gene (TCF2) associated with MODY. Nat. Genet. 1997;17:384–385. doi: 10.1038/ng1297-384. [DOI] [PubMed] [Google Scholar]
85.Gudmundsson J., Sulem P., Steinthorsdottir V., Bergthorsson J.T., Thorleifsson G., Manolescu A., Rafnar T., Gudbjartsson D., Agnarsson B.A., Baker A. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nat. Genet. 2007;39:977–983. doi: 10.1038/ng2062. [DOI] [PubMed] [Google Scholar]
86.Frazer K.A., Ballinger D.G., Cox D.R., Hinds D.A., Stuve L.L., Gibbs R.A., Belmont J.W., Boudreau A., Hardenbol P., Leal S.M., International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
87.O’Reilly P.F., Birney E., Balding D.J. Confounding between recombination and selection, and the Ped/Pop method for detecting selection. Genome Res. 2008;18:1304–1313. doi: 10.1101/gr.067181.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Wenger A.M., Peluso P., Rowell W.J., Chang P.-C., Hall R.J., Concepcion G.T., Ebler J., Fungtammasan A., Kolesnikov A., Olson N.D. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 2019;37:1155–1162. doi: 10.1038/s41587-019-0217-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Devroye L. Springer-Verlag; New York: 1986. Non-uniform random variate generation. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S8

mmc1.pdf^{(631.1KB, pdf)}

Document S2. Article plus Supplemental Information

mmc2.pdf^{(1.7MB, pdf)}

Data Availability Statement

The ibd-ends software is freely available under the open source Apache License 2.0 (see Web Resources).

The UK Biobank IBD rates shown in Figure 4 are available from Mendeley Data https://doi.org/10.17632/m4nxyv6rw8.1

[bib1] 1.Browning B.L., Browning S.R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 2011;88:173–182. doi: 10.1016/j.ajhg.2011.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Huff C.D., Witherspoon D.J., Simonson T.S., Xing J., Watkins W.S., Zhang Y., Tuohy T.M., Neklason D.W., Burt R.W., Guthery S.L. Maximum-likelihood estimation of recent shared ancestry (ERSA) Genome Res. 2011;21:768–774. doi: 10.1101/gr.115972.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Ramstetter M.D., Dyer T.D., Lehman D.M., Curran J.E., Duggirala R., Blangero J., Mezey J.G., Williams A.L. Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives. Genetics. 2017;207:75–82. doi: 10.1534/genetics.117.1122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Palamara P.F., Lencz T., Darvasi A., Pe’er I. Length distributions of identity by descent reveal fine-scale demographic history. Am. J. Hum. Genet. 2012;91:809–822. doi: 10.1016/j.ajhg.2012.08.030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Palamara P.F., Pe’er I. Inference of historical migration rates via haplotype sharing. Bioinformatics. 2013;29:i180–i188. doi: 10.1093/bioinformatics/btt239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Ralph P., Coop G. The geography of recent genetic ancestry across Europe. PLoS Biol. 2013;11:e1001555. doi: 10.1371/journal.pbio.1001555. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Browning S.R., Browning B.L. Accurate non-parametric estimation of recent effective population size from segments of identity by descent. Am. J. Hum. Genet. 2015;97:404–418. doi: 10.1016/j.ajhg.2015.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Browning S.R., Browning B.L., Daviglus M.L., Durazo-Arvizu R.A., Schneiderman N., Kaplan R.C., Laurie C.C. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 2018;14:e1007385. doi: 10.1371/journal.pgen.1007385. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Campbell C.D., Chong J.X., Malig M., Ko A., Dumont B.L., Han L., Vives L., O’Roak B.J., Sudmant P.H., Shendure J. Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 2012;44:1277–1281. doi: 10.1038/ng.2418. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Palamara P.F., Francioli L.C., Wilton P.R., Genovese G., Gusev A., Finucane H.K., Sankararaman S., Sunyaev S.R., de Bakker P.I., Wakeley J., Genome of the Netherlands Consortium Leveraging Distant Relatedness to Quantify Human Mutation and Gene-Conversion Rates. Am. J. Hum. Genet. 2015;97:775–789. doi: 10.1016/j.ajhg.2015.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Narasimhan V.M., Rahbari R., Scally A., Wuster A., Mason D., Xue Y., Wright J., Trembath R.C., Maher E.R., van Heel D.A. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat. Commun. 2017;8:303. doi: 10.1038/s41467-017-00323-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Tian X., Browning B.L., Browning S.R. Estimating the Genome-wide Mutation Rate with Three-Way Identity by Descent. Am. J. Hum. Genet. 2019;105:883–893. doi: 10.1016/j.ajhg.2019.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Zhou Y., Browning B.L., Browning S.R. Population-specific recombination maps from segments of identity by descent. Am. J. Hum. Genet. 2020;107:137–148. doi: 10.1016/j.ajhg.2020.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Albrechtsen A., Moltke I., Nielsen R. Natural selection and the distribution of identity-by-descent in the human genome. Genetics. 2010;186:295–308. doi: 10.1534/genetics.110.113977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Han L., Abney M. Using identity by descent estimation with dense genotype data to detect positive selection. Eur. J. Hum. Genet. 2013;21:205–211. doi: 10.1038/ejhg.2012.148. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Palamara P.F., Terhorst J., Song Y.S., Price A.L. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nat. Genet. 2018;50:1311–1317. doi: 10.1038/s41588-018-0177-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Lewontin R.C., Krakauer J. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics. 1973;74:175–195. doi: 10.1093/genetics/74.1.175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Akey J.M., Zhang G., Zhang K., Jin L., Shriver M.D. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12:1805–1814. doi: 10.1101/gr.631202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Workman P.L., Blumberg B.S., Cooper A.J. Selection, gene migration and polymorphic stability in a US White and Negro population. Am. J. Hum. Genet. 1963;15:429–437. [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Tang H., Choudhry S., Mei R., Morgan M., Rodriguez-Cintron W., Burchard E.G., Risch N.J. Recent genetic selection in the ancestral admixture of Puerto Ricans. Am. J. Hum. Genet. 2007;81:626–633. doi: 10.1086/520769. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Green R.E., Krause J., Briggs A.W., Maricic T., Stenzel U., Kircher M., Patterson N., Li H., Zhai W., Fritz M.H. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Sabeti P.C., Reich D.E., Higgins J.M., Levine H.Z., Richter D.J., Schaffner S.F., Gabriel S.B., Platko J.V., Patterson N.J., McDonald G.J. Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002;419:832–837. doi: 10.1038/nature01140. [DOI] [PubMed] [Google Scholar]

[bib23] 23.Voight B.F., Kudaravalli S., Wen X., Pritchard J.K. A map of recent positive selection in the human genome. PLoS Biol. 2006;4:e72. doi: 10.1371/journal.pbio.0040072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Leutenegger A.L., Prum B., Génin E., Verny C., Lemainque A., Clerget-Darpoux F., Thompson E.A. Estimation of the inbreeding coefficient through use of genomic data. Am. J. Hum. Genet. 2003;73:516–523. doi: 10.1086/378207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Browning S.R. Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics. 2008;178:2123–2132. doi: 10.1534/genetics.107.084624. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Thompson E.A. The IBD process along four chromosomes. Theor. Popul. Biol. 2008;73:369–373. doi: 10.1016/j.tpb.2007.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Browning S.R., Browning B.L. High-resolution detection of identity by descent in unrelated individuals. Am. J. Hum. Genet. 2010;86:526–539. doi: 10.1016/j.ajhg.2010.02.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Albrechtsen A., Sand Korneliussen T., Moltke I., van Overseem Hansen T., Nielsen F.C., Nielsen R. Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet. Epidemiol. 2009;33:266–274. doi: 10.1002/gepi.20378. [DOI] [PubMed] [Google Scholar]

[bib30] 30.Han L., Abney M. Identity by descent estimation with dense genome-wide genotype data. Genet. Epidemiol. 2011;35:557–567. doi: 10.1002/gepi.20606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Kong A., Masson G., Frigge M.L., Gylfason A., Zusmanovich P., Thorleifsson G., Olason P.I., Ingason A., Steinberg S., Rafnar T. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 2008;40:1068–1075. doi: 10.1038/ng.216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Gusev A., Lowe J.K., Stoffel M., Daly M.J., Altshuler D., Breslow J.L., Friedman J.M., Pe’er I. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19:318–326. doi: 10.1101/gr.081398.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Zhou Y., Browning S.R., Browning B.L. A fast and simple method for detecting identity by descent segments in large-scale data. Am. J. Hum. Genet. 2020;106:426–437. doi: 10.1016/j.ajhg.2020.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Thompson E.A. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics. 2013;194:301–326. doi: 10.1534/genetics.112.148825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Chiang C.W., Ralph P., Novembre J. Conflation of Short Identity-by-Descent Segments Bias Their Inferred Length Distribution. G3 (Bethesda) 2016;6:1287–1296. doi: 10.1534/g3.116.027581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Delaneau O., Zagury J.-F., Robinson M.R., Marchini J.L., Dermitzakis E.T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 2019;10:5436. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Williams A.L., Genovese G., Dyer T., Altemose N., Truax K., Jun G., Patterson N., Myers S.R., Curran J.E., Duggirala R., T2D-GENES Consortium Non-crossover gene conversions show strong GC bias and unexpected clustering in humans. eLife. 2015;4:4. doi: 10.7554/eLife.04637. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv. 2019 doi: 10.1101/563866. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41.Haller B.C., Messer P.W. SLiM 3: Forward genetic simulations beyond the Wright–Fisher model. Mol. Biol. Evol. 2019;36:632–637. doi: 10.1093/molbev/msy228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42.Kelleher J., Etheridge A.M., McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput. Biol. 2016;12:e1004842. doi: 10.1371/journal.pcbi.1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] 43.Haller B.C., Galloway J., Kelleher J., Messer P.W., Ralph P.L. Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Mol. Ecol. Resour. 2019;19:552–566. doi: 10.1111/1755-0998.12968. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44.Tennessen J.A., Bigham A.W., O’Connor T.D., Fu W., Kenny E.E., Gravel S., McGee S., Do R., Liu X., Jun G., Broad GO. Seattle GO. NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45.Fu W., O’Connor T.D., Jun G., Kang H.M., Abecasis G., Leal S.M., Gabriel S., Rieder M.J., Altshuler D., Shendure J., NHLBI Exome Sequencing Project Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493:216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] 46.Adrion J.R., Cole C.B., Dukler N., Galloway J.G., Gladstein A.L., Gower G., Kyriazis C.C., Ragsdale A.P., Tsambos G., Baumdicker F. A community-maintained standard library of population genetic models. eLife. 2020;9:e54967. doi: 10.7554/eLife.54967. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47.Bhérer C., Campbell C.L., Auton A. Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nat. Commun. 2017;8:14994. doi: 10.1038/ncomms14994. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] 48.Bersaglieri T., Sabeti P.C., Patterson N., Vanderploeg T., Schaffner S.F., Drake J.A., Rhodes M., Reich D.E., Hirschhorn J.N. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 2004;74:1111–1120. doi: 10.1086/421051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] 49.Ding Q., Hu Y., Xu S., Wang J., Jin L. Neanderthal introgression at chromosome 3p21.31 was under positive natural selection in East Asians. Mol. Biol. Evol. 2014;31:683–695. doi: 10.1093/molbev/mst260. [DOI] [PubMed] [Google Scholar]

[bib50] 50.Bitarello B.D., de Filippo C., Teixeira J.C., Schmidt J.M., Kleinert P., Meyer D., Andrés A.M. Signatures of long-term balancing selection in human genomes. Genome Biol. Evol. 2018;10:939–955. doi: 10.1093/gbe/evy054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] 51.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] 52.Graf J., Hodgson R., van Daal A. Single nucleotide polymorphisms in the MATP gene are associated with normal human pigmentation variation. Hum. Mutat. 2005;25:278–284. doi: 10.1002/humu.20143. [DOI] [PubMed] [Google Scholar]

[bib53] 53.Meyer D., Single R.M., Mack S.J., Erlich H.A., Thomson G. Signatures of demographic history and natural selection in the human major histocompatibility complex Loci. Genetics. 2006;173:2121–2142. doi: 10.1534/genetics.105.052837. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib54] 54.Ramos P.S. Population genetics and natural selection in rheumatic disease. Rheum. Dis. Clin. North Am. 2017;43:313–326. doi: 10.1016/j.rdc.2017.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib55] 55.Browning S.R., Browning B.L., Zhou Y., Tucci S., Akey J.M. Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture. Cell. 2018;173:53–61. doi: 10.1016/j.cell.2018.02.031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] 56.Sams A.J., Dumaine A., Nédélec Y., Yotova V., Alfieri C., Tanner J.E., Messer P.W., Barreiro L.B. Adaptively introgressed Neandertal haplotype at the OAS locus functionally impacts innate immune responses in humans. Genome Biol. 2016;17:246. doi: 10.1186/s13059-016-1098-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] 57.Storz J.F., Payseur B.A., Nachman M.W. Genome scans of DNA variability in humans reveal evidence for selective sweeps outside of Africa. Mol. Biol. Evol. 2004;21:1800–1811. doi: 10.1093/molbev/msh192. [DOI] [PubMed] [Google Scholar]

[bib58] 58.Stefansson H., Helgason A., Thorleifsson G., Steinthorsdottir V., Masson G., Barnard J., Baker A., Jonasdottir A., Ingason A., Gudnadottir V.G. A common inversion under selection in Europeans. Nat. Genet. 2005;37:129–137. doi: 10.1038/ng1508. [DOI] [PubMed] [Google Scholar]

[bib59] 59.Lewis M.J., Vyse S., Shields A.M., Boeltz S., Gordon P.A., Spector T.D., Lehner P.J., Walczak H., Vyse T.J. UBE2L3 polymorphism amplifies NF-κB activation and promotes plasma cell development, linking linear ubiquitination to multiple autoimmune diseases. Am. J. Hum. Genet. 2015;96:221–234. doi: 10.1016/j.ajhg.2014.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib60] 60.Zody M.C., Jiang Z., Fung H.-C., Antonacci F., Hillier L.W., Cardone M.F., Graves T.A., Kidd J.M., Cheng Z., Abouelleil A. Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 2008;40:1076–1083. doi: 10.1038/ng.193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib61] 61.Itan Y., Powell A., Beaumont M.A., Burger J., Thomas M.G. The origins of lactase persistence in Europe. PLoS Comput. Biol. 2009;5:e1000491. doi: 10.1371/journal.pcbi.1000491. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] 62.Ho M.W., Povey S., Swallow D. Lactase polymorphism in adult British natives: estimating allele frequencies by enzyme assays in autopsy samples. Am. J. Hum. Genet. 1982;34:650–657. [PMC free article] [PubMed] [Google Scholar]

[bib63] 63.Hughes A.L., Yeager M. Natural selection at major histocompatibility complex loci of vertebrates. Annu. Rev. Genet. 1998;32:415–435. doi: 10.1146/annurev.genet.32.1.415. [DOI] [PubMed] [Google Scholar]

[bib64] 64.Racimo F., Sankararaman S., Nielsen R., Huerta-Sánchez E. Evidence for archaic adaptive introgression in humans. Nat. Rev. Genet. 2015;16:359–371. doi: 10.1038/nrg3936. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] 65.Hider J.L., Gittelman R.M., Shah T., Edwards M., Rosenbloom A., Akey J.M., Parra E.J. Exploring signatures of positive selection in pigmentation candidate genes in populations of East Asian ancestry. BMC Evol. Biol. 2013;13:150. doi: 10.1186/1471-2148-13-150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib66] 66.Hu J., Wang X., Xing Y., Rong E., Ning M., Smith J., Huang Y. Origin and development of oligoadenylate synthetase immune system. BMC Evol. Biol. 2018;18:201. doi: 10.1186/s12862-018-1315-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib67] 67.Barreiro L.B., Ben-Ali M., Quach H., Laval G., Patin E., Pickrell J.K., Bouchier C., Tichit M., Neyrolles O., Gicquel B. Evolutionary dynamics of human Toll-like receptors and their different contributions to host defense. PLoS Genet. 2009;5:e1000562. doi: 10.1371/journal.pgen.1000562. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib68] 68.Raychaudhuri S., Thomson B.P., Remmers E.F., Eyre S., Hinks A., Guiducci C., Catanese J.J., Xie G., Stahl E.A., Chen R., BIRAC Consortium. YEAR Consortium Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk. Nat. Genet. 2009;41:1313–1318. doi: 10.1038/ng.479. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib69] 69.Zhou X.J., Lu X.L., Lv J.C., Yang H.Z., Qin L.X., Zhao M.H., Su Y., Li Z.G., Zhang H. Genetic association of PRDM1-ATG5 intergenic region and autophagy with systemic lupus erythematosus in a Chinese population. Ann. Rheum. Dis. 2011;70:1330–1337. doi: 10.1136/ard.2010.140111. [DOI] [PubMed] [Google Scholar]

[bib70] 70.Best T., Li D., Skol A.D., Kirchhoff T., Jackson S.A., Yasui Y., Bhatia S., Strong L.C., Domchek S.M., Nathanson K.L. Variants at 6q21 implicate PRDM1 in the etiology of therapy-induced second malignancies after Hodgkin’s lymphoma. Nat. Med. 2011;17:941–943. doi: 10.1038/nm.2407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib71] 71.Paust S., Gill H.S., Wang B.Z., Flynn M.P., Moseman E.A., Senman B., Szczepanik M., Telenti A., Askenase P.W., Compans R.W., von Andrian U.H. Critical role for the chemokine receptor CXCR6 in NK cell-mediated antigen-specific memory of haptens and viruses. Nat. Immunol. 2010;11:1127–1135. doi: 10.1038/ni.1953. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib72] 72.Papadakis K.A., Prehn J., Nelson V., Cheng L., Binder S.W., Ponath P.D., Andrew D.P., Targan S.R. The role of thymus-expressed chemokine and its receptor CCR9 on lymphocytes in the regional specialization of the mucosal immune system. J. Immunol. 2000;165:5069–5076. doi: 10.4049/jimmunol.165.9.5069. [DOI] [PubMed] [Google Scholar]

[bib73] 73.Ellinghaus D., Degenhardt F., Bujanda L., Buti M., Albillos A., Invernizzi P., Fernandez J., Prati D., Baselli G., Asselta R. The ABO blood group locus and a chromosome 3 gene cluster associate with SARS-CoV-2 respiratory failure in an Italian-Spanish genome-wide association analysis. medRxiv. 2020 doi: 10.1101/2020.05.31.20114991. [DOI] [Google Scholar]

[bib74] 74.Shelton J.F., Shastri A.J., Ye C., Weldon C.H., Filshtein-Somnez T., Coker D., Symons A., Esparza-Gordillo J., Aslibekyan S., Auton A. Trans-ethnic analysis reveals genetic and non-genetic associations with COVID-19 susceptibility and severity. medRxiv. 2020 doi: 10.1101/2020.09.04.20188318. [DOI] [PubMed] [Google Scholar]

[bib75] 75.Mathieson I., Lazaridis I., Rohland N., Mallick S., Patterson N., Roodenberg S.A., Harney E., Stewardson K., Fernandes D., Novak M. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015;528:499–503. doi: 10.1038/nature16152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib76] 76.Rogers M.A., Edler L., Winter H., Langbein L., Beckmann I., Schweizer J. Characterization of new members of the human type II keratin gene family and a general evaluation of the keratin gene domain on chromosome 12q13.13. J. Invest. Dermatol. 2005;124:536–544. doi: 10.1111/j.0022-202X.2004.23530.x. [DOI] [PubMed] [Google Scholar]

[bib77] 77.Freathy R.M., Mook-Kanamori D.O., Sovio U., Prokopenko I., Timpson N.J., Berry D.J., Warrington N.M., Widen E., Hottenga J.J., Kaakinen M., Genetic Investigation of ANthropometric Traits (GIANT) Consortium. Meta-Analyses of Glucose and Insulin-related traits Consortium. Wellcome Trust Case Control Consortium. Early Growth Genetics (EGG) Consortium Variants in ADCY5 and near CCNL1 are associated with fetal growth and birth weight. Nat. Genet. 2010;42:430–435. doi: 10.1038/ng.567. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib78] 78.Dupuis J., Langenberg C., Prokopenko I., Saxena R., Soranzo N., Jackson A.U., Wheeler E., Glazer N.L., Bouatia-Naji N., Gloyn A.L., DIAGRAM Consortium. GIANT Consortium. Global BPgen Consortium. Anders Hamsten on behalf of Procardis Consortium. MAGIC investigators New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 2010;42:105–116. doi: 10.1038/ng.520. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib79] 79.Ramos P.S., Shedlock A.M., Langefeld C.D. Genetics of autoimmune diseases: insights from population genetics. J. Hum. Genet. 2015;60:657–664. doi: 10.1038/jhg.2015.94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib83] 80.Mathyer M.E., Brettmann E.A., Schmidt A.D., Goodwin Z.A., Quiggle A.M., Oh I.Y., Tycksen E., Zhou L., Estrada Y.D., Wong X.C.C. An enhancer: involucrin regulatory module impacts human skin barrier adaptation out-of-Africa and modifies atopic dermatitis risk. bioRxiv. 2019 doi: 10.1101/816520. [DOI] [Google Scholar]

[bib80] 81.Fumagalli M., Pozzoli U., Cagliani R., Comi G.P., Bresolin N., Clerici M., Sironi M. Genome-wide identification of susceptibility alleles for viral infections through a population genetics approach. PLoS Genet. 2010;6:e1000849. doi: 10.1371/journal.pgen.1000849. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib81] 82.Mischke D., Korge B.P., Marenholz I., Volz A., Ziegler A. Genes encoding structural proteins of epidermal cornification and S100 calcium-binding proteins form a gene complex (“epidermal differentiation complex”) on human chromosome 1q21. J. Invest. Dermatol. 1996;106:989–992. doi: 10.1111/1523-1747.ep12338501. [DOI] [PubMed] [Google Scholar]

[bib82] 83.Mikkelsen T., Hillier L., Eichler E., Zody M., Jaffe D., Yang S.-P., Enard W., Hellmann I., Lindblad-Toh K., Altheide T., Chimpanzee Sequencing and Analysis Consortium Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]

[bib84] 84.Horikawa Y., Iwasaki N., Hara M., Furuta H., Hinokio Y., Cockburn B.N., Lindner T., Yamagata K., Ogata M., Tomonaga O. Mutation in hepatocyte nuclear factor-1 β gene (TCF2) associated with MODY. Nat. Genet. 1997;17:384–385. doi: 10.1038/ng1297-384. [DOI] [PubMed] [Google Scholar]

[bib85] 85.Gudmundsson J., Sulem P., Steinthorsdottir V., Bergthorsson J.T., Thorleifsson G., Manolescu A., Rafnar T., Gudbjartsson D., Agnarsson B.A., Baker A. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nat. Genet. 2007;39:977–983. doi: 10.1038/ng2062. [DOI] [PubMed] [Google Scholar]

[bib86] 86.Frazer K.A., Ballinger D.G., Cox D.R., Hinds D.A., Stuve L.L., Gibbs R.A., Belmont J.W., Boudreau A., Hardenbol P., Leal S.M., International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib87] 87.O’Reilly P.F., Birney E., Balding D.J. Confounding between recombination and selection, and the Ped/Pop method for detecting selection. Genome Res. 2008;18:1304–1313. doi: 10.1101/gr.067181.107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib88] 88.Wenger A.M., Peluso P., Rowell W.J., Chang P.-C., Hall R.J., Concepcion G.T., Ebler J., Fungtammasan A., Kolesnikov A., Olson N.D. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 2019;37:1155–1162. doi: 10.1038/s41587-019-0217-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib89] 89.Devroye L. Springer-Verlag; New York: 1986. Non-uniform random variate generation. [Google Scholar]

PERMALINK

Probabilistic Estimation of Identity by Descent Segment Endpoints and Detection of Recent Selection

Sharon R Browning

Brian L Browning

Summary

Introduction

Figure 1.

Material and Methods

Overview of Method

Notation

Modeling the IBS Data for the IBD Segment

Iterative Updating of Endpoints and Focal Point

Estimation of Error Rate

Modified Error Rate to Account for Gene Conversion

Analysis Pipeline

Simulation Overview

Simulation of Constant Size Population with Variable Marker Density

Simulation of UK-like Population with Non-constant Size

Simulation of a Heterogeneous Population

UK Biobank Data

Results

Simulated Data with Variable Marker Density

Figure 2.

UK-like Simulated Data

Figure 3.

Heterogeneous Simulated Data

UK Biobank Data

Figure 4.

Table 1.

Table 2.

Discussion

Declaration of Interests

Acknowledgments

Footnotes

Appendix A: Prior for IBD Length Distribution

Appendix B: Modeling the IBS Data beyond the IBD Segment

Appendix C: Obtaining the Posterior Cumulative Distribution Function for the Endpoint

Data and Code Availability

Web Resources

Supplemental Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases