Abstract
Segments of identity by descent (IBD) are used in many genetic analyses. We present a method for detecting identical-by-descent haplotype segments in phased genotype data. Our method, called hap-IBD, combines a compressed representation of haplotype data, the positional Burrows-Wheeler transform, and multi-threaded execution to produce very fast analysis times. An attractive feature of hap-IBD is its simplicity: the input parameters clearly and precisely define the IBD segments that are reported, so that program correctness can be confirmed by users. We evaluate hap-IBD and four state-of-the-art IBD segment detection methods (GERMLINE, iLASH, RaPID, and TRUFFLE) using UK Biobank chromosome 20 data and simulated sequence data. We show that hap-IBD detects IBD segments faster and more accurately than competing methods, and that hap-IBD is the only method that can rapidly and accurately detect short 2–4 centiMorgan (cM) IBD segments in the full UK Biobank data. Analysis of 485,346 UK Biobank samples through the use of hap-IBD with 12 computational threads detects 231.5 billion autosomal IBD segments with length ≥2 cM in 24.4 h.
Keywords: identity by descent, UK Biobank, haplotype
Introduction
Segments of identity by descent (IBD) are genomic regions over which a pair of individuals share a haplotype due to inheritance from a recent common ancestor. IBD segments are useful in a wide variety of applications because they capture information about genetic relationships between individuals. Correlation between pairwise IBD and phenotypic similarity can be used to detect genomic regions harboring trait-affecting variants1, 2, 3, 4, 5, 6 and to estimate heritability.7, 8, 9, 10 IBD segments are also used to estimate kinship coefficients,11 detect close relationships,12, 13, 14 and identify fine-scale population structure.15, 16, 17, 18, 19, 20
Recent demographic history can be inferred from IBD segments.15,16,21, 22, 23, 24 Populations with smaller effective population size have more IBD sharing because individuals are more closely related on average. Short segments have a larger time to the most recent common ancestor (TMRCA), and thus are informative about less recent effective size, while long segments have a smaller TMRCA and are informative about very recent effective size. Similarly, IBD segments shared between populations are informative about migration rates. Approximately the past 100 generations of demographic history can be inferred from IBD segments.23
IBD segments are also useful for estimating population genetic parameters, including mutation rates25, 26, 27, 28 and recombination rates,29 and for detecting regions undergoing recent selection.10,20,30, 31, 32 The mutation rate is estimated from the observed discordance rate in IBD haplotypes. Recombination rate maps can be estimated using the rate of IBD segment endpoints. Selection is detected by looking for genomic regions with higher rates of IBD sharing.
There are several classes of methods for detecting IBD segments. The first class of methods are probabilistic. These methods include PLINK,2 Beagle IBD,33 and others.4,10,34, 35, 36, 37, 38, 39 For these methods, the unobserved IBD status for a pair (or set) of individuals at a locus takes two (IBD/non-IBD) or more possible states. Typically, a hidden Markov model is used to infer the IBD state at each marker. In the context of a pedigree, with shared haplotypes inherited only through pedigree founders, this IBD-state approach makes sense. However, in population samples, the concept of “IBD state” is ill defined. Two haplotypes are identical by descent if they are descended from a common ancestor, which is trivially true for all pairs of haplotypes at each position in the genome.
The second class of methods, which includes all the methods presented in this paper, look for long segments of identical-by-state allele sharing either in phased or in unphased genotype data. These identity-by-state (IBS) methods include GERMLINE40 and others.41, 42, 43, 44 In contrast to most of the probabilistic methods, these methods do not dichotomize pairwise haplotypic sharing into “IBD” and “non-IBD,” but instead dichotomize it into “long-IBD” and “not-long-IBD,” which better fit the realities of population-based IBD sharing. Ideally, reported IBD segments should primarily represent IBD from a single common ancestor, rather than a conflation of segments from multiple ancestors, and this is achieved when the length threshold is relatively long.45 A drawback to these methods is that the handling of allelic discordances within IBD segments tends to be ad hoc.
For IBS methods, the requirement that two individuals share a haplotype is more stringent than the requirement that the two individuals share at least one allele in their genotypes across a given region. Thus, haplotype-based methods can detect short IBD segments (e.g 2–10 centiMorgan [cM] in length) with much greater accuracy than genotype-based methods can. However, haplotype-based methods can break up a long-IBD segment into a sequence of shorter IBD segments if there are haplotype phasing errors in the long-IBD segment. For some downstream applications, after detecting IBD segments, it is necessary to perform a merging step in order to recover the original long-IBD haplotype. On the other hand, genotype-based methods do not require accurately phased genotype data, and they can detect long segments (≥15 cM) with high accuracy, which is sufficient for highly accurate detection of first- and second-degree relatives.13
A third class of methods are those that combine aspects of probabilistic modeling and length-based thresholding on IBS. Typically, these methods detect candidate long shared segments and then form a likelihood ratio for IBD versus non-IBD.11,46, 47, 48 These methods tend to be more computationally efficient than the full probabilistic methods, but unlike some of the purely length-based IBS methods, they cannot analyze biobank-scale datasets.
Although “identity by descent” implies allelic identity, in fact, positions of discordance will be observed. Causes of this discordance include mutation or gene conversion since the common ancestor, and genotype error. Probabilistic methods allow for these discordances via an error term in the modeling, whereas length-based methods allow for short, infrequent gaps in allele sharing.
Genotype error rates vary greatly across datasets. Data from two recent studies give genotype error rate estimates of 0.008 per Mb per individual in a large SNP array study49 and 25 per Mb per individual for single-nucleotide variants in a large sequencing study50 (with error rates estimated as half the discordance between duplicate samples after quality control filtering, multiplied by the average number of called and/or assayed variants per Mb). Exclusion of rare variants can decrease the genotype error rate,51 particularly for sequence data.50
With increasingly large datasets, computational issues become significant. The detection of sets of shared haplotypes can be reduced to linear computational complexity by means of hashing40,44 or by use of the positional Burrows-Wheeler transform (PBWT).43 However, the generation of pairwise IBD segments from these sets scales quadratically with sample size, because the number of pairs of individuals grows quadratically with sample size. Consequently, detecting IBD segments in biobank-scale data is challenging. As well as computation time being an issue, some algorithms require unfeasibly large amounts of computer memory to analyze such datasets.
In this work, we present hap-IBD, a method which scales to biobank-sized data, has greater accuracy than competing methods, and is notable for the simplicity of its algorithm and tuning parameters. Hap-IBD utilizes the PBWT52 and parallel computation to reduce computing time, and it uses data compression to reduce memory requirements.53 It addresses the issue of allele discordance in IBD segments by requiring that a reported segment have a central core (the “seed”) that is free of discordance, while allowing extension beyond the seed after a short gap containing discordance. The key parameters for hap-IBD are the minimum seed length, the minimum extension length, the maximum gap length, and the minimum length of reported IBD segments. These parameters directly control which IBD segments are detected and reported. The hap-IBD program is open-source and freely available for academic and commercial use.
Material and Methods
The Hap-IBD Algorithm
The hap-IBD method employs a simple seed-and-extend algorithm. A seed is an IBS segment with genetic length (i.e., length in cM) greater than a specified minimum length (2 cM by default). The hap-IBD algorithm finds all seed segments, and it extends each seed if possible (Figure 1). A seed segment is extended if there is another long-IBS segment for the same pair of haplotypes that is separated from the seed segment by a short non-IBS gap. The maximum number of base pairs between the first and last markers in the non-IBS gap and the minimum cM length of the extension IBS segment can be specified by the user, and these are 1,000 base pairs and 1 cM, respectively, by default. A segment may be extended multiple times. When it is no longer possible to extend the segment, the segment is written to the output file if its centiMorgan length is greater than a specified minimum output length (2 cM by default).
Allowing short non-IBS gaps provides robustness to three sources of discordant alleles in IBD segments: genotype error, gene conversion, and mutation since the most recent common ancestor. Genotype error and mutation will typically introduce a single discordant allele in an IBD segment. Gene conversion will generally produce a very short interval containing one or a few discordant alleles in an IBD segment. When the phasing of the surrounding alleles is correct, the mismatching alleles on the pair of IBD haplotypes result in two IBS segments for the same pair of haplotypes, separated by a single marker, or at most a few markers in the case of gene conversion. Our method allows these breaks in IBS sharing to be detected and the IBS segments on each side of the break to be included in the same reported IBD segment. The maximum length is specified in base pairs rather than centiMorgans, because breaks due to mutation or genotype error typically involve a single marker, and breaks due to gene conversion are typically less than 1,000 base pairs.54
Two or more distinct IBS seed segments can result in the same IBD haplotype after each seed is extended. If an IBS segment that extends the seed segment to the left is itself a valid seed segment, we stop the extension process and discard the seed segment that is being extended, because the same IBD haplotype will be generated by a seed segment that occurs earlier on the chromosome.
The hap-IBD algorithm also has an optional min-markers parameter that requires seed IBS segments to have a minimum number of markers. The min-markers parameter can be useful for ensuring a minimum level of evidence for IBD in genomic regions having low marker density. When a min-markers parameter is specified, IBS segments that extend a seed are also required to have a minimum number of markers. We set the minimum number of markers in an extension to be the product of the min-markers parameter and the ratio of the minimum extension length to the minimum seed length.
We first describe a single-threaded implementation of the preceding algorithm and then describe how the single-threaded implementation is modified to permit parallel computation.
Computationally Efficient Detection of Seed Segments
After the genotype data for a chromosome are read into memory, we apply the PBWT.52 The PBWT sweeps through the markers in chromosome order, and at each marker, it sorts the reverse haplotype prefixes in lexicographic order (the reverse haplotype prefix at the -th marker is the sequence of alleles at markers ). At marker , we generate a “divergence” array that stores the first marker of the IBS segment containing marker for each pair of haplotypes that are adjacent after sorting.52 The divergence array is used to efficiently identify all seed IBS segments that end at marker (see Durbin’s Algorithm 3).52 After a seed is identified, it is extended (if possible) by comparing the alleles on the two IBS haplotypes in the regions preceding and succeeding the seed segment as described above.
Memory-Efficient Computation
The hap-IBD program takes phased genotype data in variant call format (VCF) as input.55 As the genotype data are read into memory, the data are immediately converted to binary reference format (version 3).53 Binary reference format compresses low-frequency variants by storing only the indices of the haplotypes carrying non-major alleles. Higher-frequency variants are compressed by storing unique allele sequences in a region, along with a vector that maps haplotype indices to the allele sequence carried by the haplotype. We use binary reference format because it permits data for an entire chromosome to be stored compactly in memory, and it allows rapid queries of alleles carried by haplotypes at each marker.
The PBWT requires only two additional arrays of stored information, each with length equal to the number of haplotypes. Seed IBS segments are extended as soon as they are identified by the PBWT. After extension, segments that are longer than the minimum output segment length are immediately printed to an output buffer, which is flushed to disk when full. Consequently, only a limited number of IBD segments are stored in memory at any time.
Parallelization
The hap-IBD algorithm is parallelized by applying the PBWT concurrently in overlapping marker windows. If is the genetic distance between the first and last markers on the chromosome, is the minimum seed genetic length, and is the number of computational threads, we sequentially define overlapping marker windows that each have length approximately equal to cM, and that have approximately cM overlap between adjacent windows. The first window begins at the first marker on the chromosome and ends at the first marker after genetic position whose index is greater than the minimum number of markers required for an IBS seed segment. The first marker in is the first marker in that cannot be the start of a seed IBS segment contained within because the number of markers or genetic distance separating the marker from the last marker in is too small. The last marker in is the first marker that is cM away from the last marker in window . With these definitions, every seed IBS segment will be detected in at least one of the overlapping windows.
We run the PBWT algorithm in each overlapping window in parallel. When a seed IBS segment is found, we ignore the window boundaries when we extend the segment, so that the extension process is the same as for the single-threaded case. If multiple seeds result in the same maximal IBD segment after extension, we keep the maximal IBD segment generated by the first seed, and we discard the duplicate IBD segments generated by later windows seeds.
Input and Output Data
The input data is a VCF file55 with phased genotypes and no missing genotypes, and a PLINK-format genetic map.2 Linear interpolation is used to estimate the genetic map positions for any marker whose position is not on the genetic map. Although the use of a genetic map is recommended, hap-IBD can be used with Mb units simply by supplying a genetic map with a recombination rate of 1 cM = 1 Mb.
Two output files are produced: one containing within-individual segments of homozygosity by descent (HBD) and one containing between-individual IBD segments. Each output line contains the pair of samples, the specific haplotypes, the starting base position, the ending base position, and the genetic length of the HBD or IBD segment.
Hap-IBD Parameters
The minimum seed length parameter has a large influence on computation time. Increasing the minimum seed length reduces computation time because fewer seed IBS segments will be considered. Decreasing the minimum seed length can increase power to detect short IBD segments that have discordant alleles on the pair of shared haplotypes.
The maximum gap length and minimum extension length allow reported IBD segments to contain discordant alleles due to genotype error, mutation, or gene conversion. The hap-IBD software also has an option for excluding input markers that have low minor allele counts.
The minimum markers parameter controls the minimum number of markers in IBS seed and extension segments. The number of reported IBD segments should be approximately constant throughout the genome; however, regions with low marker density can produce local spikes in the number of reported IBD segments (see Results). These spikes contain many IBS segments that satisfy the genetic length requirements, but contain relatively few markers. The spikes can be reduced or eliminated by post-processing23,56 or by requiring seed and extension IBS segments to contain a minimum number of markers.
UK Biobank Genotype Data
We downloaded the UK Biobank genotype data from the European Genome-phenome Archive57 (dataset accession: EGAD00010001497). The UK Biobank data contain 488,377 individuals and 784,256 autosomal markers.49 We excluded markers with more than 5% missing genotypes (n = 70,246), markers that had only one individual carrying a minor allele (n = 5,123), and markers that failed one or more of the UK Biobank’s batch quality control tests (n = 1,527).49 After we excluded 72,601 markers that failed one or more of these filters, there were 711,655 autosomal markers.
We then excluded 968 individuals that were identified by the UK Biobank as being outliers for their proportion of missing genotypes or proportion of heterozygous genotypes, and we excluded nine individuals that were identified by the UK Biobank as showing third-degree or closer relationships with more than 200 individuals (indicating sample contamination).49 After these exclusions, there were 487,400 individuals.
We identified parent-offspring trios using the kinship coefficients and the proportion of markers that share no alleles (IBS0) that are reported by the UK Biobank.49,58 First-degree relatives were considered to be pairs of individuals with a kinship coefficient between and . Among first-degree relatives, parent-offspring relationships were assumed to be the first-degree relative pairs with IBS0 < 0.0012. These are the same kinship coefficient and IBS0 thresholds used by the UK Biobank to identify parent-offspring relationships.49 We considered an individual to be the offspring in a parent-offspring trio if the individual had a parent-offspring relationship with exactly one male and one female individual, and if the male and female first-degree relatives were not in the set of related pairs of individuals reported by the UK Biobank, which is the set of pairs of individuals with estimated kinship coefficient greater than . In this case, we considered the male and female first-degree relatives to be the offspring’s parents. Using this procedure, we identified 1,064 parent-offspring trios.
The 1,064 trio offspring have 2,054 distinct parents. We excluded these parents from the data before phasing and IBD segment detection so that phasing accuracy in the trio offspring would more closely match phasing accuracy in unrelated individuals. After we excluded the trio parents, there were 485,346 remaining individuals. We listed the 1,064 trio offspring followed by the remaining samples in random order. By taking the corresponding number of samples from the top of this list, we created five telescoping genotype datasets that included 5,000, 15,000, 50,000, 150,000, and all 485,346 individuals. We then used Beagle 5.1 to phase each dataset.59
Because we use trio genotypes to determine the genotype phase in the offspring, we selected the 850 trios that had the lowest genotype error rate, as measured by the number of autosomal sites with Mendelian inconsistent genotypes.60 The number of inconsistent sites in the 1,064 trios ranged from 57 to 5,102 sites per trio, and the number of inconsistent sites in the 850 trios with the lowest genotype error rate ranged from 57 to 456 sites per trio. We phased the 850 trio offspring at all heterozygous genotypes for which phase could be determined from parental genotypes and Mendelian inheritance constraints (82.4% of heterozygous genotypes), and we masked genotypes at Mendelian inconsistent sites in this phased data. We used these estimated haplotypes to evaluate false-positive and false-negative rates for IBD segment detection as described below.
After we excluded trio parents, there were 43 remaining parent-offspring pairs who were not part of a trio in the 50,000 individual subset of the UK Biobank data. We used these 43 remaining pairs to compute the mean proportion of chromosome 20 covered by detected IBD segments in parent-offspring pairs.
Simulated Data
In order to test the performance of hap-IBD and other methods on sequence data, we generated 60 Mb of data for 50,000 individuals from a demographic model that simulates the present UK European population.47 This model has a population size of 24,000 in the distant past, a reduction to 3,000 occurring 5,000 generations ago, growth at rate 1.4% per generation starting 300 generations ago, and growth at rate 25% beginning 10 generations ago.
We used forward simulation with SLiM v3.361,62 to simulate the ancestral recombination graph for the most recent 5,000 generations. Gene conversion tracts were initiated at a rate of per base pairs (bp) per generation, and had geometrically distributed lengths with mean 300 bp, giving an overall gene conversion rate of .27,63 A constant recombination rate of was used. We then used coalescent simulation in msprime (v0.7.1) to add mutations (at rate and simulate the more distant past.64 This hybrid strategy of using SLiM and msprime enables utilization of msprime’s computational efficiency for large datasets, while incorporating biologically realistic settings, such as gene conversion, that are implemented in SLiM but not currently implemented in msprime.65 Our simulation only includes gene conversion events in the most recent 5,000 generations, but it is the more recent gene conversions that have the greatest potential impact on haplotype phase accuracy and that can create discordances between identical-by-descent haplotypes.
We determined the true IBD segments for 1,000 simulated individuals from the simulated ancestral recombination graphs. IBD segments are required to have the same ancestral node along their length, except for short breaks due to gene conversion.
We added genotype error at a rate of 0.02%, which is the error rate that produces the observed 0.04% rate of discordance at SNVs that pass quality control in the TOPMed Freeze 5 whole-genome-sequence data.50 We then removed variants with frequencies less than 10% and used Beagle 5.1 to phase the remaining genotypes.59 We also separately phased a subset of 5,000 individuals with the same minor allele frequency threshold of 10%. Low-frequency variants are not very informative for IBD because most individuals are homozygous for the major allele, and because allele discordance at low-frequency variants in IBD segments could be due to genotype error, recent mutation, or phasing error, rather than indicating that they are non-IBD variants. Other methods for IBD detection in sequence data have used a minor allele frequency filter. The application of GERMLINE to the Genomes of the Netherlands whole-genome sequence data used a minor allele frequency filter of 1%.56 The TRUFFLE analysis of 1000 Genomes sequence data used minor allele frequency filters of 5% and 10%.42
Parameter Settings
Each method has an option for setting the minimum length of reported IBD segments. All methods, except TRUFFLE, measure distance in cM units. For TRUFFLE, we substituted Mb units for cM units. All genetic distances are interpolated from the HapMap genetic map.66
We required all UK Biobank chromosome 20 analyses to complete within two days of wall-clock time on the compute nodes used for these analyses. Parameter settings for analysis of UK Biobank and simulated sequence data are based on previously published analyses of SNP array5,42, 43, 44 and sequence data.42,43,56 Parameter settings for each method are reported in Tables S1 and S2.
The hap-IBD parameter settings in Tables S1 and S2 are recommended parameter settings for analysis of SNP array data (Table S1) and sequence data (Table S2) that have marker densities, rates of genotype error per Mb, and effective population sizes similar to those of the data considered in this paper. We used a lower minimum seed length and extension length for analysis of the sequence data because IBS segments are shorter in the sequence data due to the much higher number of genotype errors per Mb. We use the same 2.0 cM minimum output IBD length for both SNP array data and sequence data because both datasets are derived from outbred, human populations, and previous work has shown that allele identity at this scale is an accurate proxy for identifying IBD segments in outbred populations.45 If one is analyzing data from an inbred population, then use of a longer minimum output IBD length could be appropriate due to the increased probability in an inbred population that a 2.0 cM IBS segment is actually a conflation of multiple shorter IBD segments.
Comparison of Methods
For the simulated data, coalescent trees for 1,000 simulated samples were used to determine true IBD segments exceeding 1.5 cM in length for those samples. For the UK Biobank data, we considered true IBD segments to be IBS segments exceeding 1.5 cM in length among the 850 trio offspring which were phased using parental genotypes and Mendelian inheritance rules.
False-Positive Rate Estimation
We divided detected IBD segments into bins according to the detected segment length (2–3, 3–4, 4–6, 6–10, 10–18, and >18 cM). For each detected IBD segment, we identified the cM length of the portion of the detected segment that is not covered by any true IBD segment with length >1.5 cM, and we calculated the sum of these false-positive segment lengths. The false-positive rate for a bin is the sum of the false-positive segment lengths divided by the sum of the detected segments’ lengths.
False-Negative Rate Estimation
We divided true IBD segments into bins according to the true segment length. The length bins and number of true UK Biobank IBD segments in each length bin are: 2.5–3 cM (2,492), 3–4 cM (1,360), 4–6 cM (551), 6–10 cM (160), 10–18 cM (55), and >18 cM (64). For each true IBD segment, we identified the cM length of the portion of the true segment that is not covered by any detected IBD segment with length ≥2.0 cM, and we calculated the sum of these false-negative segment lengths. The false-negative rate for the bin is the sum of the false-negative segment lengths divided by the sum of the true segments’ lengths.
ROC Analysis
In order to account for inter-method differences in determining IBD end-points, differences which affect the reported lengths of IBD segments, we calculated false-positive and false-negative rates for each method over a range of detected segment length thresholds (1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, and 2.4 cM). We calculated the false-positive rate for each threshold as described above using all true segments having length >1.5 cM, and we calculated the false-negative rate as described above using all true segments having length >2.5 cM. For each method, we then generated a receiver operating characteristic (ROC) curve that shows the true-positive rate (which is one minus the false-negative rate) and false-positive rate for each detected segment length threshold. For length thresholds <2.0 cM, the hap-IBD minimum seed length was set to 1.6 cM.
We also generated ROC curves for each method for 5 cM segments. For this analysis, we used detected segment length thresholds of 4.6, 4.8, 5.0, 5.2, and 5.4 cM. We calculated false-positive rates using all true segments having length >1.5 cM, and we calculated false-negative rates using all true segments having length >5.5 cM.
Computation Time
All analyses were run on a 12-core 2.6 GHz computer with Intel Xeon ES-2630 processors and 128 GB of memory. Computation time was measured using the Unix time command, which returns a “real,” a “system,” and a “user” time. The wall-clock time is the “real” time, which is the length of time the program was running. The central processing unit (CPU) time is the sum of the “system” and “user” times. For multi-threaded compute jobs, the CPU time includes the sum of the CPU time for each computational thread, so that it represents the total CPU resources consumed by the program. A maximum of 2 days of wall-clock time was allowed for each analysis of UK Biobank chromosome 20 data or of the 60 Mb of simulated sequence data, with no results reported if the analysis did not complete within this time frame.
Results
Computational Feasibility
Figure 2 shows CPU times for subsets of the UK Biobank chromosome 20 data (5,000 to 485,346 individuals) for 2 cM and 5 cM output length thresholds. For the full UK Biobank data with 485,346 individuals, hap-IBD detected 3.43 billion IBD segments on chromosome 20 at the 2 cM threshold, and 106 million segments at the 5 cM threshold. GERMLINE could not analyze subsets of 50,000 or more individuals because it required more than 128 GB of memory. TRUFFLE could not analyze subsets of 150,000 or more individuals on our compute nodes. iLASH could not analyze subsets of 150,000 or more individuals at the 2 cM output threshold because it needed more than 128 GB of memory, but it could analyze the full dataset at the 5 cM output threshold. RaPID could not analyze the full chromosome 20 dataset at the 2 cM threshold within the permitted two days of wall-clock time, but it could analyze the full dataset at the 5 cM output threshold.
Three of the five methods can use multiple computational threads, and running these methods on a 12-core computer leads to an approximate 10-fold reduction in computing time compared to single-threaded analysis (Figure S1). This degree of speedup is important for analysis of large datasets. For example, the single-threaded RaPID program required 223.6 min of wall-clock time to output IBD segments ≥5 cM for all samples on chromosome 20, but hap-IBD required only 13.4 min when using 12 computational threads.
Overall, we see that hap-IBD is the fastest program except when analyzing the smallest sample size (5,000 individuals) using the largest output threshold (5 cM output threshold); for this combination, iLASH is faster. In our experiments, hap-IBD was the only method that could analyze the full UK Biobank chromosome 20 data on our compute servers in less than 2 days when using a 2 cM output length threshold.
We also performed a genome-wide analysis of the 22 autosomes for the UK Biobank data. Genome-wide analysis of 485,346 UK Biobank samples using hap-IBD with 12 computational threads detected 231.5 billion autosomal IBD segments in 24.4 h.
Accuracy
Some methods have a low false-positive rate (Figure 3 and Figure S2) but a high false-negative rate (Figure 4 and Figure S3), or vice versa. The methods apply different algorithms for determining the end points of IBD segments, and this results in different methods reporting different lengths for a true IBD segment. Because false-positive rates and false-negative rates can be traded off by changing the output length threshold, we constructed ROC curves by varying the output IBD segment length threshold for each method in order to assess the true-positive versus false-positive trade-off. The true-positive rate is one minus the false-negative rate. An ideal method would have a true-positive rate of 1 and a false-positive rate of 0. For 2 cM IBD segments (Figure 5) and for 5 cM IBD segments (Figure S4) hap-IBD shows the best performance on these ROC curves. In particular, hap-IBD has much lower false-positive rates than RaPID and much higher true-positive rates than iLASH. The IBD segment detection method for unphased genotype data (TRUFFLE) has high error rates for these short IBD segments.
We also investigated the proportion of chromosome 20 in parent-offspring pairs that was covered by detected IBD segments with length ≥2 cM in 43 parent-offspring pairs in the set of 50,000 UK Biobank samples. The proportions were 0.978 for iLASH, 0.987 for hap-IBD, 0.994 for RaPID, and 1.0 for TRUFFLE. GERMLINE was not evaluated because it could not analyze 50,000 individuals on our compute server. All methods detected IBD across all or nearly all of the chromosome in the parent-offspring pairs. For haplotype-based methods, the methods with higher false-positive rates (Figure 3) detected slightly higher numbers of IBD segments in the parent-offspring pairs. Genotype-based methods are not affected by haplotype phase errors, and the genotype-based method (TRUFFLE) had the highest detection rate for these chromosome-length shared haplotypes.
We examined the effect of varying the hap-IBD min-seed and min-extend parameters when detecting 5 cM segments in the UK Biobank chromosome 20 data for 5,000 individuals. Accuracy was unchanged when using min-seed values between 1 cM and 3 cM (Figure S5). The false-positive rate doubled when the min-extend parameter decreased from 1.0 cM to 0.1 cM (Figure S6), although the absolute increase in the false-positive rate is small (approximately 0.5%).
In genome-wide analysis of the UK Biobank data, we found regions in which IBD detection methods reported inflated levels of IBD segments. These are generally regions with large gaps in marker coverage, or very low marker density, and often occur around centromeres. Figure 6 shows results for chromosomes 1 and 20 for the methods with the highest accuracy for short IBD segments (the four haplotype-based methods) for the 5,000 individuals’ UK Biobank data. Around the chromosome 1 centromere, the methods found IBD segments at a rate 40 to 3,000 times greater than the baseline level. The inflation was worse for RaPID and iLASH than for GERMLINE and hap-IBD. Figure S7 shows that the inflated detection can be reduced by increasing hap-IBD’s min-markers parameter. However, the use of overly high values of this parameter will reduce power to detect short IBD segments. Alternatively, regions with high rates of IBD segment discovery can be identified after IBD segment detection and excluded.67
We also assessed accuracy by using simulated sequence data. There are several important differences between the UK Biobank analysis and the simulated sequence data analysis. First, the approach to assessing accuracy differs. In the UK Biobank data, we determined the true phase of trio offspring, and we used that to determine identity by state at the haplotype level, which we used as a proxy for true IBD. The genotype error rate was extremely low in these data (with a duplicate discordance rate of ),49 but genotype errors can disrupt both the true IBD and the estimated IBD in the UK Biobank analysis. In contrast, in the simulated data, the true IBD status was obtained directly from the simulation (defined as no change in common ancestor across a segment except in tracts of gene conversion), and mis-called alleles may have disrupted the estimated IBD but did not affect the ascertainment of true IBD. Second, the marker density was much higher for the simulated sequence data. Although we removed markers with minor allele frequency <10% (see Material and Methods), the marker density was still five times greater than that of the UK Biobank data (97,890 markers with minor allele frequency ≥10% in the simulated 60 Mb region, compared with 18,424 total UK Biobank markers on chromosome 20). Third, the genotype error rate in the simulated sequence data was much higher than for the UK Biobank data. With current technology, error rates tend to be higher for sequence data than for SNP array data, even with high sequence coverage and careful processing. We added genotype error to the simulated sequence data at a rate that generates the level of duplicate discordance observed in the TOPMed data, which is for SNVs passing quality control.50 This level of duplicate discordance is six times higher than for the UK Biobank SNP data. There are also important similarities between the two analyses, which include the length of the region (approximately 60 Mb for the simulated analysis and for the UK Biobank chromosome 20 analysis), large sample size (up to 50,000 for the simulated data and up to 485,346 for the UK Biobank data), and demographic history (UK-like simulation versus actual UK population).
In the simulated sequence data, we used settings which the authors of the GERMLINE, RaPID, and TRUFFLE methods have used in published analyses of sequence data, while for iLASH, we used the same settings as for SNP array data (see Material and Methods for details). To compare accuracy, we produced ROC curves for detection of 2 cM segments. We considered sample sizes of 5,000 (Figure 7) and 50,000 individuals (Figure S8). We found that hap-IBD and GERMLINE have a very similar accuracy profile (for the 5,000 individuals only, because GERMLINE could not analyze the 50,000 individuals with the available computer memory). iLASH had reduced power to detect 2 cM IBD segments, while TRUFFLE had very low power to detect these segments, and RaPID had a high false-positive rate. Overall, these results are similar to those seen in the UK Biobank analysis, except that the relative accuracy of GERMLINE is improved in these simulated sequence data. The parameters that we used for GERMLINE in the simulated sequence analysis may be a better match for these data than were the parameters that we used for the UK Biobank data, although we used published parameter settings in both instances.
Discussion
We have presented an IBD segment detection method for large-scale genotype data that is substantially faster and more accurate than four state-of-the-art competing methods (GERMLINE, iLASH, RaPID, and TRUFFLE). We applied hap-IBD to 485,346 samples from the UK Biobank49 and detected 231.5 billion autosomal IBD segments having length ≥2 cM in less than 24.4 h of wall-clock time on a compute server with 12 CPU cores.
An attractive feature of hap-IBD is its simplicity. All seed IBS segments that exceed a specified length are identified and then extended if possible. The extension process allows for sporadic non-IBS alleles due to mutation, genotype error, or gene conversion. The hap-IBD parameters define the minimum length of IBS seed and extension segments and the maximum length of non-IBS gaps. These parameters have a simple and direct relationship to the IBD segments that are reported, and this enables the correctness of the output results to be confirmed. In contrast, some methods utilize a large number of tuning parameters which have only an indirect relationship to output IBD segments, such as iLASH’s seven parameters for controlling locality-sensitive hashing: perm_count, shingle_size, shingle_overlap, bucket_count, match_threshold, interest_threshold, and minhash_threshold.44
The hap-IBD method shares some similarities with the GERMLINE method: both methods search for long-IBS segments via a seed and extend algorithm, and both methods allow for the presence of some discordant alleles in a reported IBD segment.40 However, hap-IBD achieves much greater computational efficiency and greater accuracy than GERMLINE does, because hap-IBD employs the PBWT instead of a hash table, and hap-IBD identifies seeds that exceed a specific genetic length rather than a specified number of markers.
In our tests, hap-IBD consistently required less CPU time than competing methods did. One factor contributing to hap-IBD’s fast computation time is that it only considers shared haplotype segments that exceed a relatively long minimum seed length, and these segments are efficiently detected using the PBWT. The hap-IBD method also includes internal parallelization that can yield wall-clock compute times that are a fraction of the total CPU time on multi-core processors.
The hap-IBD method requires phased genotype data. In practice, nearly all large genotype datasets are phased because phased data are required to obtain the highest accuracy for many downstream analyses, including IBD segment detection, relationship inference,13 local ancestry inference,68 population demography inference,21, 22, 23 and detection of selection.10 Phased data are also required for computationally efficient and accurate genotype imputation.53,69 With state-of-the-art methods, the effort and computational cost required to phase large datasets is modest when using a small compute cluster. We phased the UK Biobank genome-wide data with Beagle 5.1 in less than two days using 16 compute servers, each with 20 CPU cores.
Our results confirm that IBD segment detection methods for phased genetic data can detect much shorter IBD segments than can methods for unphased genetic data. In our tests, the method for unphased data (TRUFFLE) could not accurately detect segments with length <10 cM, but most methods for phased data could accurately detect IBD segments with length ≥2 cM (Figure 5). Furthermore, haplotype-based methods identify the shared allele sequence, whereas genotype methods cannot identify the shared allele when the individuals both carry an identical heterozygote genotype. However, genotype-based IBD detection methods, such as TRUFFLE, have some advantages. Genotype-based methods can detect first- and second-degree relationships with high accuracy before haplotypes are estimated,13,42 and this can be useful during initial data quality control. In addition, genotype-based methods can be used when genotype data cannot be accurately phased due to non-uniform marker coverage or high rates of genotype error, as can be the case for exome and low-coverage sequence data. If the sample size is less than 50,000, if one is interested only in accurately detecting long (≥15 cM) IBD segments, and if haplotypes have not been estimated, then TRUFFLE can detect long IBD segments in less CPU time than is required for phasing and haplotype-based IBD segment detection. The CPU time for the TRUFFLE analysis of the chromosome 20 UK Biobank data for 50,000 individuals was 4.4 times faster than the combined CPU time for Beagle 5.1 phasing and hap-IBD analysis (461 min versus 2,014 min). However, pairwise IBD detection scales quadratically with sample size, while phasing with Beagle 5.1 scales approximately linearly. Thus, an extrapolated TRUFFLE CPU time for all 485,346 UK Biobank individuals is more than twice as long as the total CPU time for Beagle 5.1 phasing and hap-IBD analysis (43,447 min versus 20,815 min).
The hap-IBD method performs well across a range of haplotype switch error rates. In the UK Biobank data, the switch error rate for 5,000 samples is more than an order of magnitude higher than the switch error rate for 485,346 samples.60 However, even for the 5,000-sample subset of the UK Biobank data, the IBD-detection accuracy is very high and is sufficient to identify close relatives in the data. Furthermore, one can increase the accuracy of phase estimates in small samples by phasing the samples together with a reference panel of sequenced individuals.70
Segments of IBD are genomic regions over which the pair of haplotypes share a recent ancestor. The expected length of the segment is related to the number of generations to the common ancestor, so the genetic length of the segment acts as a proxy for how recent the ancestry is. When using a genetic length threshold to determine which segments to report, as is done in this paper, the choice of threshold determines the degree of recentness. If the length threshold is increased, only more recent ancestry will be included, while if the threshold is decreased, slightly less recent ancestry will be included. In this paper, we used a 2 cM threshold. Segments of IBD resulting from shared ancestry 25 generations ago have an expected length of 2 cM, while the typical 2 cM segment has a common ancestor around 50–100 generations ago,16 because there are many more segments with less recent ancestry, and some of these are by chance relatively long for their age and pass the length threshold. Thus “recent” in the context of this paper essentially means “within the past 100 generations.” One problem with the length-based approach is that informativeness of the genotype data varies considerably across the genome, primarily due to gaps in marker coverage around repetitive regions, leading to increased false positive detection in some regions. We found that all of the methods investigated in this paper are affected by this issue. Reported IBD segments in these regions typically do not result from recent shared ancestry. Previous attempts to address this issue of variable informativeness across the genome include methods that are based on haplotype frequency modeling, such as fastIBD11 and RefinedIBD,46 however, these methods do not scale to analysis of large datasets.
A general limitation of IBD segment detection methods that rely on IBS is that there is some degree of error in determination of segment endpoints. The IBS interval can extend beyond the endpoints of a contained IBD segment. Consequently, IBD detection methods that report the full IBS interval will often overextend the IBD segment ends. Such methods can also miss some regions at the end of IBD segments when genotype error, mutation, or gene conversion near the end of the IBD segment causes the IBS segment to end before the actual end of the IBD segment. If the genetic distance between the truncated end of the IBS segment and the true end of the IBD region is short, it is not possible to determine with confidence whether or not the IBD segment extends past the end of the IBS segment. Development of IBD segment detection methods that are robust to genotype error, recent mutation, and gene conversion that occur near the ends of IBD segments is an area for future research.
Declaration of Interests
The authors declare no competing interests.
Acknowledgments
Research reported in this publication was supported by the National Human Genome Research Institute and National Institute of General Medical Sciences of the National Institutes of Health under award numbers HG008359, HG005701, and GM075091. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research has been conducted using the UK Biobank Resource under Application Number 19934.
Published: March 12, 2020
Footnotes
Supplemental Data can be found online at https://doi.org/10.1016/j.ajhg.2020.02.010.
Web Resources
The European Genome-Phenome Archive, https://www.ebi.ac.uk/ega/home
Supplemental Data
References
- 1.Houwen R.H., Baharloo S., Blankenship K., Raeymaekers P., Juyn J., Sandkuijl L.A., Freimer N.B. Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis. Nat. Genet. 1994;8:380–386. doi: 10.1038/ng1294-380. [DOI] [PubMed] [Google Scholar]
- 2.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kenny E.E., Gusev A., Riegel K., Lütjohann D., Lowe J.K., Salit J., Maller J.B., Stoffel M., Daly M.J., Altshuler D.M. Systematic haplotype analysis resolves a complex plasma plant sterol locus on the Micronesian Island of Kosrae. Proc. Natl. Acad. Sci. USA. 2009;106:13886–13891. doi: 10.1073/pnas.0907336106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Moltke I., Albrechtsen A., Hansen T.V., Nielsen F.C., Nielsen R. A method for detecting IBD regions simultaneously in multiple individuals--with applications to disease genetics. Genome Res. 2011;21:1168–1180. doi: 10.1101/gr.115360.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gusev A., Kenny E.E., Lowe J.K., Salit J., Saxena R., Kathiresan S., Altshuler D.M., Friedman J.M., Breslow J.L., Pe’er I. DASH: a method for identical-by-descent haplotype mapping uncovers association with recent variation. Am. J. Hum. Genet. 2011;88:706–717. doi: 10.1016/j.ajhg.2011.04.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Browning S.R., Thompson E.A. Detecting rare variant associations by identity-by-descent mapping in case-control studies. Genetics. 2012;190:1521–1531. doi: 10.1534/genetics.111.136937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Price A.L., Helgason A., Thorleifsson G., McCarroll S.A., Kong A., Stefansson K. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 2011;7:e1001317. doi: 10.1371/journal.pgen.1001317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zuk O., Hechter E., Sunyaev S.R., Lander E.S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA. 2012;109:1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Browning S.R., Browning B.L. Identity-by-descent-based heritability analysis in the Northern Finland Birth Cohort. Hum. Genet. 2013;132:129–138. doi: 10.1007/s00439-012-1230-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Palamara P.F., Terhorst J., Song Y.S., Price A.L. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nat. Genet. 2018;50:1311–1317. doi: 10.1038/s41588-018-0177-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Browning B.L., Browning S.R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 2011;88:173–182. doi: 10.1016/j.ajhg.2011.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Huff C.D., Witherspoon D.J., Simonson T.S., Xing J., Watkins W.S., Zhang Y., Tuohy T.M., Neklason D.W., Burt R.W., Guthery S.L. Maximum-likelihood estimation of recent shared ancestry (ERSA) Genome Res. 2011;21:768–774. doi: 10.1101/gr.115972.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ramstetter M.D., Dyer T.D., Lehman D.M., Curran J.E., Duggirala R., Blangero J., Mezey J.G., Williams A.L. Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics. 2017;207:75–82. doi: 10.1534/genetics.117.1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Qiao Y., Sannerud J., Basu-Roy S., Hayward C., Williams A.L. Distinguishing pedigree relationships using multi-way identical by descent sharing and sex-specific genetic maps. bioRxiv. 2019 doi: 10.1016/j.ajhg.2020.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gusev A., Palamara P.F., Aponte G., Zhuang Z., Darvasi A., Gregersen P., Pe’er I. The architecture of long-range haplotypes shared within and across populations. Mol. Biol. Evol. 2012;29:473–486. doi: 10.1093/molbev/msr133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ralph P., Coop G. The geography of recent genetic ancestry across Europe. PLoS Biol. 2013;11:e1001555. doi: 10.1371/journal.pbio.1001555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Fu W., Browning S.R., Browning B.L., Akey J.M. Robust inference of identity by descent from exome-sequencing data. Am. J. Hum. Genet. 2016;99:1106–1116. doi: 10.1016/j.ajhg.2016.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Han E., Carbonetto P., Curtis R.E., Wang Y., Granka J.M., Byrnes J., Noto K., Kermany A.R., Myres N.M., Barber M.J. Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nat. Commun. 2017;8:14238. doi: 10.1038/ncomms14238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Taylor A.R., Schaffner S.F., Cerqueira G.C., Nkhoma S.C., Anderson T.J.C., Sriprawat K., Pyae Phyo A., Nosten F., Neafsey D.E., Buckee C.O. Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent. PLoS Genet. 2017;13:e1007065. doi: 10.1371/journal.pgen.1007065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Henden L., Lee S., Mueller I., Barry A., Bahlo M. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS Genet. 2018;14:e1007279. doi: 10.1371/journal.pgen.1007279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Palamara P.F., Lencz T., Darvasi A., Pe’er I. Length distributions of identity by descent reveal fine-scale demographic history. Am. J. Hum. Genet. 2012;91:809–822. doi: 10.1016/j.ajhg.2012.08.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Palamara P.F., Pe’er I. Inference of historical migration rates via haplotype sharing. Bioinformatics. 2013;29:i180–i188. doi: 10.1093/bioinformatics/btt239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Browning S.R., Browning B.L. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am. J. Hum. Genet. 2015;97:404–418. doi: 10.1016/j.ajhg.2015.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Browning S.R., Browning B.L., Daviglus M.L., Durazo-Arvizu R.A., Schneiderman N., Kaplan R.C., Laurie C.C. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 2018;14:e1007385. doi: 10.1371/journal.pgen.1007385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Narasimhan V.M., Rahbari R., Scally A., Wuster A., Mason D., Xue Y., Wright J., Trembath R.C., Maher E.R., van Heel D.A. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat. Commun. 2017;8:303. doi: 10.1038/s41467-017-00323-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Campbell C.D., Chong J.X., Malig M., Ko A., Dumont B.L., Han L., Vives L., O’Roak B.J., Sudmant P.H., Shendure J. Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 2012;44:1277–1281. doi: 10.1038/ng.2418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Palamara P.F., Francioli L.C., Wilton P.R., Genovese G., Gusev A., Finucane H.K., Sankararaman S., Sunyaev S.R., de Bakker P.I., Wakeley J., Genome of the Netherlands Consortium Leveraging distant relatedness to quantify human mutation and gene-conversion rates. Am. J. Hum. Genet. 2015;97:775–789. doi: 10.1016/j.ajhg.2015.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tian X., Browning B.L., Browning S.R. Estimating the Genome-wide Mutation Rate with Three-Way Identity by Descent. Am. J. Hum. Genet. 2019;105:883–893. doi: 10.1016/j.ajhg.2019.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zhou Y., Browning B.L., Browning S. Population-specific recombination maps from segments of identity by descent. bioRxiv. 2019 doi: 10.1016/j.ajhg.2020.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Albrechtsen A., Moltke I., Nielsen R. Natural selection and the distribution of identity-by-descent in the human genome. Genetics. 2010;186:295–308. doi: 10.1534/genetics.110.113977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Cai Z., Camp N.J., Cannon-Albright L., Thomas A. Identification of regions of positive selection using Shared Genomic Segment analysis. Eur. J. Hum. Genet. 2011;19:667–671. doi: 10.1038/ejhg.2010.257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Han L., Abney M. Using identity by descent estimation with dense genotype data to detect positive selection. Eur. J. Hum. Genet. 2013;21:205–211. doi: 10.1038/ejhg.2012.148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Browning S.R., Browning B.L. High-resolution detection of identity by descent in unrelated individuals. Am. J. Hum. Genet. 2010;86:526–539. doi: 10.1016/j.ajhg.2010.02.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Leutenegger A.-L., Prum B., Génin E., Verny C., Lemainque A., Clerget-Darpoux F., Thompson E.A. Estimation of the inbreeding coefficient through use of genomic data. Am. J. Hum. Genet. 2003;73:516–523. doi: 10.1086/378207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Browning S.R. Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics. 2008;178:2123–2132. doi: 10.1534/genetics.107.084624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Albrechtsen A., Sand Korneliussen T., Moltke I., van Overseem Hansen T., Nielsen F.C., Nielsen R. Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet. Epidemiol. 2009;33:266–274. doi: 10.1002/gepi.20378. [DOI] [PubMed] [Google Scholar]
- 37.Han L., Abney M. Identity by descent estimation with dense genome-wide genotype data. Genet. Epidemiol. 2011;35:557–567. doi: 10.1002/gepi.20606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Brown M.D., Glazner C.G., Zheng C., Thompson E.A. Inferring coancestry in population samples in the presence of linkage disequilibrium. Genetics. 2012;190:1447–1460. doi: 10.1534/genetics.111.137570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Thompson E.A. The IBD process along four chromosomes. Theor. Popul. Biol. 2008;73:369–373. doi: 10.1016/j.tpb.2007.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gusev A., Lowe J.K., Stoffel M., Daly M.J., Altshuler D., Breslow J.L., Friedman J.M., Pe’er I. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19:318–326. doi: 10.1101/gr.081398.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kong A., Masson G., Frigge M.L., Gylfason A., Zusmanovich P., Thorleifsson G., Olason P.I., Ingason A., Steinberg S., Rafnar T. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 2008;40:1068–1075. doi: 10.1038/ng.216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Dimitromanolakis A., Paterson A.D., Sun L. Fast and Accurate Shared Segment Detection and Relatedness Estimation in Un-phased Genetic Data via TRUFFLE. Am. J. Hum. Genet. 2019;105:78–88. doi: 10.1016/j.ajhg.2019.05.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Naseri A., Liu X., Tang K., Zhang S., Zhi D. RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. Genome Biol. 2019;20:143. doi: 10.1186/s13059-019-1754-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Shemirani R., Belbin G.M., Avery C.L., Kenny E.E., Gignoux C.R., Ambite J.L. Rapid detection of identity-by-descent tracts for mega-scale datasets. bioRxiv. 2019 doi: 10.1038/s41467-021-22910-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chiang C.W., Ralph P., Novembre J. Conflation of short identity-by-descent segments bias their inferred length distribution. G3: Genes, Genomes. G3 (Bethesda) 2016;6:1287–1296. doi: 10.1534/g3.116.027581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Browning B.L., Browning S.R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 2013;194:459–471. doi: 10.1534/genetics.113.150029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Browning B.L., Browning S.R. Detecting identity by descent and estimating genotype error rates in sequence data. Am. J. Hum. Genet. 2013;93:840–851. doi: 10.1016/j.ajhg.2013.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Rodriguez J.M., Bercovici S., Huang L., Frostig R., Batzoglou S. Parente2: a fast and accurate method for detecting identity by descent. Genome Res. 2015;25:280–289. doi: 10.1101/gr.173641.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv. 2019 doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Weedon M.N., Jackson L., Harrison J.W., Ruth K.S., Tyrrell J., Hattersley A.T., Wright C.F. Very rare pathogenic genetic variants detected by SNP-chips are usually false positives: implications for direct-to-consumer genetic testing. bioRxiv. 2019 [Google Scholar]
- 52.Durbin R. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT) Bioinformatics. 2014;30:1266–1272. doi: 10.1093/bioinformatics/btu014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Browning B.L., Zhou Y., Browning S.R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet. 2018;103:338–348. doi: 10.1016/j.ajhg.2018.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Jeffreys A.J., May C.A. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet. 2004;36:151–156. doi: 10.1038/ng1287. [DOI] [PubMed] [Google Scholar]
- 55.Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., 1000 Genomes Project Analysis Group The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Genome of the Netherlands Consortium Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 2014;46:818–825. doi: 10.1038/ng.3021. [DOI] [PubMed] [Google Scholar]
- 57.Lappalainen I., Almeida-King J., Kumanduri V., Senf A., Spalding J.D., Ur-Rehman S., Saunders G., Kandasamy J., Caccamo M., Leinonen R. The European Genome-phenome Archive of human data consented for biomedical research. Nat. Genet. 2015;47:692–695. doi: 10.1038/ng.3312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Manichaikul A., Mychaleckyj J.C., Rich S.S., Daly K., Sale M., Chen W.-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Delaneau O., Zagury J.F., Robinson M.R., Marchini J.L., Dermitzakis E.T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 2019;10:5436. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Haller B.C., Messer P.W. SLiM 2: Flexible, interactive forward genetic simulations. Mol. Biol. Evol. 2017;34:230–240. doi: 10.1093/molbev/msw211. [DOI] [PubMed] [Google Scholar]
- 62.Haller B.C., Messer P.W. SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Mol. Biol. Evol. 2019;36:632–637. doi: 10.1093/molbev/msy228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Williams A.L., Genovese G., Dyer T., Altemose N., Truax K., Jun G., Patterson N., Myers S.R., Curran J.E., Duggirala R., T2D-GENES Consortium Non-crossover gene conversions show strong GC bias and unexpected clustering in humans. eLife. 2015;4:e04637. doi: 10.7554/eLife.04637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Kelleher J., Etheridge A.M., McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol. 2016;12:e1004842. doi: 10.1371/journal.pcbi.1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Haller B.C., Galloway J., Kelleher J., Messer P.W., Ralph P.L. Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Mol. Ecol. Resour. 2019;19:552–566. doi: 10.1111/1755-0998.12968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Frazer K.A., Ballinger D.G., Cox D.R., Hinds D.A., Stuve L.L., Gibbs R.A., Belmont J.W., Boudreau A., Hardenbol P., Leal S.M., International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Li H., Glusman G., Hu H., Shankaracharya, Caballero J., Hubley R., Witherspoon D., Guthery S.L., Mauldin D.E., Jorde L.B. Relationship estimation from whole-genome sequence data. PLoS Genet. 2014;10:e1004144. doi: 10.1371/journal.pgen.1004144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Maples B.K., Gravel S., Kenny E.E., Bustamante C.D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 2013;93:278–288. doi: 10.1016/j.ajhg.2013.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Howie B., Fuchsberger C., Stephens M., Marchini J., Abecasis G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 2012;44:955–959. doi: 10.1038/ng.2354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Loh P.-R., Danecek P., Palamara P.F., Fuchsberger C., A Reshef Y., K Finucane H., Schoenherr S., Forer L., McCarthy S., Abecasis G.R. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.