HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

Sepp Hochreiter

doi:10.1093/nar/gkt1013

. 2013 Oct 29;41(22):e202. doi: 10.1093/nar/gkt1013

HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

Sepp Hochreiter ^1,^*

PMCID: PMC3905877 PMID: 24174545

Abstract

Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority—152 000 IBD segments—are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at http://www.bioinf.jku.at/research/short-IBD.

INTRODUCTION

A DNA segment is ‘identical by state (IBS)’ in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is ‘identical by descent (IBD)’ in two or more individuals if they have inherited it from a common ancestor, that is, the segment has the same ancestral origin in these individuals. Rare variants can be used for distinguishing IBD from IBS without IBD because independent origins are highly unlikely for such variants. In other words, IBS generally implies IBD for rare variants, which is not true for common variants [(1), Ch. 15.3, p. 441].

Current IBD methods reliably detect long IBD segments because many minor alleles in the segment are concordant between the two haplotypes under consideration. However, many cohort studies contain unrelated individuals, which share only short IBD segments. Short IBD segments contain too few minor alleles to distinguish IBD from random allele sharing by recurrent mutations, which corresponds to IBS, but not IBD. New sequencing techniques provide rare variants, which facilitate the detection of short IBD segments. Rare variants convey more information on IBD than common variants because random minor allele sharing is less likely for rare variants than for common variants (2).

Short IBD segments resolve genetic structures on a fine scale and, therefore, provide important additional information for various applications in genetics. For example, the imputation of missing single nucleotide variants (SNVs) in genotype data (3,4) and haplotype phasing could be improved (5). Short IBD segments that are characterized by rare variants can increase genotyping accuracy obtained from low-coverage sequencing (6–10). The low power of association tests between diseases and rare variants (11,12) can be increased by using short IBD segments. They can serve to group SNVs to reduce the number of hypotheses or be directly used to test for an association (13–19). Moreover, short IBD segments can be assumed to be old compared with long IBD segments, which provides valuable insights in the field of population genetics. Sharing of short IBD segments between populations and the distribution of their lengths allow to investigate the evolutionary and the demographic history of humans (20,21).

Most IBD detection methods are based on hidden Markov models (HMMs) in which, at each DNA locus, a hidden state indicates presence or absence of IBD. HMM-based IBD methods are implemented in software tools like PLINK (22), RELATE (23) and BEAGLE (24). The phasing method fastPHASE (25) internally constructs IBD segments by using HMMs. The fastIBD method of the BEAGLE software package (26) uses HMMs for scoring matching alleles between two haplotypes. FastIBD first detects hot spots of matching DNA regions and then extends them, which is the basic idea of the previously published very fast IBD detection method GERMLINE (27). For related individuals, IBD detection methods can be enhanced by using pedigree information, where IBD segment sharing can be found in more than two individuals (28–30). Most IBD methods are not robust against genotyping errors and are computationally expensive for larger cohorts, as they must test all pairs of individuals for IBD. However, the main problem with current IBD detection methods is that they reliably detect long IBD segments (longer than 1 cM), but fail to distinguish IBD from identity by state (IBS) without IBD at short segments.

To detect short IBD segments, both the information supplied by rare variants and the information from IBD segments that are shared by more than two individuals should be used (2). The probability of a segment being IBD is typically computed via the probabilities of randomly sharing single alleles within the segment, where linkage disequilibrium (LD) may be taken into account (for an investigation to what extend LD helps to identify short IBD segments see the Supplementary Information, Section S7). The probability of randomly sharing a segment depends (a) on the allele frequencies within the segment, where lower frequency means lower probability of random sharing, and (b) on the number of individuals that share the allele, where more individuals result in lower probability of random segment sharing. The shorter the IBD segments, the higher the likelihood that they are shared by more individuals (see Supplementary Information, Section S4). Therefore, we focus on short IBD segments. There exists a trade-off between low minor allele frequency (MAF) versus many individuals having a segment (see Supplementary Information, Section S5). Consequently, a segment that contains rare variants and is shared by more individuals has higher probability of representing IBD (31,32). These two characteristics are our basis for detecting short IBD segments.

IBD detection using multiple individuals has been proposed for genotyping data with pedigree information (13,31,33). For IBD detection without pedigrees, the extensions of standard HMM algorithms that consider multiple individuals are computationally intractable due to the large state spaces (34). DASH (35) integrates multiple individuals into IBD clusters, which are found by clustering IBD detection results from GERMLINE. However, the performance of DASH depends mainly on the preceding IBD detection, which fails for short IBD segments. Only Moltke et al.’s (34) Markov Chain Monte Carlo-based method (MCMC) considers multiple individuals simultaneously during IBD detection. Moltke et al. (34) showed that multiple individuals improve IBD detection, as the MCMC method determined IBD segment break points more precisely and found shorter IBD segments with higher accuracy than other IBD methods. However, the MCMC method is based on a sampling technique and is therefore computationally expensive.

We propose biclustering (36) to detect very short IBD segments that are shared among multiple individuals. Biclustering simultaneously clusters rows and columns of a matrix. In particular, it clusters row elements that are similar to each other on a subset of column elements. A genotype matrix has individuals (unphased) or chromosomes (phased) as row elements and SNVs as column elements. Entries in the genotype matrix usually count how often the minor allele of a particular SNV is present in a particular individual. Alternatively, minor allele likelihoods or dosages may be used (see Supplementary Information, Section S6). Individuals that share an IBD segment are similar to each other because they also share minor alleles of SNVs (tagSNVs) within the IBD segment (see Figure 1). Individuals that share an IBD segment represent a bicluster. Identifying a bicluster means identifying tagSNVs (column bicluster elements) that tag an IBD segment and, simultaneously, identifying individuals (row bicluster elements) that possess the IBD segment.

Figure 1. — Biclustering of a genotyping matrix. Left: original genotyping data matrix, where rows give the individuals and columns consecutive SNVs. If at least one minor allele is present, then this is indicated by a violet bar for each individual–SNV pair, otherwise the bar is yellow. Right: after reordering the rows, a bicluster can be seen at the top three individuals. They contain the same IBD segment (in gold) and, therefore, are similar to each other by sharing minor alleles of SNVs within the segment (the tagSNVs).

In contrast to standard IBD detection methods for unrelated individuals (except the MCMC method), biclustering considers multiple individuals. Biclustering performs well even if few individuals are similar to each other, e.g. for few occurrences of the minor allele of tagSNVs or, equivalently, for rare variants. Analogously, biclustering works well for few tagSNVs, i.e. for very short IBD segments. In contrast to standard clustering, biclustering allows for SNVs or individuals that do not belong to any cluster or that belong to more than one bicluster. Multi-cluster membership is advantageous for IBD detection because diploid individuals can have two IBD segments at one locus and an SNV may tag more than one IBD segment. An SNV that belongs to a bicluster tags the according IBD segment and an individual that belongs to a bicluster possesses this IBD segment. In summary, biclustering is well suited for detecting very short IBD segments in multiple individuals that are tagged by rare variants.

We have developed HapFABIA for identifying very short IBD segments. HapFABIA first applies Factor Analysis for Bicluster Acquisition (FABIA) biclustering to genotype data to detect identity by state (IBS) and then extracts IBD segments from FABIA results by distinguishing IBD from IBS without IBD. In contrast to other biclustering models, FABIA models are able to represent homozygous regions (same IBD segment in both chromosomes) and overlapping IBD segments (a different IBD segment in each chromosome at a locus). We compared HapFABIA with other IBD detection methods using artificial and simulated genotype data with implanted IBD segments and applied HapFABIA to sequencing data from the 1000 Genomes Project.

MATERIALS AND METHODS

We present the HapFABIA method, which extracts short IBD segments that are tagged by rare variants from large sequencing data. The following two subsections describe the two stages of the HapFABIA method. In the first stage, HapFABIA applies FABIA biclustering to phased or unphased genotype data. Biclustering extracts individuals that share minor alleles (are similar to each other), that is, it detects identity by state (IBS). In the second stage, HapFABIA extracts IBD segments from FABIA models by distinguishing IBD from IBS without IBD. Finally, HapFABIA prunes spuriously correlated SNVs from the extracted IBD segments and joins segments.

FABIA for genotype data

We propose identifying similarities between individuals by biclustering. Biclustering simultaneously clusters rows and columns of a matrix. More specifically, it clusters row elements that are similar to each other on a subset of column elements. An IBD segment corresponds to a bicluster because individuals that possess the IBD segment are similar to each other at this segment. The similarity is given by identical minor alleles of tagSNVs. Figure 1 depicts a bicluster identified in genotype data.

We use the ‘FABIA’ biclustering model (36). In contrast to other biclustering methods such as BIMAX (37) and QUBIC (38), FABIA can represent homozygous regions where the same IBD segment may be present in one diploid individual two times. As described below and depicted in Figure 2, the FABIA model has a variable that describes zygosity, i.e. how many chromosomes of an individual contain a particular IBD segment. FABIA can be applied to discrete phased or unphased genotype data, but also to real values that correspond to minor allele likelihoods or to minor allele dosages (see comparisons of results based on genotype, haplotype, likelihood and dosage in the Supplementary Information, Section S6). We use FABIA not only because it is well suited for genotyping data, but also because it outperformed other biclustering methods in extensive comparisons on different data sets (36).

Figure 2. — The outer product of vectors and . indicates IBD segment tagSNVs and how many chromosomes of an individual contain the IBD segment. The row containing 2s indicates a homozygous region represented by (two times the same IBD segment in individual j).

FABIA describes genotype data by IBD segments

FABIA describes an IBD segment in genotype data Inline graphic by an outer product of two vectors and , where the vector indicates tagSNVs by nonzero values and the vector indicates individuals that possess the IBD segment. FABIA can represent a homozygous region of an IBD segment by z = 2, that is, two occurrences of an IBD segment in one diploid individual. Figure 2 visualizes this description of a genotype matrix by one IBD segment as an outer product.

A diploid individual may also possess two different IBD segments at a particular locus where genotyping sums up the occurrences of minor alleles. This fact is reflected by the FABIA model, which sums up bicluster contributions. If we assume genotyping errors that are accounted for by a noise term Inline graphic , the FABIA model for genotype data is

(1)

where Inline graphic is the genotyping data; is the matrix that indicates which individuals possess an IBD segment; indicates IBD segment tagSNVs; is an additive noise term; n is the number of SNVs; l is the number of individuals (or chromosomes for phased genotypes); p is the number of IBD segments; is the tagSNV indicator vector for the i-th IBD segment (the i-th row of Inline graphic ); and corresponds to the number of times each of the l individuals contains the i-th IBD segment (the i-th column of ). The additive noise not only covers genotyping errors but also genotypes that cannot be explained by IBD segments. Such unexplained genotypes may arise from recently acquired SNVs, IBD segments observed in only one individual and IBD segments that are missed.

As illustrated in Figure 2, both the vector Inline graphic and the vector should be sparse to describe an IBD segment. Sparse means that only few individuals possess the IBD segment, which implies rare tagSNVs. Sparse means that only few SNVs are tagSNVs, which implies short IBD segments. See Supplementary Information, Section S2, for the interpretation of Inline graphic and in the context of identifying short IBD segments in genotype data. In contrast to standard factor analysis, FABIA’s model selection is tailored to sparse factors and sparse parameters (36), which are essential for IBD detection. Sparseness in the FABIA model is obtained by a component-wise independent Laplace distribution both for the prior on the parameters Inline graphic and for the distribution of the counts . However, the Laplace distribution of the counts leads to an analytically intractable likelihood and posterior. Therefore, the model selection of FABIA is performed by variational expectation maximization (36,39–43). See Supplementary Information, Section S2, for more details on the FABIA method.

The number p of bicluster need not be determined a priori if p is chosen large enough. The sparseness constraint will remove a spurious bicluster i by setting Inline graphic to the zero vector. In this way, FABIA automatically determines the number of biclusters.

Adaptation of FABIA for IBD detection

Since an entry in the genotype matrix Inline graphic reports how often the minor allele is present, FABIA must explain occurrences of minor alleles by IBD segments.

Nonnegativity constraints: The genotype matrix is nonnegative. The indicator matrix of tagSNVs is 1, if the corresponding SNV is a tagSNV, and 0 otherwise. Thus, is also nonnegative. The matrix counts the number of occurrences of IBD segments in individuals or chromosomes. Consequently, is nonnegative too. FABIA biclustering does not regard these nonnegativity constraints. For HapFABIA, we modified FABIA to ensure that the tagSNV indicator matrix is nonnegative, which also implies that is nonnegative. See Supplementary Information, Section S2, for more details.
Sparse matrix algebra for efficient computations: We exploit the sparsity of the genotype vectors (mostly the major allele is observed) and the sparsity of the indicator matrix to speed up computations and to allow IBD segment detection in large sequencing data. We developed a specialized sparse matrix algebra that only stores and computes with nonzero values.
Iterative biclustering for efficient computations: To further speed up the computation, we extended FABIA to an iterative version, where, in each iteration, p biclusters are detected. These p biclusters are removed from the genotype matrix before starting the next iteration. The computational complexity of FABIA is , which means it is linear in the number of SNVs n and in the number of chromosomes or individuals l, but cubic in the number of biclusters p. The iterative version can extract biclusters in time instead of time of the original version of FABIA. For the 1000 Genomes Project, we used a = 40, which gave a speed up of .

Extraction of IBD segments from FABIA models

FABIA biclustering detects identity by state (IBS) by finding individuals that are similar to each other by sharing minor alleles. In the second stage, HapFABIA distinguishes IBD from IBS without IBD. The idea is to find local accumulations of IBS SNVs, which indicate short IBD segments. IBD SNVs are within short IBD segments and, therefore, have small mutual physical distances. Then IBD segments are disentangled, pruned from spurious SNVs and finally joined if they are part of a long IBD segment.

For the separation of IBD from random IBS, it is important that the SNVs extracted by FABIA (the SNVs that are IBS) are independent of their physical location and their temporal order. Only if this independence assumption holds, statistical methods for identifying local SNV accumulations are justified. FABIA biclustering complies with this independence assumption because it does not regard the order of SNVs and random shuffling of SNVs does not change the results. Therefore, randomly correlated SNVs that are found by FABIA (SNVs that are IBS without IBD) would be uniformly distributed along the chromosome. However, SNVs that are IBS because they tag an IBD segment agglomerate locally in this segment. Deviations from the null hypothesis of uniformly distributed SNVs can be detected by a binomial test for the number of expected SNVs within an interval if the MAF of SNVs is known. A low P-value hints at local agglomerations of bicluster SNVs stemming from an IBD segment.

We propose a four-step procedure to extract IBD segments from FABIA models:

Identify local accumulations of IBS SNVs (SNVs detected by the FABIA model) by a binomial test since these accumulations distinguish random IBS from IBS caused by IBD;
Disentangle IBD segments and reassign individuals or chromosomes to IBD segments;
Prune IBD segments off SNVs with spuriously correlations based on an exponential test for long physical distances;
Join similar IBD segments stemming from long IBD segments that were divided at the first step during identifying accumulations.

Step 1: FABIA model selection is independent of the order of the SNVs. Therefore, spuriously correlated SNVs are unlikely to agglomerate at a DNA locus, whereas tagSNVs do, as they are within an IBD segment. To detect agglomerations, we compute histogram counts of FABIA model SNVs within bins that overlap by half of their length. Bins with large counts are assumed to contain IBD segments. For computing the histogram of counts of FABIA model SNVs, we consider those SNVs for which the FABIA model parameter Inline graphic is largest (threshold ‘Lt’, with 10% being the default value). Large values ensure IBD segments that are shared by multiple individuals. These segments are, therefore, more reliable. The HapFABIA parameter ‘IBDsegmentLength’ determines the typical length of IBD segments. The histogram bin size in number of SNVs (all SNVs and not only model SNVs) is computed from ‘IBDsegmentLength’ using the average physical distance between adjacent SNVs.

The histogram bins with more model SNVs than expected by chance are assumed to contain IBD segments. We select bins for which the model SNV count exceeds the expected value by a binomial test for random counts. We need to compute how many model SNVs are expected to be in a bin if they are IBS, but not IBD. Thus, we have to compute the probability of observing k or more bin counts by chance. Let p be the probability of a random minor allele match between t individuals. If n SNVs are in a bin, the probability of observing k model SNVs by chance is given by

(2)

If q is the MAF for one SNV, the probability p of observing the minor allele of this SNV in all t individuals is Inline graphic . We assumed that all SNVs have the same MAF q—in the experiments we used the average MAF. For b bins, the probability of observing k or more counts of model SNVs by chance in at least one bin is

(3)

where l is the number of individuals and Inline graphic is the number of possibilities to chose t individuals from the l individuals. If the probability in Equation (3) is below the threshold ‘thresCount’, the according bin is selected for IBD segment extraction because more FABIA model SNVs are in this bin than expected by chance. If Inline graphic is the minimum k for which Equation (3) is below the threshold ‘thresCount’, then all bins with model SNV counts are selected. In our experiments, we allow for IBD segments that are observed in only two individuals (standard IBD), and therefore set t = 2.

If a bin is selected, SNVs and individuals must be assigned to it. Bicluster memberships of FABIA biclusters cannot be used directly because they include all bins and therefore different IBD segments. First, model SNVs are assigned to the selected bin if they contributed to its count. Then individuals or chromosomes are assigned to the selected bin if they possess a minor allele at one or more SNVs that have been assigned to the bin. Individuals are only chosen from the top z-values of the FABIA model to ensure that assigned individuals are similar to each other. The parameter ‘Zt’ (default 20%) gives the percentage of top z-scores that are considered.

In this step, we automatically distinguish between identity by state (IBS) and IBD. In particular, IBD can be distinguished from IBS without IBD by sharing of rare alleles because two independent origins are unlikely for them, so IBS generally implies IBD, which is not true for common alleles [(1), Ch. 15.3, p. 441]. The probability of IBS without IBD is given by (a) the probability of randomly observing minor allele sharing plus (b) the probability of observing recombined segments. In case (b), recombinations may be missed if a segment is broken via meiosis in one generation and then put together in later generations. Recombinations may also be missed if mother and father both have the same DNA segment. In both variants of case (b), IBS sharing in a segment is observed after intervening recombination and, therefore, this segment is not considered as a single IBD segment (44). For case (b), the lengths of IBS segments do not reflect their age because they are not IBD and, therefore, would misguide subsequent analyses. However, the case (b) has low probability if rare variants are considered. If the tagSNVs have low MAF, then the tagged segments cannot be observed frequently. The probability of observing a recombined segment is proportional to the MAF squared, which is 0.0025 for 5% MAF and 0.0001 for 1% MAF. This false-positive rate due to undetected recombinations is tolerable. Therefore, we only consider case (a) of random allele sharing. The probability of randomly observing k or more tagSNVs at t individuals simultaneously (IBS without IBD). This probability is given by Equation (3) without the factor b. Therefore, we distinguish IBS from IBD in this step.

If minimizing Equation (3), we observe a trade-off between small q and large t because Inline graphic . This trade-off is discussed in the Supplementary Information, Section S5. For rare variants, more individuals make random minor allele sharing (IBS without IBD) less likely.

Step 2: In this step, IBD segments in a selected bin are disentangled, where only SNVs and individuals are considered that have been assigned to the bin. An IBD segment is initialized by two core individuals that are identical at m or more minor alleles. The number m is computed as Inline graphic , where is computed in Step 1 and mintagSNVsFactor is a parameter with default value 3/4. All individuals that are identical in at least m minor alleles to one of the two IBD core individuals are classified as possessing the IBD segment. The tagSNVs of this IBD segment are model SNVs that have their minor allele in at least two individuals that possess the IBD segment.

Step 2 is repeated after removing the current IBD segment by deleting the segment’s tagSNVs until no more core individuals are found.

Step 3: This step prunes IBD segment borders of SNVs that have spurious correlations to the IBD segment. Spurious correlations may still be present in a bin leading to an overestimation of the segment length. Such SNVs can be identified by deviations of their MAFs from those of other tagSNVs. However, this criterion is not reliable for rare SNVs. Therefore, we identify SNVs with spurious correlations to an IBD segment on the basis of unusually large distances to other tagSNVs. The deviation from an expected distance is quantified by means of an exponential distribution with the median distance between tagSNVs as parameter. SNVs with distances leading to P-values below 1e-3 are removed. The two furthest upstream and the two furthest downstream tagSNVs are tested for their distances to other tagSNVs. If the second-furthest up- or downstream tagSNV is removed, then the furthest up- or downstream tagSNV is removed, too.

Step 4: IBD segments that are similar to each other are merged. In this way, long IBD segments that were divided by the bins into smaller parts are reconstructed. IBD segments greater than given by ‘IBDsegmentLength’ can be detected. To compute similarities, we assess how many tagSNVs and individuals of the smaller IBD segment are explained by the larger IBD segment. This criterion is expressed by the ‘overlap coefficient’

(4)

Using the overlap coefficient for both tagSNVs and individuals, we define a distance-like measure between IBD segments Inline graphic and by

(5)

where S_i and I_i are the tagSNVs that tag IBD segment Inline graphic and individuals possessing IBD segment , respectively. Using the measure D, IBD segments are clustered by hierarchical clustering using complete linkage. IBD segments are merged if their segments are clustered together below a cutting height of 0.8.

RESULTS AND DISCUSSION

We first compare IBD detection methods on artificial and simulated sequencing genotype data sets where IBD segments are tagged by rare variants. The first data set contains artificial data. The second data set is based on genotype data from the 1000 Genomes Project into which real DNA segments are implanted to construct IBD segments. The third data set is based on genotype data obtained via a forward-time simulation into which IBD segments are implanted. Finally, we test IBD segment detection of HapFABIA on genotype data from the 1000 Genomes Project.

For all experiments and all compared methods the detailed command line calls, parameter settings, result filters and additional results can be found in the Supplementary Information, Section S8.

Artificial and simulated genotype data

To compare IBD detection methods on artificial and simulated data, we first choose evaluation criteria that are described in the next subsection. Each of the following three subsections is devoted to comparisons on an artificial or simulated genotype data set. In each subsection, we first describe the data generation process and then report the results.

Evaluation criteria

The primary measures used to evaluate IBD segment detection methods are power (sensitivity, true-positive rate, recall), false discovery rate (FDR) (1—precision) and computational complexity (2). Power can be increased by increasing the number of detections at the cost of a higher FDR and vice versa. Therefore, neither power nor FDR should be considered separately. A measure that combines both power and FDR is the F1 score. The F1 score is a standard performance measure in the field of information retrieval for measuring search performance, e.g. for finding documents. IBD segment detection is analogous to a document search, in which true IBD segments correspond to relevant documents and false IBD segments to nonrelevant documents. The F1 score is the harmonic mean of precision (1—FDR) and recall (power). Its maximal value of 1 is achieved for optimal detection, while its minimal value of 0 means that precision or recall were 0. We assess power, FDR and F1 score at the SNV (marker) level to take into account whether IBD segment lengths are under- or overestimated (2). Consequently, for each individual, SNVs that belong to an IBD segment are positives and all other SNVs are negatives. Analogously, SNVs that belong to a predicted IBD segment are predicted positives and all other SNVs are predicted negatives. Figure 3 shows true positives (TPs), false positives (FPs), true negatives (TNs) and false negatives (FNs) for a chromosome with a true IBD segment and a detected IBD segment. A perfect IBD detection method would detect all true IBD segments with correct break points and would not detect false IBD segments, thereby, yielding only TP and TN (100% power, 0% FDR and F1 score equal to 1). IBD detection methods as described in the introduction, except DASH, detect an IBD segment in a pair of chromosomes. For these methods, an IBD segment is detected in a chromosome if this segment is detected at least once (for at least one pair of chromosomes). Therefore, pairwise IBD detection methods are not penalized if they do not detect all IBD segments in all pairs of chromosomes.

Figure 3. — Evaluation of IBD detection methods. Each column is an SNV. The upper row shows a true IBD segment and the lower row a detected IBD segment. The middle row indicates TP, FP, TN and FN.

We compare IBD detection methods on genotype data with known true IBD segments to evaluate the methods based on the ground truth. For both assessing the FDR and assessing the power of IBD detection methods, it is essential to know the positives, the true IBD segments (2).

Power, FDR and F1 score are given as the median over 100 experiments together with the P-value of a Wilcoxon rank sum test with the null hypothesis that HapFABIA and another method yield the same value. For reporting the results, the median and the Wilcoxon rank sum test are chosen because the results are in general not normally distributed (Shapiro-Wilk tests for normality). In contrast to normal assumptions, deviations of the results from their mean values are large because IBD segments are missed or falsely detected. The means (instead of the medians) of power, FDR and F1 score are reported in the Supplementary Information, Section S14.2.

Artificial genotype data with IBD segments

First, we tried to simulate genotyping data by coalescent and forward population genetic modeling. However, current software packages (45–48) were not able to generate short IBD segments that are tagged by rare variants. Such short IBD segments do exist in real data—we could detect them in data of the 1000 Genomes Project (49) as well as in data of the Korean Personal Genome Project. We explored a wide range of different parameters including migration, population split, population join and different growth assumptions. Since standard genotype simulation models did not yield short IBD segments that are tagged by rare variants, we implanted IBD segments into genotype data.

For the first data set, we generated phased genotype data with rare variants (MAF <5%). Chromosomes are generated artificially, where alleles are in linkage equilibrium. To consider IBD detection with LD, in later experiments chromosomes are generated by forward simulation (see Subsection ‘Forward Simulation Genotype Data with Implanted IBD Segments’).

For the randomly generated chromosomes, the statistical characteristics (minor allele frequencies and distances between SNVs) were chosen to match the genotyping data from the 1000 Genomes Project. Minor alleles were chosen randomly according to the MAF. We implanted short IBD segments that are tagged by rare variants. The artificial genotype data consist of 100 and 1000 diploid individuals (200 and 2000 chromosomes) and 10 000 SNVs. The lengths of IBD segments were chosen to be very short, containing 100–200 SNVs on average, which corresponds to a length of 10–20 kb. This was motivated by the lengths of haplotype blocks (50,51). For example, Gabriel et al. (52) found that common haplotype blocks have an average length of 9 kb in Africans (AFR) and 18 kb in Europeans (EUR) and East Asians (ASN). Each IBD segment possesses a particular number of tagSNVs and is implanted in a certain number of chromosomes. More details on how the data are constructed can be found in the Supplementary Information, Section S3.1.

Table 1 provides the following information for each artificial genotype data set: the number of implanted IBD segments, the number of tagSNVs for an IBD segment, the number of chromosomes possessing an IBD segment, the minimal overlap of the implanted IBD segments between chromosomes (as they are broken at the end and beginning) and the number of mismatches that simulate genotyping errors.

Table 1.

Overview of artificial data sets of phased genotype data

Data set	#I	L	#S	F	O	#M	#I
artA100	100	200	50	6	50	0	1
artA	1000	200	50	6	50	0	1
artAMis	100	200	50	6	50	6	1
artB100	100	200	20	10	100	0	1
artB	1000	200	20	10	100	0	1
artBMis	100	200	20	10	100	6	1
artC100	100	200	25	10	100	0	5
artC	1000	200	25	10	100	0	5
artCMis	100	200	25	10	100	6	5
artD100	100	100	20	10	50	0	20
artD	1000	100	20	10	50	0	20
artDMis	100	100	20	10	50	6	20

Number of individuals	100	100	1000	1000
Number of SNVs	10 000	100 000	10 000	200 000
Method
HapFABIA	31 s	5 min 43 s	6 min 12 s	3 min 2 s
fastIBD	52 s	8 min 2 s	43 min 57 s	10 h 29 min
PLINK	1 min 47 s	18 min 12 s	2 h 59 min	67 h 14 min
GERMLINE	5 s	52 s	8 min 17 s	36 s
DASH	22 s	44 min 17 s	52 min 32 s	5 min 17 s
fastPHASE1	46 min 23 s	5 h 43 min	7 h 45 min	na
fastPHASE2	98 h 50 min	>490 h	>490 h	na
RELATE	53 min 2 s	10 h 43 min	89 h 12 min	na
MCMC	>564 h	Na	na	na

Method	artA100		artB100		artC100		artD100
	Median	P	Median	P	Median	P	Median	P
HapFABIA	0.87		0.72		0.79		0.56
fastIBD-1	1.00	6e-16	0.17	6e-12	0.27	5e-18	0.20	4e-18
fastIBD-2	1.00	4e-13	0.00	3e-15	0.10	4e-18	0.05	4e-18
PLINK	0.98	4e-18	0.99	5e-18	0.99	4e-18	0.97	4e-18
GERMLINE-1	0.28	4e-18	0.84	9e-06	0.77	1e-01	0.32	4e-18
GERMLINE-2	0.13	4e-18	0.55	5e-02	0.49	7e-18	0.17	4e-18
DASH	0.20	4e-18	0.75	4e-18	0.71	4e-18	0.27	4e-18

	L	F	#I		L	F	#I
simA	20	10	1	simAlong	1000	6	1
simB	10	10	1	simBlong	1000	2	1
simC	20	6	1	simClong	2000	6	1
simD	10	10	20	simDlong	2000	2	1
simE	20	10	5	simElong	500	6	1
				simFlong	500	2	1

Method	simA		simB		simC		simD		simE
	m	P	m	P	m	P	m	P	m	P
HapFABIA	0.81		0.83		0.86		0.56		0.72
fastIBD-1	0.10	5e-16	0.10	4e-12	0.50	2e-07	0.15	4e-18	0.15	4e-18
fastIBD-2	0.00	4e-16	0.00	4e-13	0.17	5e-14	0.06	4e-18	0.06	4e-18
PLINK	0.36	4e-08	0.04	3e-12	0.28	9e-11	0.12	4e-18	0.37	1e-17
GERMLINE-1	0.95	2e-11	0.92	2e-04	0.96	2e-10	0.78	6e-18	0.94	8e-18
GERMLINE-2	0.00	2e-15	0.00	2e-12	0.00	3e-15	0.00	4e-18	0.00	4e-18
DASH	0.94	9e-10	0.92	2e-04	0.93	4e-07	0.76	6e-18	0.91	2e-16

Method	simAlong		simBlong		simClong		simDlong		simElong		simFlong
	m	P	m	P	m	P	m	P	m	P	m	P
HapFABIA	0.000		0.00		0.000		0.000		0.00		0.000
fastIBD-1	0.057	7e-02	0.02	2e-11	0.032	5e-02	0.003	7e-10	0.30	5e-18	0.577	2e-17
fastIBD-2	0.035	3e-01	0.00	1e-00	0.019	8e-02	0.000	1e-00	0.09	5e-14	0.005	4e-08
PLINK	0.987	4e-18	0.99	4e-18	0.975	4e-18	0.992	4e-18	1.00	4e-18	0.999	4e-18
GERMLINE-1	0.001	6e-02	0.00	1e-00	0.002	3e-01	0.000	1e-00	0.76	4e-18	0.910	4e-18
GERMLINE-2	0.001	6e-02	0.00	1e-00	0.002	4e-01	0.000	1e-00	0.96	4e-18	0.986	4e-18
DASH	0.000	3e-02			0.000	2e-02			0.00	1e-00

Single population					All populations
AFR	AMR	ASN	EUR		AFR/AMR/ASN/EUR
93 197	981	2522	1191		4132
Pairs of populations				Triplets of populations
AFR/AMR	AFR/ASN	AFR/EUR		AFR/AMR/ASN	AFR/AMR/EUR
42 631	615	1720		1196	8322
AMR/ASN	AMR/EUR	ASN/EUR		AFR/ASN/EUR	AMR/ASN/EUR
384	1901	556		307	933

	L	F	#I		L	F	#I
impA	20	10	1	impD	10	10	20
impB	10	10	1	impE	20	10	5
impC	20	6	1

Method	impA		impB		impC		impD		impE
	Median	P	Median	P	Median	P	Median	P	Median	P
HapFABIA	0.8210		0.5062		0.7228		0.4112		0.6083
fastIBD-1	0.1000	3e-11	0.0341	6e-08	1.0000	3e-14	0.0787	4e-18	0.1197	5e-18
fastIBD-2	0.0000	3e-14	0.0000	4e-10	1.0000	2e-09	0.0313	4e-18	0.0221	4e-18
PLINK	1.0000	2e-17	1.0000	2e-17	1.0000	4e-18	0.9893	4e-18	1.0000	4e-18
GERMLINE-1	0.7135	8e-01	0.4000	5e-01	0.6667	4e-02	0.4170	4e-01	0.6623	1e-02
GERMLINE-2	0.2399	6e-11	0.1000	8e-08	0.1077	8e-09	0.1544	5e-18	0.2698	2e-17
DASH	0.6621	3e-03	0.3743	3e-02	0.6233	6e-03	0.3686	4e-02	0.6222	4e-01

Method	impA		impB		impC		impD		impE
	Median	P	Median	P	Median	P	Median	P	Median	P
HapFABIA	0.8758		0.96451		0.92735		0.4577		0.6093
fastIBD-1	0.9997	3e-14	0.99994	3e-08	0.99805	7e-10	0.9971	4e-18	0.9980	4e-18
fastIBD-2	1.0000	2e-14	1.00000	4e-10	0.99642	2e-10	0.9978	4e-18	0.9993	4e-18
PLINK	0.9996	8e-14	0.99984	8e-06	0.99981	9e-10	0.9967	4e-18	0.9984	4e-18
GERMLINE-1	0.9996	8e-14	0.99988	6e-06	0.99977	7e-10	0.9976	4e-18	0.9982	4e-18
GERMLINE-2	0.9997	6e-14	0.99993	4e-08	0.99993	2e-11	0.9980	4e-18	0.9984	4e-18
DASH	0.9999	2e-14	1.00000	5e-10	1.00000	2e-12	0.9985	4e-18	0.9991	4e-18

Method	impA		impB		impC		impD		impE
	Median	P	Median	P	Median	P	Median	P	Median	P
HapFABIA	0.2124		0.0663		0.1337		0.4687		0.4707
fastIBD-1	0.0006	6e-14	0.0001	3e-08	0.0039	7e-10	0.0055	4e-18	0.0039	4e-18
fastIBD-2	0.0000	2e-14	0.0000	3e-10	0.0071	2e-10	0.0041	4e-18	0.0014	4e-18
PLINK	0.0006	2e-13	0.0003	8e-06	0.0004	9e-10	0.0066	4e-18	0.0031	4e-18
GERMLINE-1	0.0008	2e-13	0.0002	6e-06	0.0005	7e-10	0.0048	4e-18	0.0036	4e-18
GERMLINE-2	0.0006	8e-14	0.0001	4e-08	0.0002	2e-11	0.0040	4e-18	0.0032	4e-18
DASH	0.0001	2e-14	0.0000	5e-10	0.0000	2e-12	0.0030	4e-18	0.0018	4e-18

PERMALINK

HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

Sepp Hochreiter

Abstract

INTRODUCTION

Figure 1.

MATERIALS AND METHODS

FABIA for genotype data

Figure 2.

FABIA describes genotype data by IBD segments

Adaptation of FABIA for IBD detection

Extraction of IBD segments from FABIA models

RESULTS AND DISCUSSION

Artificial and simulated genotype data

Evaluation criteria

Figure 3.

Artificial genotype data with IBD segments

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Sequencing data with implanted IBD segments

Table 6.

Table 7.

Table 8.

Table 9.

Forward-simulation genotype data with implanted IBD segments

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

IBD segments in data from the 1000 Genomes Project

Table 17.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

CONCLUSION

SUPPLEMENTARY DATA

FUNDING

Supplementary Material

ACKNOWLEDGEMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases