Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2016 Dec 29;33(7):1021–1030. doi: 10.1093/bioinformatics/btw735

RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination

Sajad Mirzaei 1,, Yufeng Wu 1,
Editor: Oliver Stegle
PMCID: PMC5860023  PMID: 28065901

Abstract

Motivation

Haplotypes from one or multiple related populations share a common genealogical history. If this shared genealogy can be inferred from haplotypes, it can be very useful for many population genetics problems. However, with the presence of recombination, the genealogical history of haplotypes is complex and cannot be represented by a single genealogical tree. Therefore, inference of genealogical history with recombination is much more challenging than the case of no recombination.

Results

In this paper, we present a new approach called RENT+ for the inference of local genealogical trees from haplotypes with the presence of recombination. RENT+ builds on a previous genealogy inference approach called RENT, which infers a set of related genealogical trees at different genomic positions. RENT+ represents a significant improvement over RENT in the sense that it is more effective in extracting information contained in the haplotype data about the underlying genealogy than RENT. The key components of RENT+ are several greatly enhanced genealogy inference rules. Through simulation, we show that RENT+ is more efficient and accurate than several existing genealogy inference methods. As an application, we apply RENT+ in the inference of population demographic history from haplotypes, which outperforms several existing methods.

Availability and Implementation

RENT+ is implemented in Java, and is freely available for download from: https://github.com/SajadMirzaei/RentPlus.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Large-scale population genetic studies such as the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) have produced large amount of population genetic data for many populations. These population genetic data provide states (i.e. alleles) at multiple linked genetic variation sites such as single nucleotide polymorphisms (SNPs). Alleles for multiple linked SNPs from a single chromosome can be represented by a haplotype. The alleles on a haplotype are correlated and offer hints on evolutionary history of genes. Thus, haplotype is the main data type for the new method we develop in this paper. We refer to Section 2.1 for more details about haplotypes.

Now we consider a set of haplotypes at the same genomic region (locus) collected from multiple individuals of one or multiple related populations. These haplotypes share a common but usually unknown genealogical history (called gene genealogy or just genealogy). Gene genealogy of haplotypes, if known, can be very useful for the study of population evolution, such as natural selection and population demography. However, inference of genealogy from haplotypes is a challenging computational problem. This is to a great extent due to meiotic recombination (or just recombination). With recombination, there is still a genealogical tree at a single genomic position but the genealogical trees at different genomic positions (e.g. SNP sites) may be different. So with recombination, we need to infer a genealogical tree at each SNP site and these genealogical trees may be different. Thus, inference of genealogy with recombination is much more difficult than the no-recombination case.

Despite the challenges in the genealogical inference with recombination, there exists a number of computational approaches for genealogy inference with recombination since recombination is an important subject in population genetics. Many genealogy inference approaches are based on parsimony (Gusfield, 2005; Mele et al., 2010; Minichiello and Durbin, 2006; Song and Hein, 2005; Wu, 2008). Most existing parsimony-based approaches (see e.g. Gusfield, 2014) aim to minimize a quantity (usually the total number of recombination events in the genealogy). One potential issue with parsimony-based approaches is that the underlying genealogy may contain more recombinations than that in the most parsimonious genealogy. This can happen in ‘recombination hotspots’ in the genome where there are significantly more recombinations than other regions (see, e.g. Myers et al., 2005). There are also probabilistic approaches, such as ARGweaver (Rasmussen et al., 2014). Probabilistic approaches are usually based on some population genetic models and can be more accurate than parsimony approaches (Rasmussen et al., 2014). However, a major drawback of probabilistic approaches such as ARGweaver is that they are usually much slower than parsimony-based approaches and there are also issues such as convergence and parameter settings. In Wu (2011), we proposed an approach called RENT for genealogy inference. Instead of minimizing the total number of recombinations, RENT infers the likely genealogical history from population haplotypes in a heuristic way. It is shown in Wu (2011) that RENT performs reasonably well (but sometime less well) when compared with another genealogy inference heuristic called MARGARITA (Minichiello and Durbin, 2006). When compared with the more recently developed ARGweaver approach, however, both RENT and MARGARITA appear to perform less well in terms of topological accuracy of inference.

In this paper, we present a new method, called RENT+, for genealogy inference with recombination. RENT+ improves upon RENT in several aspects. First and the most important, RENT+ significantly improves the inference accuracy. The key is that RENT+ utilizes more information (e.g. the so-called singletons, where there is only a single minor allele among all haplotypes at a SNP site) contained in the data than RENT does. Different from the original RENT, RENT+ constructs guide trees from haplotypes and uses the guide trees to infer local genealogies. Through simulation, we show that RENT+ consistently outperforms the original RENT and MARGARITA, and is more accurate than ARGweaver on many simulated datasets in terms of topological accuracy of the inferred genealogy. Second, RENT+ implements new features (e.g. estimating branch lengths and local tree heights) for genealogy inference which are not in the original RENT. Finally, RENT+ is implemented in a new software tool that is efficient and can be applied to the whole genome data. Simulations show that RENT+ is much faster than ARGweaver (Rasmussen et al., 2014). To demonstrate the usefulness of RENT+, we apply RENT+ in the inference of population divergence history (called the population tree) with haplotypes from multiple populations. Haplotype-based population tree inference is developed in Wu (2015). However, the approach in Wu (2015) only allows haplotypes with no or very few recombinations. By applying RENT+, we can now infer the population tree using the whole genome data.

2 Background

2.1 Recombination and local genealogical trees

Recombination takes two equal-length homologous haplotypes and produces an offspring haplotype of the same length by taking a prefix from one sequence and then merging with a suffix from another sequence. With recombination, there is no single genealogical tree that can represent the genealogical history of a set of recombining haplotypes. This is because genealogy may follow different ancestry at different genomic positions (sites). Instead, we need to use a directed acyclic graph called ‘Ancestral Recombination Graph (ARG)’ (Griffiths and Marjoram (1996), also in Hein et al. (2005)), or a ‘Phylogenetic Network’ (Gusfield, 2005, 2014). A formal definition of an ARG (i.e. genealogical network) is given in Gusfield (2005, 2014). Let N be an ARG. The following observation of ARG is important to many methods (e.g. Hein, 1993; Minichiello and Durbin, 2006; Song and Hein, 2005; Wu, 2008, 2011): the full evolutionary history of the haplotypes M of site x is completely represented by a subtree Tx of N. The tree Tx is called a local genealogical tree (or just a local tree) at position x of the ARG N. One may view an ARG as essentially a collection of local genealogical trees, one for each SNP position. A useful property (see e.g. Gusfield, 2014) on local trees is: if there is no recombination in N with a breakpoint between sites x and y, then the local trees Tx and Ty are the same. However, if there is recombination between x and y, then Tx and Ty may be topologically different.

A variation at a single nucleotide site of the genome in a given population is called a single nucleotide polymorphism (SNP). If a site has only two out of four possible nucleotides in individuals of the population, it is called a biallelic SNP. We assume biallelic SNPs throughout this paper. Therefore, we use 0/1 to represent a nucleotide at a SNP site. A binary vector of SNPs of length m is called a haplotype (or SNP sequence). A mutation at a haplotype derives a new haplotype with a different nucleotide at the mutated site. n haplotypes at m SNP sites from a population form an n×m binary matrix M. Throughout this paper, we assume the infinite site model in population genetics (Kimura, 1969), which implies that there is at most one mutation at any SNP site in the haplotypes. Each SNP in M induces a split, which is a bi-partition of rows (leaves in the genealogy) in M. We use one side of the bi-partition to represent a split. For example, suppose at a SNP site, the haplotype rows from 1 to 3 have allele 0, and the rest of rows have allele 1. Then this SNP leads to a split {1, 2, 3}, which means that rows 1, 2 and 3 are on the same side of the split while all other rows are on the other side. We call a split trivial if there is a single row on one side of the split. That is, if there is a single 0 or 1 at a SNP site, the split at this SNP is trivial. A SNP site is a singleton if its induced split is trivial. Given a haplotype matrix M, two sites (columns) p and q in M are said to be incompatible if and only if there are four rows in M where columns p and q contain all four ordered pairs (0,1), (1,0), (1,1) and (0,0). Otherwise, we say p and q are compatible. The test for the existence of all four pairs is called the ‘four-gamete test’ in population genetics. We can extend the incompatibility (and compatibility) definition to splits and trees. Two splits are incompatible if the two columns implied by the two splits are incompatible. A branch of a tree induces a split, where the leaves below the branch are on one side of the split and the rest is on the other side. Thus, a tree implies multiple splits, one for each branch. A split s is incompatible with a tree T if s is incompatible with one of the splits in T. Two trees T and T are said to be incompatible if at least one of the splits in T is incompatible with T. We say a SNP site p is compatible with a local tree if the split induced by p is compatible with the local tree. Note that a local genealogical tree at p must be compatible with the split implied by p. See, for example, Gusfield (2014) for more details.

2.2 The RENT method

The RENT approach (Wu, 2011) is a heuristic for the following problem:

Local tree inference from haplotypes with recombination: given a set of haplotypes M, infer the local genealogical trees, one tree for each SNP site, that properly explain M.

The main idea of RENT is to jointly refine a set of local trees at the SNP sites by several justifiable rules. Here, we say a tree is refined when the tree is modified so that some new splits are present in the tree after the modification while all original splits are still present. Refinement occurs at internal non-binary nodes of a tree that have more than two children. Recall that we assume that there is one single mutation at each SNP site. So there is one known split at each site and the local tree for that site should contain this split. RENT starts with simple local trees that contains the single split at each site. It then refines these local trees following a set of rules. Each refinement step may potentially add a false positive split, and such error may increase the chance of further errors. Most reliable refinements of RENT are inferred from two important rules: Fully Compatible Region and Split Propagation. Since these two rules are important to our new method, we explain them here again.

Fully Compatible Region Rule. A region of consecutive SNP sites is fully compatible if all pairs of SNP sites inside that region are compatible. For each site s the maximal fully compatible region centering at s is called Rc(s). The fully compatible rule applies to the local tree T at site s such that for each site sRc(s), the split from site s is used to refine T. An illustration is given in Figure 1, where sites 1, 2, and 3 are in a compatible region. So the tree T2 derives splits {1, 2} and {4, 5, 6} (circled in the figure) from neighboring trees T1 and T3 respectively. Trees T1 and T3 derive splits {4, 5, 6} and {1, 2} from each other. Tree T is the refined tree which replaces all three local trees T1, T2, and T3. Since neighboring trees are often highly correlated and sometimes identical, this rule usually leads to very accurate refined trees. However, when recombination occurs, this rule often can only be applied to short genomic intervals.

Fig. 1.

Fig. 1

An illustration of fully compatible rule. Leaf labels: rows in M. T1,T2andT3 are local trees at three nearby SNP sites. The binary vector below each Ti: the SNP alleles at the corresponding ordered sites (one allele for each row). T: the refined common local trees for all three sites

Split Propagation Rule. This rule generalizes the fully compatible region rule such that each split is propagated to (i.e. used to refine) the neighboring trees and continues until it reaches an incompatible tree. This rule applies to all given and inferred splits of each local tree until there is no more refinement. This rule may potentially identify shared subtrees in two otherwise incompatible nearby local trees. We refer to Figure 2 for an illustration. Here, sites 2 and 3 are incompatible and so the fully compatible region rule does not apply. But, split {5, 6} from local tree T1 is a split compatible with both trees T2 and T3. The split propagation rule allows this split to be propagated and be derived by all neighboring trees (here T2 and T3) until it reaches an incompatible tree. As shown in Figure 2, T3 is refined by this rule.

Fig. 2.

Fig. 2

The propagation rule example. T3: the refined tree at the third site. The split {5,6} of T1 is propagated to T3

More details about these rules and their implementations in RENT+ are given in Section 3.5.

3 Method

3.1 The high-level idea

The key idea of RENT+ is extracting more information from data for genealogy inference than RENT does. There are several major improvements made by RENT+. We use singleton SNPs to illustrate one key idea. In both simulated and real population haplotype data, a large portion of SNPs are singletons. That is, there is only a single minor allele among all haplotypes at a SNP site. The original RENT essentially ignores all singleton SNPs because splits implied by singleton SNPs are compatible with any genealogical trees and so do not seem to add topological information to genealogy inference.

However, ignoring all singleton SNP sites means that the original RENT (and most parsimony-based methods) essentially discards a significant portion of data. A key observation made by RENT+ is that although singleton sites do not directly lead to new non-trivial splits, they provide important hints to the underlying genealogy on the timing of genealogical events. To fix ideas, consider three haplotypes H1, H2 and H3 within a short region, where there are five singleton SNPs (with 1 being the minor allele). Further suppose the haplotypes are: H1=11111,H2=H3=00000. That is, H1 contains all five minor alleles. The original RENT would not be able to determine which two haplotypes are closer to each other because all the SNPs are singletons.

RENT+, on the other hand, can extract more topological information from haplotypes. For example, for the three haplotypes above, RENT+ infers that H2 and H3 are more similar and thus creates a new split separating H2 and H3 from H1.

To summarize, RENT+ aims to using more information contained in the data for genealogy inference. This allows RENT+ to infer more accurate genealogical topologies than RENT. RENT+ first infers the genealogical tree topologies at the SNP sites. Then the branch lengths are estimated for these trees.

3.2 The general procedure of RENT+

RENT+ has the same high-level procedure as the original RENT. It starts with the split implied by the haplotypes at each SNP site, and then refines the local trees at given sites.

Different from RENT, RENT+ starts with calculating a distance between each pair of sequences at each site. Then it constructs a guide tree using these distances for each site and uses information in the guide trees to improve the rules of RENT. RENT+ also introduces new rules.

3.3 Calculating distances between pairs of haplotypes

We estimate the distance between each pair of haplotypes x and y for each site. The distance is computed within a specific region for each SNP site. We use the simple normalized Hamming distance as the distance between x and y for that region. That is, the distance is equal to the number of SNP sites where x and y have different alleles divided by the length of the region. It is important to determine the region for each site for each pair of haplotypes. Based on the given SNPs, we define a region such that it is unlikely for recombination to occur between x and y within that region. We ignore the singleton sites for defining the regions, but we count them in the distance calculation. More precisely, at a SNP site p, the region for two haplotypes x and y is the largest window [wl,wr] of sites such that for each site q[wl,wr], one the following holds: (i) x[q]=y[q], (ii) site q is a singleton, (iii) q = wl or (iv) q = wr.

For example in Figure 3(a), for the site 6, the region for sequences 3 and 4 is defined from the site 1 to 11 because haplotypes 3 and 4 have different alleles at non-trivial sites 1 and 11. Singleton SNPs in sites 2, 7, 9 and 10 are ignored for definition of the region. The Hamming distance between sequences 3 and 4 within this region is 6. So the distance between sequences 3 and 4 at site 6 is 6/(1021+1)0.058.

Fig. 3.

Fig. 3

(a) A set of five SNP sequences for 13 sites with genomic positions. (b) The distance matrix for the site 6. Dark braces show the regions of calculating the distance between pairs 12 and 34. (c) the guide tree for site 6

If the region for two sequences for site a is too short, we extend the region to a window of w sites from both sides of a. This is because calculation of the distance in a short region may not be accurate enough. For example in Figure 3(a), for the site 6 the original region for sequences 1 and 2 is [4,8]. If we assume w = 3, this region should be extended to [3,9]. So the distance between sequences 1 and 2 is 3/(9525+1)0.042. Our experiments show that w = 5 usually performs well in practice.

3.4 Construction of guide trees

We now construct a guide tree based on the computed pairwise distances between haplotypes at each SNP site. The guide tree is constructed with a modified version of the well-known UPGMA algorithm. There is another method (Zhou et al., 2015) that uses the distances from the PSMC approach in Li and Durbin (2011). Our experiments show that our computed distances perform better in the guide trees. Note that in principle other distance-based tree inference methods can also be used. Empirical results suggest that the UPGMA algorithm appears to work well (even better than Neighbor Joining method) with the computed distances. We modify the UPGMA tree algorithm as follows. As in the original UPGMA algorithm, we find two clusters of leaves that are close in the average distance to be siblings. We impose an additional constraint: the guide tree remains compatible with input data and original non-trivial splits for each site. For instance in Figure 3(b), the smallest distance among all pairs of sequences is between sequences [2,5] (0.028), then between sequences [1,2],[1,3], and [1,4] (0.042). But, none of them are compatible with the site 6, except for [1,2]. So the rest will be ignored and 1 and 2 will be chosen as siblings.

Moreover, instead of choosing the minimum distance between nodes in each step, we keep a set of other distances that are close to the minimum distance (up to 20% more) and choose the pair that is more compatible with neighboring SNP data. Between pairs that are compatible with all neighboring sites, we choose the one with the smallest distance. For example in Figure 3(b), the distance between sequences [4,5] (also [3,5]) is slightly less than that of between sequences [3,4], but unlike split {4, 5}, split {3, 4} is compatible with neighboring sites 3 and 8, so nodes 3 and 4 will be chosen as siblings first. This is shown in the guide tree at the site 6 in Figure 3(c).

3.5 Local tree refinement

3.5.1 Initial tree and rooting

Similar to RENT, we start with a simple local tree for each site including a single split implied by this SNP site. We have two choices for rooting the tree at this point. That is to determine which one of the two alleles to be the mutant allele at this site. After rooting is determined, we add an internal node to the tree with mutant alleles as its children, and the rest of alleles become children of the root. This is an important choice because it will affect the accuracy of the inferred tree and its neighboring local trees. Since this clade (split) may be shared with them.

RENT uses the majority allele rooting approach for rooting the local tree. That is, for each site the major allele (i.e. the allele with higher frequency) will be considered to be the ancestral allele (which is to be used as the root). Experiments show that this method of rooting works reasonably, but it does lead to wrong rooting in some cases, especially when the frequencies of two alleles (0 or 1) are close. Different from RENT, RENT+ uses the guide tree to determine the root for the tree.

RENT+ first finds the lowest common ancestor (LCA) of the rows from each partition of the original SNP site in the guide tree. It then chooses the partition that its LCA’s height is less than that of the other one and this partition is a subtree under the root. In other words, the partition with smaller height is likely to be the true subtree below the root. For example, in Figure 4, in the site a the majority rule implies that {3, 4} is the mutant allele, but the true tree for this site (Ta) indicates that {1, 2, 5} is the mutant allele which also agrees with the guide tree Ua. So the valid choice should be {1, 2, 5}. Our experiments show that this rooting is much closer to the actual root, because it has more information about the time of coalescence between haplotypes.

Fig. 4.

Fig. 4

Application of local tree refinement rules. Top row shows the haplotypes for the site a to j generated from true trees in row 2. Row 3 shows the tuide trees inferred from haplotypes. Row 4 shows initial local trees inferred from haplotypes. At some sites (e.g. b,c), haplotypes contain singletons, so the initial tree is a tree without any non-trivial split (e.g. Lb, Lc). Row 5 shows refined local trees after applying Enhanced Fully Compatible Region and Enhanced Split Propagation rules. Row 6 shows final refined local trees after applying all refinement rules of RENT+

RENT+ then applies several refinement rules to update local trees in the following order.

3.5.2 Enhanced fully compatible region rule

The original Fully Compatible Region Rule in RENT adds all splits within a maximal fully compatible region to all the trees within this region. Some of these splits may be false positives.

Different from RENT, RENT+ uses the guide tree information to avoid false positive splits. We note that changes in the order of adding splits to each local tree can lead to different results. For each site, RENT+ first adds splits that also appear in the guide tree of that site before adding other splits. Our experiments show that this change alone removes many false positive splits and generates more accurate local trees. For example, in Figure 4 the fully compatible region for the site e is shown with braces in top section, and this region only includes split {1, 2} from the site d. But it does not agree with the guide tree for the site e (Ue), and will also be a wrong inference according to the related true tree Te. The new enhanced rule will not allow this inference at the first step of this rule. On the other hand, the site f will derive the split {2, 3} from local tree in the site g (Lg) because it agrees with Uf. Note that the maximal compatible region rule for the site f is different from site e (lower braces). In the second step of this new rule, the original fully compatible region will apply, and the site e will derive all the splits from neighboring trees even if they do not agree with Ue.

Since the refined local tree in the site f is now closer to the site e, and it has derived split {2, 3}, it will also be derived by Le which is the correct inference according to Te.

3.5.3 Enhanced split propagation rule

One deficiency of the original split propagation rule is that it does not consider the possible occurrence of recombination when propagating a split to a neighboring tree. In some cases a split may be compatible with a neighboring tree, but the overall topology of a local tree may have been changed because of recombination. Our experiments show that this may add false splits to a neighboring tree which may cause further false inferences in later stages. For example, in Figure 4, in the original RENT, the split {1, 2, 5} will be propagated to neighboring sites from site a + 1 to e – 1 since it is compatible with all local trees in between. But this is a false split because starting from the site c the true trees (also the guide trees) do not agree with this split.

To prevent this, we enhance this rule with adding three more levels of propagations before the main process.

First level Since the original split propagation rule starts from the left most sites, it may have biased results in favor of splits on the left.

To prevent this, we consider all regions between two sites with two incompatible splits, which have no incompatible local trees in between. For example in Figure 4, sites a and h with incompatible splits {1, 2, 5} and {4, 5}. All local trees inside this region should derive one of these two incompatible splits. Here we use the guide trees to decide which site should derive which split. In this example, sites from a to b will derive {1, 2, 5} but sites between c and h will derive split {4, 5} which are correct according to true trees.

Second level We apply the same process of propagation with checking with guide trees, but without limiting regions. That is, all the splits will be propagated as long as they are in agreement with guide trees, and are compatible with the local trees.

Third level At some sites guide trees do not agree with any of the neighboring splits, and so the first two steps will not add any splits to those sites. In Figure 4, the site i is an example since Ui is not compatible with either the split {4, 5} from the site h or the split {3, 4} from the site j. Guide trees contain the estimated distances between two nodes which provide some hints on the height of the local tree. If a recombination happens between two neighboring sites which contain a part of a specific split, it may change the height of that split in the guide tree. During the propagation of a split to a neighboring site, if we observe a big change in the height of the related split in the guide tree at this site (i.e. an increase or decrease for more than 50% of the current height), we will stop the propagation. In the example above, the split {4, 5} will stop propagating at site i because there is a big change in height of split containing {4, 5} in Ui. On the other hand, the split {3, 4} will not stop propagating in the site i because the change in the height is not significant. So the split {3, 4} will correctly be added to Li.

Final level We now apply the original Split Propagation Rule to refine local trees more. For each site s in Figure 4, Ls is the refined local tree after applying Enhanced Fully Compatible Region and Enhanced Split Propagation rules.

3.5.4 Adding guide tree splits

Our experiments show that guide trees contain many true splits that previous rules fail to infer. The reason is that some splits never appear in haplotypes but they exist in true trees. Here we refine the remaining non-binary nodes of local trees by using the distance matrices for each site such that closer leaves will define a new split. We continue with this rule till there are no more non-binary nodes hence no more refinement for local trees. In Figure 4, for each site s, L¯s is the final inferred local tree. This rule is another important addition in our approach which is also the simplest one. It makes the local trees more refined and improves the accuracy of local trees. For example in Figure 4, Ti has split {2, 3, 4} which does not exist in SNP data, but it exists in Ui. After applying this rule, as in L¯i, this split will be added to Li.

3.6 Inferring branch lengths and tree heights

After inferring local tree topologies, we use the guide tree branch lengths to infer the branch lengths and the height of the trees (i.e. the time to the most recent common ancestors or TMRCAs) for local trees. The branch lengths and TMRCA are in the standard coalescent units. We convert the pairwise haplotype distance (and thus also the branch lengths of the guide trees) to coalescent time as follows. We first estimate the per-nucleotide mutation parameter θ0 (i.e. θ0=4Nμ where N is the diploid effective population size and μ is the mutation rate per generation per nucleotide) using the Watterson estimator (Watterson, 1975). Then the calculated pairwise distance between two haplotypes is dL=θ02t, where d is the Hamming distance for a region of L nucleotides and t is the distance between the two haplotypes in the standard coalescent unit. This allows us to convert the pairwise distance to coalescent time. Note that TMRCAs at different sites may be different. We first find the consecutive local trees where the trees have the same topology. These trees are assumed to have the same heights, and we say these trees form a region. Their branch lengths are jointly estimated from the distance matrices within the region. For each subtree containing taxa set S in a local tree, the height of this subtree is estimated to be the average of all related LCA(S) (lowest common ancestor) in all guide trees within the same region. The same calculation applies for inferring TMRCA for each local tree. In this case, S contains all the taxa in the local tree.

4 Results

We now provide empirical results of using RENT+ for local tree topology and TMRCA inference, and one related application.

4.1 Performance of local tree topology inference

We use the program MS (Hudson, 2002) to generate simulated data. For comparison, we let MS to output true local trees. These parameters may affect the inference accuracy: sequence length l, the number of sequences n, the scaled mutation parameter t, and the scaled recombination parameter r. Note that t and r values for MS program are relative to the region length. In the following, we fix the effective population size N = 10, 000 and mutation rate of μ=1.8×108 mutations per site per generation. So, t=4Nμl=7.2×104l. Recombination parameter r is then determined by t/r (the ratio of t and r). Therefore, in the following, we give the values of n, l and t/r for each simulation.

In each simulation we run each program with 50 randomly generated datasets by MS and report the accuracy as the average over all 50 datasets. To compare the inferred trees with true trees, we define the topological accuracy as the ratio of the number of correctly inferred non-trivial splits to the number of total non-trivial splits in the true tree.

4.1.1 Comparing four methods

We let l={100K,1M},n={15,30,100}, and mutation versus recombination ratio t/r={5,2,1,0.5,0.2} to generate simulated haplotypes. We compare RENT+ with programs RENT (Wu, 2011), MARGARITA (Minichiello and Durbin, 2006), and ARGweaver (Rasmussen et al., 2014). We run ARGweaver in its default settings which runs for 1,000 sampling iterations and then we use the script provided by ARGweaver to output a set of consensus local trees for different regions of the haplotypes.

Figure 5 shows the results for this simulation. The original RENT is not implemented to scale to long sequences, so we do not include it in the second simulation (l=1M). Results show that RENT+ and ARGweaver have better performances than the other two methods in all cases. RENT+ works better than ARGweaver as the mutation-to-recombination ratio (t/r) decreases. Also, it generates more accurate results compared to ARGweaver when we increase the number of sequences (i.e. n={30,100}), although overall accuracy of both programs decrease.

Fig. 5.

Fig. 5

Topological accuracy of local tree inference for all four programs, RENT+, ARGweaver, MARGARITA and RENT. The numbers of sequences range from 15 to 100, and mutation over recombination ratios range from 5 to 0.2. We use genomic regions with two different lengths: 100K and 1M (Color version of this figure is available at Bioinformatics online.)

In another simulation as shown in Figure 6, we fix the ratio t/r=1, as it is closer to reality in human populations, and vary the number of sequences. l is fixed to 100K. RENT+ reports more accurate results than ARGweaver for n > 15, and the gap in accuracy between the two programs increases as the number of sequences increases.

Fig. 6.

Fig. 6

Topological accuracy comparison of ARGweaver and RENT+ by increasing the number of sequences where t/r=1 and l=100K (Color version of this figure is available at Bioinformatics online.)

4.1.2 Changing ARGweaver’s settings

It is important to mention that ARGweaver is sensitive to its initial settings (e.g. the values of mutation and recombination rates and the number of sampling steps). We experiment with one of the settings (n=30,l=100K) by setting initial mutation and recombination rates to the true values used in simulations when running ARGweaver. As shown in Figure 7(a), ARGweaver performs better for higher t/r ratios, but performs worse for lower t/r ratios than the default settings. Note that RENT+ does not need the user to choose mutation and recombination rates.

Fig. 7.

Fig. 7

(a) Topological accuracy comparison of RENT+ with ARGweaver in its default mode and when setting its initial mutation and recombination rates to different values. Here, n = 30, l=100K. t/r values vary. (b) Effect of the number of ARGweaver’s sampling iterations on accuracy comparison with RENT+. Left: Comparing the average topological accuracy. Right: Comparing the running time (Color version of this figure is available at Bioinformatics online.)

Furthermore, ARGweaver needs more sampling iterations to obtain more accurate results when the number of haplotypes becomes larger. That is, ARGweaver runs much slower for larger data. We let l=100K, n = 30, t/r=1, and run ARGweaver with more sampling iterations (up to 5000 iterations). ARGweaver becomes more accurate but also become much slower when more iterations are used. As shown in Figure 7(b), it takes ARGweaver more than two hours to obtain to the same accuracy as RENT+ which only takes around ten seconds.

4.1.3 Comparing the running times

We compare the running time of RENT+ and ARGweaver in the inference of local trees. Program MARGARITA is much faster compared to RENT+ and ARGweaver. But since its accuracy is always less than these two programs, MARGARITA is not included in the running time comparison. Generally, RENT+ is much faster than ARGweaver especially when we increase the numbers and/or the length of sequences. We note that ARGweaver performs more tasks than RENT+. For example, ARGweaver produces multiple samples of genealogies while RENT+ only produces one inference result. Still, the running time comparison shows the gap of efficiency between the two programs. Figure 8 shows the average running times with parameters t/r=1,l={100K,1M}, and the values of n vary. Note that the running time for ARGweaver depends on the number of sampling iterations. The running time reported here includes only the sampling time of ARGweaver. Obtaining consensus local trees from ARGweaver’s output takes even more time. In this simulation, we use the default settings of ARGweaver which generates 1,000 samples.

Fig. 8.

Fig. 8

Comparing running times of RENT+ and ARGweaver by increasing the numbers of sequences and the sequence lengths with fixed ratio of t/r=1. Running time reported in this figure is the average running time over all replicates for each setting (Color version of this figure is available at Bioinformatics online.)

4.2 Performance of TMRCA inferences

In addition to inferring local trees, RENT+ reports the time to most recent common ancestor (TMRCA) for each site. To measure the accuracy of TMRCA, we report the Pearson’s correlation coefficients for RENT+’s estimated TMRCAs vs. true TMRCAs. As shown in Figure 9, the Pearson’s correlation coefficients by RENT+ are mostly over 50%. These are close to the ARGweaver results in default settings and closer when t/r decreases to 0.2. Other simulations show that ARGweaver’s TMRCA values are also sensitive to the initial settings such that its accuracy can be reduced to less than 10% if the chosen initial mutation rate and the actual mutation rate are not close.

Fig. 9.

Fig. 9

Comparison of TMRCA accuracy of RENT+ with ARGweaver in its default setting and with setting initial mutation and recombination rates. n = 30, and l=100K. The values of t/r vary (Color version of this figure is available at Bioinformatics online.)

In Figure 10, we plot the TMRCA values compared to the true values for both programs RENT+ and ARGweaver for one random dataset with n = 30, l=100K and t/r=1. We assume initial default population size (10 000 generations) for ARGweaver. Using the inferred population sizes from ARGweaver decreases the TMRCA inference accuracy. In this example, the Pearson’s correlations to the true values are 64.79%, and 66.48% for RENT+ and ARGweaver respectively.

Fig. 10.

Fig. 10

The inferred TMRCA values of RENT+ (top) and ARGweaver (bottom) in default settings and comparing with true TMRCA values for a random dataset with n = 30, l=100K and t/r=1 (Color version of this figure is available at Bioinformatics online.)

4.3 Application to the inference of population tree

We use RENT+ to infer the population trees from haplotypes originated from multiple populations. A population tree represents the divergence history of multiple related populations. Population divergence refers to the demographic event where one ancestral population splits into two populations. In Wu (2015), we develop a method to infer the population trees from haplotypes. However, the approach in Wu (2015) needs haplotypes from regions that have no or low level of recombination. RENT+ allows the inference of population trees from haplotypes in regions with recombination. In principle, we can infer population trees using haplotypes from the whole genome with RENT+. Our population tree inference is a two-stage approach. First, we use RENT+ to infer local genealogical trees from haplotypes. Then we use the program STELLS (Wu, 2012, 2016) to infer the underlying population tree from the inferred local genealogical trees. The key advantage of our method is that the local genealogical trees capture the underlying relations among nearby SNPs (i.e. the so-called linkage disequilibrium). Simulation results show that our approach outperforms approaches (e.g. Pickrell and Pritchard, 2012) that do not use linkage disequilibrium information.

4.3.1 Inference of population tree from simulated haplotypes

We use MS to generate simulated data. For this simulation, we let l={100K},n={15}, and t=r={100,500,1000}. We simulate three populations A, B and C with 5 haplotypes each. A and B diverge in time T1, and the height of population trees is T2. Here, T2={1.0,0.5,0.1,0.05}. For each value of T2, we choose {T1=1/2 T2} or {T1=1/10 T2}. We generate 20 simulated replicates for each setting. STELLS is used to infer the population tree using the local trees inferred by RENT+. Our results show that the correct population trees are inferred for all 20 datasets under almost all settings. The only exception happens when t=r={100}, and T2={0.05}. Here, our method finds the correct population trees for 18 replicates when T1={0.005} and for 15 replicates when T1={0.025}.

4.3.2 Inference of population tree with phased haplotypes and comparison with TreeMix and SVDQuartets

RENT+ assumes haplotypes are given as input. While haplotypes for human populations are becoming more available, this is not the case for many other populations. Thus, we also investigate the performance of RENT+ on the inferred haplotypes.

We simulate ten haplotypes for each population. We randomly group pairs of haplotypes into genotypes and then use the program PHASE (Stephens et al., 2001) to infer the haplotypes form the genotypes. For comparison, we also experiment with the ten simulated haplotypes in the simulation. In addition to the population tree simulated in Section 4.3.1, we experiment with and additional population tree with four populations. In this simulation, populations A and B and also C and D diverge at time {T1=1/2 T2}. The two ancestral populations (prior to the divergence) then find the common ancestral population (of all four populations) at time T2. We let T2={0.1,0.05} since these are the harder cases for population tree inference. This is because time is shorter and there are fewer mutations in these cases. We let l={100K},t=r={100}. For each setting we simulated 20 replicates. For comparison, we run the programs TreeMix (Pickrell and Pritchard, 2012) and SVDQuartets (Chifman and Kubatko, 2015) on the same data. The program SVDQuartets (Chifman and Kubatko, 2015) is implemented in PAUP* version 4.0a150 (Swofford, 2002). SVDQuartets only works with four (or more) populations and the inferred population tree is not rooted. Table 1 shows the results for this simulation. True values for the three populations indicate the number of correctly inferred population trees among 20 datasets. Also, for four populations there are two population clusters, (A,B) and (C,D) to infer. In this case, numbers in Table 1 show how many of these two clusters are inferred (two, one, or none). SVDQuartets only infers unrooted trees (i.e. it cannot distinguish the two and one cases). As shown in Table 1, RENT+ is more accurate than TreeMix in both true and phased haplotypes. Although phasing reduces the accuracy of inferred local trees of RENT+, the population tree inference is still accurate in most cases. Furthermore, SVDQuartets has a few wrongly inferred population trees for each setting. Overall, the performance of RENT+ is better than SVDQuartets.

Table 1.

Simulation results for inferring population trees from both true and phased haplotypes of three and four different populations, and comparison with programs TreeMix and SVDQuartets

Program Haplotypes 3 Pops (30 haps)
4 Pops (40 haps)
T2=0.1
T2=0.05
T2=0.1
T2=0.05
true false true false two one none two one none
RENT+ True haplotypes 20 0 20 0 19 1 0 19 1 0
Phased 19 1 20 0 20 0 0 14 4 2
TreeMix True haplotypes 10 10 9 11 4 16 0 10 10 0
Phased 10 10 5 15 5 15 0 11 9 0
SVDQuartets True haplotypes 16 4 19 1
Phased 17 3 19 1

T2: height of population trees. true/false: number of correct (wrong) inference for three populations; two/one/none: the number of correctly inferred clusters in population trees is 2/1/0 for four populations.

We also apply our approach to infer the population tree from the recently released haplotypes from the 1000 Genomes Project. Due to the lack of space, we provide these results in the Supplementary Material.

5 Discussion and conclusions

In this paper, we develop a new approach, RENT+, for the inference of local genealogical trees from recombining haplotypes. We demonstrate that our approach is more efficient than ARGweaver, currently the best recombination genealogy inference approach. Despite the heuristic nature of our method, we show that RENT+ is competitive in terms of accuracy of the inferred genealogy in many datasets when compared with ARGweaver, which is based on a complicated probabilistic model. The lesson learned is that when properly designed, heuristics can give good results for difficult problems such as genealogy inference. We also demonstrate that RENT+ allows the development of a new approach in inference of population demographic history. The key benefit of using RENT+ is that it allows the inference to utilize the underlying joint information contained in multiple nearby SNPs (i.e. the so-called linkage disequilibrium) in such inference. We focus on the inference of the topologies of the local trees. We note that there are approaches (e.g. the PSMC approach in Li and Durbin (2011)) for the inference of coalescent time from population genetic data. The PSMC approach can infer the coalescent time of two haplotypes. In principle, RENT+ can use the estimated pairwise coalescent time inferred by PSMC to guide the local tree inference. Our initial experiment suggests that this can lead to reasonably accurate estimates of coalescent times in the local trees inferred by RENT+. In order to scale to large data, RENT+ uses simpler and faster approaches for estimating coalescent times. Empirical results show that our approach performs reasonably well. Genealogy inference with recombination is a challenging computational problem. For future work, we plan to investigate other ways of exploiting the information contained in large genetic data for the purpose of genealogy inference.

Funding

This work is partly supported by U.S. National Science Foundation grants IIS-0953563 and IIS-1447711. Parts of simulations are performed on a computer cluster that is supported under a grant S10-RR027140 from National Institutes of Health.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

References

  1. Chifman J., Kubatko L. (2015) Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol., 374, 35–47. [DOI] [PubMed] [Google Scholar]
  2. Griffiths R.C., Marjoram P. (1996) Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol., 3, 479–502. [DOI] [PubMed] [Google Scholar]
  3. Gusfield D. (2005) Optimal, efficient reconstruction of Root-Unknown phylogenetic networks with constrained and structured recombination. JCSS, 70, 381–398. [Google Scholar]
  4. Gusfield D. (2014) ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. MIT press, Cambridge, MA. [Google Scholar]
  5. Hein J. (1993) A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol., 36, 396–405. [Google Scholar]
  6. Hein J. et al. (2005). Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press, UK. [Google Scholar]
  7. Hudson R. (2002) Generating samples under the Wright-Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338. [DOI] [PubMed] [Google Scholar]
  8. Kimura M. (1969) The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics, 61, 893–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Li H., Durbin R. (2011) Inference of human population history from individual whole-genome sequences. Nature, 475, 493–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Mele M. et al. (2010) A new method to reconstruct recombination events at a genomic scale. PLoS Comput. Biol., 6, e1001010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Minichiello M., Durbin R. (2006) Mapping trait loci using inferred ancestral recombination graphs. Am. J. Hum. Genet., 79, 910–922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Myers S. et al. (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science, 310, 321–324. [DOI] [PubMed] [Google Scholar]
  13. Pickrell J.K., Pritchard J.K. (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics, 8, e1002967.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Rasmussen M.D. et al. (2014) Genome-wide inference of ancestral recombination graphs. PLoS Genet, 10, e1004342.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Song Y.S., Hein J. (2005) Constructing minimal ancestral recombination graphs. J. Comp. Biol., 12, 159–178. [DOI] [PubMed] [Google Scholar]
  16. Stephens M. et al. (2001) A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet., 68, 978–989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Swofford D.L. (2002). PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.
  18. The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature, 526, 64–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Watterson G.A. (1975) On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol., 7, 256276.. [DOI] [PubMed] [Google Scholar]
  20. Wu Y. (2008) Association mapping of complex diseases with ancestral recombination graphs: models and efficient algorithms. J. Comput. Biol., 15, 667–684. [DOI] [PubMed] [Google Scholar]
  21. Wu Y. (2011) New methods for inference of local tree topologies with recombinant SNP sequences in populations. IEEE/ACM Trans. Comput. Biol. Bioinform., 8, 182–193. [DOI] [PubMed] [Google Scholar]
  22. Wu Y. (2012) Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution, 66, 763–775. 2012. [DOI] [PubMed] [Google Scholar]
  23. Wu Y. (2015) A coalescent-based method for population tree inference with haplotypes. Bioinformatics, 31, 691–698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wu Y. (2016) An algorithm for computing the gene tree probability under the multispecies coalescent and its application in the inference of population tree. Bioinformatics (A Supplemental Issue of ISMB 2016), 32, i225–i233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Zhou H. et al. (2015) A chronological atlas of natural selection in the human genome during the past half-million years. bioRxiv, doi:10.1101/01892. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES