Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2008 Jun 18;10(1):106–120. doi: 10.1093/biostatistics/kxn019

StepBrothers: inferring partially shared ancestries among recombinant viral sequences

Erik W Bloomquist 1,2,3, Karin S Dorman 1,2,3, Marc A Suchard 1,2,3,*
PMCID: PMC2639346  PMID: 18562348

Abstract

Phylogeneticists have developed several statistical methods to infer recombination among molecular sequences that are evolutionarily related. Of these methods, Markov change-point models currently provide the most coherent framework. Yet, the Markov assumption is faulty in that the inferred relatedness of homologous sequences across regions divided by recombinant events is not independent, particularly for nonrecombinant sequences as they share the same history. To correct this limitation, we introduce a novel random tips (RT) model. The model springs from the idea that a recombinant sequence inherits its characters from an unknown number of ancestral full-length sequences, of which one only observes the incomplete portions. The RT model decomposes recombinant sequences into their ancestral portions and then augments each portion onto the data set as unique partially observed sequences. This data augmentation generates a random number of sequences related to each other through a single inferable tree with the same random number of tips. While intuitively pleasing, this single tree corrects the independence assumptions plaguing previous methods while permitting the detection of recombination. The single tree also allows for inference of the relative times of recombination events and generalizes to incorporate multiple recombinant sequences. This generalization answers important questions with which previous models struggle. For example, we demonstrate that a group of human immunodeficiency type 1 recombinant viruses from Argentina, previously thought to have the same recombinant history, actually consist of 2 groups: one, a clonal expansion of a reference sequence and another that predates the formation of the reference sequence. In another example, we demonstrate that 2 hepatitis B virus recombinant strains share similar splicing locations, suggesting a common descent of the 2 viruses. We implement and run both examples in a software package called StepBrothers, freely available to interested parties.

Keywords: Bayesian, Hepatitis B virus, Human Immunodeficiency Virus, Phylogeny, Recombination

1. INTRODUCTION

The exchange of genetic information through homologous recombination substantially contributes to the diversity of life (Posada and others, 2002). Only recently recognized outside of sexually reproducing organisms (Temin, 1991), recombination is now expounded as a critical process in natural viral reproduction and pathogenesis (Rambaut and others, 2004). Two human viruses where recombination has clinical relevance are the human immunodeficiency virus 1 (HIV) and the hepatitis B virus (HBV). In several regions of the world, including Southeast Asia (Zhang and others, 2006) and Eastern Europe (Adojaan and others, 2006), recombinant strains of HIV dominate the acquired immunodeficiency syndrome epidemic. Similarly, recombinant strains of HBV have been found in Asia (Hannoun and others, 2000) and Africa (Owiredu and others, 2001).

To better appreciate the history, diversity, and ancestry of recombinant genomes, researchers have developed several approaches to infer recombination from molecular sequences of putative recombinants and representative parentals. In a recent simulation study comparing popular methods for recombination detection, Chan and others (2006) show that Bayesian phylogenetic-based methods have the best accuracy at detecting recombination events as well as recovering the point of recombination on simulated data. In addition, such likelihood-based methods allow for formal statistical inference on the recombination process. Other approaches that do not rely on formal statistical models are able to illuminate simple facts about the existence and properties of recombinants, but they provide results increasingly difficult to interpret, especially as hypotheses about recombination become more complex.

Bayesian phylogenetic-based methods fall into 2 separate forms. The first uses a hidden Markov model (HMM), where the hidden states are discrete phylogenetic topologies (Husmeier & McGuire, 2003); the second form uses Markov change-point (MCP) processes to model the spatial changes in the evolutionary history along the genome (Suchard and others, 2003), (Minin and others, 2005). Overall, the advantages of the MCP models outweigh the advantages of the HMM models. Chan and others (2006) show that the MCP models recover recombination events and the location of recombination break points better than the HMM models. Furthermore, the MCP models uniquely account for other forms of evolutionary process heterogeneity along the data, most importantly rate variation. Failure to account for such rate variation, particularly common in viruses, can lead to a high false-positive rate of recombination detection in real data (Dorman and others, 2002).

In spite of these advantages, MCP models assume that the topologies summarizing the phylogenetic relationship of sequences at any 2 alignment sites separated by a recombination break point are independent. Although this assumption may hold for recombinant sequences, it clearly does not for nonrecombinant sequences. To correct this poor assumption, we propose a more flexible framework for defining sequence relatedness across recombinant break points. This framework, deemed the random tips (RT) model, permits the inference of evolutionary histories where the number of tips on the phylogenetic tree is not fixed but is determined by the number of recombinants and break points in the data.

An illustration helps demonstrate how a fixed number of sequences can generate a random number of tips (Figure 1). The figure depicts the history of 3 sequences , 𝒢, and all sampled contemporaneously as they descend from a single common ancestor, sequence 𝒜. As shown, 𝒢 is a recombinant sequence; the left-hand portion of 𝒢 derives from sequence 𝒟, and the right-hand portion derives from sequence . At the time of recombination, both 𝒟 and were full length, but , the only contemporaneous record of these parental forms, retains information only from the leftmost portion of 𝒟 and the rightmost portion of . Assuming for the moment that we know 𝒢 is a recombinant, we split it into 2 separate sequences 𝒢1 and 𝒢2, such that 𝒢1 contains the leftmost portion of 𝒟, 𝒢2 the rightmost portion of , and both have missing information everywhere else. Sequences 𝒢1 and 𝒢2 represent our best current inference about the ancient sequences 𝒟 and . Once 𝒢 is split at the recombinant break point, we recover a bifurcating topology (right-hand side of Figure 1), where evidence of recombination is indicated by the fact that 𝒢1 is most closely related to , but 𝒢2 is most closely related to This strategy of data splitting or augmentation has several advantages. The strategy avoids the assumption of branch length and topological independence for sequences not sharing the selected break point. The use of a single phylogenetic tree maintains a time ordering on the evolutionary histories, allowing for the possibility of timing or at least bounding the time of recombination events. The ability to date recombinants is necessary to establish their role in epidemics and to tease apart the events giving rise to complex recombinants. The RT model naturally extends to the case of multiple recombinant sequences, permitting researchers to test if 2 or more recombinant sequences share a recombination event in time and sequence space.

Fig. 1.

Fig. 1.

Hypothetical evolutionary history relating recombinant sequence 𝒢 to its 2 parental sequences, and . Sequence 𝒜 is the ancestral sequence, and sequences , 𝒢, and are the sampled sequences. Since the first half of sequence 𝒢 derives from sequence 𝒟, while the second half of sequence 𝒢 derives from sequence , we split sequence 𝒢 into 2 separate sequences, 𝒢1 and 𝒢2. The blank portions of sequences 𝒢1 and 𝒢1 are missing data, coded as wild cards. Once the data set becomes augmented, we infer a phylogenetic tree on the 4 sequences.

A major difficultly with the RT model is the greatly expanded tree space. The space of all trees with a fixed number of tips is already extremely large, but the RT space contains all rooted bifurcating trees with a variable number of tips. Such a large space challenges prior specification. We overcome this issue by assuming a Yule-like branching process on the tree (Edwards, 1972). The other difficulty with the RT model springs from the sampling scheme required to make inference. Since the model space contains trees with a variable number of tips, we employ a reversible jump Markov chain Monte Carlo (MCMC) sampling scheme (Green, 1995).

We organize this paper as follows. Section 2 introduces the RT model, provides an example for clarity, and gives an outline of the MCMC scheme. In Section 3, we apply the RT model to both HIV and HBV data sets. Section 4 discusses the model's limitations and future directions for research. Extensive derivations are found in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org.).

2. MODEL

The RT model uses a Bayesian phylogenetic framework. We develop the model likelihood and prior in the next 2 subsections and then turn to methods to draw inference under this model. For more background information on traditional phylogenetic modeling, we refer the reader to Felsenstein (2003).

2.1. Likelihood

Observed data Yobs includes N aligned DNA or RNA sequences of length S. Each column Ysobs,s = 1,…,S, of Yobs is a vector (Ys1obs,Ys2obs,…,YsNobs) such that each element Ysnobs corresponds to the molecular character at site s of sequence n, where n = 1,…,N. Allowable molecular characters include the standard 4 nucleotides {A, G, C, T/U}, wildcard characters, and gaps introduced into the alignment prior. We denote a wildcard by a *; it signifies the presence of a completely unobserved character. We order the sequences so that the first P sequences, 𝒫p,p = 1,…,P, are the known nonrecombinant or possible parental sequences and the remaining R sequences, r,r = 1,…,R, are the putative recombinant sequences, with P + R = N and P ≥ 2.

The decision whether to consider a specific sequence as a recombinant or parental ultimately relies upon the data. In our experience, investigators know or have defined specific strains as parental sequences. This occurs with both the HIV and the HBV data sets we examine. Of course, we do not believe that all data sets reflect such a nice distinction between parental and recombinant. For some bacterial data sets, nearly all sequences may demonstrate recombinant characteristics, an observation that limits the reach of the RT model. Still, these types of data reiterate that a statistician needs to tailor the model to the data, not the other way around.

We follow standard likelihood-based phylogenetic modeling and assume that sites are independent given the site-specific model parameters θs that characterize the evolutionary process (Felsenstein, 2003). We parameterize θs through a rooted bifurcating tree τs, a vector of bifurcation times Ts along τs, and a continuous-time Markov chain rate matrix Qs. The matrix Qs = {quv}s,u,v∈{A, G, C, T/U}, provides the instantaneous transition rates for a continuous-time Markov process representing nucleotide substitution; Qs follows the parameterization of Hasegawa and others (1985) that includes a site-specific rate multiplier μs, the ratio of the transition rate and the transversion rate κs, and the stationary distribution of the process πs = (πAs,πGs,πCs,πTs). The finite-time transition matrix derives as Ps(x) = exQs = {p(u,v)|x}s, such that {p(u,v)|x}s is the probability that nucleotide u mutates into nucleotide v along a branch of length x. For identifiability between x and Qs, we normalize Qs so that μs is the expected number of substitutions along a branch of unit length 1. We set the stationary distribution πs, at every site s, equal to the empirical nucleotide frequencies of the whole alignment (Li and others, 2000).

To avoid severe overfitting of the data, we employ a parsimonious Bayesian MCP approach (Suchard and others, 2003), (Minin and others, 2005). Rather than allowing each individual site s to have a distinct set of parameters θs, we assume 1 + R underlying MCP processes. The first MCP process describes the joint variation in the substitution process across all sequences. The remaining R processes partition the alignment into recombinant segments.

For the first MCP, we assume an unknown number J + 1 of nonoverlapping intervals such that κs and μs are the same for every site in a given interval. The intervals are partitioned at substitution change points ρ = (ρ0,ρ1,…,ρJ + 1) such that 0 = ρ0 < ρ1 < ⋯ < ρJ + 1 = S + 1. Then for all sites s∈[ρj − 1,ρj), κs = κj and μs = μj. We collect these parameters into vectors κ = (κ1,κ2,…,κJ + 1) and μ = (μ1,μ2,…,μJ + 1).

The specification of the other R MCP processes that determine τs and Ts is considerably trickier. We assume that for every recombinant sequence r, there are an unknown number Mr + 1 nonoverlapping intervals and ξr corresponds to the vector of recombinant break points (ξr0,ξr1,…,ξr,Mr + 1), 0 = ξr0 < ξr1 < ⋯ < ξr,Mr + 1 = S + 1. Following the illustration in Section 1, each of the intervals [ξr,m − 1,ξrm),m = 1,…,Mr + 1, corresponds to a fragment of a partially observed sequence rm. For sites s∈[ξr,m − 1,ξrm), the nucleotide character of rm at site s is the nucleotide character of r at s. If s∉[ξr,m − 1,ξrm), the character of rm at site s is a *. We collect all the ξr into ξ = (ξ1,ξ2,…,ξR) and all the Mr into M = (M1,M2,…,MR).

We can now construct τ and T. In this construction, we drop the subscript s since every site s has the same τ and T. Recombinant r consists of Mr + 1 fragments; so across all recombinants there are a total of ∑r = 1RMr + R = M partially unobserved sequences. Using these sequences, we create an augmented data set Yaug with P + M total sequences. The first P sequences of Yaug are 𝒫p ordered according to p, and the next M are the partially observed sequences r ordered first by r and then by m.

From the variable-sized Yaug, we can naturally define τ as a rooted bifurcating tree with P + M external tips, such that each tip la corresponds to the ath sequence in Yaug,a = 1,…,P + M. When aP, we refer to la as a parental tip and otherwise a recombinant tip. We label the root of τ as b1 and its time of bifurcation t1 as 1. The remaining internal bifurcation nodes bi,i = 2,…,P + M − 1, are time ordered so that the times ti of each bi follow 1 = t1t2 ≥ ⋯ ≥ tP + M − 1. We define the parental tree to be that obtained from τ by removing all recombinant tips and corresponding bi. We place all ti in the vector T and require the distance from every la to b1 to be 1. At a given site s, τ satisfies a molecular clock. Finally, due to identifiability problems with the likelihood, we restrict τ to lie in a subset of the space of all bifurcating trees with P + M tips; a point we return to in Section 2.4.

Using the definitions above, we set θs = (τ,T,κs,μs) and θ = (θ1,…,θs). With this parameterization and the assumption of site independence, we write the likelihood as

2.1. (2.1)

where Ysaug is the sth column of Yaug and f(Ysaug|τ,T,κs,μs) is the likelihood at site s, easily computed with the techniques in Felsenstein (1981).

We appreciate the complexity of the RT model notation; for clarity, Figure 2 summarizes the important quantities for a specific model realization. The figure assumes 2 parental sequences 𝒫1 and 𝒫2 and 2 recombinant sequences 1 and 2, all of length 300 in Yobs. Sequence 1 has recombinant breaks at positions 100 and 200, and 2 has one recombinant break at position 200. A substitution change point is located at position 150; hence, M1 = 2, M2 = 1, J = 1, ρ = (0,150,301), ξ1 = (0,100,200,301), and ξ2 = (0,200,301).

Fig. 2.

Fig. 2.

Relationship between the observed data Yobs, the augmented data Yaug, and a phylogenetic tree τ with a random number of tips. Sequences 𝒫1 and 𝒫2 in Yobs represent the parental sequences that do not split in the model. Sequences 1 and 2 represent the putative recombinant sequences, which can split. For the recombinant sequences in Yaug, the black regions represent data from the corresponding full-length recombinant sequences in Yobs, while the white regions represent wild cards (missing data). Each tip la on τ represents a sequence in Yaug, and the bi represents bifurcation points that occur at times ti.

Corresponding to the substitution change point at site 150, there exists a κ1, κ2, μ1, and μ2, such that sites s∈[1,150) evolve according to κ1 and μ1 and the remaining sites evolve under κ2 and μ2.

Using Figure 2 as a guide, we see that transforming Yobs into Yaug generates 7 sequences: 𝒫1, 𝒫2, 11, 12, 13, 21, and 22. We now elaborate on the contents of a specific column, say 250, of Yaug. For this column, Y250aug represents a vector of length 7, such that the characters in the first 2 positions come from the parental sequences, position 5 comes from 1, position 7 comes from 2, and the remaining positions are wildcards. All other sites in Figure 2 have a similar characterization.

2.2. Prior

We complete our model formulation by specifying a prior distribution over the evolutionary substitution change-point process (J,ρ,κ,μ) and the recombination break point processes (M,ξ) with corresponding (τ,T). The substitution change-point process finds a direct analog in Minin and others (2005). In brief, we assume that J follows a truncated Poisson distribution with prior expectation δ. Given J, the substitution change-point locations ρ are uniform over all possible unordered selections from S − 1 choices. The vectors (κj,μj) follow a standard hierarchical prior over 2 after suitable transformation with estimable location and scale parameters φ. The distributions for (κj,μj) are independent of each other given the hyperparameters φ (Minin and others, 2005), (Gelman, 2006).

The specification of (τ, T) and (M, ξ) differs substantially from that found in Minin and others (2005) and requires further discussion. Traditional Bayesian phylogenetic reconstruction grows from the premises of a fixed tree size. However, the RT model lives in a parameter space that grows with a variable number of tips. The space is also extremely large. Therefore, a noninformative prior over all trees places unreasonable mass on trees with unrealistically large numbers of recombination events.

As an alternative, we return to the roots of Bayesian phylogenetic inference (Rannala & Yang, 1996) and consider each tree as sprouting from a Yule process. The Yule process defines a pure birth branching process that is time homogeneous and begins with the first bifurcation event of a single particle into 2. At any time in the process, extant particles divide independently with infinitesimal-time probability λ. If one identifies the bifurcation points with the internal nodes bi of τ, then the differences between branch lengths in T, e.g. t5t4 or t6t2, become exponential waiting times until division. Following Edwards (1972), for a given λ, the Yule process gives the joint probability density of a labeled, rooted bifurcating tree τ and branch length vector T, given V tips and t1 = 1, as

2.2. (2.2)

We use this density as the joint prior of τ and T, where V = M + P. In this density, the parameter λ determines the tendency of branches to sprout closer to the sampling time, such that larger values of λ increase the tendency. Since we usually want branches to sprout uniformly over the tree a priori, we set λ = 0.05. However, in most standard phylogenetic problems, the posterior remains highly robust against larger values of λ since the likelihood dominates the prior.

The final pieces of the puzzle are the priors for M and ξ. We assume that Mr follows a truncated Poisson distribution with expectation ηr, η = (η1,…,ηR). Since M jointly affects the probability of τ and T, the induced hierarchical prior on (M1,M2,…,MR) generates a positive correlation on the Mr. For typical values of η, this correlation is usually small with values below 10%.

Given M, we assume a joint prior over ξ. While we could simply assume that each ξr is independently and identically distributed uniformly over Mi choices from S − 1 locations, this does not allow us to model the prior probability that 2 or more r share the same break point. Also, with this independence prior, the prior probability that a recombinant break point coincides with 2 or more sequences is unrealistically small, usually less than 1%.

To correct these shortcomings, we develop an urn model that we first describe when R = 2. Assume that we have 2 urns with S − 1 labeled balls each of weight 1. The balls in the first urn represent locations in the sequence space of the first recombinant, while balls in the second urn represent the second recombinant. We select M1 balls from the first urn and M2 balls from the second urn and note the number of matching pairs O selected. We model the probability of the sample as Inline graphic, where w ≥ 0 and W is a normalizing constant. Physically, each of the M1 + M2 balls contributes 1 unit of weight to the sample; but, depending on w, the model adds extra “weight” to samples based on the number of shared pairs, more shared pairs imply a larger aggregate weight and a larger probability of occurrence. When w = 0, the model reduces to a product of independent priors, but as w grows large, the probability of at least one shared break point soon dominates the probability of none. For R recombinants, we model ξ as

2.2. (2.3)

where

2.2. (2.4)

We provide a derivation of W in the supplementary material available at Biostatistics online. The specification of O takes into account the size of an overlap break point: a break point that contains 3 separate sequences contributes 2 overlaps to O, 4 separate sequences contribute 3 overlaps to O, and so on. We take w as a known constant. In a setting similar to one in which our data live, setting w = 1000, S = 3224, R = 2, and η = (3.7,5.2) so that the E(M) = (3,4), the prior probability that at least 2 break points overlap is roughly 28%.

2.3. Posterior

The posterior distribution of the RT model falls out naturally using Bayes theorem. Specifically, if we let Θ define the collection of all model parameters, the posterior distribution of Θ given Yobs is

2.3. (2.5)

To make inference on Θ, we employ an MCMC sampler that we describe in Section 2.5.

2.4. Identifiability issues

We restrict the tree τ to lie in a subspace of 𝒯, the set of all rooted bifurcating trees τ, where the number of tips of trees in 𝒯 can vary over a bounded set. The restrictions are necessary for 2 reasons. The space 𝒯 is potentially too large for efficient exploration because of the data augmentation procedure, and 𝒯 is not completely identifiable with respect to the data likelihood.

To manage the first difficulty, we assume that the topological relationships of the P parental sequences are known. Fixing the parental history remains reasonable for recombination inference involving distantly related parentals, such as those found in inter-subtype HIV and HBV evolution.

To appreciate the second restriction and its implications for data identifiability, we need to define a recombinant neighborhood 𝒩(la) for a tip la. Intuitively, 𝒩(la) is the collection of all recombinant tips on τ that share a common bifurcation point bi with la. To be more precise, we establish a family of working sets related to la. We define the first working set to be 0 = or the empty set and the next working set to be 1 = {la}. Then, starting at la, we move upwards along τ toward b1 and stop at the first bifurcation node bi. The next working set 2 includes all the descendant tips of this bi, note la2. We continue this process of moving toward the root until we reach the root, at which point, we define our last working set and group all the working sets into = {0,1,2,…}. Now, 𝒩(la) is the set with the largest cardinality in that does not include any parental tips. For any bi, we also define 𝒩(bi) in the same way, except now 1 includes all descendant nodes of bi.

As an example, we illustrate the recombinant neighborhood definition on Figure 2. For tip l3, = {,{l3},{l3,l5},{l3,l5,l7,l2},{l1,l2,…,l7}}, implying 𝒩(l3) = {l3,l5}. Other examples include 𝒩(l5) = {l3,l5}, 𝒩(b3) = {l3,l5}, and 𝒩(l1) = .

Using the recombinant neighborhood definition, we can be precise about the identifiability issue. Let la and la + 1 on τ correspond to one r, and assume 𝒩(la) = 𝒩(la + 1). Since Yaug is ordered by m, la corresponds to some rm and la + 1 corresponds to r,m + 1. By the construction of ξr, rm and r,m + 1 share a boundary ξrm. Now, from basic phylogenetic principles, if rm and r,m + 1 are combined into a single sequence rm, and la (or la + 1) is removed from τ so that rm corresponds to la + 1 (or la) on τ, the likelihood does not change.

In other words, the model can only identify the break ξrm if the 2 corresponding sequences on either side of ξrm do not have the same recombinant neighborhood. Or in more formal terms, if θ represents the likelihood parameters without ξrm, we have the situation where f(Yobs|θ) = f(Yobs|θ) and θθ. Hence, a necessary condition that we impose for likelihood identifiability is that for all τ and all tips la and la + 1 pertaining to the same r, 𝒩(la)∩𝒩(la + 1) = . This problem does not occur for tips such as la and la + 2 because they are not adjacent in the sequence space.

The topology within each recombinant neighborhood is allowed to vary for 2 reasons. If tips from 2 or more recombinants are present in the same recombinant neighborhood, the topological structure within the neighborhood will affect the likelihood. Second, a change in the topology of the neighborhood can affect the prior through the sum of the branch lengths.

In addition to topological identifiability, changes in branch lengths within a recombinant neighborhood do not affect the data likelihood. However, the Yule-like tree prior influences these parameters, so the model posterior is identifiable, and importantly, we do not need to draw inferences about these heights. As an example, consider node b3 and time t3 in Figure 2. Due to the data augmentation, adjusting t3 does not affect the data likelihood, but adjusting t2 does. In later sections, we call nodes such as b3 recombinant nodes and all others likelihood nodes to highlight about which node heights the data inform us.

2.5. Sampling

We employ a Metropolis-within-Gibbs MCMC sampler to gain samples from p(Θ|Yobs). Because this sampler needs to move in the complicated parameter space of Θ, we implement a variety of transition kernels; details of which are found in the supplementary material available at Biostatistics online. Importantly, since the number of parameters in the model is unknown, we employ the reversible jump MCMC sampler of Green (1995). For more background information regarding MCMC for phylogenetic modeling, we refer the reader to Larget (2005).

We implement a working version of the sampler in Java called StepBrothers (available at http://www.biomath.ucla.edu/msuchard/StepBrothers/index.html). Before using this sampler to run examples, we have extensively checked our program by simulating from the prior distribution using the code and running numerous simulated examples. StepBrothers passed all tests. On the performance side, StepBrothers works well on moderately sized data sets but can bog down on data sets with large numbers of taxa. This performance issue, however, is not surprising. Via profiling, StepBrothers spends over 95% of its time calculating the likelihood: the bottleneck of all likelihood-based phylogeny inference. Therefore, on large data sets, users can face the same sorts of problems that plague likelihood-based phylogeny inference in general. To assess convergence of the MCMC sampler, we run multiple chains and assess the estimated posterior distributions. In the examples below, any differences between each chain can easily be attributed to Monte Carlo error.

3. DATA EXAMPLES

We demonstrate the benefits of the RT model on data from 2 sets of viral recombinants, one from HIV and the other from HBV. We show how the RT model permits analysis and comparison of multiple recombinants in order (1) to deduce whether they descend from the same recombinant ancestor or multiple distinct recombination events and (2) to date or at least bound the recombination events in time.

HIV circulating recombinant form 12 (CRF_12) affects patients in Argentina, Bolivia, Brazil, and Uruguay (Carr and others, 2001), (Quarleri and others, 2004). To better characterize the diversity and prevalence of CRF_12 in Argentina, Quarleri and others (2004) sample 284 recombinant HIV strains from patients seeking second-line antiretroviral therapy and sequence protease and the first 400 codons of reverse transcriptase from the pol region of HIV. The authors classify the 284 strains, along with the CRF_12 reference strain, and note the presence of 2 distinct subclades within the clade containing CRF_12. The authors are interested in whether the 2 subclades contain sequences with distinct recombination structure, but a dearth of appropriate modeling tools hamper their efforts.

Upon conducting a nationwide survey of HBV in China to better characterize the prevalence and diversity of the disease, Zeng and others (2005) identify genetic material from 8 patients who appear to descend from the D strain of HBV, but after more careful analysis, including full-genome sequencing, the authors discover that the 8 strains are recombinant HBV stains between HBV type C and type D (Wang and others, 2005). Our goal is to determine whether these strains share at least one recombination event in their histories.

3.1. Human immunodeficiency virus

We use the RT model to explore the relationship of recombinants in each subclade to the CRF_12 reference sequence (GenBank #AF385936). We select 2 patient sequences, 112567 (GenBank #AY365861) and 113314 (GenBank #AY365871), from the subclade containing CRF_12 and 1 sequence, 103520 (GenBank #AY365682), from the other. We use the same parental sequences as Quarleri and others (2004): A (GenBank #AF069671), B (GenBank #M17449), C (GenBank #AF110978), and F (GenBank #AF005494). We set the parental tree to be ((A, F), (B, C)). We treat the bifurcation times on τ as parameters that we infer, only the underlying parental tree topology remains fixed. We align the sequences first using an automated alignment program and then manually correct for obvious errors. Since the pol region of HIV is relatively conserved, we feel that the assumption of a fixed and known alignment has little impact on the results. For all analyses, we set w = 0 and η so that 0 break points occur with 50% prior probability.

We start off with separate, independent analyses to classify each recombinant sequence. As Figure 3(a) shows, sequences 112567, 113314, and CRF_12 have almost identical (F, B, F, B, F) recombinant structure. Sequence 103520, however, differs in 2 ways from CRF_12; the third break of 103520 occurs further right than the corresponding break in CRF_12, and sequence 103520 does not possess the fourth break present in CRF_12. A formal Bayes factor (BF) verifies this second observation: a recombinant break point is present in the interval (1400, 1497) for sequences 112567, 113314, and CRF_12 (BF  > 105 for all 3), whereas there is little evidence to suggest that a break point exists in the same interval of sequence 103520 (BF ≈0.5).

We now compare each patient sequence to CRF_12. If a patient sequence and CRF_12 both descend from the same recombinant parent and have not experienced further recombination, then the 2 sequences should have the same recombinant neighborhoods throughout the genome. To summarize this information, we obtain the posterior probability that 2 sites, one from each of the recombinants, land on recombinant tips in the same neighborhood. If this occurs, we say these sites overlap on τ. We plot these posterior probabilities for all sites s in CRF_12 and sites s in the patient sequences in Figures 3(a) and (b). We move the results for 112567 to Supplementary Figure SF-1, available at Biostatistics online, as they do not differ much from 113314. In these figures, the shade of the pixel (s,s) denotes the posterior probability of an overlap; the darker the shade, the more likely the 2 sites overlap and the more likely the 2 sites share their recombinant histories.

Fig. 3.

Fig. 3.

HIV recombinant analysis: (a) Posterior probabilities that each site in the alignment is a nearest neighbor to one of the parental nodes. The dashed line indicates the probability that a site phylogenetically arises from an F parental strain. The solid line indicates a B strain. (b) The probability that each site of the HIV patient sequence 113314 overlaps with each site of CRF_12. The shade of each pixel (s,s) represents the posterior probability that the 2 sites fall into the same recombinant neighborhood on τ. This figure shows 2 regions of high similarity between the patient sequence and the CRF_12. (c) In contrast, shows only one large of similarity between HIV patient sequence 103520 and CRF_12.

From Figure 3(b), we see that sequence 113314 shares 2 large areas of similarity with CRF_12. An analogous similarity exists between 112567 and CRF_12 in Supplementary Figure SF-1, available at Biostatistics online. In both cases, the dark regions present extremely strong evidence that the patient sequence derives from CRF_12. Through simulation, we find a 12% prior probability that any 2 sites overlap on τ. Thus, a 90% posterior probability in Figure 3(a) and Supplementary Figure SF-1, available at Biostatistics online, translates into a BF of 66. In Figure 3(b), we see that patient 103520 differs from the other 2; only one large area of similarity is shared with CRF_12.

Our final analysis in this section explores the timing of recombination. Since the RT model utilizes a single phylogenetic tree τ, a time ordering is maintained on τ, from which an upper bound on the time of recombination can be inferred for each break. Consider the first break in 1 in Figure 2. Here, the left portion of 1 corresponds to tip l3 and the right portion corresponds to l4. Noting this relationship, we see that tip l3 diverges from τp at time t2, while l4 diverges at t5. We focus on t2 rather than t3 since t3 corresponds to a recombinant node, as defined in Section 2.4. With these 2 times, t2 and t5, we can infer an upper bound on the time of recombination by noting that the first recombination break point in 1 occurred before the min(t2,t5) = t5.

Using the procedure just outlined, we consider such an upper bound as a proportion of the total tree height on each sequence using univariate analyses. To conduct this analysis, we use a different B parental (GenBank #AF156836) than Quarleri and others (2004) since their B parental was sampled approximately 14 years earlier than the other sequences in the data set. Because the topological space of our model can vary, we condition on the topological structure (F, B, F, B, F) for CRF_12, 112567, and 113314, and (F, B, F, B) for 103520. In Table 1, we present the upper bounds for all 4 breaks. We believe that the estimates from each analysis are comparable because the bifurcation time of the parental sequences A and F and the bifurcation time of the parental sequences B and C remain relatively constant in all 4 analyses (B–C difference < 0.04; A–F difference  < 0.01).

Table 1.

Expected upper bounds on the time of each recombinant break in the HIV example. The 4 breaks correspond to those at approximate positions 250, 350, 750, and 1450, as shown in Figure 3(a). Since no evidence exists for the fourth break in sequence 103520, we do not give a bound

Break
Sequence 1 2 3 4
CRF_12 0.24 0.25 0.32 0.28
113314 0.20 0.21 0.29 0.27
112567 0.24 0.28 0.28 0.28
103520 0.38 0.38 0.39 NA

As shown in the table, the bounds for CRF_12, 113314, and 112567 generally group together. Sequence 103520, however, does not fit the pattern of the other 3 sequences; its bounds are relatively higher than the other 3. To gain a better handle on the results in Table 1, we roughly translate these proportions into actual dates. To do this, we follow Korber and others (2000) and use the year 1930 as the date of divergence for subgroup M in HIV-1. As the sequences in our data set were approximately sampled in the year 1998, we can translate the proportions into approximate dates using linear interpolation. With a little work, we see that the upper bound times for sequences CRF_12, 113314, and 112567 all fall around 1980 (1976–1984), while the upper bounds for sequence 103520 fall in the early 1970s (1971–1972).

The 95% posterior credible intervals corresponding to the estimates in Table 1 are all quite wide and overlap in all 4 sequences at all 4 breaks. For example, the RT model provides (0.03,0.33) as a credible interval for the first break in 113314. Therefore, we should not credit these differences as statistically significant. Still, the results do suggest another difference between the recombination histories of the 3 patient sequences.

As a whole, these results favor the hypothesis of differences between the recombination histories of 3 patient sequences. This implies that a discordance exists between the 2 subclades from Quarleri and others (2004). While not presented here, we conduct further analysis on additional sequences in each subclade. Even though it is difficult to generalize all the results, we find that sequences in the subclade containing CRF_12 tend to have 2 large areas of overlap with CRF_12 and 4 recombinant break points; the results are similar to the results from sequences 112567 and 113314. Sequences in the other subclade, however, have only 3 recombinant break points and 1 large area of overlap with CRF_12; the results are similar to those from 103520. Therefore, from our analyses, the 2 subclades found in Quarleri and others (2004) represent distinct recombination histories: one subclade shares its break points with CRF_12 and most likely represents an expansion of this variant across the population, while the second subclade exposes a unique pattern that apparently predates the formulation of CRF_12.

As with any statistical analysis, some issues remain with these results. In particular, the results in Figures 3(b) and (c) and Supplementary Figure SF-1, available at Biostatistics online, demonstrate 1 or 2 large areas of similarity with CRF_12, but the ambiguous results in the lower 2 areas beg for further explanation. Biologically, the results could simply imply that further recombination events occurred at the first 2 breaks. Other reasons include a lack of informative sites and violations of the strict molecular clock assumption across τ. These same issues can also affect the dating analysis, but we leave them for future research.

3.2. HBV recombinant

We now explore the relationship of 2 HBV sequences (GenBank #AF460143 and #AY817511), the later published in Wang and others (2005). For our analysis, we use 8 parental sequences: 3 D strains (GenBank #M32138, X02496, X72702), 3 C strains (GenBank #X02496, D12980, AB014381), 1 B strain (GenBank #D23677), and 1 A strain (GenBank #AF297623). We set η≈(3.7,5.2) so that E(M) = (3,4) and align the sequences in a similar method as the HIV example. All sequences are of length 3224.

While several features of the 2 sequences are comparable, we focus our attention on the location of the recombination break points within the 2 recombinant sequences. In particular, we conduct a BF test whether the 2 sequences share at least one recombinant break point. This hypothesis translates into whether the location of the break near position 1400 occurs at the same position in both sequences (Supplementary Figure SF-2, available at Biostatistics online). To carry out the test, we follow Suchard and others (2005) and estimate the BF under several values of overlap weight w. Estimating the BF across a wide spectrum of weights w allows us to obtain a more accurate estimate of the BF, gain insight into the joint prior over ξ, and expose possible programing errors. After performing MCMC simulations at 8 values of w ranging from 20 to 10 000, we estimate the BF = 5.3 (Supplementary Figure SF-3, available at Biostatistics online). While not providing overwhelming support for our hypothesis, this BF does suggest the sharing of at least one break point.

4. DISCUSSION

The RT model presented in this paper fosters clear advantages over previous methodologies. With its more appropriate assumptions regarding the recombination history, the RT model allows researchers to bound times of recombination and infer relationships between multiple recombinant sequences. Without these 2 new forms of data analysis, we would not have discovered the discordant recombination histories in Section 3.1.

A few aspects of the RT model deserve more attention. The RT model relies upon a strict molecular clock assumption: the substitution rate μ along τ remains constant. In some data situations, strong evidence exists that μ changes along τ as time progresses. In these situations, assumption of a constant rate can lead to incorrect topology estimation (Ho & Jermiin, 2004). For these reasons, we plan to explore alternatives to a strict molecular clock for the RT model. Such alternatives include the relaxed clock model of Drummond and others (2006), but issues arise with model identifiability.

Beyond the molecular clock assumption, the RT model also requires at least 2 parental sequences to be present in the data. For intrahost sequences from rapidly evolving viruses, this requirement may not hold because a substantial minority of the sequences present in the data may be recombinants. To handle this more extreme case, a new technique for phylogenetic-based recombination detection exploiting ancestral recombination graphs (ARGs) may afford a future solution. First developed by Hudson (1983), Hudson (1990) and later by Griffiths & Marjoram (1996), an ARG is a graph much like a phylogenetic tree but allows for the complete characterization of the recombination history for all sequences present within the data. The ARG's benefits include the removal of the a priori distinction between parental sequences and recombinant sequences. Also, ARGs naturally generalize and extend the RT framework presented in this paper. By simply constraining certain sequences from recombining in an ARG, we can find a surjective function from ARG space to RT tree space. We further elaborate on this function in Supplementary Figure SF-4, available at Biostatistics online. Even so, for the ARG, the sequence data contain no more information about the recombination times than the bounds provided by the RT model. So, until sampling methods for ARGs advance to match the field's successes with trees, we endorse the use of random-tipped trees.

FUNDING

National Institutes of Health (AI07370 to E.W.B., GM068955 to K.S.D. and M.A.S.).

Supplementary Material

[Supplementary Material]
kxn019_index.html (710B, html)

Acknowledgments

We thank Vladimir Minin for helpful discussion and providing the source code to DualBrothers upon which StepBrothers is based. We also wish to thank 2 anonymous referees for their helpful comments. M. A. Suchard is an Alfred P. Sloan Research Fellow. Conflict of Interest: None declared.

References

  1. Adojaan M, kivisild T, Männik A, Onu krispin T, Ustina V, Zilmer k, Liebert E, Jaroslav N, Priimägi L, Tefanova V. Predominance of a rare type of HIV-1 in Estonia. Journal of Acquired Immune Deficiency Syndromes. 2006;39:598–605. and others. [PubMed] [Google Scholar]
  2. Carr J, Avila M, Carrillo MG, Salomon H, Hierholzer J, Watanaveeradej V, Pando M, Negrete M, Russell K, Sanchez J. Diverse BF recombinants have spread widely since the introduction of HIV-1 into South America. AIDS. 2001;15:F41–F47. doi: 10.1097/00002030-200110190-00002. and others. [DOI] [PubMed] [Google Scholar]
  3. Chan C, Beiko R, Ragan M. Detecting recombination in evolving nucleotide sequences. BMC Bioinformatics. 2006;7:412–426. doi: 10.1186/1471-2105-7-412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Dorman K, Kaplan A, Sinsheimer J. Bootstrap confidence intervals for HIV-1 recombinants. Journal of Molecular Evolution. 2002;54:200–209. doi: 10.1007/s00239-001-0002-4. [DOI] [PubMed] [Google Scholar]
  5. Drummond A, Ho S, Phillips M, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biology. 2006;4:699–710. doi: 10.1371/journal.pbio.0040088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Edwards A. Estimation of the branch points of a branching diffusion process. Journal of the Royal Statistical Society, Series B (Methodological) 1972;32:155–174. [Google Scholar]
  7. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
  8. Felsenstein J. Inferring Phylogenies. Sunderland, MA: Sinauer Associates; 2003. [Google Scholar]
  9. Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis. 2006;1:515–534. [Google Scholar]
  10. Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]
  11. Griffiths R, Marjoram P. Ancestral inference from samples of DNA sequences with recombination. Journal of Computational Biology. 1996;3:479–502. doi: 10.1089/cmb.1996.3.479. [DOI] [PubMed] [Google Scholar]
  12. Hannoun C, Norder H, Lindh M. An aberrant genotype revealed in recombinant hepatitis B virus strains from Vietnam. Journal of General Virology. 2000;81:2267–2272. doi: 10.1099/0022-1317-81-9-2267. [DOI] [PubMed] [Google Scholar]
  13. Hasegawa M, Kishino H, Yano T. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution. 1985;22:160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]
  14. Ho S, Jermiin L. Tracing the decay of the historical signal in biological sequence data. Systematic Biology. 2004;53:628–637. doi: 10.1080/10635150490503035. [DOI] [PubMed] [Google Scholar]
  15. Hudson R. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology. 1983;23:183–201. doi: 10.1016/0040-5809(83)90013-8. [DOI] [PubMed] [Google Scholar]
  16. Hudson R. Gene genealogies and the coalescent process. In: Futuyma D, Antonovics J, editors. Oxford Surveys in Evolutionary Biology. Volume 7. New York: Oxford University Press; 1990. pp. 1–44. [Google Scholar]
  17. Husmeier D, McGuire G. Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo. Molecular Biology and Evolution. 2003;20:315–337. doi: 10.1093/molbev/msg039. [DOI] [PubMed] [Google Scholar]
  18. Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn B, Wolinsky S, Bhattacharya T. Timing the ancestor of the HIV-1 pandemic strains. Science. 2000;288:1789–1796. doi: 10.1126/science.288.5472.1789. [DOI] [PubMed] [Google Scholar]
  19. Larget B. Introduction to Markov chain Monte Carlo methods in molecular evolution. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 45–62. [Google Scholar]
  20. Li S, Pearl D, Doss H. Phylogenetic tree construction using Markov chain Monte Carlo. Journal of the American Statistical Association. 2000;95:493–508. [Google Scholar]
  21. Minin V, Dorman K, Fang F, Suchard M. Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics. 2005;21:3034–3042. doi: 10.1093/bioinformatics/bti459. [DOI] [PubMed] [Google Scholar]
  22. Owiredu W, Kramvis A, Kew M. Molecular analysis of hepatitis B virus genomes isolated from black african patients with fulminant hepatitis B. Journal of Medical Virology. 2001;65:485–492. [PubMed] [Google Scholar]
  23. Posada D, Crandall K, Holmes E. Recombination in evolutionary genomics. Annual Reviews of Genetics. 2002;36:75–97. doi: 10.1146/annurev.genet.36.040202.111115. [DOI] [PubMed] [Google Scholar]
  24. Quarleri J, Rubio A, Carobene M, Turk G, Vignoles M, Harrigan R, Montaner J, Salomon H, Carrillo MG. HIV Type 1 BF recombinant strains exhibit different pol gene mosaic patterns: descriptive analysis from 284 patients under treatment failure. AIDS Research and Human Retroviruses. 2004;20:1100–1107. doi: 10.1089/aid.2004.20.1100. [DOI] [PubMed] [Google Scholar]
  25. Rambaut A, Posada D, Crandall K, Holmes E. The causes and consequences of HIV evolution. Nature Reviews Genetics. 2004;5:52–61. doi: 10.1038/nrg1246. [DOI] [PubMed] [Google Scholar]
  26. Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of Molecular Evolution. 1996;43:304–311. doi: 10.1007/BF02338839. [DOI] [PubMed] [Google Scholar]
  27. Suchard M, Weiss R, Dorman K, Sinsheimer J. Inferring spatial phylogenetic variation along nucleotide sequences: a multiple changepoint model. Journal of the American Statistical Association. 2003;98:427–437. [Google Scholar]
  28. Suchard M, Weiss R, Sinsheimer J. Models for estimating Bayes factors with applications to phylogeny and tests of monophyly. Biometrics. 2005;61:665–673. doi: 10.1111/j.1541-0420.2005.00352.x. [DOI] [PubMed] [Google Scholar]
  29. Temin H. Sex and recombination in retroviruses. Trends in Genetics. 1991;7:71–74. doi: 10.1016/0168-9525(91)90272-R. [DOI] [PubMed] [Google Scholar]
  30. Wang Z, Liu Z, Zeng G, Wen S, Qi Y, Ma S, Naoumov N, Hou J. A new intertype recombinant between genotypes C and D of hepatitis B virus identified in China. Journal of General Virology. 2005;86:985–990. doi: 10.1099/vir.0.80771-0. [DOI] [PubMed] [Google Scholar]
  31. Zeng G, Wang Z, Wen S, Jiang J, Wang L, Cheng J, Tan D, Xiao F, Ma S, Li W. Geographic distribution, virologic and clinical characteristics of hepatitis B virus genotypes in China. Journal of Viral Hepatitis. 2005;12:609–617. doi: 10.1111/j.1365-2893.2005.00657.x. and others. [DOI] [PubMed] [Google Scholar]
  32. Zhang Y, Lu L, Ba L, Yang L, Jia M, Wang H, Fang Q, Shi Y, Yang W, Chang G. Dominance of HIV-1 subtype CRF_AE in sexually acquired cases leads to a new epidemic in Yunnan Province of China. PLoS Medicine. 2006;3:2065–2076. doi: 10.1371/journal.pmed.0030443. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Material]
kxn019_index.html (710B, html)
kxn019_1.pdf (290.6KB, pdf)

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES