Postprocessing of Genealogical Trees

Loukia Meligkotsidou; Paul Fearnhead

doi:10.1534/genetics.107.071910

. 2007 Sep;177(1):347–358. doi: 10.1534/genetics.107.071910

Postprocessing of Genealogical Trees

Loukia Meligkotsidou ^1,¹, Paul Fearnhead ¹

PMCID: PMC2013683 PMID: 17565950

Abstract

We consider inference for demographic models and parameters based upon postprocessing the output of an MCMC method that generates samples of genealogical trees (from the posterior distribution for a specific prior distribution of the genealogy). This approach has the advantage of taking account of the uncertainty in the inference for the tree when making inferences about the demographic model and can be computationally efficient in terms of reanalyzing data under a wide variety of models. We consider a (simulation-consistent) estimate of the likelihood for variable population size models, which uses importance sampling, and propose two new approximate likelihoods, one for migration models and one for continuous spatial models.

THERE are two common approaches to analyzing population genetic data. The first approach involves (i) inferring a genealogical or phylogenetic tree for the data and (ii) making inferences about demographic or other parameters conditional on this tree. Examples of this include inference of the demography (Underhill et al. 2001), nested clade analysis (Templeton et al. 1987), and phylogeographic and spatial analysis (Emerson and Hewitt 2005; French et al. 2005). Often this approach is applied informally, with the qualitative features of the inferred tree being used to suggest plausible demographic histories for the sample (e.g., Shen et al. 2000).

The second approach involves joint inference of the genealogical tree and the parameters. In many cases the genealogical tree is a nuisance parameter, and calculation of the likelihood for the parameters involves integrating out the unknown tree, for example, in inference about various demographic models under a coalescent prior, including variable population sizes (Griffiths and Tavaré 1994a; Kuhner et al. 1998; Drummond et al. 2005) and population structure (Bahlo and Griffiths 1998; Beerli and Felsenstein 1999), inference for selection (Coop and Griffiths 2004), dispersal of a population (Brooks et al. 2007), and inference for recombination rates (Griffiths and Marjoram 1996; Kuhner et al. 2000; Fearnhead and Donnelly 2002). (In the latter case the genealogical information is contained in a graph and not in a tree.)

The advantage of the second approach is that, assuming the model for the genealogical tree is reasonable, the uncertainty in this genealogy is correctly incorporated into the inference about the parameters of interest. This is particularly important for data where there is considerable uncertainty in the genealogy (which is common for many data sets). The first approach of conditioning on a single estimate of the genealogy can sometimes lead to biases in estimates and, more generally, to underestimates of the uncertainty in the parameters. These problems often mean that analysis conditional on the tree is often used primarily to test hypotheses (Templeton et al. 1987; French et al. 2005), rather than for estimating parameters of appropriate models.

However, implementing the second approach is considerably more challenging and generally requires the use of modern computationally intensive statistical methods (Stephens and Donnelly 2000). In particular, this often requires the development of customized programs to analyze the data under the specific model or models of interest, and the application of this approach can be limited by the availability of suitable software.

In this article we consider a new approach, which lies between these two approaches. The basic idea is (i) to perform inference for the genealogical/phylogenetic tree using a suitable Bayesian approach, obtaining a sample of trees from the posterior and (ii) to perform inference on the parameters of interest using this sample of trees. The idea is that by using a sample of trees in an appropriate way we can still take account of the uncertainty within the inference for the tree, but that this approach will be less computationally intensive and more widely applicable than the second approach above.

We consider inference under three different demographic models: (a) variable population size, (b) migration between discrete subpopulations, and (c) continuous spatial structure. For model a we present a simple importance-sampling approach that can reweight a sample of trees so that the resulting weighted sample approximates the posterior distribution of the genealogy under any variable population size model. For models b and c we propose approximate-likelihood functions based on specifying a probability model for the population or on spatial information of the sample given the genealogy.

Our aim is to evaluate the potential for this approach of postprocessing a sample of genealogical trees. As such we focus on the specific case of inference for a nonrecombining DNA region with infinite-sites data and known topology. The advantage of focusing on this special case is that there exists an algorithm for simulating directly from the posterior distribution of the coalescence times of the tree, under a specific prior (see methods). Thus we can focus on the computational and statistical efficiency of the postprocessing methods, without any need to take into account the possible effects of any inaccuracies in the method for generating the sample of trees. However, the ideas of postprocessing can be applied to the output of any MCMC or other approach for generating samples of trees from a known posterior distribution and thus are not restricted to the assumptions of infinite-sites data or known topology.

METHODS

Infinite-sites data and phylogenetic prior:

We focus on analyzing data from m chromosomes sampled from a population. We assume we have infinite-sites data from a nonrecombining region of the genome and that the topology of the genealogy is known. The infinite-sites data mean that we will know the number of mutations that have occurred on each branch of the genealogy. Our mutation model is that (for our chosen scaling of time) these mutations occur at a constant rate θ/2 along each branch of the genealogy.

We assume some labeling of the nodes in the genealogy and denote by t = (t₁,…, t_m₋₁) the coalescent times for these nodes. We take the usual convention of the current time being time 0 and time being measured backward into the past. We also introduce the notation t′ = ( Inline graphic ,…, ) to denote the ordered coalescent times (so < < … < ). In the genealogy there are 2(m − 1) branches. The branch lengths are denoted by b = (b₁,…, b_2(m−1)), and sequence data can be summarized by the number of mutations on each branch: n = (n₁,…, n_2(m−1)). The branch lengths, b, are a linear function of the coalescent times, t; and to emphasize their interdependence we write b(t) and b_i(t). The likelihood of the data, n, can be written as

(1)

Now we use the pure birth process prior of Rannala and Yang (1996) for the coalescent times, which assumes that the length of each branch has an exponential distribution with rate φ,

(2)

Under this prior the posterior distribution for t (given φ and θ) is

(3)

Note that setting φ = 0 produces a posterior that is proportional to the likelihood function.

By introducing new variables s = (s₁,…, s_m₋₁), which satisfy s_i = (φ + θ/2)t_i, we obtain

(4)

where by the linear relationship between branch lengths and coalescent times b_i(s) = (φ + θ/2)b_i(t). Fearnhead and Meligkotsidou (2004) show how to draw independent and identically distributed (i.i.d.) samples from this density and hence (through rescaling) from the posterior (3). Furthermore this gives that the likelihood for φ is proportional to

(5)

where n is the total number of mutations.

Variable population size:

Consider a panmictic population of current effective population size N chromosomes, with time measured in units of N generations, and let the effective population size at time t in the past be N/λ(t). The distribution for the coalescence times for a random sample of m chromosomes from such a population (Griffiths and Tavaré 1994a) is

(6)

where Inline graphic , and remember that the 's are defined as ordered coalescent times.

Interest lies in generating samples from the posterior distribution of the coalescent times p(t | λ(·), θ, n) and in calculating the marginal likelihood p(n | λ(·), θ). The former allows us to perform inference for a given demographic model, and the latter is required for choosing between different demographic models.

Both of these can be achieved through an algorithm that generates samples of the coalescent times from (3) and then reweights these samples. For example,

where the expectation is with respect to p(t | n, θ, φ), and the constant of proportionality is Inline graphic . The last step of working above uses . A natural estimate of this expectation is based on the sample mean of π₂(t | λ(·))/π₁(t | φ) for an i.i.d. sample from p(t | n, θ, φ). In addition, the weighted sample will approximate p(t | λ(·), θ, n). This is a standard importance-sampling approach, and for more general details of this method see Srinivasan (2002).

Specifically the algorithm is as follows:

Generate an i.i.d. sample of size K from (3) using the method of Fearnhead and Meligkotsidou (2004). Denote the sample as t⁽¹⁾, …, t^(K).
For k = 1, …, K assign t^(k) a weight w_k = π₂(t^(k) | λ(.))/π₁(t^(k) | φ). Let .
The weighted sample, t⁽¹⁾, …, t^(K) with corresponding weights w₁/C, …, w_K/C, approximates the posterior p(t | λ(.), θ, n). Furthermore, an estimate of the marginal likelihood p(n | λ(.), θ) (up to a common constant of proportionality) is given by C/K.

The advantage of this approach is that the costly, in terms of CPU time, step of generating the sample of coalescent times in A is required only once. Calculating the importance-sampling weights in B has negligible CPU cost and thus can be repeated easily for a wide range of possible models for how the population size has varied through time. For informative data, the hope is that (3), which is closely related to the likelihood, will be a good proposal density for a wide range of λ(t)'s. However, the efficiency of this method is likely to depend crucially on the sample size m, which affects the dimension of t.

Migration models:

We now consider inference for a structured population model. We consider a model with D demes, each with constant population sizes N₁, …, N_D, respectively, and D × D backward migration matrix M = {M_ij}. Under this model, backward in time a chromosome currently in deme i will migrate to deme j with rate M_ij/2. The diagonal elements are defined so that rows of the matrix sum to zero, Inline graphic . We assume the population is at stationarity, so that the expected number of migrants leaving a deme is equal to the expected number entering, which corresponds to , and thus the model is parameterized by the migration matrix M, and the total population size . Note that knowledge of the migration matrix and the total population size will define the population sizes of the individual demes.

The data now include the deme in which each of the chromosomes was sampled. We propose an approximate-likelihood approach to estimating the migration rates. We first introduce an approximate likelihood function for the observed demes of the sample conditional on t. We denote this by Inline graphic . The approximation that we use treats the deme that a chromosome belongs to in an equivalent way to an allele. This is an approximation as migration models assume strong density regulation, so that the population size of each deme is constant over time and a fixed proportion of chromosomes move from one deme to another in a single generation. By comparison our approximation is (by direct analogy to neutral Wright–Fisher models) equivalent to allowing the population size of these to fluctuate through time. Each chromosome in a given deme is choosing independently whether to migrate from its deme to another (with the probability of migrating and the deme to which it migrates being determined by the migration rates). For real-life populations, the truth is likely to lie in between these two extremes: with some degree of variation in population size of demes over time, but with density regulation restricting this variability.

To define our approximate likelihood we first define γ_i = N_i/N for i = 1, …, D and introduce a forward migration matrix F whose entries satisfy F_ij = N_jM_ji/N_i, for i, j = 1, …, D. So the probability of a specific descendant of a chromosome in deme y being in deme x at a time t in the future is

We introduce a vector x = (x₁, …, x_2m−1), where (x₁, …, x_m) denotes the deme of the m chromosomes in the sample, and (x_m₊₁, …, x_2m−1) are the demes of the internal nodes of the genealogy. We assume x_2m−1 is the deme of the most recent common ancestor. Finally, for i = 1, …, 2m − 2, we let b_i be the branch length connecting node i to its parent and y_i be the deme of the parent of node i. Then we define a joint density

where the Inline graphic term comes from the stationary distribution of the migration process. Finally, the likelihood conditional on t is

(7)

Note that this likelihood is uninformative about the total population size N. Calculating (7) is possible using the peeling algorithm of Felsenstein (1981).

Our approximate likelihood is then obtained by averaging Inline graphic over samples of t from (3). So given a sample t⁽¹⁾, …, t^(K) from (3), we get

Note that a direct importance-sampling approach (similar to that used for the variable population size scenario) is not computationally feasible here. To calculate importance-sampling weights we need to know not only t but also the specific details of all migration events in the history of our sample. We have considered an importance-sampling approach that imputes the migration events, but the resulting method was highly inefficient because of the large space of possible migration events for any given data set.

Continuous spatial models:

Finally we consider inference for samples obtained across a continuous spatial habitat. We assume that the data now include a spatial location for each sampled chromosome. We focus on inference under an isolation-by-distance model.

For simplicity we first describe the model assuming a one-dimensional location. We assume that the displacement of the location of a chromosome from the location of its ancestor at time t in the past has a univariate Gaussian distribution, with zero mean and variance σ²t. First, condition on the genealogy of the sample. Furthermore, let μ be the location of the most recent common ancestor (MRCA), T be the time to the MRCA, and t_ij be the time back to the first common ancestor of chromosomes i and j. Then, conditional on this, the spatial data X = (X₁, …, X_m) have a multivariate normal distribution with

for all i, j = 1, …, m. The intuition here is that as dispersion is unbiased, the expected location of each sampled chromosome will be the location of the MRCA; whereas the covariance between the locations of two chromosomes is proportional to the amount of shared ancestry they have back to the most recent common ancestor. This model trivially extends to the case of two-dimensional locations where the dispersion in each direction is independent and identically distributed.

To perform inference we then introduce a prior distribution on the genealogy of the sample and a prior distribution on μ. We use (2) as the prior on the genealogy and we choose an improper uniform prior on μ. For this choice of prior on μ it is possible to analytically integrate out μ conditional on the genealogy (Rue and Held 2005). We write p(x | t, σ) to be the resulting conditional probability of the data, given just the genealogy and σ, and p(μ | x, t, σ) to be the corresponding conditional distribution for μ.

For many spatial genetic studies, samples are generated by first choosing the locations and then sampling chromosomes at those locations. Thus it makes sense to perform inference for σ under a conditional likelihood, where we condition on the spatial location. More generally, use of the conditional likelihood for σ means that inferences should depend less on the choice of prior on the genealogy (since in the limit as the mutation rate tends to 0, the conditional likelihood will become constant). If as before we denote the genetic data by n and the spatial data by x, then the conditional likelihood can be written as

If we use the prior (2), but rather than specifying a value of φ use the uninformative hyperprior π(φ) ∝ 1/φ, then the denominator is constant as a function of σ (see the appendix), which greatly simplifies the calculation of this conditional likelihood.

We calculate CL(σ) by simulation as follows:

We simulate K i.i.d. samples of times, by repeatedly (i) simulating φ from its posterior and (ii) simulating t from (3) conditional on that φ. Denote the sample as t⁽¹⁾, …, t^(K).
For k = 1, …, K assign t^(k) a weight w_k = p(x | t^(k), σ). Let .
An estimate of CL(σ) is C/K, and the posterior distribution for μ is approximated by the mixture

Simulation in part i of A is straightforward, as the posterior for φ is proportional to

and can be related to a beta distribution through the transformation γ = φ/(φ + θ/2).

Simulation of continuous spatial data:

Simulating data under an appropriate continuous spatial model is difficult. There appear to be two approaches: first, those based on the isolation-by-distance model of Wright (1943), which ignores any regulation of population density and thus produces populations with infinite density (Felsenstein 1975), and, second, models that assume a constant population density (Wilkins and Wakeley 2002; Wilkins 2004) and require the population to live on some closed finite region.

As our inference model ignores any restriction on the location of chromosomes as required for these latter models, we simulated data under a version of the isolation-by-distance model of Wright (1943). In particular, we simulated the genealogical tree for our data under a coalescent model with exponential population growth and then conditional on this simulated the spread of the chromosomes from the model described above. The idea is to model a situation where the effect of population density regulation is less: that of a population growing in size to fill a new habitat. Note that we are simulating the data under a different model from that under which we are analyzing the data, as the distributions on the genealogy differ.