An agglomerative hierarchical clustering approach to visualisation in Bayesian clustering problems

Kevin J Dawson; Khalid Belkhir

doi:10.1038/hdy.2009.29

. Author manuscript; available in PMC: 2010 Jan 1.

Published in final edited form as: Heredity (Edinb). 2009 Apr 1;103(1):32–45. doi: 10.1038/hdy.2009.29

An agglomerative hierarchical clustering approach to visualisation in Bayesian clustering problems

Kevin J Dawson ^†, Khalid Belkhir ^*

PMCID: PMC2705916 EMSID: UKMS3290 PMID: 19337306

Abstract

Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals, - the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. Since the number of possible partitions grows very rapidly with the sample size, we can not visualise this probability distribution in its entirety, unless the sample is very small. As a solution to this visualisation problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package Partition View.

The exact linkage algorithm takes the posterior co-assignment probabilities as input, and yields as output a rooted binary tree, - or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities.

Keywords: Bayesian clustering, hierarchical clustering, label switching, MCMC, population genetics, visualisation

1 Introduction

In a clustering problem we have a collection of individuals (the sample) which must be classified into mutually exclusive categories. In general the number of categories represented in the sample, and the characteristics of these categories, are uncertain. Problems of this type have recently received much attention in population genetics, where the data available from the sampled individuals include their genotypes at a number of marker loci. (Other types of data which may be available include the spatial positions of individuals, and observations of their behaviour.) Examples include the following. (i) Clustering individuals from an outcrossing species into putative populations (represented approximately as panmictic populations). These putative populations may correspond to subspecies, or even individual spawning grounds or breeding sites (for example, in phylopatric species of fish, amphibians or marine turtles). For Bayesian approaches to these problems, see Pritchard et al. (2000); Dawson and Belkhir (2001); Falush et al. (2003); Corander et al. (2003, 2004, 2006); Corander and Marttinen (2006); Pella and Masuda (2006); Huelsenbeck et al. (2007). Methods are also available which make use of spatial position data in addition to genotype data from the sampled individuals (Guillot et al., 2005; François et al., 2006; Chen et al., 2007). (In the earlier work of Wasser et al. (2004), individuals were not classified into discrete populations.) (ii) The more general problem of clustering individuals from an outcrossing species into hybrid generations, including F1, F2, and B1 crosses, as well as parental populations has been addressed by Anderson and Thompson (2002). Falush et al. (2003) also allow for the possibility that the sample contains individuals of hybrid origin. (iii) Clustering individuals into (full-sib) families (Painter, 1997; Almudevar and Field, 1999; Thomas and Hill, 2000, 2002; Emery et al., 2001; Wang, 2004; Hadfield et al., 2006). Painter (1997); Thomas and Hill (2000); Smith et al. (2001) assumed that the separate full-sib families do not share any parents in common. Thomas and Hill (2002) and Wang (2004) imposed the weaker constraint that parents of only one sex are permitted to have multiple mates. Emery et al. (2001) imposed no constraints on the matings among parents. Hadfield et al. (2006) incorporated spatial information on the territories occupied by individuals. (iv) Clustering individuals from a partially selfing population into selfing lines (Wilson and Dawson, 2007), or individuals from a partially asexual population into clonal lineages.

In a clustering problem, the parameter of interest is a partition of the label set S of the sample. We refer to this parameter as the sample partition. A partition of a set S is a set of non-empty disjoint subsets of S, the union of which is the set S itself.

From a Bayesian point of view, inferences about the sample partition should be based on the (marginal) posterior distribution of the sample partition. An MCMC sampler can be used to generate a large sample of observations from the posterior distribution of the sample partition, - or at least a distribution which is an adequate approximation to this posterior distribution. This is the approach taken in many of the papers cited above. Here we are concerned with the treatment of the output form the Markov chain sampler.

2 The visualisation problem

Bayesian clustering problems of this type have certain features which have proved to be challenging. The number of ways of partitioning a set of n distinct elements into k non-empty disjoint (and unlabelled) subsets is equal to the Stirling number of the second kind, denoted by ${\begin{matrix} n \\ k \end{matrix}}$ . These numbers grow very rapidly with n. The total number of ways of partitioning a set of n distinct elements into non-empty disjoint (and unlabelled) subsets, is given by the sum

B (n) = \sum_{k = 1}^{n} {\begin{matrix} n \\ k \end{matrix}} .

This is the nth Bell number. (For further information see for example, Aigner (1979).) Since these numbers grow very rapidly with n, it is only in cases where n is small, that a probability distribution over the space of partitions of the set can be visualised in its entirety. This is what we mean by a visualisation problem. There need not be a problem if there are only a small number of partitions which have an appreciable share of the posterior probability, - as these partitions can easily be identified and listed along with their individual posterior probabilities. However, if many of the individuals in the sample are difficult to classify, then there will be many plausible partitions, each one having an individually low posterior probability. Under these circumstance those regions where the probability is elevated will be difficult to identify.

In addition to this visualisation problem, there is also a computational problem. The Monte Carlo method will not provide accurate estimate of very low probabilities (because these will necessarily be based on a small number of observations). Therefore, in situations where there are a large number of plausible partitions, there is a need to collapse this marginal posterior distribution further, in order to obtain marginal posterior probabilities of more probable events.

Some authors, using Bayesian methodology, have avoided the visualisation problem by considering only point estimation of the sample partition (Corander et al., 2003; Huelsenbeck et al., 2007; Coulon et al., 2008). Corander et al. (2003) used a MAP (maximum a posteriori probability) estimate. However, MAP estimates are vulnerable to the computational problem which arises when there are many plausible partitions, each one having an individually low posterior probability.

The Bayesian framework provides us with a general strategy for overcoming visualisation problems. We compute marginal distributions (possibly many different marginals) which can be visualised individually, each of which provides some information about the parameter of interest. We can also estimate the posterior probability that the sample partition satisfies a chosen condition. This is equivalent to the (marginal) posterior distribution of an indicator function corresponding to the chosen condition. We can compute these posterior probabilities for as many different conditions as we find helpful.

Dawson and Belkhir (2001) made use of the posterior co-assignment probabilities of sets of individuals. Let S denote the label set of the sample. The co-assignment probability of a set U Inline graphic S is the probability that the set of individuals, U, all belong to the same category (or element of the partition). A co-assignment probability can also be interpreted as the probability that the sample partition satisfies a condition. In this case, the condition is that the chosen set U is a subset of (or is equal to) an element of the sample partition. Here we denote the posterior co-assignment probability of the set U by II (U).

3 The label switching problem

In the past, Bayesian clustering methods have been beset by the so-called label switching problem described by Richardson and Green (1997). These are problems which can arise in Bayesian inference about any model having a set partition (such as a sample partition) as a parameter, when we try to compute marginal distributions for parameters which depend upon an arbitrary labelling of the elements of the partition. The labelling of the elements is arbitrary whenever the prior and the likelihood function are both invariant under permutation of the component labels, - in which case the posterior distribution is also invariant under permutation of the component labels. For example, one might be tempted to compute what we refer to here as the posterior assignment probability of individual i to a particular category α (conditional on there being K categories, say). This is the posterior probability that individual i belongs to category α, conditional on there being K categories. However, α is an arbitrary label, which does not refer to any fixed entity. A category is characterised by the individuals assigned to it, and by the value of some category-specific parameter, all of which are uncertain. In fact, Pritchard et al. (2000) recommended assigning individuals to discrete clusters using these posterior assignment probabilities, which they refer to as membership coefficients, and which are an output of their widely-used software package (STRUCTURE).

In the case of STRUCTURE, this label switching problem is obscured by the poor mixing of the Markov chain sampler. (According to Pritchard et al. (2000), their Markov chain samples in the vicinity of a single mode of the posterior distribution.) It is only as a consequence of poor mixing that a set of individuals which have a high posterior co-assignment probability, also tend to remain associated with the same label over many iterations of the Markov chain sampler. In the model which underlies STRUCTURE, the parameter K is the number of potential populations. For each run of the Markov chain sampler, K is held fixed. The potential populations are labelled α = 1, . . . , K. One of the outputs from a run of the Markov chain sampler is an n × K matrix of posterior assignment probabilities. This output is referred to as a Q-matrix. If the Markov chain sampler mixed well, then each of these assignment probabilities would be equal to 1/K, and so would be completely uninformative.

Recently (Jakobsson and Rosenberg, 2007) the software package CLUMPP has been developed to process the output from STRUCTURE. This program takes as its input the Q-matrices computed from r runs of the Markov chain sampler. It then searches for the sequence of r - 1 permutations of the set {1, . . . , K} (and hence permutations of the columns of the Q-matrix) which maximises a measure of similarity for the sequence of r Q-matrices. Once the required sequences of permutations has been found, it is applied to the sequence of Q-matrices. The resulting sequence of (column permuted) Q-matrices can then be passed as input to the visualisation software package DESTRUCT (Rosenberg, 2004). It seems to us that CLUMPP does not address the label switching problem. If the Markov chain sampler mixed well, then each of the Q-matrices (which CLUMPP takes as its input), would have every entry equal to 1/K, in which case the output from CLUMPP would be as uninformative as the input. If on the other hand the Markov chain sampler mixes poorly, then we can not rely on it to make accurate estimates of posterior probabilities.

Label switching problems can also arise in Bayesian inference about mixture models, whenever we try to compute marginal distributions for parameters which depend upon an arbitrary labelling of the component populations of the mixture. Richardson and Green (1997) drew attention to the label switching problem in the context of their attempt to compute the marginal posterior distribution of component-specific parameters. For example, in the case of a population genetic model, where the components correspond to distinct panmictic populations, we might want to compute the marginal posterior distribution of the frequency of a particular allele type in each population. The solution to the label switching problem offered by Richardson and Green (1997) was to rank the components of the partition with respect to the value of one of the component-specific parameters. It is then possible to compute the marginal posterior distribution for any component-specific parameter in the first component, the second component, and so on. Such an ordering of the components has been referred to as an identifiability constraint (Celeux et al., 2000; Hurn et al., 2003; Jasra et al., 2005). These are certainly well defined marginals of the posterior distribution. However, there is no guarantee that any of these marginals will correspond to a unique component of the true mixture. We may also have to choose from a bewildering array of possible identifiability constraints. For example, in the case of a population genetic model, where the components correspond to distinct panmictic populations, we could choose to order the components with respect to the frequency of any allele type, at any locus, which is represented in the genotype data. An identifiability constraint of this type was used by Guillot et al. (2005).

When the number of potential components in the mixture is a constant K, the posterior distribution can be represented as a symmetric mixture of K! copies of a single joint distribution, each with a different permutation of the component labels. This observation is the starting point for an alternative approach proposed by Stephens (1997, 2000), where the aim is to assign each observation from the posterior distribution to one of the K! permutation. This is achieved by applying a k means type clustering algorithm to the sample of observation from the posterior distribution, to assign each observation to one of K! clusters. Jasra et al. (2005) referred to this method as a relabelling algorithm. This may prove to be a very difficult clustering problem when the K! components in the symmetric mixture overlap too much.

Jasra et al. (2005) favoured a third approach to the label-switching problem, originally proposed by Celeux et al. (2000) and Hurn et al. (2003). The authors advocated solving the label-switching problem within the framework of decision theory, by introducing a label invariant loss function. The loss function is a function of parameter values of the model (the sample partition, and the component-specific parameters), and an action. (For an introduction to decision theory, see Berger (1985).) In Celeux et al. (2000) and Hurn et al. (2003), the action is simply a point estimate, of either the sample partition, or a sequence of component-specific parameters. (In principle, the action could also be the choice of a credible set, rather than the choice of a point estimate. However, in the case of the sample partition, this would take us back to the visualisation problem.) Having introduced a loss function, we are in a position to compute the posterior expected loss for each possible action. In principle, we can search for the action that minimises the posterior expected loss. However, this minimisation procedure is computationally costly. Note that the choice of a loss function introduces further subjectivity, in addition to the choice of prior.

4 A solution

In contrast to Stephens (2000) and Jasra et al. (2005), we strongly recommend avoiding the label-switching problem by restricting attention to marginal distributions of only those parameters which are invariant under permutations of the labels of the categories. This point of view was expressed by Gilks (1997) in a contribution to the discussion of Richardson and Green (1997), from which we now quote.

“I am not convinced by the authors’ desire to produce a unique labelling of the groups. It is unnecessary for valid Bayesian inference concerning identifiable quantities; . . . it is only the group members which lend meaning to the individual clusters.”

Unlike assignment probabilities, co-assignment probabilities are invariant under permutations of the labels of the categories. This was one reason why Dawson and Belkhir (2001, 2002) proposed using the posterior co-assignment probabilities as a basis for inferences about the sample partition. Furthermore, in situations where the posterior probabilities of individual partitions are too small (relative to Monte Carlo error) to be estimated reliably, the posterior co-assignment probabilities of many subsets of the label set can be estimated accurately, using a large sample of observations from the posterior distribution.

The availability of posterior co-assignment probabilities for sets of individuals suggests a solution to the visualisation problem. If we have a rooted binary tree with n terminal nodes (external nodes, tips or leaves), each of which is labelled with a distinct element of the label set S of the sample, then each of the n - 1 internal nodes of this tree specifies a subset (containing 2 or more elements) of the label set S. Included among these subsets is the set S itself, which is the set specified by the root node. By making the height of each node equal to the posterior co-assignment probability of the set of individuals defined by that node (the individuals whose labels occupy the terminal nodes that can be reached by ascending the tree from that node), we can present the posterior co-assignment probabilities of n - 1 sets in a format which is easy to view and interpret.

Before proceeding, we need to clarify what is meant by a rooted binary tree, a terminal node, and an internal node. A tree consists of nodes and edges (branches). (A tree is a graph.) It is possible to walk on a tree (stepping from one node to the next only when they are joined by an edge) from any node to every other node. (A tree is a connected graph.) When walking on a tree, it is impossible to return to the original node without travelling back along at least one of the same edges. (A tree is an acyclic connected graph. This can be taken as the definition of a tree.) Every node of a tree is connected to at least one edge. A node which is connected to only one edge is called a terminal node (external node, tip or leaf). A node which is connected to more than one edge is called an internal node. In a rooted tree, one of the internal nodes is designated the root node. A rooted binary (bifurcating or fully resolved) tree is one in which the root node is connected to exactly 2 edges, and every other internal node is connected to exactly 3 edges.

We are now faced with the problem of choosing a rooted binary tree, and hence choosing which subsets of S are to be specified, together with their posterior co-assignment probabilities. Any hierarchical clustering algorithm will automatically pick out one particular hierarchy of sets, which can be represented in the form of a (rooted) binary tree. The computational time required for a divisive hierarchical clustering algorithm (where we begin by assigning all individuals to a single category) grows exponentially with the sample size n, because there are 2^n-1 - 1 distinct bipartitions of a set of n elements (Edwards and Cavalli-Sforza, 1965). In contrast, the computational time required for an agglomerative hierarchical clustering algorithm (where we begin by assigning each individual to a separate category) is only of order n². For these computational reasons an agglomerative hierarchical clustering algorithm is preferable. But which agglomerative hierarchical clustering algorithm should we use? Dawson and Belkhir (2001) introduced the maximin agglomerative algorithm for this purpose. More recently, Corander et al. (2004) applied the complete linkage algorithm to the output form their MCMC sampler (the software package BAPS). The complete linkage algorithm is a special case of the maximin agglomerative algorithm.

Here we recommend an agglomerative hierarchical clustering algorithm (described below) which we refer to as the exact linkage algorithm. The exact linkage algorithm is also a special case of the maximin agglomerative algorithm of Dawson and Belkhir (2001). In the next section, we describe the exact linkage algorithm.

We end this section by noting that, in general, the output from the exact linkage algorithm may be a rooted binary tree in which some internal nodes have height zero, - represented co-assignment probabilities of zero. It is more natural to interpret a rooted binary tree having k nodes at height zero (1 ≤ k < n) as a rooted binary forest made up of k + 1 separate trees. A forest is a graph in which every connected subgraph is a tree. In other words, a forest is composed of one or more trees.

5 The exact linkage algorithm

As with any agglomerative hierarchical clustering algorithm, the exact linkage algorithm begins with n terminal nodes, each of which is labelled with a distinct element of the label set S of the sample. (So the set of terminal nodes can be equated with the label set S.) At each subsequent iteration (t = 1, 2, ...), a new internal node is created, having two existing nodes (terminal or internal) as descendants. In this way we build up a forest of rooted binary trees. (This rooted binary forest may include trivial trees consisting of a single terminal node.) This process of adding internal nodes can continue for a maximum of n - 1 iterations (in which case we have, at the end of iteration n - 1, a single rooted binary tree having n terminal nodes and n - 1 internal nodes). See Figure 1, where we illustrate the exact linkage algorithm with an example for a sample of 5 individuals.

Here we illustrate the exact linkage algorithm with an example for a sample of 5 individuals. At the initialisation step (iteration t = 0) we create 5 terminal nodes, labelled respectively 1, 2, 3, 4 and 5. This is followed by iterations t = 1, 2, 3, 4. At each iteration t (≥ 1) we begin with an *update* step. This is followed by an *acceptance* step at which we create a new internal node, labelled -t (where t is the iteration at which it is created). At the update step we update the set of proposed nodes. This updated set of proposed nodes is formed from all possible pairs of the root nodes available at the start of the present iteration. In Figure 1, every node (The height of a node is equal to the co-assignment probability of the set of individuals defined by the node) is positioned at a height equal to the co-assignment probability of the set defined by that node. The proposed nodes are arranged on the vertical axis to the right of the forest. (The height of a node is equal to the co-assignment probability of the set of individuals defined by the node.) Notice that it is always the highest proposed node which is accepted. At iteration t = 1 the proposed node (4, 5) is accepted and added to the tree, where it is labelled -1. At iteration t = 2 the proposed node (1, 2) is accepted and added to the tree, where it is labelled -2. At iteration t = 3 the proposed node (-1, 3) is accepted and added to the tree, where it is labelled -3. At iteration t = 4 the proposed node (-3, -2) is accepted and added to the tree, where it is labelled -4.

Each internal node is labelled with an integer t (= 1, 2, ..., n - 1), according to the iteration t at which it was created. Each node of this forest defines a set of individuals from the sample. A terminal node defines a set containing only one individual. Every internal node t, defines a set S (t) Inline graphic S. This is the set of terminal nodes that can be reached by ascending the tree, starting from node t.

At any iteration t, a node is said to be a root node if it has not yet become the descendent of another node. Let R (t) denote the set of root nodes at the end of iteration t. At the end of iteration t, every root node in R (t) belongs to a distinct tree in the associated forest, and every tree in the forest has a unique root node which belongs to R (t). In the case of a trivial tree, consisting of a single terminal node, this terminal node is also the unique root node of the tree.

We now state the exact linkage algorithm.

5.1 The exact linkage algorithm

We begin with an initialisation step.

(0) Set the iteration number to be t = 0. Set the height of each terminal node to 1. Set R (0) = S (the set of terminal nodes).

To construct the internal nodes, at each subsequent iteration we repeat the following three steps.

(1) Increase the iteration number by one (from t - 1 to t), then construct a set Q (t) of proposed nodes. This is the set of the nodes which can be constructed by taking a pair (i, j) of (distinct) existing root nodes (i, j Inline graphic R (t - 1), i ≠ j), and joining them together to create a new root node having these two nodes as its descendants.

(2) Calculate the node height p (i, j) associated with each proposed node (i, j) Inline graphic Q (t).

(3) Find the proposed node which is highest, and add this node to the forest. Label this new node t. So we have

p (t) = max_{\underset{i \neq j}{(i, j) : i, j \in R (t - 1)}} {p (i, j)} .

(1)

Repeat steps 1-3 until either p (t) = 0, or (at the end of iteration t = n - 1) only one root node remains.

In the event of ties (multiple proposed nodes having the same height) occurring at step 3, we simply choose one of these highest nodes at random and accept it.

Notice that at each iteration the number of root nodes (and hence trees in the forest) is reduced by one. So, at the end of iteration t we have |R (t)| = n - t root nodes. At the start of iteration t, the number of proposed nodes is

∣ Q (t) ∣ = (\begin{matrix} n - t + 1 \\ 2 \end{matrix}) .

The node height p (i, j) of a proposed node (i, j) is a Monte Carlo estimate of the co-assignment probability II (S (i, j)) of the set S (i, j) = S (i) Inline graphic S (j) defined by this proposed node. The Monte Carlo estimate is obtained from the sample of N observations of the sample partition. The sample is stored as an n × N array, where each line is an observation of the sample partition, represented as a vector of n integer values. Each position i in the vector of n integer values represents an individual (i Inline graphic S), and the integer value at position i is an arbitrary label α, indicating the element of the partition (a subset of S) to which that individual is assigned. To estimate the co-assignment probability II (S (i, j)), we count the number of observations where the set S (i, j) is a subset of an element of the sample partition, and divide this count by the total number of observations N. We indicate this approximate relationship by writing

p (i, j) \approx II (S (i, j)), for i, j \in R (t - 1), i \neq j,

(2)

and

p (t) \approx II (S (t)), for t = 1, 2, \dots, n - 1,

(3)

where S (t) is the set defined by the accepted node t. These posterior co-assignment probabilities are estimated from the output of the Monte Carlo sampler.

Notice that we can construct the set of proposed nodes Q (t) at step 1 of iteration t (≥ 2), by updating the set of proposed nodes Q (t - 1) from the previous iteration (t - 1), as follows. Suppose that the most recent accepted node (labelled t - 1) has descendants d₁ and d₂. First, delete the proposed node (d₁, d₂), along with all proposed nodes (d₁, j), (j, d₂), where j Inline graphic R (t - 2), and j ≠ d₁, d₂. (The total number of nodes to be deleted is 2 (n - t) + 1). Second, insert the nodes (t - 1, j), where j R (t - 2), and j ≠ t - 1. (The total number of nodes to be inserted is n - t). As a consequence, at step 2 of interaction t, we only need to compute the node heights of the n - t new proposed nodes created at step 1.

In Figure 1, steps 1 and 2 taken together, are referred to as the update step, because the proposed nodes are updated as described above. Step 3 is referred to as the acceptance step in Figure 1, because a proposed node is accepted and added to the forest.

It is possible to improve on this version of the exact linkage algorithm by ruling out certain proposed nodes right from the start. The version of the exact linkage algorithm which has now been implemented in the PartitionView software package, has been optimised along these lines.

5.2 The maximin agglomerative algorithm

As already mentioned, the exact linkage algorithm is a special case of the maximin agglomerative algorithm (Dawson and Belkhir, 2001). The maximin agglomerative algorithm is also described in Appendix A (see Supplementary Material).

In the maximin agglomerative algorithm, clusters are constructed so as to maximise the minimum co-assignment probability of subsets of size d within clusters. This is why we have called this algorithm the maximin agglomerative algorithm. In fact, it is a generalisation of the classical complete linkage (or furthest neighbour) algorithm (Sørensen, 1948; McQuitty, 1960; Defays, 1977), - to which it reduces in the case d = 2.

The exact linkage algorithm includes a maximisation step, but no minimisation. So, the name maximin agglomerative algorithm is no longer appropriate. In the complete linkage algorithm the heights of nodes are upper bounds on the co-assignment probabilities for the sets they define, when these contain more than 2 individuals. This is in contrast to the exact linkage algorithm where the heights of nodes are exact (up to Monte Carlo sampling error) co-assignment probabilities for the sets they define (regardless of how many individuals they contain), - hence its name. An alternative more descriptive name for this algorithm is the most probable cluster agglomerative algorithm. This name reflects the fact that at each iteration of the algorithm, we accept that proposed node which defines the cluster having the highest co-assignment probability, out of all the proposed nodes at the iteration in question.

As recognised in Dawson and Belkhir (2001), the tree or forest generated by the exact linkage algorithm captures more of the information from the posterior distribution than any other version of the maximin agglomerative algorithm. However, in our earlier work we implemented the maximin agglomerative algorithm, with dimension d = 2, 3 or 4, for computational reasons. See Appendix B (Supplementary Material) for details.

The exact linkage algorithm is implemented in PartitionView. The maximin agglomerative algorithm (with dimension d = 2, 3 or 4) was implemented in the earlier companion program, called Analyse.exe, which PartitionView.exe replaces.¹

6 Interpretation of the forests generated by the exact linkage algorithm

As we descend the forest (from the terminal nodes towards the root node), passing successively through the nodes t = 1, 2, ..., n - 1, the set S (t) defined by each node has a lower posterior co-assignment probability than the set defined by the preceding nodes. This brings us to the question of how to decide when a node t is sufficiently low that we can confidently reject the claim that all individuals in the set S (t) defined by this node belong to the same category. Perhaps the most obvious method is for the user to choose a threshold value p for the probability of co-assignment. So, for every node t defining a set of individuals S (t) with co-assignment probability p (t) ≥ p , we accept the co-assignment of all the individuals in the set S (t) defined by that node; while for nodes t having p(t) < p, we do not accept the corresponding co-assignments. Ideally, each user would choose a value for the threshold which corresponds to how averse they are to the risk of making false co-assignments. For example, a particularly cautious user might want to choose a threshold of p = 0·9, or p = 0·95, while a less cautious user might be content with a threshold of p = 0·5. If we apply a rule of the above type, with any threshold p, to the sets defined by the nodes of a forest (with terminal nodes corresponding to the elements of the set S), then we can only accept the assignment of an individual i to a sequence of sets A, B, C, ... (|A| < |B| < |C| < . . . < n) if these sets form a nested sequence (i Inline graphic A B C . . . S).

7 Examples

In this section we illustrate the performance of the exact linkage algorithm for classifying individuals into categories, by applying it to the output from Markov chain samplers applied to simulated data sets - where the correct classification is known to us. The software package PartitionView (version 0 and now version 1.0) was originally developed as a companion program to the MCMC sampler program Partition (Dawson and Belkhir, 2001). However, PartitionView can be used to process the output from a Monte Carlo sampler for any probability distribution which has, as a marginal, a probability distribution on the space of partitions of a set (with the obvious proviso that observations of the set partition can be recovered from the output of the Monte carlo sampler). To emphasise this fact, our first example is the classification of 100 individuals into selfing lines, on the basis of output from the Markov chain sampler program IMPPS (Wilson and Dawson, 2007). In this example, the true sample partition has a large number of small elements (many sets of size 1 or 2).

In our second example the simulated data set is a sample of 80 individuals drawn from two source populations (40 individuals from each population). The Markov chain sampler program HWLER Pella and Masuda (2006) was applied to this data set. We use this relatively simple example to illustrate how the rooted binary forests generated by the exact linkage algorithm may differ from those generated by the maximin agglomerative algorithm (with d = 2, 3, and 4).

The rooted binary forest, or tree, generated by PartitionView is encoded in the “Newick 8:45” format adopted by PHYLIP Felsenstein (2004). In the current version of PartitionView, a rooted binary forest consisting of multiple separate trees, is represented as a single rooted tree, where the separate root nodes of the forest are joined together by a sequence of internal nodes which all have hight zero. The root node of the resulting tree is also at hight zero.

7.1 Example 1

Recently, Wilson and Dawson (2007) used the exact linkage algorithm (as implemented in PartitionView version 1.0) to process the output from the Markov chain sampler program IMPPS (Inference of Matrilineal Pedigrees under Partial Selfing). This is a Markov chain sampler for the joint posterior distribution of the pedigree of a sample from a partially selfing population, and the parameters of the population genetic model (including the selfing rate), given the genotypes of the sampled individuals at unlinked marker loci. The population genetic model is appropriate for a hermaphrodite annual species, allowing both selfing and outcrossing. Generations are discrete and non-overlapping, as is appropriate for an annual species. The local population or deme was founded in the recent past (T generations in the past), and the founders were drawn from an infinitely large unstructured source population. Throughout its existence, this deme has remained at a constant size of N diploid individuals. The N founders (the seed immigrants) are sampled independently from the source population. In the current version of IMPPS it is assumed that no seed immigration into the local deme has occurred since the founder generation. The efficient Markov chain Monte carlo strategy employed in the IMPPS program relies on the rather artificial assumption that, for every outcrossing event which occurs within the local deme, the pollen grain is drawn directly from the large source population.

The aspect of the pedigree that we are interested in here is the classification of individuals into selfing lines. Each selfing line is founded by an outcrossed ancestor: the most recent outcrossed ancestor (MROA) of the individuals belonging to that selfing line. The MROAs of a sample are the ancestors of the sample that we encounter if we trace the line of descent of each individual from the sample back up the pedigree until we encounter the first ancestor which is a product of outcrossing. If an individual in the sample is itself a product of outcrossing then that individual is an MROA, and we have no need to trace its ancestry back any further. The classification of individuals into selfing lines induces a partition of the label set of the sample. We refer to this sample partition as the selfing line partition. It is this partition which we want to infer from the genotype data.

The output from the IMPPS Markov chain sampler includes observations of the selfing line partition. From a large sample of such observations we can estimate the co-assignment probability of any set of individuals. Here, the relevant co-assignment probability is the posterior probability that all individuals in the set belong to the same selfing line (in other words, they share the same MROA).

A simulated data set was generated by a genealogical (reverse time) simulation of the model outlined above. The selfing rate was s = 0·8, the local population was of size N = 100, and age T = 5 generations. The local population was exhaustively sampled (sample size n = 100), and the individuals were genotyped at 6 polymorphic marker loci. This simulated data set was one of those analysed in Wilson and Dawson (2007), using the IMPPS Markov chain sampler. The true matrilineal pedigree of this exhaustively sampled of n = 100 individuals is represented in Figure 2. Outcrossing events are indicated by diamonds.

The true matrilineal pedigree of the exhaustive sample of all n = N = 100 individuals from a single local population. Outcrossing events are indicated by diamonds. Those selfing lines which are represented in the sample by more than 2 individuals are colour-coded.

Every partition of a set S induces an integer partition of |S|. The true selfing line partition in our example induces an integer partition of the sample size 100. This integer partition can be written as 1⁴⁰2¹⁴3²4¹5²6², - meaning that there are 40 sets of size 1, 134 sets of size 2, and so on, in the original set partition. This notation for an integer partition is referred to as the frequency representation of the integer partition.

We have colour-coded those true selfing lines which are of size greater than 2, and coloured each node of the forest according to the true selfing line of the set of individuals defined by the node. (All the branches in the clade defined by a node are similarly coloured according to the true selfing line of this set of individuals.) The true selfing lines are colour-coded as follows. The two distinct selfing lines of size 6 are designated respectively the Red line and the Orange line; the two distinct lines of size 5 are designated respectively the Green and Blue lines; the selfing line of size 4 is the Violet line; and the two distinct lines of size 3 are the Purple line and the Sienna line respectively. The 14 selfing lines of size 2, and the 40 selfing lines of size 1, are assigned the colour black. Black is also used as the default colour for any internal nodes that define sets of individuals from multiple selfing lines.

A sample of 60 000 observations of the selfing line partition was generated by pooling three independent runs of the Markov chain sampler. Each run provided 20 000 observations, from the 20 000 iterations (without thinning) following a burn-in of 1000 iterations.

Figure 3 shows the rooted binary forest generated by applying the exact linkage algorithm to the pooled sample of 60 000 observations. In order to illustrate how successfully the true selfing lines have been identified, we have applied the same colour coding to this rooted binary forest as we applied to the true pedigree (in Figure 2).

The rooted binary forest generated by applying the exact linkage algorithm to the output from the IMPPS Markov chain sampler (a pooled sample of 60 000 observations).

If we choose a threshold of p = 0·75 (so that we accept the co-assignment of all the individuals in the set defined by any node t having p(t) ≥ 0·75), then we obtain a selfing line partition which induces an integer partition (of the sample size 100) having the frequency representation 1³⁸2¹⁴3²4¹5¹6²7¹.

The small number of errors in the inferred partition can be summarised as follows. The Blue line (of size 5) is erroneously fused with a set of size 2. In addition, one individual from a set of size 2 has been removed and placed erroneously with an individual form a set of size 1, and two sets of size 1 have been erroneously fused to form a set of size 2. The Red line (of size 6), the Orange line (of size 6), the Green line (of size 5), the Violet line (of size 4), the Purple line (of size 3), and the Sienna line (of size 3), are all present in the inferred partition, as are 12 out of the 14 selfing lines of size 2, and 37 out of the 40 selfing lines of size 1. Note that if we choose a lower threshold, of p = 0·5 for example, only a few additional erroneous agglomerations of selfing lines are accepted.

7.2 Example 2

Our second example illustrates how the rooted binary forests generated by the exact linkage algorithm may differ from those generated by the maximin agglomerative algorithm (with d = 2, 3, and 4). For this purpose we have chosen a simpler example. We reanalyse one data set (data set 4) out of the five simulated data sets reported in Dawson and Belkhir (2001). In our earlier publication, these five data sets where analysed by first using a Markov chain sampler (Partition) to generate a sample of observations from the posterior distribution of the sample partition, and then applying the maximin agglomerative algorithm (with d = 2 and 3) to the output form this Markov chain sampler. These five data sets were reanalysed recently by Pella and Masuda (2006), who applied their Markov chain sampler (HWLER), followed by our maximin agglomerative algorithm (with d = 2), and obtained rooted binary forests having more clearly defined clusters. Recall that the maximin agglomerative algorithm with d = 2 is identical to the complete linkage algorithm (a classical distance-based method).

Data set 4 consists of the genotypes (at 10 unlinked marker loci) of 80 diploid individuals, sampled form a pair of populations (40 individuals from each population) which have diverged under random drift during a period of isolation following the fission of a common ancestral population. (For the details of the weak divergence scenario, see Dawson and Belkhir (2001).) Here we designate the two populations Red and Blue respectively. Pella and Masuda (2006) applied their HWLER program to data set 4, and generated a sample of 20160 observations (with a period of 5 iterations) from the the posterior distribution of the sample partition. They discarded the first 10080 observations (half of the original sample) as burn in. Here we have discarded the first 10000 observations as burn in, leaving a sample of 10160 observations.

We applied the maximin agglomerative algorithm, with d = 2, 3, and 4 (Figures 4, 5 and 6, respectively) to this sample, and used it to generate rooted binary forests. Using the PartitionView software, we also applied the exact linkage algorithm to this same sample, to generate another rooted binary forest (Figure 7).

The rooted binary forest generated by applying the maximin agglomerative algorithm, with d = 2, to the output from the HWLER Markov chain sampler.

The rooted binary forest generated by applying the maximin agglomerative algorithm, with d = 3, to the output from the HWLER Markov chain sampler.

The rooted binary forest generated by applying the maximin agglomerative algorithm, with d = 4, to the output from the HWLER Markov chain sampler.

The rooted binary forest generated by applying the exact linkage algorithm to the output from the HWLER Markov chain sampler.

In the analysis reported by Pella and Masuda (2006), using the maximin agglomerative algorithm with d = 2, the rooted binary forest consisted of two separate trees, defining two well supported clusters, corresponding to the two populations (Red and Blue respectively), with only one individual (a Red individual, labelled 27 in the original analysis of Dawson and Belkhir (2001)) belonging to the wrong cluster. We obtained similar results using the maximin agglomerative algorithm, with d = 2, 3, and 4, and with the exact linkage algorithm (and burn in = 10000).

The forests represented in Figures 5, 6, and 7, each consist of two separate trees, while the forest represented by Figure 4 consists of a single tree, but one whose root node is very close to zero. All these trees contain a node which pairs the same Red individual (labelled 27) with the same Blue individual (labelled and 74 in the original analysis). As d is increased from 2 to 3, and then to 4, the height of the node which defines the cluster corresponding to the Red population (excluding individual 23) falls, as does the height of the node which defines the cluster corresponding to the Blue population (but also including individual 23). In the case of the forests generated by the maximin agglomerative algorithm, with d = 4, and by the exact linkage algorithm, these two nodes are lower still, and there is a particularly sharp drop in node height when we reach the node where the pair of individuals 23 (Red) and 74 (Blue) are joined with the rest of the individuals from the second population. At least in the case of individual 23, the increased skepticism represented by this fall in node height, is a more accurate representation of the true situation. At the same time this forest indicates a greater reluctance to assign two Red individuals (labelled 11 and 24) with the rest of the Red population, and this is not so easily justified.

When we apply the maximin agglomerative algorithm, with different values for the dimension d, to the same sample of observations from a posterior distribution of partitions, we often obtain rooted binary forests having similar topologies (sharing many of the same internal nodes) as here. In such cases we expect the heights of the internal nodes tend to fall as the dimension d is increased, as we can observe in Figures 4, 5, 6 and 7. In particular, the forests generated by the exact linkage algorithm (Figure 7) tend to have the lowest nodes, - thus favouring more cautious co-assignment.

8 Discussion

We were not the first to use (posterior) co-assignment probabilities in the context of Bayesian clustering (Dawson and Belkhir, 2001). In a contribution to the discussion of Richardson and Green (1997) O’Hagan (1997) suggested computing pairwise posterior co-assignment probabilities. Painter (1997) recommended using posterior co-assignment probabilities in an early paper on Bayesian discovery of full-sib families. Emery et al. (2001) also used posterior co-assignment probabilities in Bayesian discovery of full-sib and half-sib families. However, Painter (1997) and Emery et al. (2001) only made use of pairwise co-assignment probabilities, and the visualisation problem which can arise with large samples of individuals was not addressed. In Dawson and Belkhir (2001) we offered a solution to the visualisation problem. We also indicated how to make use of the co-assignment probabilities for sets of any size. However, in practice we only made use of co-assignment probabilities for sets of sizes 2, 3 and 4.

The exact linkage algorithm recommended here, has two obvious advantages over classical hierarchical clustering algorithms. Firstly the node heights in the resulting rooted binary forests have a direct Bayesian interpretation. This is not the case for the node heights which would result from applying various classical hierarchical clustering algorithms, for example, the average linkage algorithm (Sokal and Michener (1958) and Sokal and Sneath (1963), pages 182-185) to the (symmetric) matrix of pairwise co-assignment probabilities (as the similarity matrix). Secondly, the exact linkage algorithm can make use of the additional information which the posterior distribution (of the sample partition) provides about the support for the co-assignment of sets of more than two individuals.

The PartitionView software package can now compute the posterior co-assignment probability of any set of individuals, regardless of weather or not it corresponds to a node of the rooted binary forest, so that the rooted binary forest can be used as a starting point for a more detailed exploration of the posterior distribution of the sample partition. We have also provided an R script for viewing, labelling and colouring the rooted binary forests in the output files (in “Newick 8:45” format) generated by PartionView.

The pairwise posterior co-assignment probabilities, for all pairs of individuals, are also provided in an output file. So any user who wants to use these probabilities as measures of similarity, and apply a classical hierarchical clustering algorithm (such as single linkage Florek et al. (1951); Sibson (1973), complete linkage Sørensen (1948); McQuitty (1960); Defays (1977), or average linkage Sokal and Michener (1958) and Sokal and Sneath (1963), pages 182-185) is free to do so.

Partition View was originally designed to be a companion program to the Markov chain sampler program - Partition. (For more information about the Markov chain sampler implemented in the Partition software, see Dawson and Belkhir (2001).) Since the output file from Partition contains observations of the sample partition, users are free to design their own software for the visual representation of any chosen marginals of the posterior distribution of the sample partition. We would also like to make a plea to others who have developed (or may be planning to develop) their own software for sampling form the posterior distribution of the sample partition, to generate output files which contain complete observations of this type. This would mean that our visualisation software (PartitionView) could be modified to process the output file. More generally, we recommend, as a general design principal for Bayesian inference software, that Monte Carlo samplers should provide output files which contain complete observations from the posterior distribution, and that the visualisation software should be a separate module which can take such a file as input.

Partition View 1.0 is freely available from the the Partition web page: http://www.genetix.univ-montp2.fr/partition/partition.htm.

Supplementary Material

NIHMS3290-supplement-1.pdf^{(93.4KB, pdf)}

Acknowledgements

This work was supported by a BBSRC grant (reference: 206/D16977) to Kevin Dawson, and by the CNRS BIOINFO initiative. Rothamsted Research receives grant-aided support form the BBSRC (the Biotechnology and Biological Sciences Research Council of the United Kingdom). We are grateful to Ian Wilson for help with the IMPPS software package. We are also grateful to Jerry Pella and Michele Masuda for providing us with a number of output files from their program HWLER, one of which was reanalysed here. Kevin Dawson thanks Toby Johnson for useful discussion of some computational issues. We also thank two anonymous referees for their comments on an earlier version of the manuscript.

Footnotes

Analyse.exe was an unfortunate choice of name (as it has already been used in population genetics, and no doubt elsewhere). It might be best to rename this program posthumously as PartitionView0.exe.

References

Aigner M. Combinatorial theory. Springer-Verlag; New York: 1979. [Google Scholar]
Almudevar A, Field C. Inference of single generation sibling relationships based on DNA markers. Journal of Agricultural Biology and Environmental statistics. 1999;4:136–165. [Google Scholar]
Anderson EC, Thompson EA. A model-based method for identifying species hybrids using multilocus genetic data. Genetics. 2002;106:1217–1229. doi: 10.1093/genetics/160.3.1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berger JO. Statistical Decision Theory and Bayesian Analysis. 2nd edition Springer-Verlag; New York: 1985. [Google Scholar]
Celeux G, Hurn M, Robert CP. Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association. 2000;95:957–970. [Google Scholar]
Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Molecular Ecology Notes. 2007;7:747–756. [Google Scholar]
Corander J, Marttinen P. Bayesian identification of admixture events using multilocus molecular markers. Molecular Ecology. 2006 doi: 10.1111/j.1365-294X.2006.02994.x. [DOI] [PubMed] [Google Scholar]
Corander J, Marttinen P, Mäntyniemi S. A Bayesian method for identification of stock mixtures from molecular marker data. Fishery Bulletin. 2006;104:550–558. [Google Scholar]
Corander J, Waldmann P, Marttinen P, Sillanpää MJ. BAPS2: Enhanced possibilities for the analysis of genetic population structure. Bioinformatics. 2004;20(15):2363–2369. doi: 10.1093/bioinformatics/bth250. [DOI] [PubMed] [Google Scholar]
Corander JC, Waldmann P, Sillanpää MJ. Bayesian anlysis of genetic differentiation between populations. Genetics. 2003;163:367–374. doi: 10.1093/genetics/163.1.367. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coulon A, Fitzpatrick JW, Bowman R, Stith BM, Makarewich CA, Stenzler LM, Lovette IJ. Congruent population structure inferred from dispersal behaviour and intensive genetic surveys of the threatened Florida scrub-jay (Aphelocoma coerulescens) Molecular Ecology. 2008;17:1685–1701. doi: 10.1111/j.1365-294X.2008.03705.x. [DOI] [PubMed] [Google Scholar]
Dawson KJ, Belkhir K. A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genetical Research. 2001;78:59–77. doi: 10.1017/s001667230100502x. [DOI] [PubMed] [Google Scholar]
Dawson KJ, Belkhir K. A Bayesian approach to assignment problems in population genetics: Partition and related software packages; Proceedings of the Seventh World Congress of Genetics Applied to Livestock Production; 2002.pp. 745–746. [Google Scholar]
Defays D. An efficient algorithm for a complete link method. Computer Journal. 1977;20:364–366. [Google Scholar]
Edwards AWF, Cavalli-Sforza LL. A method for cluster analysis. Biometrics. 1965;21:362–375. [PubMed] [Google Scholar]
Emery AM, Wilson IJ, Craig S, Boyle PR, Noble LR. Assignment of paternity groups without access to parental genotypes: Multiple mating and developmental plasticity in squid. Molecular Ecology. 2001;10:1265–1278. doi: 10.1046/j.1365-294x.2001.01258.x. [DOI] [PubMed] [Google Scholar]
Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. 2004. Distributed by the author via http://evolution.gs.washington.edu/phylip.html.
Florek K, Lukaszewics J, Perkal J, Steinhaus H, Zubrzycki S. Sur la liaison et la division des points d’un ensemble fini. Colloquium Mathematicum. 1951;2:282–285. [Google Scholar]
François O, Ancelet S, Guillot G. Bayesian clustering using hidden markov random fields in spatial population genetics. Genetics. 2006;174:805–816. doi: 10.1534/genetics.106.059923. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gilks WR. Contribution to discussion of “on Bayesian analysis of mixtures with an unknown number of components”, by: Richardson, s., and green, p. j. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:770–771. [Google Scholar]
Guillot G, Estoup A, Morter F, Cosson J-F. A spatial-statistical model for landscape genetics. Genetics. 2005;170:1261–1280. doi: 10.1534/genetics.104.033803. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hadfield JD, Richardson DS, Burke T. Towards unbiased parentage assignment: combining genetic, behavioural and spatial data in Bayesian framework. Molecular Ecology. 2006;15:3715–3730. doi: 10.1111/j.1365-294X.2006.03050.x. [DOI] [PubMed] [Google Scholar]
Huelsenbeck JP, Andolfatto P. Inference of population structure under a Dirichlet process model genetics. Genetics. 2007;175:1787–1802. doi: 10.1534/genetics.106.061317. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hurn M, Justel A, Robert CP. Estimating mixtures of regressions. Journal of Computational and Graphical Statistics. 2003;12:55–79. [Google Scholar]
Jakobsson M, Rosenberg NA. CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis population structure. Bioinformatics. 2007;23(14):1801–1806. doi: 10.1093/bioinformatics/btm233. [DOI] [PubMed] [Google Scholar]
Jasra A, Holmes CC, Stephens DA. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modelling. Statistical Science. 2005;20:50–67. [Google Scholar]
McQuitty LL. Hierarchical linkage analysis for the isolation of types. Educational and Psychological Measurement. 1960;20:55–67. [Google Scholar]
O’Hagan A. Contribution to discussion of “on Bayesian analysis of mixtures with an unknown number of components”, by: Richardson, s., and green, p. j. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:772. [Google Scholar]
Painter I. Sibship reconstruction without parental information. Journal of Agricultural Biology and Environmental statistics. 1997;2:212–229. [Google Scholar]
Pella J, Masuda M. The gibbs and split-merge sampler for population mixture analysis from genetic data with incomplete baselines. Canadian Journal of Fisheries and Aquatic Sciences. 2006;63:576–596. [Google Scholar]
Pritchard JK, Stephens M, Donnelly PJ. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:731–758. [Google Scholar]
Rosenberg NA. DESTRUCT: a program for the graphical display of population structure. Molecular Ecology Notes. 2004;4:137–138. [Google Scholar]
Sibson R. SLINK: An optimally efficient algorithm for the single-link cluster method. Computer Journal. 1973;16:30–34. [Google Scholar]
Smith BR, Herbinger CM, Merry HR. Accurate partition of individuals into full-sib families from genetic data without parental information. Genetics. 2001;158:1329–1338. doi: 10.1093/genetics/158.3.1329. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. 1958;38:1409–1438. [Google Scholar]
Sokal RR, Sneath PHA. Principles of Numerical Taxonomy. W. H. Freeman and Company; San Francisco: 1963. [Google Scholar]
Sørensen TJ. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter / Kongelige Danske Videnskabernes Selskab. 1948;5(4):1–34. [Google Scholar]
Stephens M. Contribution to discussion of “on Bayesian analysis of mixtures with an unknown number of components”, by: Richardson, s., and green, p. j. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:768–769. [Google Scholar]
Stephens M. Dealing with label-switching in mixture models. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 2000;62:795–809. [Google Scholar]
Thomas SC, Hill WG. Estimating quantitative genetic parameters using sibships reconstructed from marker data. Genetics. 2000;155:1961–1972. doi: 10.1093/genetics/155.4.1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thomas SC, Hill WG. Sibships reconstruction in hierarchical population structures using markov chain monte carlo techniques. Genetical Research. 2002;79:227–234. doi: 10.1017/s0016672302005669. [DOI] [PubMed] [Google Scholar]
Wang J. Sibships reconstruction from genetic data with typing errors. 2004;166:1963–1979. doi: 10.1534/genetics.166.4.1963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wasser SK, Shedlock AM, Comstock K, Ostrander EA, Mutayoba B, Stephens M. Assigning african elephant DNA to geographic region of origin: Applications to the ivory trade. Proceedings of the National Academy of Sciences, USA. 2004;101(41):14847–14852. doi: 10.1073/pnas.0403170101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wilson IJ, Dawson KJ. A Markov chain Monte Carlo strategy for sampling from the joint posterior distribution of pedigrees and population parameters under a Fisher-Wright model with partial selfing. Theoretical Population Biology. 2007;72:436–458. doi: 10.1016/j.tpb.2007.03.002. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS3290-supplement-1.pdf^{(93.4KB, pdf)}

[R1] Aigner M. Combinatorial theory. Springer-Verlag; New York: 1979. [Google Scholar]

[R2] Almudevar A, Field C. Inference of single generation sibling relationships based on DNA markers. Journal of Agricultural Biology and Environmental statistics. 1999;4:136–165. [Google Scholar]

[R3] Anderson EC, Thompson EA. A model-based method for identifying species hybrids using multilocus genetic data. Genetics. 2002;106:1217–1229. doi: 10.1093/genetics/160.3.1217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Berger JO. Statistical Decision Theory and Bayesian Analysis. 2nd edition Springer-Verlag; New York: 1985. [Google Scholar]

[R5] Celeux G, Hurn M, Robert CP. Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association. 2000;95:957–970. [Google Scholar]

[R6] Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Molecular Ecology Notes. 2007;7:747–756. [Google Scholar]

[R7] Corander J, Marttinen P. Bayesian identification of admixture events using multilocus molecular markers. Molecular Ecology. 2006 doi: 10.1111/j.1365-294X.2006.02994.x. [DOI] [PubMed] [Google Scholar]

[R8] Corander J, Marttinen P, Mäntyniemi S. A Bayesian method for identification of stock mixtures from molecular marker data. Fishery Bulletin. 2006;104:550–558. [Google Scholar]

[R9] Corander J, Waldmann P, Marttinen P, Sillanpää MJ. BAPS2: Enhanced possibilities for the analysis of genetic population structure. Bioinformatics. 2004;20(15):2363–2369. doi: 10.1093/bioinformatics/bth250. [DOI] [PubMed] [Google Scholar]

[R10] Corander JC, Waldmann P, Sillanpää MJ. Bayesian anlysis of genetic differentiation between populations. Genetics. 2003;163:367–374. doi: 10.1093/genetics/163.1.367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Coulon A, Fitzpatrick JW, Bowman R, Stith BM, Makarewich CA, Stenzler LM, Lovette IJ. Congruent population structure inferred from dispersal behaviour and intensive genetic surveys of the threatened Florida scrub-jay (Aphelocoma coerulescens) Molecular Ecology. 2008;17:1685–1701. doi: 10.1111/j.1365-294X.2008.03705.x. [DOI] [PubMed] [Google Scholar]

[R12] Dawson KJ, Belkhir K. A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genetical Research. 2001;78:59–77. doi: 10.1017/s001667230100502x. [DOI] [PubMed] [Google Scholar]

[R13] Dawson KJ, Belkhir K. A Bayesian approach to assignment problems in population genetics: Partition and related software packages; Proceedings of the Seventh World Congress of Genetics Applied to Livestock Production; 2002.pp. 745–746. [Google Scholar]

[R14] Defays D. An efficient algorithm for a complete link method. Computer Journal. 1977;20:364–366. [Google Scholar]

[R15] Edwards AWF, Cavalli-Sforza LL. A method for cluster analysis. Biometrics. 1965;21:362–375. [PubMed] [Google Scholar]

[R16] Emery AM, Wilson IJ, Craig S, Boyle PR, Noble LR. Assignment of paternity groups without access to parental genotypes: Multiple mating and developmental plasticity in squid. Molecular Ecology. 2001;10:1265–1278. doi: 10.1046/j.1365-294x.2001.01258.x. [DOI] [PubMed] [Google Scholar]

[R17] Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. 2004. Distributed by the author via http://evolution.gs.washington.edu/phylip.html.

[R19] Florek K, Lukaszewics J, Perkal J, Steinhaus H, Zubrzycki S. Sur la liaison et la division des points d’un ensemble fini. Colloquium Mathematicum. 1951;2:282–285. [Google Scholar]

[R20] François O, Ancelet S, Guillot G. Bayesian clustering using hidden markov random fields in spatial population genetics. Genetics. 2006;174:805–816. doi: 10.1534/genetics.106.059923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Gilks WR. Contribution to discussion of “on Bayesian analysis of mixtures with an unknown number of components”, by: Richardson, s., and green, p. j. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:770–771. [Google Scholar]

[R22] Guillot G, Estoup A, Morter F, Cosson J-F. A spatial-statistical model for landscape genetics. Genetics. 2005;170:1261–1280. doi: 10.1534/genetics.104.033803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Hadfield JD, Richardson DS, Burke T. Towards unbiased parentage assignment: combining genetic, behavioural and spatial data in Bayesian framework. Molecular Ecology. 2006;15:3715–3730. doi: 10.1111/j.1365-294X.2006.03050.x. [DOI] [PubMed] [Google Scholar]

[R24] Huelsenbeck JP, Andolfatto P. Inference of population structure under a Dirichlet process model genetics. Genetics. 2007;175:1787–1802. doi: 10.1534/genetics.106.061317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Hurn M, Justel A, Robert CP. Estimating mixtures of regressions. Journal of Computational and Graphical Statistics. 2003;12:55–79. [Google Scholar]

[R26] Jakobsson M, Rosenberg NA. CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis population structure. Bioinformatics. 2007;23(14):1801–1806. doi: 10.1093/bioinformatics/btm233. [DOI] [PubMed] [Google Scholar]

[R27] Jasra A, Holmes CC, Stephens DA. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modelling. Statistical Science. 2005;20:50–67. [Google Scholar]

[R28] McQuitty LL. Hierarchical linkage analysis for the isolation of types. Educational and Psychological Measurement. 1960;20:55–67. [Google Scholar]

[R29] O’Hagan A. Contribution to discussion of “on Bayesian analysis of mixtures with an unknown number of components”, by: Richardson, s., and green, p. j. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:772. [Google Scholar]

[R30] Painter I. Sibship reconstruction without parental information. Journal of Agricultural Biology and Environmental statistics. 1997;2:212–229. [Google Scholar]

[R31] Pella J, Masuda M. The gibbs and split-merge sampler for population mixture analysis from genetic data with incomplete baselines. Canadian Journal of Fisheries and Aquatic Sciences. 2006;63:576–596. [Google Scholar]

[R32] Pritchard JK, Stephens M, Donnelly PJ. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:731–758. [Google Scholar]

[R34] Rosenberg NA. DESTRUCT: a program for the graphical display of population structure. Molecular Ecology Notes. 2004;4:137–138. [Google Scholar]

[R35] Sibson R. SLINK: An optimally efficient algorithm for the single-link cluster method. Computer Journal. 1973;16:30–34. [Google Scholar]

[R36] Smith BR, Herbinger CM, Merry HR. Accurate partition of individuals into full-sib families from genetic data without parental information. Genetics. 2001;158:1329–1338. doi: 10.1093/genetics/158.3.1329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. 1958;38:1409–1438. [Google Scholar]

[R38] Sokal RR, Sneath PHA. Principles of Numerical Taxonomy. W. H. Freeman and Company; San Francisco: 1963. [Google Scholar]

[R39] Sørensen TJ. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter / Kongelige Danske Videnskabernes Selskab. 1948;5(4):1–34. [Google Scholar]

[R40] Stephens M. Contribution to discussion of “on Bayesian analysis of mixtures with an unknown number of components”, by: Richardson, s., and green, p. j. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1997;59:768–769. [Google Scholar]

[R41] Stephens M. Dealing with label-switching in mixture models. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 2000;62:795–809. [Google Scholar]

[R42] Thomas SC, Hill WG. Estimating quantitative genetic parameters using sibships reconstructed from marker data. Genetics. 2000;155:1961–1972. doi: 10.1093/genetics/155.4.1961. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Thomas SC, Hill WG. Sibships reconstruction in hierarchical population structures using markov chain monte carlo techniques. Genetical Research. 2002;79:227–234. doi: 10.1017/s0016672302005669. [DOI] [PubMed] [Google Scholar]

[R44] Wang J. Sibships reconstruction from genetic data with typing errors. 2004;166:1963–1979. doi: 10.1534/genetics.166.4.1963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Wasser SK, Shedlock AM, Comstock K, Ostrander EA, Mutayoba B, Stephens M. Assigning african elephant DNA to geographic region of origin: Applications to the ivory trade. Proceedings of the National Academy of Sciences, USA. 2004;101(41):14847–14852. doi: 10.1073/pnas.0403170101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Wilson IJ, Dawson KJ. A Markov chain Monte Carlo strategy for sampling from the joint posterior distribution of pedigrees and population parameters under a Fisher-Wright model with partial selfing. Theoretical Population Biology. 2007;72:436–458. doi: 10.1016/j.tpb.2007.03.002. [DOI] [PubMed] [Google Scholar]

PERMALINK

An agglomerative hierarchical clustering approach to visualisation in Bayesian clustering problems

Kevin J Dawson

Khalid Belkhir

Abstract

1 Introduction

2 The visualisation problem

3 The label switching problem

4 A solution

5 The exact linkage algorithm