Spectral Clustering, Bayesian Spanning Forest, and Forest Process

Leo L Duan; Arkaprava Roy; Alzheimer’s Disease Neuroimaging Initiative

doi:10.1080/01621459.2023.2250098

. Author manuscript; available in PMC: 2025 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2023 Sep 29;119(547):2140–2153. doi: 10.1080/01621459.2023.2250098

Spectral Clustering, Bayesian Spanning Forest, and Forest Process

Leo L Duan ^a,^*, Arkaprava Roy ^b; Alzheimer’s Disease Neuroimaging Initiative

PMCID: PMC11580821 NIHMSID: NIHMS1940488 PMID: 39583343

Abstract

Spectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To induce a distribution for the forest, we develop a “forest process” as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. We illustrate several model-based extensions useful for data applications, including high-dimensional and multi-view clustering for images.

Keywords: Graphical Model Clustering, Model-based Clustering, Normalized Graph-cut, Partition Probability Function

1. Introduction

Clustering aims to partition data $y_{1}, \dots, y_{n}$ into disjoint groups. There is a large literature ranging from various algorithms such as K-means and DBSCAN (MacQueen, 1967; Ester et al., 1996; Frey and Dueck, 2007) to mixture model-based approaches [reviewed by Fraley and Raftery (2002)]. In the Bayesian community, model-based approaches are especially popular. To roughly summarize the idea, we view each $y_{i}$ as generated from a distribution $𝒦_{(\cdot ∣ θ_{i})}$ , where $(θ_{1}, \dots, θ_{n})$ are drawn from a discrete distribution $\sum_{k = 1}^{K} w_{k} δ_{θ_{k}^{*}} (\cdot)$ , with $w_{k}$ as the probability weight, and $δ_{θ_{k}^{*}}$ as a point mass at $θ_{k}^{*}$ . With prior distributions, we could estimate all the unknown parameters ( $θ_{k}^{*}$ ’s, $w_{k}$ ’s, and $K$ ) from the posterior.

The model-based clustering has two important advantages. First, it allows important uncertainty quantification such as the probability for cluster assignment $c_{i}$ , $Pr (c_{i} = k ∣ y_{i})$ , as a probabilistic estimate that $y_{i}$ comes from the $k$ th cluster $(c_{i} = k \Leftrightarrow θ_{i} = θ_{k}^{*})$ . Different from commonly seen asymptotic results in statistical estimation, the clustering uncertainty does not always vanish even as $n \to \infty$ . For example, in a two-component Gaussian mixture model with equal covariance, for a point $y_{i}$ at nearly equal distances to two cluster centers, we would have both $Pr (c_{i} = 1 ∣ y_{i})$ and $Pr (c_{i} = 2 ∣ y_{i})$ close to 50% even as $n \to \infty$ . For a recent discussion on this topic as well as how to quantify the partition uncertainty, see Wade and Ghahramani (2018) and the references within. Second, the model-based clustering can be easily extended to handle more complicated modeling tasks. Specifically, since there is a probabilistic process associated with the clustering, it is straightforward to modify it to include useful dependency structures. We list a few examples from a rich literature: Ng et al. (2006) used a mixture model with random effects to cluster correlated gene-expression data, Müller and Quintana (2010); Park and Dunson (2010); Ren et al. (2011) allowed the partition to vary according to some covariates, Guha and Baladandayuthapani (2016) simultaneously clustered the predictors and use them in high-dimensional regression.

On the other hand, model-based clustering has its limitations. Primarily, one needs to carefully specify the density/mass function $𝒦$ , otherwise, it will lead to unwanted results and difficult interpretation. For example, Coretto and Hennig (2016) demonstrated the sensitivity of the Gaussian mixture model to non-Gaussian contaminants, Miller and Dunson (2018) and Cai et al. (2021) showed that when the distribution family of $𝒦$ is misspecified, the number of clusters would be severely overestimated. It is natural to think of using more flexible parameterization for $𝒦$ , in order to mitigate the risk of model misspecification. This has motivated many interesting works, such as modeling $𝒦$ via skewed distribution (Frühwirth-Schnatter and Pyne, 2010; Lee and McLachlan, 2016), unimodal distribution (Rodríguez and Walker, 2014), copula (Kosmidis and Karlis, 2016), mixture of mixtures (Malsiner-Walli et al., 2017), among others. Nevertheless, as the flexibility of $𝒦$ increases, the modeling and computational burdens also increase dramatically.

In parallel to the above advancements in model-based clustering, spectral clustering has become very popular in machine learning and statistics. Von Luxburg (2007) provided a useful tutorial on the algorithms and a review of recent works. On clustering point estimation, spectral clustering has shown good empirical performance for separating non-Gaussian and/or manifold data, without the need to directly specify the distribution for each cluster. Instead, one calculates a matrix of similarity scores between each pair of data, then uses a simple algorithm to find a partition that approximately minimizes the total loss of similarity scores across clusters (adjusted with respect to cluster sizes). This point estimate is found to be not very sensitive to the choice of similarity score, and empirical solutions have been proposed for tuning the similarity and choosing the number of clusters (Zelnik-Manor and Perona, 2005; Shi et al., 2009). There is a rapidly growing literature of frequentist methods on further improving the point estimate [ Chi et al. (2007); Rohe et al. (2011); Kumar et al. (2011); Lei and Rinaldo (2015); Han et al. (2021); Lei and Lin (2022); among others], although, in this article, we focus on the Bayesian perspective and aim to characterize the probability distribution.

Due to the algorithmic nature, spectral clustering cannot be directly used in model-based extension, or produce uncertainty quantification. This has motivated a large Bayesian literature. There have been several works trying to quantify the uncertainty around the spectral clustering point estimate. For example, since the spectral clustering algorithm can be used to estimate the community memberships in a stochastic block model, one could transform the data into a similarity matrix, then treat it as if generated from a Bayesian stochastic block model (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001; McDaid et al., 2013; Geng et al., 2019). Similarly, one could take the Laplacian matrix (a transform of the similarity used in spectral clustering) or its spectral decomposition, and model it in a probabilistic framework (Socher et al., 2011; Duan et al., 2023).

Broadly speaking, we can view these works as following the recent trend of robust Bayesian methodology, in conditioning the parameter of interest (clustering) on an insufficient statistic (pairwise summary statistics) of the data. See Lewis et al. (2021) for recent discussions. Pertaining to Bayesian robust clustering, one gains model robustness by avoiding putting any parametric assumption on within-cluster distribution $𝒦_{(\cdot ∣ θ_{k}^{*})}$ ; instead, one models the pairwise information that often has an arguably simple distribution. Recent works include the distance-based Pólya urn process (Blei and Frazier, 2011; Socher et al., 2011), Dirichlet process mixture model on Laplacian eigenmaps (Banerjee et al., 2015), Bayesian distance clustering (Duan and Dunson, 2021a), generalized Bayes extension of product partition model (Rigon et al., 2023).

This article follows this trend. Instead of modeling $y_{i}$ ’s as conditionally independent (or jointly dependent) from a certain within-cluster distribution $𝒦_{(\cdot ∣ θ_{k}^{*})}$ , we choose to model $y_{i}$ as dependent on another point $y_{j}$ that is close by, provided $y_{i}$ and $y_{j}$ are from the same cluster. This leads to a Markov graphical model based on a spanning forest, a graph consisting of multiple disjoint spanning trees (each tree as a connected subgraph without cycles). The spanning forest itself is not new to statistics. There has been a large literature on using spanning trees and forests for graph estimation, such as Meila and Jordan (2000); Meilă and Jaakkola (2006); Edwards et al. (2010); Byrne and Dawid (2015); Duan and Dunson (2021b); Luo et al. (2021). Nevertheless, a key difference between graph estimation and graph-based clustering is that — the former aims to recover both the node partition and the edges characterizing dependencies, while the latter only focuses on estimating the node partition alone (equivalent to clustering). Therefore, a distinction of our study is that we will treat the edges as a nuisance parameter/latent variable, while we will characterize the node partition in the marginal distribution.

Importantly, we formally show that by marginalizing the randomness of edges, the point estimate on the node partition is provably close to the one from the normalized spectral clustering algorithm. As the result, the spanning forest model can serve as the probabilistic model for the spectral clustering algorithm — this relationship is analogous to the one between the Gaussian mixture model and the K-means algorithm (MacQueen, 1967). Further, we show that treating the spanning forest as random, as opposed to a fixed parameter (that is unknown), leads to much less sensitivity in clustering performance, compared to cutting the minimum spanning tree algorithm (Gower and Ross, 1969). On the distribution specification on the node and edges, we take a Bayesian non-parametric approach by considering the forest model as realized from a “forest process” — each cluster is initiated with a point from a root distribution, then gradually grown with new points from a leaf distribution. We characterize the key differences in the partition distribution between the forest and classic Pólya urn processes. This difference also reveals that extra care should be exerted during model specification when using graphical models for clustering.Lastly, by establishing the probabilistic model counterpart for spectral clustering, we show how such models can be easily extended to incorporate other dependency structures. We demonstrate several extensions, including a multi-subject clustering of the brain networks, and a high-dimensional clustering of photo images.

2. Method

2.1. Background on Spectral Clustering Algorithms

We first provide a brief review of spectral clustering algorithms. For data $y_{1}, \dots, y_{n}$ , let $A_{i, j} \geq 0$ be a similarity score between $y_{i}$ and $y_{j}$ , and denote the degree $D_{i, i} = \sum_{j \neq i} A_{i, j}$ . To partition the data index $(1, \dots, n)$ into $K$ sets, $𝒱 = (V_{1}, \dots, V_{K})$ , we want to solve the following problem:

min_{𝒱} \sum_{k = 1}^{K} \frac{\sum_{i \in V_{k}, j \notin V_{k}} A_{i, j}}{\sum_{i \in V_{k}} D_{i, i}} .

(1)

This is known as the minimum normalized cut loss. The numerator above represents the across-cluster similarity due to cutting $V_{k}$ off from the others; and the denominator prevents trivial solutions of forming tiny clusters with small $\sum_{i \in V_{k}} D_{i, i}$ .

This optimization problem is a combinatorial problem, hence has motivated approximate solutions such as spectral clustering. To start, using the Laplacian matrix $L = D - A$ with $D$ the diagonal matrix of $D_{i, i}$ ’s, and the normalized Laplacian $N = D^{- 1 / 2} L D^{- 1 / 2}$ , we can equivalently solve the above problem via:

min_{𝒱} tr ({Z^{'}}_{ν} N Z_{ν}),

where $Z_{𝒱_{: i, k}} = 1 (i \in V_{k}) \sqrt{D_{i, i}} / \sqrt{\sum_{i \in V_{k}} D_{i, i}}$ . It is not hard to verify that ${Z^{'}}_{𝒱} Z_{𝒱} = I_{K}$ . We can obtain a relaxed minimizer of $Z : Z^{'} Z = I_{K}$ , by simply taking $\hat{Z}$ as the bottom $K$ eigenvectors of $N$ (with the minimum loss equal to the sum of the smallest $K$ eigenvalues). Afterward, we cluster the rows of $\hat{Z}$ into $K$ groups (using algorithms such as the K-means), hence producing an approximate solution to (1).

To clarify, there is more than one version of the spectral clustering algorithms. An alternative version to (1) is called “minimum ratio cut”, which replaces the denominator $\sum_{i \in V_{k}} D_{i, i}$ by the size of cluster $| V_{k} |$ . Similarly, continuous relaxation approximation can be obtained by following the same procedures above, except for clustering the eigenvectors of the unnormalized $L$ . Details on comparing those two versions can be found in Von Luxburg (2007). In this article, we focus on the one based on (1) and the normalized Laplacian matrix $N$ . This version is also commonly referred to as “normalized spectral clustering”.

2.2. Probabilistic Model via Bayesian Spanning Forest

The next question is if there is some partition-based generative model for $y$ , that has the maximum likelihood estimate (or, the posterior mode in the Bayesian framework) almost the same as the point estimate from the normalized spectral clustering.

We found an almost equivalence in the spanning forest model. A spanning forest model is a special Bayesian network that describes the conditional dependencies among $y_{1}, \dots, y_{n}$ . Given a partition $𝒱 = (V_{1}, \dots, V_{K})$ of the data index $(1, \dots, n)$ , consider a forest graph $ℱ_{𝒱} = (T_{1}, \dots, T_{k})$ , with each $T_{k} = (V_{k}, E_{k})$ a component tree (a connected subgraph without cycles), $V_{k}$ the set of nodes and $E_{k}$ the set of edges among $V_{k}$ . Using $ℱ_{𝒱}$ and a set of root nodes $ℛ_{𝒱} = (1^{*}, \dots, K^{*})$ with $k^{*} \in V_{k}$ , we can form a graphical model with a conditional likelihood given the forest:

ℒ (𝒱, ℱ_{𝒱}, ℛ_{𝒱}, θ) = \prod_{k = 1}^{K} [r (y_{k^{*}}; θ) \prod_{(i, j) \in T_{k}} f (y_{i} ∣ y_{j}; θ)],

(2)

where we refer to $r (\cdot; θ)$ as a “root” distribution, and $f (\cdot ∣ y_{j}; θ)$ as a “leaf” distribution; and we use $θ$ to denote the other parameter; and we use simplified notation $(i, j) \in G$ to mean that $(i, j)$ is an edge of the graph $G$ . Figure 1 illustrates the high flexibility of a spanning forest in representing clusters. It shows the sampled $ℱ$ based on three clustering benchmark datasets. Note that some clusters are not elliptical or convex in shape. Rather, each cluster can be imagined as if it were formed by connecting a point to another nearby. In the Supplementary Materials S4.8, we show two different realizations of spanning forest.

Fig. 1 — Three examples of clusters that can be represented by a spanning forest.

Remark 1. To clarify, the point estimation on a spanning forest (as some fixed and unknown graph) has been studied (Gower and Ross, 1969). However, a distinction here is that we consider $𝒱$ as the parameter of interest, but the edges and roots $(ℱ_{𝒱}, ℛ_{𝒱})$ as latent variables. The performance differences are shown in the Supplementary Materials S4.6.

The stochastic view of $(ℱ_{𝒱}, ℛ_{𝒱})$ is important, as it allows us to incorporate the uncertainty of edges and avoids the sensitivity issue in the point graph estimate. Equivalently, our clustering model is based on the marginal likelihood that varies with the node partition $𝒱$ :

ℒ (y; 𝒱, θ) = \sum_{ℱ_{𝒱}, ℛ_{𝒱}} ℒ (y; 𝒱, ℱ_{v}, ℛ_{v}, θ) \prod (ℱ_{𝒱}, ℛ_{𝒱} ∣ 𝒱),

(3)

where $Π (ℱ_{𝒱}, ℛ_{𝒱} ∣ 𝒱)$ is the latent variable distribution that we will specify in the next section. We can quantify the marginal connecting probability for each potential edge $(i, j) :$

M_{i, j} ≔ Pr [F_{𝒱} ∋ (i, j)] \propto \sum_{𝒱} \sum_{ℱ_{𝒱}, ℛ_{𝒱}} 1 [(i, j) \in F_{𝒱}] ℒ (y; 𝒱, ℱ_{𝒱}, ℛ_{𝒱}, θ) Π (ℱ_{𝒱}, ℛ_{𝒱} | 𝒱) .

(4)

Similar to the normalized graph cut, there is no closed-form solution for directly maximizing (3). However, closed-form does exist for (4) (see Section 4). Therefore, an approximate maximizer of (3), $𝒱$ , can be obtained via computing the matrix $M$ and searching for $K$ diagonal blocks (after row and column index permutation) that contain the highest total values of $M_{i, j}$ ’s. Specifically, we can extract the top leading eigenvectors of $M$ and cluster the rows into $K$ groups.

This approximate marginal likelihood maximizer produces almost the same estimate as the normalized spectral clustering does. This is because the two sets of eigenvectors are almost the same. Further, it is important to clarify that such closeness does not depend on how the data are really generated. Therefore, to provide some numerical evidence, for simplicity, we generate $y_{i}$ from a simple three-component Gaussian mixture in $ℝ^{2}$ with means in (0, 0), (2, 2), (4, 4) and all variances equal to $l_{2}$ . Figure 2 compares the eigenvectors of the matrix $M$ and the normalized Laplacian $N$ (that uses $f$ and $r$ to specify $A$ , with details provided in Section 4). Clearly, these two are almost identical in values. Due to this connection, the clustering estimates from spectral clustering can be viewed as an approximate estimate for $𝒱$ in (3).

Fig. 2 — Comparing the eigenvectors of a marginal connecting probability matrix $M$ and the ones of normalized Laplacian $N$ .

We now fully specify the Bayesian forest model. For simplicity, we now focus on continuous $y_{i} \in ℝ^{p}$ . For ease of computation, we recommend choosing $f$ as a symmetric function $f (y_{i} ∣ y_{j}; θ) = f (y_{j} ∣ y_{i}; θ)$ , so that the likelihood is invariant to the direction of each edge; and choose $r$ as a diffuse density, so that the likelihood is less sensitive to the choice of a node as root. In this article, we choose a Gaussian density for $f$ and Cauchy for $r$ .

f (y_{i} ∣ y_{j}; θ) = {(2 π σ_{i, j})}^{- p / 2} exp {- \frac{{‖ y_{i} - y_{j} ‖}_{2}^{2}}{2 σ_{i, j}}}, r (y_{i}; θ) = \frac{Γ [(1 + p) / 2]}{γ^{p} π^{(1 + p) / 2}} \frac{1}{{(1 + {‖ y_{i} - μ ‖}_{2}^{2} / γ^{2})}^{(1 + p) / 2}} .

(5)

where $σ_{i j} > 0$ and $γ > 0$ are scale parameters. As the magnitudes of distances between neighboring points may differ significantly from cluster to cluster, we use a local parameterization $σ_{i, j} = {\tilde{σ}}_{i} {\tilde{σ}}_{j}$ , and will regularize $({\tilde{σ}}_{1}, \dots, {\tilde{σ}}_{n})$ via a hyper-prior.

Remark 2. In (5), we effectively use Euclidean distances ${‖ y_{i} - y_{j} ‖}_{2}$ . We focus on Euclidean distance in the main text, for the simplicity of presentation and to allow a complete specification of priors. One can replace Euclidean distance with some others, such as Mahalanobis distance and geodesic distance. We present a case of high-dimensional clustering based on geodesic distance on the unit-sphere in the Supplementary Materials S1.1.

2.3. Forest Process and Product Partition Prior

To simplify notations as well as to facilitate computation, we now introduce an auxiliary node 0 that connects to all roots $(1^{*}, \dots, K^{*})$ . As the result, the model can be equivalently represented by a spanning tree rooted at 0:

𝒯 = (V_{𝒯}, E_{𝒯}), V_{𝒯} = {0} \cup V_{1} \cup \dots \cup V_{K}, E_{𝒯} = {(0, 1^{*}), \dots, (0, K^{*})} \cup E_{1} \cup \dots \cup E_{K} .

In this section, we focus on the distribution specification for $𝒯$ . The distribution, denoted by $Π (𝒯), Π (𝒯)$ can be factorized according to the following hierarchies: picking the number of partitions $K$ , partitioning the nodes into $(V_{1}, \dots, V_{K})$ , forming edges $E_{k}$ and picking one root $k^{*}$ for each $V_{k}$ . To be clear on the nomenclature, we call $Π (ℱ_{𝒱}, ℛ_{𝒱} | 𝒱)$ as the “latent variable distribution”, $Π_{0} (𝒱)$ as the “partition prior”.

Π (𝒯) = \underset{Π_{0} (𝒱)}{\underset{︸}{{Π_{0} (K) Π_{0} (V_{1}, \dots, V_{K} ∣ K)}}} \underset{Π (ℱ_{𝒱}, ℛ_{𝒱} | 𝒱)}{\underset{︸}{\prod_{k = 1}^{K} {Π (E_{k} ∣ V_{k}) Π (k^{*} ∣ E_{k}, V_{k})}}} .

(6)

Remark 3. In Bayesian non-parametric literature, $Π_{0} (K) Π_{0} (V_{1}, \dots, V_{K} ∣ K)$ is known as the partition probability function, which plays the key role in controlling cluster sizes and cluster number in model-based clustering. However, when it comes to graphical model-based clustering (such as our forest model), it is important to note the difference — for each partition $V_{k}$ , there is an additional probability $Π (E_{k}, k^{*} ∣ V_{k})$ due to the multiplicity of all possible subgraphs formed between the nodes in $V_{k}$ .

For simplicity, we will use discrete uniform distribution for $Π (E_{k}, k^{*} ∣ V_{k})$ . Since there are $n_{k}^{{(n_{k} - 2)}_{+}}$ possible spanning trees for $n_{k}$ nodes [ ${(x)}_{+} = x$ if $x > 0$ , otherwise 0], and $n_{k}$ possible choice of roots. We have $Π (E_{k}, k^{*} ∣ V_{k}) = n_{k}^{- (n_{k} - 1)}$ .

We now discuss two different ways to complete the distribution specification. We first take a “ground-up” approach by viewing $𝒯$ as from a stochastic process where the node number $n$ could grow indefinitely. Starting from the first edge $e_{1} = (0, 1)$ , we sequentially draw new edges and add to $𝒯$ , from

e_{i} ∣ e_{1}, \dots e_{i - 1} ~ \sum_{j = 1}^{i - 1} π_{j}^{[i]} δ_{(j, i)} (\cdot) + π_{i}^{[i]} δ_{(0, i)} (\cdot), y_{i} ∣ (j, i) ~ 1 (j \geq 1) f (\cdot ∣ y_{j}) + 1 (j = 0) r (\cdot),

(7)

with some probability vector $(π_{1}^{[i]}, \dots, π_{i}^{[i]})$ that adds up to one. We refer to (7) as a forest process. The forest process is a generalization of the Pólya urn process (Blackwell and MacQueen, 1973). For the latter, $e_{i} = (j, i)$ would make node $i$ take the same value as node $j$ , $y_{i} = y_{j}$ [although in model-based clustering, one would use notation $θ_{i} = θ_{j}$ , and $y_{i} ~ 𝒦 (\cdot ∣ θ_{i})$ ]; $e_{i} = (0, i)$ would make node $i$ draw a new value for $y_{i}$ from the base distribution. Due to this relationship, we can borrow popular parameterization for $π_{j}^{[i]}$ from the urn process literature. For example, we can use the Chinese restaurant process parameterization $π_{j}^{[i]} = 1 / (i - 1 + α)$ for $j = 1, \dots, (i - 1)$ , and $π_{i}^{[i]} = α / (i - 1 + α)$ with some chosen $α > 0$ . After marginalizing over the order of $i$ and partition index [see Miller (2019) for a simplified proof of the partition function], we obtain:

Π (𝒯) = \frac{α^{K} Γ (α)}{Γ (α + n)} \prod_{k = 1}^{K} Γ (n_{k}) n_{k}^{- (n_{k} - 1)} .

(8)

Compared to the partition probability prior in the Chinese restaurant process, we have an additional $n_{k}^{- (n_{k} - 1)}$ term that corresponds to the conditional prior weight of for each possible $(k^{*}, E_{k})$ given a partition $V_{k}$ .

To help understand the effect of this additional term on the posterior, we can imagine two extreme possibilities in the conditional likelihood given a $V_{k}$ . If the conditional $ℒ (y_{i} : i \in V_{k} | k^{*}, E_{k})$ is skewed toward one particular choice of tree $({\hat{k}}^{*}, {\hat{E}}_{k})$ [that is, $ℒ (y_{i} : i \in V_{k} | k^{*}, E_{k})$ is large when $(k^{*}, E_{k}) = ({\hat{k}}^{*}, {\hat{E}}_{k})$ , but is close to zero for other values of $(k^{*}, E_{k})$ ], then $n_{k}^{- (n_{k} - 1)}$ acts as a penalty for a lack of diversity in trees. On the other hand, if $ℒ (y_{i} : i \in V_{k} | k^{*}, E_{k})$ is equal for all possible $(k^{*}, E_{k})$ ’s, then we can simply marginalize over $(k^{*}, E_{k})$ and be not be subject to this penalty [since $\sum_{(k^{*}, E_{k})} n_{k}^{- (n_{k} - 1)} = 1$ ].

Therefore, we can form an intuition by interpolating those two extremes: if a set of data points (of size $n_{k}$ ) are “well-knit” such that they can be connected via many possible spanning trees (each with a high conditional likelihood), then it would have a higher posterior probability of being clustered together, compared to some other points (of the same size $n_{k}$ ) that have only a few trees with high conditional likelihood.

With the “ground-up” construction useful for understanding the difference from the classic urn process, the distribution (8) itself is not very convenient for posterior computation. Therefore, we also explore the alternative of a “top-down” approach. This is based on directly assigning a product partition probability (Hartigan, 1990; Barry and Hartigan, 1993; Crowley, 1997; Quintana and Iglesias, 2003) as

Π_{0} (V_{1}, \dots, V_{K} ∣ K) = \frac{\prod_{k = 1}^{K} n_{k}^{(n_{k} - 1)}}{\sum_{all (V_{1}^{*}, \dots, V_{K}^{*})} \prod_{k = 1}^{K} {| V_{k}^{*} |}^{(V_{k}^{*} ∣ - 1)}},

(9)

where the cohesion function $n_{k}^{(n_{k} - 1)}$ effectively cancels out the probability for each $(k^{*}, E_{k})$ . To assign a prior for $K$ , we assign a probability

Π_{0} (K) \propto λ^{K} \sum_{all (V_{1}^{*}, \dots, V_{k}^{*})} \prod_{k = 1}^{K} V_{k}^{*} {| V_{k}^{*} |}^{(| V_{k}^{*} | - 1)},

supported on $K \in {1, \dots, n}$ with $λ > 0$ , with $Π (E_{k}, k^{*} ∣ V_{k}) = n_{k}^{- (n_{k} - 1)}$ , multiplying the terms according to (6) leads to

Π (𝒯) \propto λ^{K},

(10)

which is similar to a truncated geometric distribution and easy to handle in posterior computation, and we will use this from now on. In this article, we set $λ = 0.5$ .

Remark 4. We now discuss the exchangeability of the sequence of random variables generated from the above forest process. The exchangeability is defined as the the invariance of distribution $Π (X_{1} = x_{1}, \dots X_{n} = x_{n}) = Π (X_{1} = x_{{\tilde{π}}_{1}}, \dots X_{n} = x_{{\tilde{π}}_{n}})$ under any permutation $({\tilde{π}}_{1}, \dots, {\tilde{π}}_{n})$ (Diaconis, 1977). For simplicity, we focus on the joint distribution with $θ$ as given, and hence omit $θ$ here. There are three categories of random variables associated with each node $i$ : the first drawn edge $(j, i)$ that points to a new node $i$ (whose sequence forms $𝒯 = (𝒱, {E_{k}, k^{*}}_{k = 1}^{K})$ ), the cluster assignment of a node $c_{i}$ (whose sequence forms $𝒱$ ), and the data point $y_{i}$ . It is not hard to see that, since each component tree encodes an order among ${i : c_{i} = k}$ , the joint distribution of the data and the forest $Π (y_{1}, \dots, y_{n}, 𝒯)$ is not exchangeable. Nevertheless, as we marginalize out each $(E_{k}, k^{*})$ to form the clustering likelihood $ℒ (y; 𝒱)$ in (3), and all priors $Π_{0} (𝒱)$ presented in this section only depend on the number and sizes of clusters, the joint distribution of the data and cluster labels $Π {(y_{1}, c_{1}), \dots, (y_{n}, c_{n})} = ℒ (y; 𝒱) Π_{0} (𝒱)$ is exchangeable, with its form provided soon in (14). Lastly, we see that $Π (y_{1}, \dots, y_{n})$ is exchangeable after marginalizing over $𝒱$ .

2.4. Hyper-priors for the Other Parameters

We now specify the hyper-priors for the parameters in the root and leaf densities. To avoid model sensitivities to scaling and shifting of the data, we assume that the data have been appropriately scaled and centered (for example, via standardization), so that the marginally $E_{y} \approx 0$ and $E {‖ y_{., j} - E_{y_{., j}} ‖}_{2}^{2} \approx 1$ for $j = 1, \dots, p$ . To make the root density $r (\cdot)$ close to a small constant in the support of the data, we set $μ = 0$ and $γ^{2} ~ Inverse-Gamma (2, 1)$ .

For $σ_{i, j}$ in the leaf density $f (y_{i} ∣ y_{j}; σ_{i, j})$ , in order to likely pick an edge $(i, j)$ with $j$ as a close neighbors of $i$ (that is, $(i, j)$ with small ${‖ y_{i} - y_{j} ‖}_{2}$ ), we want most of $σ_{i, j} = {\tilde{σ}}_{i} {\tilde{σ}}_{j}$ to be small. We use the following hierarchical inverse-gamma prior that shrinks each ${\tilde{σ}}_{i}$ , while using a common scale hyper-parameter $β_{σ}$ to borrow strengths among ${\tilde{σ}}_{i}$ , s,

β_{σ} ~ exp (η_{σ}), η_{σ} ~ Inverse-Gamma (a_{σ}, ξ_{σ}), {\tilde{σ}}_{i} \overset{iid}{~} Inverse-Gamma (b_{σ}, β_{σ}) for i = 1, \dots, n,

where $η_{σ}$ is the scale parameter for the exponential. To induce a shrinkage effect a priori, we use $a_{σ} = 100$ and $ξ_{σ} = 1$ for a likely small $η_{σ}$ hence a small $β_{σ}$ . Further, we note that the coefficient of variation $\sqrt{Var ({\tilde{σ}}_{i} ∣ β_{σ})} / E ({\tilde{σ}}_{i} ∣ β_{σ}) = 1 / \sqrt{b_{σ} - 2}$ ; therefore, we set $b_{σ} = 10$ to have most of ${\tilde{σ}}_{i}$ near $E ({\tilde{σ}}_{i} ∣ β_{σ}) = β_{σ} / (b_{σ} - 1)$ in the prior. We use these hyper-prior settings in all the examples presented in this article.

In addition, Zelnik-Manor and Perona (2005) demonstrate good empirical performance in spectral clustering via setting ${\tilde{σ}}_{i}$ to a low order statistic of the distances to $y_{i}$ . We show a model-based formalization with similar effects in the Supplementary Materials S5.

2.5. Model-based Extensions

Compared to algorithms, a major advantage of probabilistic models is the ease of building useful model-based extensions. We demonstrate three directions for extending the Bayesian forest model. Due to the page constraint, we defer the details and numeric results of these extensions in the Supplementary Materials S1.1, S1.2 and S1.3.

Latent Forest Model:

First, one could use the realization of the forest process as latent variables in another model $ℳ$ for data $(y_{1}, \dots, y_{n})$ ,

z_{1}, \dots, z_{n} ~ ForestModel (𝒯; θ_{z}), y_{1}, \dots, y_{n} ~ ℳ (z_{1}, \dots, z_{n}; θ_{y}),

where $θ_{z}$ and $θ_{y}$ denote the other needed parameters. For example, for clustering high-dimensional data such as images, it is often necessary to represent each high-dimensional observation $y_{i}$ by a low-dimensional coordinate $z_{i}$ (Wu et al., 2014; Chandra et al., 2023). In the Supplementary Materials, we present a high-dimensional clustering model, using an autoregressive matrix Gaussian for $ℳ$ and a sparse von Mises-Fisher for the forest model.

Informative Prior-Latent Variable Distribution:

Second, in applications it is sometimes desirable to have the clustering dependent on some external information $x$ , such as covariates (Müller et al., 2011) or an existing partition (Paganin et al., 2021). From a Bayesian view, this can be achieved via taking an $x$ -informative distribution:

𝒯 ~ Π (\cdot ∣ x), y_{1}, \dots, y_{n} ~ ForestModel (𝒯; θ) .

In the Supplementary Materials, we illustrate an extension with a covariate-dependent product partition model [PPMx, Müller et al. (2011)] into the distribution of $𝒯$ .

Hierarchical Multi-view Clustering:

Third, for multi-subject data $(y_{1}^{(s)}, \dots, y_{n}^{(s)})$ for $s = 1, \dots, s$ , we want to find a clustering for every $s$ . At the same time, we can borrow strength among subjects, by letting subjects share some similar partition structure on a subset of nodes (while differing on the other nodes). This is known as multi-view clustering. On the other hand, a challenge is that a forest is a discrete object subject to combinatorial constraints, hence it would be difficult to partition the nodes freely while accommodating the tree structure. To circumvent this issue, we propose a latent coordinate-based distribution that gives a continuous representation for $𝒯 (s)$ .

Consider a latent $z_{i}^{(s)} \in ℝ^{d}$ for each node $i = 1, \dots, n$ , we assign a joint prior–latent variable distribution for $z^{(s)}$ and $𝒯^{(s)}$ :

Π [z^{(s)}, 𝒯^{(s)}] \propto λ^{k [𝒯^{(s)}]} [\prod_{(i, j) \in 𝒯^{(s)} : i \geq 1, j \geq 1} exp (- \frac{{‖ z_{i}^{(s)} - z_{j}^{(s)} ‖}_{2}^{2}}{2 ρ})] [\prod_{i = 1}^{n} {\sum_{k = 1}^{\tilde{κ}} v_{i, k} exp (- \frac{{‖ z_{i}^{(s)} - η_{k}^{*} ‖}_{2}^{2}}{2 σ_{z}^{2}})}], (v_{i, 1}, \dots, v_{i, \tilde{κ}}) ~ Dir (1 / \tilde{κ}, \dots, 1 / \tilde{κ}) for i = 1, \dots n, {y_{1}^{(s)}, \dots, y_{n}^{(s)}} ~ Forest Model (𝒯^{(s)}) for s = 1, \dots S,

(11)

where $v_{i, 1}, \dots, v_{i, \tilde{κ}}$ are the weights that vary with $i$ and $\sum_{k = 1}^{\tilde{κ}} v_{i, k} = 1, ρ > 0$ , and $z^{(s)} \in ℝ^{n \times d}$ is the matrix form. Equivalently, the above assigns each node a location parameter $η_{i}^{(s)}$ , drawn from a hierarchical Dirichlet distribution with shared atoms ${η_{1}^{*}, \dots, η_{\tilde{κ}}^{*}}$ and probability $(v_{., 1}, \dots, v_{., \tilde{κ}})$ (Teh et al., 2006). Further, one could let $η_{k}^{*}$ vary over node according to some functional using a hybrid Dirichlet distribution (Petrone et al., 2009).

Using a Gaussian mixture kernel on $z_{i}^{(s)}$ , we can now separate $z_{i}^{(s)}$ ’s into several groups that are far apart. To make the parameters identifiable and have large separations between groups, we fix ${\tilde{η}}_{k}^{*}$ ’s on the $d$ -dimensional integer lattice ${0, 1, 2}^{d}$ with $d = 2$ (hence $\tilde{κ} = 9$ ); and we use $σ_{z}^{2} = 0.01$ and $ρ = 0.001$ in this article.

Remark 5. To clarify, our goal is to induce between-subject similarity in the node partition, not the tree structure. For example, for two subjects $s$ and $s^{'}$ , when $z_{i}^{(s)}$ and $z_{i}^{(s^{'})}$ are both near $η_{k}^{*}$ for all $i \in C$ , then both the spanning forest $𝒯 (s)$ and $𝒯 (s^{'})$ will likely cluster the nodes in $C$ together, even though $T_{k}^{(s)}$ and $T_{k}^{(s^{'})}$ associated with $V_{k} \supset C$ may be different.

3. Posterior Computation

3.1. Gibbs Sampling Algorithm

We now describe the Markov chain Monte Carlo (MCMC) algorithm. For ease of notation, we use an $(n + 1) \times (n + 1)$ matrix $S$ , with $S_{i, j} = log f (y_{i} ∣ y_{j}; θ), S_{0, i} = S_{i, 0} = log r (y_{i}; θ) + log λ$ (for convenience, we use 0 to index the last row/column), $S_{i, i} = 0$ , and $A_{𝒯}$ to represent the adjacency matrix of $𝒯$ . We have the posterior distribution

Π (𝒯, θ ∣ y) \propto exp {tr [S (θ) A_{𝒯}] / 2} \prod_{0} (θ) .

(12)

Note the above form conveniently include the prior term for the number of clusters, $λ K$ , via the number of edges adjacent to node 0.

Our MCMC algorithm alternates in updating $𝒯$ and $θ$ , hence is a Gibbs sampling algorithm. To sample $𝒯$ given $θ$ , we take the random-walk covering algorithm for weighted spanning tree (Mosbah and Saheb, 1999), as an extension of the Andrei–Broder algorithm for sampling uniform spanning tree (Broder, 1989; Aldous, 1990). For this article to be self-contained, we describe the algorithm below. The above algorithm produces a random sample $𝒯$ following the full conditional $Π (𝒯 ∣ θ, y)$ proportional to (12). It has an expected finish time of $O (n log n)$ . Although some faster algorithms have been developed (Schild, 2018), we choose to present the random-walk covering algorithm for its simplicity.

Algorithm 1.

Random-walk covering algorithm for sampling the augmented tree $𝒯$

Start with

V_{𝒯} = {0}

and

E_{𝒯} = \emptyset

, and set

i \leftarrow 0

while

| V_{𝒯} | \neq n + 1

Take a random walk from

i

j

with probability

Pr (j ∣ i) = \frac{exp [S_{i, j} (θ)]}{\sum_{j : j \neq i} exp [S_{i, j} (θ)]}

j \notin V_{𝒯}

then

Add

j

V_{𝒯}

. Add

(i, j)

E_{𝒯}

Update

i \leftarrow j

Open in a new tab

We sample ${\tilde{σ}}_{i}$ using the following steps,

(η_{σ} ∣ .) ~ Inverse-Gamma (1 + a_{σ}, β_{σ} + ξ_{σ})

(β_{σ} ∣ .) ~ Gamma {1 + n b_{σ}, {(\sum_{i = 1}^{n} \frac{1}{{\tilde{σ}}_{i}} + \frac{1}{η_{σ}})}^{- 1}}

({\tilde{σ}}_{i} ∣ .) ~ Inverse-Gamma [\frac{p \sum_{j} 1 {(i, j) \in 𝒯}}{2} + b_{σ}, \sum_{j : (i, j) \in 𝒯} \frac{{‖ y_{i} - y_{j} ‖}_{2}^{2}}{2 {\tilde{σ}}_{j}} + β_{σ}]

To update $γ$ , we use the form of the multivariate Cauchy as a scale mixture of $N (μ, γ^{2} u_{γ, i} I_{p})$ over $u_{γ, i} ~ Inverse-Gamma (1 / 2, 1 / 2)$ . We can update via

u_{γ, i} ~ Inverse-Gamma (\frac{1 + p}{2}, \frac{1}{2} + \frac{{‖ y_{i} - μ ‖}_{2}^{2}}{2 γ^{2}}), γ^{2} ~ Inverse-Gamma (2 + \frac{K p}{2}, {\hat{σ}}_{y}^{2} + \sum_{i : (0, i) \in 𝒯} \frac{{‖ y_{i} - μ ‖}_{2}^{2}}{2 u_{γ, i}}) .

We run the MCMC algorithm iteratively for many iterations. And we discard the first half of iterations as burn-in.

Remark 6. We want to emphasize that the Andrei–Broder random-walk covering algorithm (Broder, 1989; Aldous, 1990; Mosbah and Saheb, 1999) is an exact algorithm for sampling a spanning tree $𝒯$ . That is, if $θ$ were fixed, each run of this algorithm would produce an independent Monte Carlo sample $𝒯 ~ Π (𝒯 ∣ θ, y)$ . Removing the auxiliary node $O$ from $𝒯$ will produce $K$ disjoint spanning trees. This augmented graph technique is inspired by Boykov et al. (2001).

In our algorithm, since the scale parameters in $θ$ are unknown, we use Markov chain Monte Carlo that updates two sets of parameters, (i) $(θ_{[t + 1]} ∣ 𝒯_{[t]})$ and (ii) $(𝒯_{[t + 1]} ∣ θ_{[t + 1]})$ from iteration $[t]$ to $[t + 1]$ . Therefore, rigorously speaking, there is a Markov chain dependency between $𝒯_{[t]}$ and $𝒯_{[t + 1]}$ induced by $θ_{[t + 1]}$ . Nevertheless, since we draw $𝒯$ in a block via the random-walk covering algorithm, we empirically find that $𝒯_{[t + 1]}$ and $𝒯_{[t]}$ are substantially different. In the Supplementary Materials S4.4, we quantify the iteration-to-iteration graph changes, and provide diagnostics with multiple start points of $(𝒯_{[0]}, θ_{[0]})$ .

3.2. Posterior Point Estimate on Clustering

In the field of Bayesian clustering, for producing point estimate on the partition, it had been a long-time practice to simply track $pr (c_{i} = k ∣ y)$ , then take the element-wise posterior mode over $k$ as the point estimate for ${\hat{c}}_{i}$ . Nevertheless, this was shown to be sub-optimal due to that: (i) label switching issue causes unreliable estimates on $pr (c_{i} = k ∣ y)$ ; (ii) the element-wise mode can be unrepresentative of the center of distribution for $(c_{1}, \dots, c_{n})$ (Wade and Ghahramani, 2018). These weaknesses have motivated new methods of obtaining point estimate of clustering, that transform an $n \times n$ pairwise co-assignment matrix ${pr (c_{i} = c_{j} ∣ y)}_{all (i, j)}$ into an $n \times K$ assignment matrix (Medvedovic and Sivaganesan, 2002; Rasmussen et al., 2008; Molitor et al., 2010; Wade and Ghahramani, 2018). More broadly speaking, minimizing a loss function based on the posterior sample (via some estimator or algorithm) is common for producing a point estimate under some decision theory criterion. For example, the posterior mean comes as the minimizer of the squared error loss; in Bayesian factor modeling, an orthogonal Procrustes-based loss function is used for producing the posterior summary of the loading matrix from the generated MCMC samples (Aßmann et al., 2016).

We follow this strategy. There have been many algorithms that one could use. For a recent survey, see Dahl et al. (2022). In this article, we use a simple solution of first finding the mode of $K$ from the posterior sample, then doing a $\hat{K}$ -rank symmetric matrix factorization on ${pr (c_{i} = c_{j} ∣ y)}_{all (i, j)}$ and clustering into $\hat{K}$ groups, provided by RcppmL package (DeBruine et al., 2021).

4. Theoretical Properties

4.1. Convergence of Eigenvectors

We now formalize the closeness of the eigenvectors of matrices $N$ and $M$ (shown in Section 2.2), by establishing the convergence of the two sets of eigenvectors as $n$ increases.

To be specific, we focus on the normalized spectral clustering algorithm using the similarity $A_{i, j} = exp (S_{i, j})$ , with $S_{i, j} = log f (y_{i} ∣ y_{j}; θ), S_{0, i} = S_{i, 0} = log r (y_{i}; θ) + log λ$ . On the other hand, for the specific form, $f (y_{i} ∣ y_{j})$ can be any density satisfying $f (y_{i} ∣ y_{j}, θ) = f (y_{j} ∣ y_{i}, θ), r (y_{i}; θ)$ can be any density satisfying $r (y_{i}; θ) > 0$ . For the associated normalized Laplacian $N$ , we denote the first $K$ bottom eigenvectors by $ϕ_{1}, \dots, ϕ_{K}$ , which correspond to the smallest $K$ eigenvalues.

Let $M$ be the matrix with $M_{i, j} = pr [𝒯 ∋ (i, j) ∣ y, θ]$ for $i \neq j$ and $M_{i, i} = 0$ . The Kirchhoff’ $s$ tree theorem (Chaiken and Kleitman, 1978) gives an enumeration of all $𝒯 \in T$ ,

\sum_{𝒯 \in T} \prod_{(i, j) \in 𝒯} exp (S_{i, j}) = {(n + 1)}^{- 1} \prod_{h = 2}^{n + 1} λ_{(h)} (L)

(13)

where $L$ is the Laplacian matrix transform of the similarity matrix $A; λ_{(h)}$ denotes the $h$ th smallest eigenvalue. Differentiating its logarithmic transform with respect to $S_{i, j}$ ,

M_{i, j} = Pr [𝒯 ∋ (i, j) ∣ y] = \frac{\sum_{𝒯 \in T, (i, j) \in 𝒯} \prod_{(i^{'}, j^{'}) \in 𝒯} exp (S_{i^{'}, j^{'}})}{\sum_{𝒯 \in T} \prod_{(i^{'}, j^{'}) \in 𝒯} exp (S_{i^{'}, j^{'}})} = \frac{\partial \sum_{i = 2}^{n + 1} log λ_{(i)} (L)}{\partial S_{i, j}} .

Let $Ψ_{1}, \dots, Ψ_{K}$ be the top $K$ eigenvectors of $M$ , associated with eigenvalues $ξ_{1} \geq ξ_{2} \geq \dots \geq ξ_{K}$ , and $ξ_{K} > ξ_{K + 1} \geq ξ_{K + 2} \geq \dots \geq ξ_{n + 1}$ . And we can compare with the $K$ leading eigenvectors of $(- N) \in ℝ^{n \times n}, ϕ_{1}, \dots, ϕ_{K}$ . Using $Ψ_{1 : K}$ and $ϕ_{1 : K}$ to denote two $(n + 1) \times K$ matrices, we now show they are close to each other.

Theorem 1. There exists an orthonormal matrix $R \in ℝ^{K \times K}$ and a finite constant $ϵ > 0$ ,

{‖ Ψ_{1 : K} - ϕ_{1 : K} R ‖}_{F} \leq \frac{40 \sqrt{K (n + 1)}}{ξ_{K} - ξ_{K + 1}} max_{i, j} {(1 + ϵ) {(D_{i}^{- 1 / 2} - D_{j}^{- 1 / 2})}^{2} A_{i, j}},

with probability at least $1 - exp (- n)$ .

Remark 7. To make the right-hand side go to zero, a sufficient condition is to have all $A_{i, j} / D_{i, i} = O (n^{- κ})$ with $κ > 1 / 2$ . We provide a detailed definition of the bound constant $ϵ$ in the Supplementary Materials S2.

To explain the intuition behind this theorem, our starting point is the close relationship between Laplacian and spanning tree models — multiplying both sides of Equation (13) by ${(n + 1)}^{- (n - 1)}$ shows that the non-zero eigenvalue product of the graph Laplacian $L$ is proportional to the marginal probability of $n$ data points from a spanning forest-mixture model. Starting from this equality, we can write the marginal inclusion probability matrix of $𝒯$ as a mildly perturbed form of the normalized Laplacian matrix. Intuitively, when two matrices are close, their eigenvectors will be close as well (Yu et al., 2015).

Therefore, under mild conditions, as $n \to \infty$ , the two sets of leading eigenvectors converge. In the Supplementary Materials S4.7, we show that the convergence is very fast, with the two sets of leading eigenvectors becoming almost indistinguishable starting around $n \geq 50$ .

Besides the eigenvector convergence, we can examine the marginal posterior $Π (𝒱 ∣ θ, y)$ , which is proportional to

ℒ (y; 𝒱, θ) Π_{0} (𝒱) = Π_{0} (K, V_{1}, \dots, V_{K}) {\prod_{k = 1}^{K} [\sum_{i \in V_{k}} r (y_{i})]} \prod_{k = 1}^{K} {n_{k}^{- 1} \prod_{h = 2}^{n_{k}} λ_{(h)} (L_{k})},

(14)

where $L_{k}$ is the unnormalized Laplacian matrix associated with matrix ${A_{i, j}}_{i \in V_{k}, j \in V_{k}}$ . Imagine that if we put all indices in one partition $V_{1} = (1, \dots, n)$ , then $Π (𝒱 ∣ θ, y)$ would be very small due to those close-to-zero eigenvalues. Applying this deduction recursively on subsets of data, it is not hard to see that a high-valued $Π (𝒱 ∣ θ, y)$ would correspond to a partition, wherein each $V_{k}$ has $λ_{(h)} (L_{k})$ away from 0 for any $h \geq 2$ . Further, since ${n_{k}^{- 1} \prod_{h = 2}^{n_{k}} λ_{(h)} (L_{k})} = | L_{k} + J / n_{k}^{2} |$ , a permutation in $(1, \dots, n)$ corresponds to congruent and simultaneous permutations of rows and columns of each $L_{k}$ , which does not change each determinant. Therefore, the joint distribution of $Π {(y_{1}, c_{1}), \dots, (y_{n}, c_{n})}$ is exchangeable.

4.2. Consistent Clustering of Separable Sets

We show that clustering consistency is possible, under some separability assumptions when the data-generating distribution follows a forest process. Specifically, we establish posterior ratio consistency, as the ratio between the maximum posterior probability assigned to other possible clustering assignments to the posterior probability assigned to the true clustering assignments converges to zero almost surely under the true model (Cao et al., 2019).

To formalize the above, we denote the true cluster label for generating $y_{i}$ by $c_{i}^{0}$ (subject to label permutation among clusters), and we define the enclosing region for all possible $y_{i} : c_{i}^{0} = k$ as $R_{k}^{0}$ for $k = 1, \dots, K_{0}$ for some true finite $K_{0}$ . And we refer to $R^{0} = (R_{1}^{0}, \dots, R_{K_{0}}^{0})$ as the “null partition”. By separability, we mean the scenario that $(R_{1}^{0}, \dots, R_{K_{0}}^{0})$ are disjoint and there is a lower-bounded distance between each pair of sets. As alternatives, regions $R = (R_{1}, \dots, R_{K})$ could be induced by ${c_{1}, \dots, c_{n}}$ from the posterior estimate of $𝒯$ . For simplicity, we assume the scale parameter in $f$ is known and all equal $σ_{i, j} = σ^{0, n}$ .

Number of clusters is known. We first start with a simple case when we have fixed $K = K_{0}$ . For regularities, we consider data as supported in a compact region $𝒳$ , and satisfying the following assumptions:

(A1, diminishing scale) $σ^{0, n} = C^{'} {(1 / log n)}^{1 + i}$ for some $t > 0$ and $C^{'} > 0$ .
(A2, minimum separation) $inf_{x \in R_{k}^{0}, y \in R_{k^{'}}^{0}} {‖ x - y ‖}_{2} > M_{n}$ , for all $k \neq k^{'}$ with some positive constant $M_{n} > 0$ such that $M_{n}^{2} / σ^{0, n} = 8 {\tilde{m}}_{0} log (n)$ for all $(i, j)$ and is known for some constant ${\tilde{m}}_{0} > p / 2 + 2$ .
(A3, near-flatness of root density) For any $n$ , $ϵ_{1} < r (y) < ϵ_{2}$ for all $y \in 𝒳$ .

Under the null partition, $Π (𝒯 ∣ y)$ is maximized at $𝒯 = 𝒯_{MST, R^{0}}$ , which contains $K_{0}$ trees with each $T_{k}$ being the minimum spanning tree (denoted by subscript “MST”) within region $R_{k}^{0}$ . Similarly, for any alternative $R, Π (𝒯 ∣ y)$ is maximized at the $𝒯 = 𝒯_{MST, R}$ .

Theorem 2. Under (A1, A2, A3), we have $Π (𝒯_{MST, R} ∣ y) / Π (𝒯_{MST, R^{0}} ∣ y) \to 0$ almost surely, unless $R_{i}^{0} \subseteq R_{ξ (i)}$ for some permutation map $ξ (\cdot)$ .

Number of clusters is unknown: Next, we relax the condition by having a $K$ not necessarily equal to $K_{0}$ . We show the consistency in two parts for 1) $K < K_{0}$ , and 2) $K > K_{0}$ separately. In order to show posterior ratio consistency in the second part, we need some finer control on $r (y)$ :

(A3’) The root density satisfies ${\tilde{m}}_{1} e^{- M / 2 σ^{0, n}} \leq r (y) \leq {\tilde{m}}_{2} e^{- M / 2 σ^{0, n}}$ for some ${\tilde{m}}_{1} < {\tilde{m}}_{2}$ .

In this assumption, we essentially assume the root distribution to be flatter with a larger $n$ . Then we have the following results.

Theorem 3. 1) If $K < K_{0}$ , under the assumptions (A1, A2, A3), we have $Π (𝒯_{MST, R} | y) / Π (𝒯_{MST, R^{0}} ∣ y) \to 0$ almost surely.

2) If $K > K_{0}$ , under the assumptions (A1, A2, A3’), we have $Π (𝒯_{MST, R} ∣ y) / Π (𝒯_{MST, R^{0}} ∣ y) \to 0$ almost surely.

The above results show posterior ratio consistency. Furthermore, when the true of clusters is known, the ratio consistency result can be further extended to show clustering consistency, which is proved in the Supplementary Materials S3.

5. Numerical Experiments

To illustrate the capability of uncertainty quantification, we carry out clustering tasks on those near-manifold data commonly used for benchmarking clustering algorithms. In the first simulation, we start with 300 points drawn from three rings of radii 0.2, 1 and 2, with 100 points from each ring. Then we add some Gaussian noise to each point to create a coordinate near a ring manifold. We present two experiments, one with noises from $N (0, {0.05}^{2} I_{2})$ , and one with noises $N (0, {0.1}^{2} I_{2})$ . As shown in Figure 3, when these data are well separated (Panel a, showing posterior point estimate), there is very little uncertainty on the clustering (Panel b), with the posterior co-assignment $Pr (c_{i} = c_{j} ∣ y)$ close to zero for any two data points near different rings. As noises increase, these data become more difficult to separate. There is a considerable amount of uncertainty for those red and blue points: these two sets of points are assigned into one cluster with a probability close to 40% (Panel d). We conduct another simulation based on an arc manifold and two point clouds (Panels e-h), and find similar results. Additional experiments are described in the Supplementary Materials S4.2.

Fig. 3 — Uncertainty quantification in clustering data generated near three manifolds. When data are close to the manifolds (Panels a,e), there is very little uncertainty on clustering in low $Pr (c_{i} = c_{j} ∣ j)$ between points from different clusters (Panels b,f). As data deviate more from the manifolds (Panel c,g), the uncertainty increases (Panels d,h). And in Panel g, the point estimate shows a two-cluster partitioning, while there is about 20% of probability for three-cluster partitioning.

In the Supplementary Materials S4.1 and S4.3, we present some uncertainty quantification results, for clustering data that are from mixture models. We compare the estimates with the ones from Gaussian mixture models, which could correspond to correctly/erroneously specified component distribution. Empirically, we find that the uncertainty estimates on $Pr (c_{i} = c_{j} ∣ y)$ and $Pr (K ∣ y)$ from the forest model are close to the ones based on the true data-generating distribution; whereas the Gaussian mixture models suffer from sensitivity in model specification, especially when $K$ is not known.

6. Application: Clustering in Multi-subject Functional Magnetic Resonance Imaging Data

In this application, we conduct a neuroscience study for finding connected brain regions under a varying degree of impact from Alzheimer’s disease. The source dataset is resting-state functional magnetic resonance imaging (rs-fMRI) scan data, collected from $S = 166$ subjects at different stages of Alzheimer’s disease. Each subject has scans over $n = 116$ regions of interest using the Automated Anatomical Labeling (AAL) atlas (Rolls et al., 2020; Shi et al., 2021) and over $p = 120$ time points. We denote the observation for the $s$ th subject in the $i$ th region by $y_{i}^{(s)} \in ℝ^{p}$ .

The rs-fMRI data are known for their high variability, often characterized by a low intraclass correlation coefficient (ICC), $(1 - {\hat{σ}}_{within--group}^{2} / {\hat{σ}}_{total}^{2})$ , as the estimate for the proportion of total variance that can be attributed to variability between groups (Noble et al., 2021). Therefore, our goal is to use the multi-view clustering to divide the regions of interest for each subject, while improving our understanding of the source of high variability.

We fit the multi-view clustering model to the data, by running MCMC for 5, 000 iterations and discarding the first 2, 500 as burn-in. As shown in Figure 4, the hierarchical Dirichlet distribution on the latent coordinates induces similarity between the clustering of brain regions among subjects on a subset of nodes, while showing subtle differences on the other nodes. On the other hand, some major differences can be seen in the clusterings between the healthy and diseased subjects. Using the latent coordinates (at the posterior mean), we quantify the distances between $z^{(s)}$ and $z^{(s^{'})}$ for each pair of subjects $s \neq s^{'}$ . As shown in Figure 5(a), there is a clear two-group structure in the pairwise distance matrix formed by ${‖ z^{(s)} - z^{(s^{'})} ‖}_{F}$ , and the separation corresponds to the first 64 subjects being healthy (denoted by $s \in g_{1}$ ) and the latter 102 being diseased (denoted by $s \in g_{2}$ ).

Fig. 4 — Results of brain region clustering (lateral view) for four subjects taken from the healthy and diseased groups. The multi-view clustering model allows subjects to have similar partition structures on a subset of nodes, while having subtle differences on the others (Panels a and b, Panels c and d). At the same time, the healthy subjects show less degree of variability in the brain clustering than the diseased subjects.

Fig. 5 — Using the latent coordinates to characterize the heterogeneity within the subjects.

Next, we compute the within–group variances for these two groups, using $\sum_{s \in g_{l}} {‖ z_{i}^{(s)} - (\sum_{s \in g_{l}} z_{i}^{(s)} / | g_{l} |) ‖}_{F}^{2} / | g_{l} |$ , for $l = 1$ and 2, and plot the variance over each region of interest $i$ on the spatial coordinate of the atlas. Figure 5(b) and (c) show that, although both groups show some degree of variability, the diseased group shows clearly higher variances in some regions of the brain. Specifically, the paracentral lobule (PCL) and superior parietal gyrus (SPG), dorsolateral superior frontal gyrus (SFGdor), and supplementary motor area (SMA) in the frontal lobe show the highest amount of variability. Indeed, those regions are also associated with very low ICC scores [Figure 5(e)] calculated based on the variance of $z_{i}^{(s)}$ , with pooled estimates ${\hat{σ}}_{total, i}^{2} = {\sum_{s} ‖ z_{i}^{(s)} - (\sum_{s} z_{i}^{(s)} / S) ‖}_{F}^{2} / S$ and ${\hat{σ}}_{within--group, i}^{2} = \sum_{l = 1}^{2} \sum_{s \in g_{l}} {‖ z_{i}^{(s)} - (\sum_{s \in g_{l}} z_{i}^{(s)} / | g_{l} |) ‖}_{F}^{2} / S$ On the other hand, some regions such as the hippocampus (HIP), parahippocampal gyrus (PHG), and superior occipital gyrus (SOG) show relatively lower variances within each group, hence higher ICC scores.

To show more details on the heterogeneity, we plot the latent coordinates associated with those ROIs using boxplots. Since each $z_{i}^{(s)}$ is in two-dimensional space, we plot the linear transform ${\tilde{z}}_{i}^{(s)} = z_{i, 1}^{(s)} + z_{i, 2}^{(s)}$ . Interestingly, those 8 ROIs with high variability still seem quite informative for distinguishing the two groups (Figure 5(f)). To verify, we concatenate those latent coordinates and form an $S \times 16$ matrix, and fit them in a logistic regression model for classifying the healthy versus diseased states. The Area Under the Curve (AUC) of the Receiver Operating Characteristic is 86.6%. On the other hand, when we fit the 6 ROIs with low variability in logistic regression, the AUC increases to 96.1%.

An explanation for the above results is that Alzheimer’s disease does different degrees of damage in the frontal and parietal lobes (see the two distinct clusterings in Figure 4 (c) and (d)), and the severity of the damage can vary from person to person. On the other hand, the hippocampus region (HIP and PHG), important for memory consolidation, is known to be commonly affected by Alzheimer’s disease (Braak and Braak, 1991; Klimova et al., 2015), which explains the low heterogeneity in the diseased group. Further, to our best knowledge, the high discriminability of the superior occipital gyrus (SOG) is a new quantitative finding, that could be meaningful for a further clinical study.

For validation, without using any group information, we concatenate those $z_{i}^{(s)}$ ’s over all $i = 1, \dots, 116$ and form an $S \times 232$ matrix and use lasso logistic regression to classify the two groups. When 12 predictors are selected (as a similar-size model to the one above using 6 ROIs), the AUC is 96.4%. Since $z_{i}^{(s)}$ ’s are obtained in an unsupervised way, this validation result shows that the multi-view clustering model produces meaningful representation for the nodes in this Alzheimer’s disease data. We provide further details on the clusterings, including the number of clusters, and the posterior co-assignment probability matrices in the Supplementary Materials S4.5.

7. Discussion

In this article, we present our discovery of a probabilistic model for popular spectral clustering algorithms. This enables straightforward uncertainty quantification and model-based extensions through the Bayesian framework. There are several directions worth exploring. First, our consistency theory is conducted under the condition of separable sets, similar to Ascolani et al. (2022). For general cases with non-separable sets, clustering consistency (especially on estimating $K$ ) is challenging to achieve; to our best knowledge, existing consistency theory only applies to data generated independently from a mixture model (Miller and Harrison, 2018; Zeng et al., 2023). For data generated dependently via a graph, this is still an unsolved problem. Second, in all of our forest models, we have been careful in choosing densities with tractable normalizing constants. One could relax this constraint by using densities $f (y_{i} ∣ y_{j}, θ) = α_{f} g_{f} (y_{i} ∣ y_{j}; θ)$ and $r (y_{i}; θ) = α_{r} g_{r} (y_{i}; θ)$ , with $g$ some similarity function, and $(α_{f}, α_{r})$ potentially intractable. In these cases, the forest posterior becomes $\prod (𝒯 ∣ \cdot) \propto {(λ α_{r} / α_{f})}^{K} \prod_{(0, i) \in 𝒯} g_{r} (y_{i}; θ) \prod_{(i, j) \in 𝒯} g_{r} (y_{i} ∣ y_{j}; θ)$ . Therefore, one could choose an appropriate $\tilde{λ} = λ α_{r} / α_{f}$ (equivalent to choosing some value of $λ$ ), without knowing the value of $α_{f}$ or $α_{r}$ ; nevertheless, how to calibrate $\tilde{λ}$ still requires further study. Third, a related idea is the Dirichlet Diffusion Tree (Neal, 2003), which considers a particle starting at the origin, following the path of previous particles, and diverging at a random time. The data are collected as the locations of particles at the end of a time period. Compared to the forest process, the diffusion tree process has the conditional likelihood given the tree invariant to the ordering of the data index, which is a stronger property compared to the marginal exchangeability of the data points. Therefore, it is interesting to further explore the relationship between those two processes.

Supplementary Material

Supp 1

NIHMS1940488-supplement-Supp_1.pdf^{(15.9MB, pdf)}

Acknowledgment:

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada.

Footnotes

Conflict of interest statement: The authors report that there are no competing interests to declare.

References

Aldous DJ (1990). The Random Walk Construction of Uniform Spanning Trees and Uniform Labelled Trees. SIAM Journal on Discrete Mathematics 3 (4), 450–465. [Google Scholar]
Ascolani F, Lijoi A, Rebaudo G, and Zanella G (2022). Clustering Consistency With Dirichlet Process Mixtures. arXiv preprint arXiv:2205.12924. [Google Scholar]
Aßmann C, Boysen-Hogrefe J, and Pape M (2016). Bayesian Analysis of Static and Dynamic Factor Models: An Ex-Post Approach Towards the Rotation Problem. Journal of Econometrics 192 (1), 190–206. [Google Scholar]
Banerjee S, Akbani R, and Baladandayuthapani V (2015). Bayesian Nonparametric Graph Clustering. arXiv preprint arXiv:1509.07535. [Google Scholar]
Barry D and Hartigan JA (1993). A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association 88 (421), 309–319. [Google Scholar]
Blackwell D and MacQueen JB (1973). Ferguson Distributions via Pólya Urn Schemes. The Annals of Statistics 1 (2), 353–355. [Google Scholar]
Blei DM and Frazier PI (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research 12 (8). [Google Scholar]
Boykov Y, Veksler O, and Zabih R (2001). Fast Approximate Energy Minimization via Graph Cuts. IEEE Transactions on pattern analysis and machine intelligence 23 (11), 1222–1239. [Google Scholar]
Braak H and Braak E (1991). Neuropathological Stageing of Alzheimer-Related Changes. Acta Neuropathologica 82 (4), 239–259. [DOI] [PubMed] [Google Scholar]
Broder AZ (1989). Generating Random Spanning Trees. In Annual Symposium on Foundations of Computer Science, Volume 89, pp. 442–447. [Google Scholar]
Byrne S and Dawid AP (2015). Structural Markov Graph Laws for Bayesian Model Uncertainty. The Annals of Statistics 43 (4), 1647–1681. [Google Scholar]
Cai D, Campbell T, and Broderick T (2021). Finite Mixture Models Do Not Reliably Learn the Number of Components. In International Conference on Machine Learning, pp. 1158–1169. PMLR. [Google Scholar]
Cao X, Khare K, and Ghosh M (2019). Posterior Graph Selection and Estimation Consistency for High-Dimensional Bayesian DAG Models. The Annals of Statistics 47 (1), 319–348. [Google Scholar]
Chaiken S and Kleitman DJ (1978). Matrix Tree Theorems. Journal of Combinatorial Theory, Series A 24 (3), 377–381. [Google Scholar]
Chandra NK, Canale A, and Dunson DB (2023). Escaping the Curse of Dimensionality in Bayesian Model Based Clustering. Journal of Machine Learning Research 24, 1–42. [Google Scholar]
Chi Y, Song X, Zhou D, Hino K, and Tseng BL (2007). Evolutionary Spectral Clustering by Incorporating Temporal Smoothness. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 153–162. [Google Scholar]
Coretto P and Hennig C (2016). Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering. Journal of the American Statistical Association 111 (516), 1648–1659. [Google Scholar]
Crowley EM (1997). Product Partition Models for Normal Means. Journal of the American Statistical Association 92 (437), 192–198. [Google Scholar]
Dahl DB, Johnson DJ, and Müller P (2022). Search Algorithms and Loss Functions for Bayesian Clustering. Journal of Computational and Graphical Statistics 31 (4),1189–1201. [Google Scholar]
DeBruine ZJ, Melcher K, and Triche TJ Jr (2021). Fast and Robust Non-Negative Matrix Factorization for Single-Cell Experiments. bioRxiv, 2021–09. [Google Scholar]
Diaconis P (1977). Finite Forms of de Finetti’s Theorem on Exchangeability. Synthese 36, 271–281. [Google Scholar]
Duan LL and Dunson DB (2021a). Bayesian Distance Clustering. Journal of Machine Learning Research 22, 1–27. [PMC free article] [PubMed] [Google Scholar]
Duan LL and Dunson DB (2021b). Bayesian Spanning Tree: Estimating the Backbone of the Dependence Graph. arXiv preprint arXiv:2106.16120. [Google Scholar]
Duan LL, Michailidis G, and Ding M (2023). Bayesian Spiked Laplacian Graphs. Journal of Machine Learning Research 24 (3), 1–35. [Google Scholar]
Edwards D, De Abreu GC, and Labouriau R (2010). Selecting High-Dimensional Mixed Graphical Models Using Minimal AIC or BIC Forests. BMC Bioinformatics 11 (1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ester M, Kriegel H-P, Sander J, and Xu X (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press. [Google Scholar]
Fraley C and Raftery AE (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association 97 (458), 611–631. [Google Scholar]
Frey BJ and Dueck D (2007). Clustering by Passing Messages Between Data Points. Science 315 (5814), 972–976. [DOI] [PubMed] [Google Scholar]
Frühwirth-Schnatter S and Pyne S (2010). Bayesian Inference for Finite Mixtures of Univariate and Multivariate Skew-Normal and Skew-t Distributions. Biostatistics 11 (2), 317–336. [DOI] [PubMed] [Google Scholar]
Geng J, Bhattacharya A, and Pati D (2019). Probabilistic Community Detection With Unknown Number of Communities. Journal of the American Statistical Association 114 (526), 893–905. [Google Scholar]
Gower JC and Ross GJ (1969). Minimum Spanning Trees and Single Linkage Cluster Analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics) 18 (1), 54–64. [Google Scholar]
Guha S and Baladandayuthapani V (2016). A Nonparametric Bayesian Technique for High-Dimensional Regression. Electronic Journal of Statistics 10 (2), 3374–3424. [Google Scholar]
Han X, Tong X, and Fan Y (2021). Eigen Selection in Spectral Clustering: A Theory-Guided Practice. Journal of the American Statistical Association, 1–13.35757777 [Google Scholar]
Hartigan JA (1990). Partition Models. Communications in Statistics-Theory and Methods 19 (8), 2745–2756. [Google Scholar]
Klimova B, Maresova P, Valis M, Hort J, and Kuca K (2015). Alzheimer’s Disease and Language Impairments: Social Intervention and Medical Treatment. Clinical Interventions in Aging, 1401–1408. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kosmidis I and Karlis D (2016). Model-Based Clustering Using Copulas With Applications. Statistics and Computing 26 (5), 1079–1099. [Google Scholar]
Kumar A, Rai P, and Daume H (2011). Co-regularized Multi-view Spectral Clustering. In Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, and Weinberger K (Eds.), Advances in Neural Information Processing Systems, Volume 24. Curran Associates, Inc. [Google Scholar]
Lee SX and McLachlan GJ (2016). Finite Mixtures of Canonical Fundamental Skew t-Distributions. Statistics and Computing 26 (3), 573–589. [Google Scholar]
Lei J and Lin KZ (2022). Bias-Adjusted Spectral Clustering in Multi-Layer Stochastic Block Models. Journal of the American Statistical Association, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lei J and Rinaldo A (2015). Consistency of Spectral Clustering in Stochastic Block Models. The Annals of Statistics 43 (1), 215–237. [Google Scholar]
Lewis JR, MacEachern SN, and Lee Y (2021). Bayesian Restricted Likelihood Methods: Conditioning on Insufficient Statistics in Bayesian Regression. Bayesian Analysis 16 (4), 1393–1462. [Google Scholar]
Luo Z, Sang H, and Mallick B (2021). A Bayesian Contiguous Partitioning Method for Learning Clustered Latent Variables. Journal of Machine Learning Research 22. [Google Scholar]
MacQueen J (1967). Classification and Analysis of Multivariate Observations. In 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297. [Google Scholar]
Malsiner-Walli G, Frühwirth-Schnatter S, and Grün B (2017). Identifying Mixtures of Mixtures Using Bayesian Estimation. Journal of Computational and Graphical Statistics 26 (2), 285–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
McDaid AF, Murphy TB, Friel N, and Hurley NJ (2013). Improved Bayesian Inference for the Stochastic Block Model With Application to Large Networks. Computational Statistics & Data Analysis 60, 12–31. [Google Scholar]
Medvedovic M and Sivaganesan S (2002). Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles. Bioinformatics 18 (9), 1194–1206. [DOI] [PubMed] [Google Scholar]
Meilă M and Jaakkola T (2006). Tractable Bayesian Learning of Tree Belief Networks. Statistics and Computing 16 (1), 77–92. [Google Scholar]
Meila M and Jordan MI (2000). Learning With Mixtures of Trees. Journal of Machine Learning Research 1 (Oct), 1–48. [Google Scholar]
Miller JW (2019). An Elementary Derivation of the Chinese Restaurant Process From Sethuraman’s Stick-Breaking Process. Statistics & Probability Letters 146, 112–117. [Google Scholar]
Miller JW and Dunson DB (2018). Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association 114 (527), 1113–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miller JW and Harrison MT (2018). Mixture Models With a Prior on the Number of Components. Journal of the American Statistical Association 113 (521), 340–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Molitor J, Papathomas M, Jerrett M, and Richardson S (2010). Bayesian Profile Regression With an Application to the National Survey of Children’s Health. Biostatistics 11 (3), 484–498. [DOI] [PubMed] [Google Scholar]
Mosbah M and Saheb N (1999). Non-Uniform Random Spanning Trees on Weighted Graphs. Theoretical Computer Science 218 (2), 263–271. [Google Scholar]
Müller P and Quintana F (2010). Random Partition Models With Regression on Covariates. Journal of Statistical Planning and Inference 140 (10), 2801–2808. [DOI] [PMC free article] [PubMed] [Google Scholar]
Müller P, Quintana F, and Rosner GL (2011). A Product Partition Model With Regression on Covariates. Journal of Computational and Graphical Statistics 20 (1), 260–278. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neal RM (2003). Density Modeling and Clustering Using Dirichlet Diffusion Trees. Bayesian Statistics 7, 619–629. [Google Scholar]
Ng S-K, McLachlan GJ, Wang K, Ben-Tovim Jones L, and Ng S-W (2006). A Mixture Model With Random-Effects Components for Clustering Correlated Gene-Expression Profiles. Bioinformatics 22 (14), 1745–1752. [DOI] [PubMed] [Google Scholar]
Noble S, Scheinost D, and Constable RT (2021). A Guide to the Measurement and Interpretation of fMRI Test-Retest Reliability. Current Opinion in Behavioral Sciences 40, 27–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nowicki K and Snijders TAB (2001). Estimation and Prediction for Stochastic Blockstructures. Journal of the American Statistical Association 96 (455), 1077–1087. [Google Scholar]
Paganin S, Herring AH, Olshan AF, and Dunson DB (2021). Centered Partition Processes: Informative Priors for Clustering. Bayesian Analysis 16 (1), 301–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park J-H and Dunson DB (2010). Bayesian Generalized Product Partition Model. Statistica Sinica, 1203–1226. [Google Scholar]
Petrone S, Guindani M, and Gelfand AE (2009). Hybrid Dirichlet mixture models for functional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (4), 755–782. [Google Scholar]
Quintana FA and Iglesias PL (2003). Bayesian Clustering and Product Partition Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2), 557–574. [Google Scholar]
Rasmussen C, Bernard J, Ghahramani Z, and Wild DL (2008). Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6 (4), 615–628. [DOI] [PubMed] [Google Scholar]
Ren L, Du L, Carin L, and Dunson DB (2011). Logistic Stick-Breaking Process. Journal of Machine Learning Research 12 (1). [PMC free article] [PubMed] [Google Scholar]
Rigon T, Herring AH, and Dunson DB (2023). A Generalized Bayes Framework for Probabilistic Clustering. Biometrika, 1–14. [Google Scholar]
Rodríguez CE and Walker SG (2014). Univariate Bayesian Nonparametric Mixture Modeling With Unimodal Kernels. Statistics and Computing 24 (1), 35–49. [Google Scholar]
Rohe K, Chatterjee S, and Yu B (2011). Spectral Clustering and the High-Dimensional Stochastic Blockmodel. The Annals of Statistics 39 (4), 1878–1915. [Google Scholar]
Rolls ET, Huang C-C, Lin C-P, Feng J, and Joliot M (2020). Automated Anatomical Labelling Atlas 3. Neuroimage 206, 116189. [DOI] [PubMed] [Google Scholar]
Schild A (2018). An Almost-Linear Time Algorithm for Uniform Random Spanning Tree Generation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 214–227. [Google Scholar]
Shi D, Zhang H, Wang S, Wang G, and Ren K (2021). Application of Functional Magnetic Resonance Imaging in the Diagnosis of Parkinson’s Disease: A Histogram Analysis. Frontiers in Aging Neuroscience 13, 624731. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shi T, Belkin M, and Yu B (2009). Data Spectroscopy: Eigenspaces of Convolution Operators and Clustering. The Annals of Statistics, 3960–3984. [Google Scholar]
Snijders TA and Nowicki K (1997). Estimation and Prediction for Stochastic Blockmodels for Graphs With Latent Block Structure. Journal of Classification 14 (1), 75–100. [Google Scholar]
Socher R, Maas A, and Manning C (2011). Spectral Chinese Restaurant Processes: Nonparametric Clustering Based on Similarities. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 698–706. JMLR Workshop and Conference Proceedings. [Google Scholar]
Teh YW, Jordan MI, Beal MJ, and Blei DM (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association 101 (476), 1566–1581. [Google Scholar]
Von Luxburg U (2007). A Tutorial on Spectral Clustering. Statistics and Computing 17 (4), 395–416. [Google Scholar]
Wade S and Ghahramani Z (2018). Bayesian Cluster Analysis: Point Estimation and Credible Balls. Bayesian Analysis 13 (2), 559–626. [Google Scholar]
Wu S, Feng X, and Zhou W (2014). Spectral Clustering of High-Dimensional Data Exploiting Sparse Representation Vectors. Neurocomputing 135, 229–239. [Google Scholar]
Yu Y, Wang T, and Samworth RJ (2015). A Useful Variant of the Davis–Kahan Theorem for Statisticians. Biometrika 102 (2), 315–323. [Google Scholar]
Zelnik-Manor L and Perona P (2005). Self-Tuning Spectral Clustering. In Advances in Neural Information Processing Systems, Volume 17. [Google Scholar]
Zeng C, Miller JW, and Duan LL (2023). Consistent Model-based Clustering using the Quasi-Bernoulli Stick-breaking Process. Journal of Machine Learning Research 24, 1–32. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1940488-supplement-Supp_1.pdf^{(15.9MB, pdf)}

[R1] Aldous DJ (1990). The Random Walk Construction of Uniform Spanning Trees and Uniform Labelled Trees. SIAM Journal on Discrete Mathematics 3 (4), 450–465. [Google Scholar]

[R2] Ascolani F, Lijoi A, Rebaudo G, and Zanella G (2022). Clustering Consistency With Dirichlet Process Mixtures. arXiv preprint arXiv:2205.12924. [Google Scholar]

[R3] Aßmann C, Boysen-Hogrefe J, and Pape M (2016). Bayesian Analysis of Static and Dynamic Factor Models: An Ex-Post Approach Towards the Rotation Problem. Journal of Econometrics 192 (1), 190–206. [Google Scholar]

[R4] Banerjee S, Akbani R, and Baladandayuthapani V (2015). Bayesian Nonparametric Graph Clustering. arXiv preprint arXiv:1509.07535. [Google Scholar]

[R5] Barry D and Hartigan JA (1993). A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association 88 (421), 309–319. [Google Scholar]

[R6] Blackwell D and MacQueen JB (1973). Ferguson Distributions via Pólya Urn Schemes. The Annals of Statistics 1 (2), 353–355. [Google Scholar]

[R7] Blei DM and Frazier PI (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research 12 (8). [Google Scholar]

[R8] Boykov Y, Veksler O, and Zabih R (2001). Fast Approximate Energy Minimization via Graph Cuts. IEEE Transactions on pattern analysis and machine intelligence 23 (11), 1222–1239. [Google Scholar]

[R9] Braak H and Braak E (1991). Neuropathological Stageing of Alzheimer-Related Changes. Acta Neuropathologica 82 (4), 239–259. [DOI] [PubMed] [Google Scholar]

[R10] Broder AZ (1989). Generating Random Spanning Trees. In Annual Symposium on Foundations of Computer Science, Volume 89, pp. 442–447. [Google Scholar]

[R11] Byrne S and Dawid AP (2015). Structural Markov Graph Laws for Bayesian Model Uncertainty. The Annals of Statistics 43 (4), 1647–1681. [Google Scholar]

[R12] Cai D, Campbell T, and Broderick T (2021). Finite Mixture Models Do Not Reliably Learn the Number of Components. In International Conference on Machine Learning, pp. 1158–1169. PMLR. [Google Scholar]

[R13] Cao X, Khare K, and Ghosh M (2019). Posterior Graph Selection and Estimation Consistency for High-Dimensional Bayesian DAG Models. The Annals of Statistics 47 (1), 319–348. [Google Scholar]

[R14] Chaiken S and Kleitman DJ (1978). Matrix Tree Theorems. Journal of Combinatorial Theory, Series A 24 (3), 377–381. [Google Scholar]

[R15] Chandra NK, Canale A, and Dunson DB (2023). Escaping the Curse of Dimensionality in Bayesian Model Based Clustering. Journal of Machine Learning Research 24, 1–42. [Google Scholar]

[R16] Chi Y, Song X, Zhou D, Hino K, and Tseng BL (2007). Evolutionary Spectral Clustering by Incorporating Temporal Smoothness. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 153–162. [Google Scholar]

[R17] Coretto P and Hennig C (2016). Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering. Journal of the American Statistical Association 111 (516), 1648–1659. [Google Scholar]

[R18] Crowley EM (1997). Product Partition Models for Normal Means. Journal of the American Statistical Association 92 (437), 192–198. [Google Scholar]

[R19] Dahl DB, Johnson DJ, and Müller P (2022). Search Algorithms and Loss Functions for Bayesian Clustering. Journal of Computational and Graphical Statistics 31 (4),1189–1201. [Google Scholar]

[R20] DeBruine ZJ, Melcher K, and Triche TJ Jr (2021). Fast and Robust Non-Negative Matrix Factorization for Single-Cell Experiments. bioRxiv, 2021–09. [Google Scholar]

[R21] Diaconis P (1977). Finite Forms of de Finetti’s Theorem on Exchangeability. Synthese 36, 271–281. [Google Scholar]

[R22] Duan LL and Dunson DB (2021a). Bayesian Distance Clustering. Journal of Machine Learning Research 22, 1–27. [PMC free article] [PubMed] [Google Scholar]

[R23] Duan LL and Dunson DB (2021b). Bayesian Spanning Tree: Estimating the Backbone of the Dependence Graph. arXiv preprint arXiv:2106.16120. [Google Scholar]

[R24] Duan LL, Michailidis G, and Ding M (2023). Bayesian Spiked Laplacian Graphs. Journal of Machine Learning Research 24 (3), 1–35. [Google Scholar]

[R25] Edwards D, De Abreu GC, and Labouriau R (2010). Selecting High-Dimensional Mixed Graphical Models Using Minimal AIC or BIC Forests. BMC Bioinformatics 11 (1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Ester M, Kriegel H-P, Sander J, and Xu X (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press. [Google Scholar]

[R27] Fraley C and Raftery AE (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association 97 (458), 611–631. [Google Scholar]

[R28] Frey BJ and Dueck D (2007). Clustering by Passing Messages Between Data Points. Science 315 (5814), 972–976. [DOI] [PubMed] [Google Scholar]

[R29] Frühwirth-Schnatter S and Pyne S (2010). Bayesian Inference for Finite Mixtures of Univariate and Multivariate Skew-Normal and Skew-t Distributions. Biostatistics 11 (2), 317–336. [DOI] [PubMed] [Google Scholar]

[R30] Geng J, Bhattacharya A, and Pati D (2019). Probabilistic Community Detection With Unknown Number of Communities. Journal of the American Statistical Association 114 (526), 893–905. [Google Scholar]

[R31] Gower JC and Ross GJ (1969). Minimum Spanning Trees and Single Linkage Cluster Analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics) 18 (1), 54–64. [Google Scholar]

[R32] Guha S and Baladandayuthapani V (2016). A Nonparametric Bayesian Technique for High-Dimensional Regression. Electronic Journal of Statistics 10 (2), 3374–3424. [Google Scholar]

[R33] Han X, Tong X, and Fan Y (2021). Eigen Selection in Spectral Clustering: A Theory-Guided Practice. Journal of the American Statistical Association, 1–13.35757777 [Google Scholar]

[R34] Hartigan JA (1990). Partition Models. Communications in Statistics-Theory and Methods 19 (8), 2745–2756. [Google Scholar]

[R35] Klimova B, Maresova P, Valis M, Hort J, and Kuca K (2015). Alzheimer’s Disease and Language Impairments: Social Intervention and Medical Treatment. Clinical Interventions in Aging, 1401–1408. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Kosmidis I and Karlis D (2016). Model-Based Clustering Using Copulas With Applications. Statistics and Computing 26 (5), 1079–1099. [Google Scholar]

[R37] Kumar A, Rai P, and Daume H (2011). Co-regularized Multi-view Spectral Clustering. In Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, and Weinberger K (Eds.), Advances in Neural Information Processing Systems, Volume 24. Curran Associates, Inc. [Google Scholar]

[R38] Lee SX and McLachlan GJ (2016). Finite Mixtures of Canonical Fundamental Skew t-Distributions. Statistics and Computing 26 (3), 573–589. [Google Scholar]

[R39] Lei J and Lin KZ (2022). Bias-Adjusted Spectral Clustering in Multi-Layer Stochastic Block Models. Journal of the American Statistical Association, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Lei J and Rinaldo A (2015). Consistency of Spectral Clustering in Stochastic Block Models. The Annals of Statistics 43 (1), 215–237. [Google Scholar]

[R41] Lewis JR, MacEachern SN, and Lee Y (2021). Bayesian Restricted Likelihood Methods: Conditioning on Insufficient Statistics in Bayesian Regression. Bayesian Analysis 16 (4), 1393–1462. [Google Scholar]

[R42] Luo Z, Sang H, and Mallick B (2021). A Bayesian Contiguous Partitioning Method for Learning Clustered Latent Variables. Journal of Machine Learning Research 22. [Google Scholar]

[R43] MacQueen J (1967). Classification and Analysis of Multivariate Observations. In 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297. [Google Scholar]

[R44] Malsiner-Walli G, Frühwirth-Schnatter S, and Grün B (2017). Identifying Mixtures of Mixtures Using Bayesian Estimation. Journal of Computational and Graphical Statistics 26 (2), 285–295. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] McDaid AF, Murphy TB, Friel N, and Hurley NJ (2013). Improved Bayesian Inference for the Stochastic Block Model With Application to Large Networks. Computational Statistics & Data Analysis 60, 12–31. [Google Scholar]

[R46] Medvedovic M and Sivaganesan S (2002). Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles. Bioinformatics 18 (9), 1194–1206. [DOI] [PubMed] [Google Scholar]

[R47] Meilă M and Jaakkola T (2006). Tractable Bayesian Learning of Tree Belief Networks. Statistics and Computing 16 (1), 77–92. [Google Scholar]

[R48] Meila M and Jordan MI (2000). Learning With Mixtures of Trees. Journal of Machine Learning Research 1 (Oct), 1–48. [Google Scholar]

[R49] Miller JW (2019). An Elementary Derivation of the Chinese Restaurant Process From Sethuraman’s Stick-Breaking Process. Statistics & Probability Letters 146, 112–117. [Google Scholar]

[R50] Miller JW and Dunson DB (2018). Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association 114 (527), 1113–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] Miller JW and Harrison MT (2018). Mixture Models With a Prior on the Number of Components. Journal of the American Statistical Association 113 (521), 340–356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Molitor J, Papathomas M, Jerrett M, and Richardson S (2010). Bayesian Profile Regression With an Application to the National Survey of Children’s Health. Biostatistics 11 (3), 484–498. [DOI] [PubMed] [Google Scholar]

[R53] Mosbah M and Saheb N (1999). Non-Uniform Random Spanning Trees on Weighted Graphs. Theoretical Computer Science 218 (2), 263–271. [Google Scholar]

[R54] Müller P and Quintana F (2010). Random Partition Models With Regression on Covariates. Journal of Statistical Planning and Inference 140 (10), 2801–2808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] Müller P, Quintana F, and Rosner GL (2011). A Product Partition Model With Regression on Covariates. Journal of Computational and Graphical Statistics 20 (1), 260–278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Neal RM (2003). Density Modeling and Clustering Using Dirichlet Diffusion Trees. Bayesian Statistics 7, 619–629. [Google Scholar]

[R57] Ng S-K, McLachlan GJ, Wang K, Ben-Tovim Jones L, and Ng S-W (2006). A Mixture Model With Random-Effects Components for Clustering Correlated Gene-Expression Profiles. Bioinformatics 22 (14), 1745–1752. [DOI] [PubMed] [Google Scholar]

[R58] Noble S, Scheinost D, and Constable RT (2021). A Guide to the Measurement and Interpretation of fMRI Test-Retest Reliability. Current Opinion in Behavioral Sciences 40, 27–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] Nowicki K and Snijders TAB (2001). Estimation and Prediction for Stochastic Blockstructures. Journal of the American Statistical Association 96 (455), 1077–1087. [Google Scholar]

[R60] Paganin S, Herring AH, Olshan AF, and Dunson DB (2021). Centered Partition Processes: Informative Priors for Clustering. Bayesian Analysis 16 (1), 301–370. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] Park J-H and Dunson DB (2010). Bayesian Generalized Product Partition Model. Statistica Sinica, 1203–1226. [Google Scholar]

[R62] Petrone S, Guindani M, and Gelfand AE (2009). Hybrid Dirichlet mixture models for functional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (4), 755–782. [Google Scholar]

[R63] Quintana FA and Iglesias PL (2003). Bayesian Clustering and Product Partition Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2), 557–574. [Google Scholar]

[R64] Rasmussen C, Bernard J, Ghahramani Z, and Wild DL (2008). Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6 (4), 615–628. [DOI] [PubMed] [Google Scholar]

[R65] Ren L, Du L, Carin L, and Dunson DB (2011). Logistic Stick-Breaking Process. Journal of Machine Learning Research 12 (1). [PMC free article] [PubMed] [Google Scholar]

[R66] Rigon T, Herring AH, and Dunson DB (2023). A Generalized Bayes Framework for Probabilistic Clustering. Biometrika, 1–14. [Google Scholar]

[R67] Rodríguez CE and Walker SG (2014). Univariate Bayesian Nonparametric Mixture Modeling With Unimodal Kernels. Statistics and Computing 24 (1), 35–49. [Google Scholar]

[R68] Rohe K, Chatterjee S, and Yu B (2011). Spectral Clustering and the High-Dimensional Stochastic Blockmodel. The Annals of Statistics 39 (4), 1878–1915. [Google Scholar]

[R69] Rolls ET, Huang C-C, Lin C-P, Feng J, and Joliot M (2020). Automated Anatomical Labelling Atlas 3. Neuroimage 206, 116189. [DOI] [PubMed] [Google Scholar]

[R70] Schild A (2018). An Almost-Linear Time Algorithm for Uniform Random Spanning Tree Generation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 214–227. [Google Scholar]

[R71] Shi D, Zhang H, Wang S, Wang G, and Ren K (2021). Application of Functional Magnetic Resonance Imaging in the Diagnosis of Parkinson’s Disease: A Histogram Analysis. Frontiers in Aging Neuroscience 13, 624731. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] Shi T, Belkin M, and Yu B (2009). Data Spectroscopy: Eigenspaces of Convolution Operators and Clustering. The Annals of Statistics, 3960–3984. [Google Scholar]

[R73] Snijders TA and Nowicki K (1997). Estimation and Prediction for Stochastic Blockmodels for Graphs With Latent Block Structure. Journal of Classification 14 (1), 75–100. [Google Scholar]

[R74] Socher R, Maas A, and Manning C (2011). Spectral Chinese Restaurant Processes: Nonparametric Clustering Based on Similarities. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 698–706. JMLR Workshop and Conference Proceedings. [Google Scholar]

[R75] Teh YW, Jordan MI, Beal MJ, and Blei DM (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association 101 (476), 1566–1581. [Google Scholar]

[R76] Von Luxburg U (2007). A Tutorial on Spectral Clustering. Statistics and Computing 17 (4), 395–416. [Google Scholar]

[R77] Wade S and Ghahramani Z (2018). Bayesian Cluster Analysis: Point Estimation and Credible Balls. Bayesian Analysis 13 (2), 559–626. [Google Scholar]

[R78] Wu S, Feng X, and Zhou W (2014). Spectral Clustering of High-Dimensional Data Exploiting Sparse Representation Vectors. Neurocomputing 135, 229–239. [Google Scholar]

[R79] Yu Y, Wang T, and Samworth RJ (2015). A Useful Variant of the Davis–Kahan Theorem for Statisticians. Biometrika 102 (2), 315–323. [Google Scholar]

[R80] Zelnik-Manor L and Perona P (2005). Self-Tuning Spectral Clustering. In Advances in Neural Information Processing Systems, Volume 17. [Google Scholar]

[R81] Zeng C, Miller JW, and Duan LL (2023). Consistent Model-based Clustering using the Quasi-Bernoulli Stick-breaking Process. Journal of Machine Learning Research 24, 1–32. [Google Scholar]

PERMALINK

Spectral Clustering, Bayesian Spanning Forest, and Forest Process

Leo L Duan

Arkaprava Roy

Abstract

1. Introduction

2. Method

2.1. Background on Spectral Clustering Algorithms

2.2. Probabilistic Model via Bayesian Spanning Forest

Fig. 1.

Fig. 2.

2.3. Forest Process and Product Partition Prior

2.4. Hyper-priors for the Other Parameters

2.5. Model-based Extensions

Latent Forest Model:

Informative Prior-Latent Variable Distribution:

Hierarchical Multi-view Clustering:

3. Posterior Computation

3.1. Gibbs Sampling Algorithm

Algorithm 1.

3.2. Posterior Point Estimate on Clustering

4. Theoretical Properties

4.1. Convergence of Eigenvectors

4.2. Consistent Clustering of Separable Sets

5. Numerical Experiments

Fig. 3.

6. Application: Clustering in Multi-subject Functional Magnetic Resonance Imaging Data

Fig. 4.

Fig. 5.

7. Discussion

Supplementary Material

Acknowledgment:

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases