Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2023 Sep 29;119(547):2140–2153. doi: 10.1080/01621459.2023.2250098

Spectral Clustering, Bayesian Spanning Forest, and Forest Process

Leo L Duan a,*, Arkaprava Roy b; Alzheimer’s Disease Neuroimaging Initiative
PMCID: PMC11580821  NIHMSID: NIHMS1940488  PMID: 39583343

Abstract

Spectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To induce a distribution for the forest, we develop a “forest process” as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. We illustrate several model-based extensions useful for data applications, including high-dimensional and multi-view clustering for images.

Keywords: Graphical Model Clustering, Model-based Clustering, Normalized Graph-cut, Partition Probability Function

1. Introduction

Clustering aims to partition data y1,,yn into disjoint groups. There is a large literature ranging from various algorithms such as K-means and DBSCAN (MacQueen, 1967; Ester et al., 1996; Frey and Dueck, 2007) to mixture model-based approaches [reviewed by Fraley and Raftery (2002)]. In the Bayesian community, model-based approaches are especially popular. To roughly summarize the idea, we view each yi as generated from a distribution 𝒦(θi), where (θ1,,θn) are drawn from a discrete distribution k=1Kwkδθk*(), with wk as the probability weight, and δθk* as a point mass at θk*. With prior distributions, we could estimate all the unknown parameters (θk*’s, wk’s, and K) from the posterior.

The model-based clustering has two important advantages. First, it allows important uncertainty quantification such as the probability for cluster assignment ci, Pr(ci=kyi), as a probabilistic estimate that yi comes from the kth cluster (ci=kθi=θk*). Different from commonly seen asymptotic results in statistical estimation, the clustering uncertainty does not always vanish even as n. For example, in a two-component Gaussian mixture model with equal covariance, for a point yi at nearly equal distances to two cluster centers, we would have both Pr(ci=1yi) and Pr(ci=2yi) close to 50% even as n. For a recent discussion on this topic as well as how to quantify the partition uncertainty, see Wade and Ghahramani (2018) and the references within. Second, the model-based clustering can be easily extended to handle more complicated modeling tasks. Specifically, since there is a probabilistic process associated with the clustering, it is straightforward to modify it to include useful dependency structures. We list a few examples from a rich literature: Ng et al. (2006) used a mixture model with random effects to cluster correlated gene-expression data, Müller and Quintana (2010); Park and Dunson (2010); Ren et al. (2011) allowed the partition to vary according to some covariates, Guha and Baladandayuthapani (2016) simultaneously clustered the predictors and use them in high-dimensional regression.

On the other hand, model-based clustering has its limitations. Primarily, one needs to carefully specify the density/mass function 𝒦, otherwise, it will lead to unwanted results and difficult interpretation. For example, Coretto and Hennig (2016) demonstrated the sensitivity of the Gaussian mixture model to non-Gaussian contaminants, Miller and Dunson (2018) and Cai et al. (2021) showed that when the distribution family of 𝒦 is misspecified, the number of clusters would be severely overestimated. It is natural to think of using more flexible parameterization for 𝒦, in order to mitigate the risk of model misspecification. This has motivated many interesting works, such as modeling 𝒦 via skewed distribution (Frühwirth-Schnatter and Pyne, 2010; Lee and McLachlan, 2016), unimodal distribution (Rodríguez and Walker, 2014), copula (Kosmidis and Karlis, 2016), mixture of mixtures (Malsiner-Walli et al., 2017), among others. Nevertheless, as the flexibility of 𝒦 increases, the modeling and computational burdens also increase dramatically.

In parallel to the above advancements in model-based clustering, spectral clustering has become very popular in machine learning and statistics. Von Luxburg (2007) provided a useful tutorial on the algorithms and a review of recent works. On clustering point estimation, spectral clustering has shown good empirical performance for separating non-Gaussian and/or manifold data, without the need to directly specify the distribution for each cluster. Instead, one calculates a matrix of similarity scores between each pair of data, then uses a simple algorithm to find a partition that approximately minimizes the total loss of similarity scores across clusters (adjusted with respect to cluster sizes). This point estimate is found to be not very sensitive to the choice of similarity score, and empirical solutions have been proposed for tuning the similarity and choosing the number of clusters (Zelnik-Manor and Perona, 2005; Shi et al., 2009). There is a rapidly growing literature of frequentist methods on further improving the point estimate [ Chi et al. (2007); Rohe et al. (2011); Kumar et al. (2011); Lei and Rinaldo (2015); Han et al. (2021); Lei and Lin (2022); among others], although, in this article, we focus on the Bayesian perspective and aim to characterize the probability distribution.

Due to the algorithmic nature, spectral clustering cannot be directly used in model-based extension, or produce uncertainty quantification. This has motivated a large Bayesian literature. There have been several works trying to quantify the uncertainty around the spectral clustering point estimate. For example, since the spectral clustering algorithm can be used to estimate the community memberships in a stochastic block model, one could transform the data into a similarity matrix, then treat it as if generated from a Bayesian stochastic block model (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001; McDaid et al., 2013; Geng et al., 2019). Similarly, one could take the Laplacian matrix (a transform of the similarity used in spectral clustering) or its spectral decomposition, and model it in a probabilistic framework (Socher et al., 2011; Duan et al., 2023).

Broadly speaking, we can view these works as following the recent trend of robust Bayesian methodology, in conditioning the parameter of interest (clustering) on an insufficient statistic (pairwise summary statistics) of the data. See Lewis et al. (2021) for recent discussions. Pertaining to Bayesian robust clustering, one gains model robustness by avoiding putting any parametric assumption on within-cluster distribution 𝒦(θk); instead, one models the pairwise information that often has an arguably simple distribution. Recent works include the distance-based Pólya urn process (Blei and Frazier, 2011; Socher et al., 2011), Dirichlet process mixture model on Laplacian eigenmaps (Banerjee et al., 2015), Bayesian distance clustering (Duan and Dunson, 2021a), generalized Bayes extension of product partition model (Rigon et al., 2023).

This article follows this trend. Instead of modeling yi’s as conditionally independent (or jointly dependent) from a certain within-cluster distribution 𝒦(θk*), we choose to model yi as dependent on another point yj that is close by, provided yi and yj are from the same cluster. This leads to a Markov graphical model based on a spanning forest, a graph consisting of multiple disjoint spanning trees (each tree as a connected subgraph without cycles). The spanning forest itself is not new to statistics. There has been a large literature on using spanning trees and forests for graph estimation, such as Meila and Jordan (2000); Meilă and Jaakkola (2006); Edwards et al. (2010); Byrne and Dawid (2015); Duan and Dunson (2021b); Luo et al. (2021). Nevertheless, a key difference between graph estimation and graph-based clustering is that — the former aims to recover both the node partition and the edges characterizing dependencies, while the latter only focuses on estimating the node partition alone (equivalent to clustering). Therefore, a distinction of our study is that we will treat the edges as a nuisance parameter/latent variable, while we will characterize the node partition in the marginal distribution.

Importantly, we formally show that by marginalizing the randomness of edges, the point estimate on the node partition is provably close to the one from the normalized spectral clustering algorithm. As the result, the spanning forest model can serve as the probabilistic model for the spectral clustering algorithm — this relationship is analogous to the one between the Gaussian mixture model and the K-means algorithm (MacQueen, 1967). Further, we show that treating the spanning forest as random, as opposed to a fixed parameter (that is unknown), leads to much less sensitivity in clustering performance, compared to cutting the minimum spanning tree algorithm (Gower and Ross, 1969). On the distribution specification on the node and edges, we take a Bayesian non-parametric approach by considering the forest model as realized from a “forest process” — each cluster is initiated with a point from a root distribution, then gradually grown with new points from a leaf distribution. We characterize the key differences in the partition distribution between the forest and classic Pólya urn processes. This difference also reveals that extra care should be exerted during model specification when using graphical models for clustering.Lastly, by establishing the probabilistic model counterpart for spectral clustering, we show how such models can be easily extended to incorporate other dependency structures. We demonstrate several extensions, including a multi-subject clustering of the brain networks, and a high-dimensional clustering of photo images.

2. Method

2.1. Background on Spectral Clustering Algorithms

We first provide a brief review of spectral clustering algorithms. For data y1,,yn, let Ai,j0 be a similarity score between yi and yj, and denote the degree Di,i=jiAi,j. To partition the data index (1,,n) into K sets, 𝒱=(V1,,VK), we want to solve the following problem:

min𝒱k=1KiVk,jVkAi,jiVkDi,i. (1)

This is known as the minimum normalized cut loss. The numerator above represents the across-cluster similarity due to cutting Vk off from the others; and the denominator prevents trivial solutions of forming tiny clusters with small iVkDi,i.

This optimization problem is a combinatorial problem, hence has motivated approximate solutions such as spectral clustering. To start, using the Laplacian matrix L=DA with D the diagonal matrix of Di,i’s, and the normalized Laplacian N=D1/2LD1/2, we can equivalently solve the above problem via:

min 𝒱tr(ZνNZν),

where Z𝒱:i,k=1(iVk)Di,i/iVkDi,i. It is not hard to verify that Z𝒱Z𝒱=IK. We can obtain a relaxed minimizer of Z:ZZ=IK, by simply taking Z^ as the bottom K eigenvectors of N (with the minimum loss equal to the sum of the smallest K eigenvalues). Afterward, we cluster the rows of Z^ into K groups (using algorithms such as the K-means), hence producing an approximate solution to (1).

To clarify, there is more than one version of the spectral clustering algorithms. An alternative version to (1) is called “minimum ratio cut”, which replaces the denominator iVkDi,i by the size of cluster |Vk|. Similarly, continuous relaxation approximation can be obtained by following the same procedures above, except for clustering the eigenvectors of the unnormalized L. Details on comparing those two versions can be found in Von Luxburg (2007). In this article, we focus on the one based on (1) and the normalized Laplacian matrix N. This version is also commonly referred to as “normalized spectral clustering”.

2.2. Probabilistic Model via Bayesian Spanning Forest

The next question is if there is some partition-based generative model for y, that has the maximum likelihood estimate (or, the posterior mode in the Bayesian framework) almost the same as the point estimate from the normalized spectral clustering.

We found an almost equivalence in the spanning forest model. A spanning forest model is a special Bayesian network that describes the conditional dependencies among y1,,yn. Given a partition 𝒱=(V1,,VK) of the data index (1,,n), consider a forest graph 𝒱=(T1,,Tk), with each Tk=(Vk,Ek) a component tree (a connected subgraph without cycles), Vk the set of nodes and Ek the set of edges among Vk. Using 𝒱 and a set of root nodes 𝒱=(1*,,K*) with k*Vk, we can form a graphical model with a conditional likelihood given the forest:

(𝒱,𝒱,𝒱,θ)=k=1K[r(yk*;θ)(i,j)Tkf(yiyj;θ)], (2)

where we refer to r(·;θ) as a “root” distribution, and f(·yj;θ) as a “leaf” distribution; and we use θ to denote the other parameter; and we use simplified notation (i, j)G to mean that (i,j) is an edge of the graph G. Figure 1 illustrates the high flexibility of a spanning forest in representing clusters. It shows the sampled based on three clustering benchmark datasets. Note that some clusters are not elliptical or convex in shape. Rather, each cluster can be imagined as if it were formed by connecting a point to another nearby. In the Supplementary Materials S4.8, we show two different realizations of spanning forest.

Fig. 1.

Fig. 1

Three examples of clusters that can be represented by a spanning forest.

Remark 1. To clarify, the point estimation on a spanning forest (as some fixed and unknown graph) has been studied (Gower and Ross, 1969). However, a distinction here is that we consider 𝒱 as the parameter of interest, but the edges and roots (𝒱,𝒱) as latent variables. The performance differences are shown in the Supplementary Materials S4.6.

The stochastic view of (𝒱,𝒱) is important, as it allows us to incorporate the uncertainty of edges and avoids the sensitivity issue in the point graph estimate. Equivalently, our clustering model is based on the marginal likelihood that varies with the node partition 𝒱:

(y;𝒱,θ)=𝒱,𝒱(y;𝒱,v,v,θ)(𝒱,𝒱𝒱), (3)

where Π(𝒱,𝒱𝒱) is the latent variable distribution that we will specify in the next section. We can quantify the marginal connecting probability for each potential edge (i,j):

Mi,jPr[F𝒱(i,j)]𝒱𝒱,𝒱1[(i,j)F𝒱](y;𝒱,𝒱,𝒱,θ)Π(𝒱,𝒱|𝒱). (4)

Similar to the normalized graph cut, there is no closed-form solution for directly maximizing (3). However, closed-form does exist for (4) (see Section 4). Therefore, an approximate maximizer of (3), 𝒱, can be obtained via computing the matrix M and searching for K diagonal blocks (after row and column index permutation) that contain the highest total values of Mi,j’s. Specifically, we can extract the top leading eigenvectors of M and cluster the rows into K groups.

This approximate marginal likelihood maximizer produces almost the same estimate as the normalized spectral clustering does. This is because the two sets of eigenvectors are almost the same. Further, it is important to clarify that such closeness does not depend on how the data are really generated. Therefore, to provide some numerical evidence, for simplicity, we generate yi from a simple three-component Gaussian mixture in 2 with means in (0, 0), (2, 2), (4, 4) and all variances equal to l2. Figure 2 compares the eigenvectors of the matrix M and the normalized Laplacian N (that uses f and r to specify A, with details provided in Section 4). Clearly, these two are almost identical in values. Due to this connection, the clustering estimates from spectral clustering can be viewed as an approximate estimate for 𝒱 in (3).

Fig. 2.

Fig. 2

Comparing the eigenvectors of a marginal connecting probability matrix M and the ones of normalized Laplacian N.

We now fully specify the Bayesian forest model. For simplicity, we now focus on continuous yip. For ease of computation, we recommend choosing f as a symmetric function f(yiyj;θ)=f(yjyi;θ), so that the likelihood is invariant to the direction of each edge; and choose r as a diffuse density, so that the likelihood is less sensitive to the choice of a node as root. In this article, we choose a Gaussian density for f and Cauchy for r.

f(yiyj;θ)=(2πσi,j)p/2exp{yiyj222σi,j},r(yi;θ)=Γ[(1+p)/2]γpπ(1+p)/21(1+yiμ22/γ2)(1+p)/2. (5)

where σij>0 and γ>0 are scale parameters. As the magnitudes of distances between neighboring points may differ significantly from cluster to cluster, we use a local parameterization σi, j=σ˜iσ˜j, and will regularize (σ˜1,,σ˜n) via a hyper-prior.

Remark 2. In (5), we effectively use Euclidean distances yiyj2. We focus on Euclidean distance in the main text, for the simplicity of presentation and to allow a complete specification of priors. One can replace Euclidean distance with some others, such as Mahalanobis distance and geodesic distance. We present a case of high-dimensional clustering based on geodesic distance on the unit-sphere in the Supplementary Materials S1.1.

2.3. Forest Process and Product Partition Prior

To simplify notations as well as to facilitate computation, we now introduce an auxiliary node 0 that connects to all roots (1*,,K*). As the result, the model can be equivalently represented by a spanning tree rooted at 0:

𝒯=(V𝒯,E𝒯),V𝒯={0}V1VK,E𝒯={(0,1*),,(0,K*)}E1EK.

In this section, we focus on the distribution specification for 𝒯. The distribution, denoted by Π(𝒯), Π(𝒯) can be factorized according to the following hierarchies: picking the number of partitions K, partitioning the nodes into (V1,,VK), forming edges Ek and picking one root k* for each Vk. To be clear on the nomenclature, we call Π(𝒱,𝒱|𝒱) as the “latent variable distribution”, Π0(𝒱) as the “partition prior”.

Π(𝒯)={Π0(K)Π0(V1,,VKK)}Π0(𝒱)k=1K{Π(EkVk)Π(kEk,Vk)}Π(𝒱,𝒱|𝒱). (6)

Remark 3. In Bayesian non-parametric literature, Π0(K)Π0(V1,,VKK) is known as the partition probability function, which plays the key role in controlling cluster sizes and cluster number in model-based clustering. However, when it comes to graphical model-based clustering (such as our forest model), it is important to note the differencefor each partition Vk, there is an additional probability Π(Ek,k*Vk) due to the multiplicity of all possible subgraphs formed between the nodes in Vk.

For simplicity, we will use discrete uniform distribution for Π(Ek,k*Vk). Since there are nk(nk2)+ possible spanning trees for nk nodes [(x)+=x if x>0, otherwise 0], and nk possible choice of roots. We have Π(Ek,k*Vk)=nk(nk1).

We now discuss two different ways to complete the distribution specification. We first take a “ground-up” approach by viewing 𝒯 as from a stochastic process where the node number n could grow indefinitely. Starting from the first edge e1=(0, 1), we sequentially draw new edges and add to 𝒯, from

eie1,ei1~j=1i1πj[i]δ(j,i)()+πi[i]δ(0,i)(),yi(j,i)~1(j1)f(yj)+1(j=0)r(), (7)

with some probability vector (π1[i],,πi[i]) that adds up to one. We refer to (7) as a forest process. The forest process is a generalization of the Pólya urn process (Blackwell and MacQueen, 1973). For the latter, ei=(j, i) would make node i take the same value as node j, yi=yj [although in model-based clustering, one would use notation θi=θj, and yi~𝒦(θi)]; ei=(0, i) would make node i draw a new value for yi from the base distribution. Due to this relationship, we can borrow popular parameterization for πj[i] from the urn process literature. For example, we can use the Chinese restaurant process parameterization πj[i]=1/(i1+α) for j=1,,(i1), and πi[i]=α/(i1+α) with some chosen α>0. After marginalizing over the order of i and partition index [see Miller (2019) for a simplified proof of the partition function], we obtain:

Π(𝒯)=αKΓ(α)Γ(α+n)k=1KΓ(nk)nk(nk1). (8)

Compared to the partition probability prior in the Chinese restaurant process, we have an additional nk(nk1) term that corresponds to the conditional prior weight of for each possible (k*, Ek) given a partition Vk.

To help understand the effect of this additional term on the posterior, we can imagine two extreme possibilities in the conditional likelihood given a Vk. If the conditional (yi:iVk|k*,Ek) is skewed toward one particular choice of tree (k^*,E^k) [that is, (yi:iVk|k*,Ek) is large when (k*,Ek)=(k^*,E^k), but is close to zero for other values of (k*,Ek)], then nk(nk1) acts as a penalty for a lack of diversity in trees. On the other hand, if (yi:iVk|k*,Ek) is equal for all possible (k*,Ek)’s, then we can simply marginalize over (k*,Ek) and be not be subject to this penalty [since (k,Ek)nk(nk1)=1].

Therefore, we can form an intuition by interpolating those two extremes: if a set of data points (of size nk) are “well-knit” such that they can be connected via many possible spanning trees (each with a high conditional likelihood), then it would have a higher posterior probability of being clustered together, compared to some other points (of the same size nk) that have only a few trees with high conditional likelihood.

With the “ground-up” construction useful for understanding the difference from the classic urn process, the distribution (8) itself is not very convenient for posterior computation. Therefore, we also explore the alternative of a “top-down” approach. This is based on directly assigning a product partition probability (Hartigan, 1990; Barry and Hartigan, 1993; Crowley, 1997; Quintana and Iglesias, 2003) as

Π0(V1,,VKK)=k=1Knk(nk1)all (V1,,VK)k=1K|Vk|(Vk1), (9)

where the cohesion function nk(nk1) effectively cancels out the probability for each (k*,Ek). To assign a prior for K, we assign a probability

Π0(K)λKall (V1*,,Vk*)k=1KVk*|Vk*|(|Vk*|1),

supported on K{1,,n} with λ>0, with Π(Ek,k*Vk)=nk(nk1), multiplying the terms according to (6) leads to

Π(𝒯)λK, (10)

which is similar to a truncated geometric distribution and easy to handle in posterior computation, and we will use this from now on. In this article, we set λ=0.5.

Remark 4. We now discuss the exchangeability of the sequence of random variables generated from the above forest process. The exchangeability is defined as the the invariance of distribution Π(X1=x1,Xn=xn)=Π(X1=xπ˜1,Xn=xπ˜n) under any permutation (π˜1,,π˜n) (Diaconis, 1977). For simplicity, we focus on the joint distribution with θ as given, and hence omit θ here. There are three categories of random variables associated with each node i: the first drawn edge (j,i) that points to a new node i (whose sequence forms 𝒯=(𝒱,{Ek,k*}k=1K)), the cluster assignment of a node ci (whose sequence forms 𝒱), and the data point yi. It is not hard to see that, since each component tree encodes an order among {i:ci=k}, the joint distribution of the data and the forest Π(y1,,yn,𝒯) is not exchangeable. Nevertheless, as we marginalize out each (Ek,k*) to form the clustering likelihood (y;𝒱) in (3), and all priors Π0(𝒱) presented in this section only depend on the number and sizes of clusters, the joint distribution of the data and cluster labels Π{(y1,c1),,(yn,cn)}=(y;𝒱)Π0(𝒱) is exchangeable, with its form provided soon in (14). Lastly, we see that Π(y1,,yn) is exchangeable after marginalizing over 𝒱.

2.4. Hyper-priors for the Other Parameters

We now specify the hyper-priors for the parameters in the root and leaf densities. To avoid model sensitivities to scaling and shifting of the data, we assume that the data have been appropriately scaled and centered (for example, via standardization), so that the marginally Ey0 and Ey., jEy., j221 for j=1,,p. To make the root density r(·) close to a small constant in the support of the data, we set μ=0 and γ2~Inverse-Gamma(2,1).

For σi, j in the leaf density f(yiyj;σi,j), in order to likely pick an edge (i,j) with j as a close neighbors of i (that is, (i, j) with small yiyj2), we want most of σi,j=σ˜iσ˜j to be small. We use the following hierarchical inverse-gamma prior that shrinks each σ˜i, while using a common scale hyper-parameter βσ to borrow strengths among σ˜i, s,

βσ~exp(ησ),    ησ~Inverse-Gamma(aσ,ξσ),σ˜i~iidInverse-Gamma(bσ,βσ)  for  i=1,,n,

where ησ is the scale parameter for the exponential. To induce a shrinkage effect a priori, we use aσ=100 and ξσ=1 for a likely small ησ hence a small βσ. Further, we note that the coefficient of variation Var(σ˜iβσ)/E(σ˜iβσ)=1/bσ2; therefore, we set bσ=10 to have most of σ˜i near E(σ˜iβσ)=βσ/(bσ1) in the prior. We use these hyper-prior settings in all the examples presented in this article.

In addition, Zelnik-Manor and Perona (2005) demonstrate good empirical performance in spectral clustering via setting σ˜i to a low order statistic of the distances to yi. We show a model-based formalization with similar effects in the Supplementary Materials S5.

2.5. Model-based Extensions

Compared to algorithms, a major advantage of probabilistic models is the ease of building useful model-based extensions. We demonstrate three directions for extending the Bayesian forest model. Due to the page constraint, we defer the details and numeric results of these extensions in the Supplementary Materials S1.1, S1.2 and S1.3.

Latent Forest Model:

First, one could use the realization of the forest process as latent variables in another model for data (y1,,yn),

z1,,zn~ForestModel(𝒯;θz),     y1,,yn~(z1,,zn;θy),

where θz and θy denote the other needed parameters. For example, for clustering high-dimensional data such as images, it is often necessary to represent each high-dimensional observation yi by a low-dimensional coordinate zi (Wu et al., 2014; Chandra et al., 2023). In the Supplementary Materials, we present a high-dimensional clustering model, using an autoregressive matrix Gaussian for and a sparse von Mises-Fisher for the forest model.

Informative Prior-Latent Variable Distribution:

Second, in applications it is sometimes desirable to have the clustering dependent on some external information x, such as covariates (Müller et al., 2011) or an existing partition (Paganin et al., 2021). From a Bayesian view, this can be achieved via taking an x-informative distribution:

𝒯~Π(x),     y1,,yn~ForestModel(𝒯;θ).

In the Supplementary Materials, we illustrate an extension with a covariate-dependent product partition model [PPMx, Müller et al. (2011)] into the distribution of 𝒯.

Hierarchical Multi-view Clustering:

Third, for multi-subject data (y1(s),,yn(s)) for s=1,,s, we want to find a clustering for every s. At the same time, we can borrow strength among subjects, by letting subjects share some similar partition structure on a subset of nodes (while differing on the other nodes). This is known as multi-view clustering. On the other hand, a challenge is that a forest is a discrete object subject to combinatorial constraints, hence it would be difficult to partition the nodes freely while accommodating the tree structure. To circumvent this issue, we propose a latent coordinate-based distribution that gives a continuous representation for 𝒯(s).

Consider a latent zi(s)d for each node i=1,,n, we assign a joint prior–latent variable distribution for z(s) and 𝒯(s):

Π[z(s),𝒯(s)]λk[𝒯(s)][(i,j)𝒯(s):i1,j1exp(zi(s)zj(s)222ρ)][i=1n{k=1κ˜vi,kexp(zi(s)ηk*222σz2)}],(vi,1,,vi,κ˜)~Dir(1/κ˜,,1/κ˜)  for  i=1,n,{y1(s),,yn(s)}~Forest Model(𝒯(s))    for s=1,S, (11)

where vi, 1,,vi,κ˜ are the weights that vary with i and k=1κ˜vi,k=1,ρ>0, and z(s)n×d is the matrix form. Equivalently, the above assigns each node a location parameter ηi(s), drawn from a hierarchical Dirichlet distribution with shared atoms {η1*,,ηκ˜*} and probability (v.,1,,v.,κ˜) (Teh et al., 2006). Further, one could let ηk* vary over node according to some functional using a hybrid Dirichlet distribution (Petrone et al., 2009).

Using a Gaussian mixture kernel on zi(s), we can now separate zi(s)’s into several groups that are far apart. To make the parameters identifiable and have large separations between groups, we fix η˜k*’s on the d-dimensional integer lattice {0, 1, 2}d with d=2 (hence κ˜=9); and we use σz2=0.01 and ρ=0.001 in this article.

Remark 5. To clarify, our goal is to induce between-subject similarity in the node partition, not the tree structure. For example, for two subjects s and s, when zi(s) and zi(s) are both near ηk* for all iC, then both the spanning forest 𝒯(s) and 𝒯(s) will likely cluster the nodes in C together, even though Tk(s) and Tk(s) associated with VkC may be different.

3. Posterior Computation

3.1. Gibbs Sampling Algorithm

We now describe the Markov chain Monte Carlo (MCMC) algorithm. For ease of notation, we use an (n+1)×(n+1) matrix S, with Si, j=log f(yiyj;θ),S0, i=Si, 0=log r(yi;θ)+log λ (for convenience, we use 0 to index the last row/column), Si,i=0, and A𝒯 to represent the adjacency matrix of 𝒯. We have the posterior distribution

Π(𝒯,θy)exp{tr[S(θ)A𝒯]/2}0(θ). (12)

Note the above form conveniently include the prior term for the number of clusters, λK, via the number of edges adjacent to node 0.

Our MCMC algorithm alternates in updating 𝒯 and θ, hence is a Gibbs sampling algorithm. To sample 𝒯 given θ, we take the random-walk covering algorithm for weighted spanning tree (Mosbah and Saheb, 1999), as an extension of the Andrei–Broder algorithm for sampling uniform spanning tree (Broder, 1989; Aldous, 1990). For this article to be self-contained, we describe the algorithm below. The above algorithm produces a random sample 𝒯 following the full conditional Π(𝒯θ, y) proportional to (12). It has an expected finish time of O(n log n). Although some faster algorithms have been developed (Schild, 2018), we choose to present the random-walk covering algorithm for its simplicity.

Algorithm 1.

Random-walk covering algorithm for sampling the augmented tree 𝒯

Start with V𝒯={0} and E𝒯=, and set i0:
while |V𝒯|n+1 do
Take a random walk from i to j with probability Pr(ji)=exp[Si, j(θ)]j:jiexp[Si, j(θ)].
if jV𝒯 then
Add j to V𝒯. Add (i, j) to E𝒯.
Update ij.

We sample σ˜i using the following steps,

(ησ.)~Inverse-Gamma(1+aσ,βσ+ξσ)
(βσ.)~Gamma{1+nbσ,(i=1n1σ˜i+1ησ)1}
(σ˜i.)~Inverse-Gamma[pj1{(i,j)𝒯}2+bσ,j:(i,j)𝒯yiyj222σ˜j+βσ]

To update γ, we use the form of the multivariate Cauchy as a scale mixture of N(μ,γ2uγ, iIp) over uγ, i~Inverse-Gamma(1/2,1/2). We can update via

uγ,i~Inverse-Gamma(1+p2,12+yiμ222γ2),γ2~Inverse-Gamma(2+Kp2,σ^y2+i:(0,i)𝒯yiμ222uγ,i).

We run the MCMC algorithm iteratively for many iterations. And we discard the first half of iterations as burn-in.

Remark 6. We want to emphasize that the Andrei–Broder random-walk covering algorithm (Broder, 1989; Aldous, 1990; Mosbah and Saheb, 1999) is an exact algorithm for sampling a spanning tree 𝒯. That is, if θ were fixed, each run of this algorithm would produce an independent Monte Carlo sample 𝒯~Π(𝒯θ,y). Removing the auxiliary node O from 𝒯 will produce K disjoint spanning trees. This augmented graph technique is inspired by Boykov et al. (2001).

In our algorithm, since the scale parameters in θ are unknown, we use Markov chain Monte Carlo that updates two sets of parameters, (i) (θ[t+1]𝒯[t]) and (ii) (𝒯[t+1]θ[t+1]) from iteration [t] to [t+1]. Therefore, rigorously speaking, there is a Markov chain dependency between 𝒯 [t] and 𝒯[t+1] induced by θ[t+1]. Nevertheless, since we draw 𝒯 in a block via the random-walk covering algorithm, we empirically find that 𝒯 [t+1] and 𝒯 [t] are substantially different. In the Supplementary Materials S4.4, we quantify the iteration-to-iteration graph changes, and provide diagnostics with multiple start points of (𝒯 [0],θ[0]).

3.2. Posterior Point Estimate on Clustering

In the field of Bayesian clustering, for producing point estimate on the partition, it had been a long-time practice to simply track pr(ci=ky), then take the element-wise posterior mode over k as the point estimate for c^i. Nevertheless, this was shown to be sub-optimal due to that: (i) label switching issue causes unreliable estimates on pr(ci=ky); (ii) the element-wise mode can be unrepresentative of the center of distribution for (c1,,cn) (Wade and Ghahramani, 2018). These weaknesses have motivated new methods of obtaining point estimate of clustering, that transform an n×n pairwise co-assignment matrix {pr(ci=cjy)}all(i, j) into an n×K assignment matrix (Medvedovic and Sivaganesan, 2002; Rasmussen et al., 2008; Molitor et al., 2010; Wade and Ghahramani, 2018). More broadly speaking, minimizing a loss function based on the posterior sample (via some estimator or algorithm) is common for producing a point estimate under some decision theory criterion. For example, the posterior mean comes as the minimizer of the squared error loss; in Bayesian factor modeling, an orthogonal Procrustes-based loss function is used for producing the posterior summary of the loading matrix from the generated MCMC samples (Aßmann et al., 2016).

We follow this strategy. There have been many algorithms that one could use. For a recent survey, see Dahl et al. (2022). In this article, we use a simple solution of first finding the mode of K from the posterior sample, then doing a K^-rank symmetric matrix factorization on {pr(ci=cjy)}all(i,j) and clustering into K^ groups, provided by RcppmL package (DeBruine et al., 2021).

4. Theoretical Properties

4.1. Convergence of Eigenvectors

We now formalize the closeness of the eigenvectors of matrices N and M (shown in Section 2.2), by establishing the convergence of the two sets of eigenvectors as n increases.

To be specific, we focus on the normalized spectral clustering algorithm using the similarity Ai, j=exp(Si, j), with Si, j=log f(yiyj;θ),S0, i=Si,0=log r(yi;θ)+log λ. On the other hand, for the specific form, f(yiyj) can be any density satisfying f(yiyj,θ)=f(yjyi, θ), r(yi;θ) can be any density satisfying r(yi;θ)>0. For the associated normalized Laplacian N, we denote the first K bottom eigenvectors by ϕ1,,ϕK, which correspond to the smallest K eigenvalues.

Let M be the matrix with Mi,j=pr[𝒯(i, j)y, θ] for ij and Mi,i=0. The Kirchhoff’ s tree theorem (Chaiken and Kleitman, 1978) gives an enumeration of all 𝒯T,

𝒯T(i,j)𝒯exp(Si,j)=(n+1)1h=2n+1λ(h)(L) (13)

where L is the Laplacian matrix transform of the similarity matrix A;λ(h) denotes the hth smallest eigenvalue. Differentiating its logarithmic transform with respect to Si,j,

Mi,j=Pr[𝒯(i,j)y]=𝒯T,(i,j)𝒯(i,j)𝒯exp(Si,j)𝒯T(i,j)𝒯exp(Si,j)=i=2n+1log λ(i)(L)Si,j.

Let Ψ1,,ΨK be the top K eigenvectors of M, associated with eigenvalues ξ1ξ2ξK, and ξK>ξK+1ξK+2ξn+1. And we can compare with the K leading eigenvectors of (N)n×n,ϕ1,,ϕK. Using Ψ1:K and ϕ1:K to denote two (n+1)×K matrices, we now show they are close to each other.

Theorem 1. There exists an orthonormal matrix RK×K and a finite constant ϵ>0,

Ψ1:Kϕ1:KRF40K(n+1)ξKξK+1maxi,j{(1+ϵ)(Di1/2Dj1/2)2Ai,j},

with probability at least 1exp(n).

Remark 7. To make the right-hand side go to zero, a sufficient condition is to have all Ai,j/Di,i=O(nκ) with κ>1/2. We provide a detailed definition of the bound constant ϵ in the Supplementary Materials S2.

To explain the intuition behind this theorem, our starting point is the close relationship between Laplacian and spanning tree modelsmultiplying both sides of Equation (13) by (n+1)(n1) shows that the non-zero eigenvalue product of the graph Laplacian L is proportional to the marginal probability of n data points from a spanning forest-mixture model. Starting from this equality, we can write the marginal inclusion probability matrix of 𝒯 as a mildly perturbed form of the normalized Laplacian matrix. Intuitively, when two matrices are close, their eigenvectors will be close as well (Yu et al., 2015).

Therefore, under mild conditions, as n, the two sets of leading eigenvectors converge. In the Supplementary Materials S4.7, we show that the convergence is very fast, with the two sets of leading eigenvectors becoming almost indistinguishable starting around n50.

Besides the eigenvector convergence, we can examine the marginal posterior Π(𝒱θ, y), which is proportional to

(y;𝒱,θ)Π0(𝒱)=Π0(K,V1,,VK){k=1K[iVkr(yi)]}k=1K{nk1h=2nkλ(h)(Lk)}, (14)

where Lk is the unnormalized Laplacian matrix associated with matrix {Ai, j}iVk, jVk. Imagine that if we put all indices in one partition V1=(1,,n), then Π(𝒱θ,y) would be very small due to those close-to-zero eigenvalues. Applying this deduction recursively on subsets of data, it is not hard to see that a high-valued Π(𝒱θ, y) would correspond to a partition, wherein each Vk has λ(h)(Lk) away from 0 for any h2. Further, since {nk1h=2nkλ(h)(Lk)}=|Lk+J/nk2|, a permutation in (1,,n) corresponds to congruent and simultaneous permutations of rows and columns of each Lk, which does not change each determinant. Therefore, the joint distribution of Π{(y1,c1),,(yn,cn)} is exchangeable.

4.2. Consistent Clustering of Separable Sets

We show that clustering consistency is possible, under some separability assumptions when the data-generating distribution follows a forest process. Specifically, we establish posterior ratio consistency, as the ratio between the maximum posterior probability assigned to other possible clustering assignments to the posterior probability assigned to the true clustering assignments converges to zero almost surely under the true model (Cao et al., 2019).

To formalize the above, we denote the true cluster label for generating yi by ci0 (subject to label permutation among clusters), and we define the enclosing region for all possible yi:ci0=k as Rk0 for k=1,,K0 for some true finite K0. And we refer to R0=(R10,,RK00) as the “null partition”. By separability, we mean the scenario that (R10,,RK00) are disjoint and there is a lower-bounded distance between each pair of sets. As alternatives, regions R=(R1,,RK) could be induced by {c1,,cn} from the posterior estimate of 𝒯. For simplicity, we assume the scale parameter in f is known and all equal σi,j=σ0,n.

Number of clusters is known. We first start with a simple case when we have fixed K=K0. For regularities, we consider data as supported in a compact region 𝒳, and satisfying the following assumptions:

  • (A1, diminishing scale) σ0,n=C(1/logn)1+i for some t>0 and C>0.

  • (A2, minimum separation) infxRk0,yRk0xy2>Mn, for all kk with some positive constant Mn>0 such that Mn2/σ0,n=8m˜0 log(n) for all (i, j) and is known for some constant m˜0>p/2+2.

  • (A3, near-flatness of root density) For any n, ϵ1<r(y)<ϵ2 for all y𝒳.

Under the null partition, Π(𝒯y) is maximized at 𝒯=𝒯MST,R0, which contains K0 trees with each Tk being the minimum spanning tree (denoted by subscript “MST”) within region Rk0. Similarly, for any alternative R, Π(𝒯y) is maximized at the 𝒯=𝒯MST,R.

Theorem 2. Under (A1, A2, A3), we have Π(𝒯MST,Ry)/Π(𝒯MST,R0y)0 almost surely, unless Ri0Rξ(i) for some permutation map ξ(·).

Number of clusters is unknown: Next, we relax the condition by having a K not necessarily equal to K0. We show the consistency in two parts for 1) K<K0, and 2) K>K0 separately. In order to show posterior ratio consistency in the second part, we need some finer control on r(y):

  • (A3’) The root density satisfies m˜1eM/2σ0,nr(y)m˜2eM/2σ0,n for some m˜1<m˜2.

In this assumption, we essentially assume the root distribution to be flatter with a larger n. Then we have the following results.

Theorem 3. 1) If K<K0, under the assumptions (A1, A2, A3), we have Π(𝒯MST,R|y)/Π(𝒯MST,R0y)0 almost surely.

2) If K>K0, under the assumptions (A1, A2, A3’), we have Π(𝒯MST,Ry)/Π(𝒯MST,R0y)0 almost surely.

The above results show posterior ratio consistency. Furthermore, when the true of clusters is known, the ratio consistency result can be further extended to show clustering consistency, which is proved in the Supplementary Materials S3.

5. Numerical Experiments

To illustrate the capability of uncertainty quantification, we carry out clustering tasks on those near-manifold data commonly used for benchmarking clustering algorithms. In the first simulation, we start with 300 points drawn from three rings of radii 0.2, 1 and 2, with 100 points from each ring. Then we add some Gaussian noise to each point to create a coordinate near a ring manifold. We present two experiments, one with noises from N(0,0.052I2), and one with noises N(0,0.12I2). As shown in Figure 3, when these data are well separated (Panel a, showing posterior point estimate), there is very little uncertainty on the clustering (Panel b), with the posterior co-assignment Pr(ci=cjy) close to zero for any two data points near different rings. As noises increase, these data become more difficult to separate. There is a considerable amount of uncertainty for those red and blue points: these two sets of points are assigned into one cluster with a probability close to 40% (Panel d). We conduct another simulation based on an arc manifold and two point clouds (Panels e-h), and find similar results. Additional experiments are described in the Supplementary Materials S4.2.

Fig. 3.

Fig. 3

Uncertainty quantification in clustering data generated near three manifolds. When data are close to the manifolds (Panels a,e), there is very little uncertainty on clustering in low Pr(ci=cjj) between points from different clusters (Panels b,f). As data deviate more from the manifolds (Panel c,g), the uncertainty increases (Panels d,h). And in Panel g, the point estimate shows a two-cluster partitioning, while there is about 20% of probability for three-cluster partitioning.

In the Supplementary Materials S4.1 and S4.3, we present some uncertainty quantification results, for clustering data that are from mixture models. We compare the estimates with the ones from Gaussian mixture models, which could correspond to correctly/erroneously specified component distribution. Empirically, we find that the uncertainty estimates on Pr(ci=cjy) and Pr(Ky) from the forest model are close to the ones based on the true data-generating distribution; whereas the Gaussian mixture models suffer from sensitivity in model specification, especially when K is not known.

6. Application: Clustering in Multi-subject Functional Magnetic Resonance Imaging Data

In this application, we conduct a neuroscience study for finding connected brain regions under a varying degree of impact from Alzheimer’s disease. The source dataset is resting-state functional magnetic resonance imaging (rs-fMRI) scan data, collected from S=166 subjects at different stages of Alzheimer’s disease. Each subject has scans over n=116 regions of interest using the Automated Anatomical Labeling (AAL) atlas (Rolls et al., 2020; Shi et al., 2021) and over p=120 time points. We denote the observation for the sth subject in the ith region by yi(s)p.

The rs-fMRI data are known for their high variability, often characterized by a low intraclass correlation coefficient (ICC), (1σ^within--group2/σ^total2), as the estimate for the proportion of total variance that can be attributed to variability between groups (Noble et al., 2021). Therefore, our goal is to use the multi-view clustering to divide the regions of interest for each subject, while improving our understanding of the source of high variability.

We fit the multi-view clustering model to the data, by running MCMC for 5, 000 iterations and discarding the first 2, 500 as burn-in. As shown in Figure 4, the hierarchical Dirichlet distribution on the latent coordinates induces similarity between the clustering of brain regions among subjects on a subset of nodes, while showing subtle differences on the other nodes. On the other hand, some major differences can be seen in the clusterings between the healthy and diseased subjects. Using the latent coordinates (at the posterior mean), we quantify the distances between z(s) and z(s) for each pair of subjects ss. As shown in Figure 5(a), there is a clear two-group structure in the pairwise distance matrix formed by z(s)z(s)F, and the separation corresponds to the first 64 subjects being healthy (denoted by sg1) and the latter 102 being diseased (denoted by sg2).

Fig. 4.

Fig. 4

Results of brain region clustering (lateral view) for four subjects taken from the healthy and diseased groups. The multi-view clustering model allows subjects to have similar partition structures on a subset of nodes, while having subtle differences on the others (Panels a and b, Panels c and d). At the same time, the healthy subjects show less degree of variability in the brain clustering than the diseased subjects.

Fig. 5.

Fig. 5

Using the latent coordinates to characterize the heterogeneity within the subjects.

Next, we compute the within–group variances for these two groups, using sglzi(s)(sglzi(s)/|gl|)F2/|gl|, for l=1 and 2, and plot the variance over each region of interest i on the spatial coordinate of the atlas. Figure 5(b) and (c) show that, although both groups show some degree of variability, the diseased group shows clearly higher variances in some regions of the brain. Specifically, the paracentral lobule (PCL) and superior parietal gyrus (SPG), dorsolateral superior frontal gyrus (SFGdor), and supplementary motor area (SMA) in the frontal lobe show the highest amount of variability. Indeed, those regions are also associated with very low ICC scores [Figure 5(e)] calculated based on the variance of zi(s), with pooled estimates σ^total,i2=szi(s)(szi(s)/S)F2/S and σ^within--group,i2=l=12sglzi(s)(sglzi(s)/|gl|)F2/S On the other hand, some regions such as the hippocampus (HIP), parahippocampal gyrus (PHG), and superior occipital gyrus (SOG) show relatively lower variances within each group, hence higher ICC scores.

To show more details on the heterogeneity, we plot the latent coordinates associated with those ROIs using boxplots. Since each zi(s) is in two-dimensional space, we plot the linear transform z˜i(s)=zi,1(s)+zi,2(s). Interestingly, those 8 ROIs with high variability still seem quite informative for distinguishing the two groups (Figure 5(f)). To verify, we concatenate those latent coordinates and form an S×16 matrix, and fit them in a logistic regression model for classifying the healthy versus diseased states. The Area Under the Curve (AUC) of the Receiver Operating Characteristic is 86.6%. On the other hand, when we fit the 6 ROIs with low variability in logistic regression, the AUC increases to 96.1%.

An explanation for the above results is that Alzheimer’s disease does different degrees of damage in the frontal and parietal lobes (see the two distinct clusterings in Figure 4 (c) and (d)), and the severity of the damage can vary from person to person. On the other hand, the hippocampus region (HIP and PHG), important for memory consolidation, is known to be commonly affected by Alzheimer’s disease (Braak and Braak, 1991; Klimova et al., 2015), which explains the low heterogeneity in the diseased group. Further, to our best knowledge, the high discriminability of the superior occipital gyrus (SOG) is a new quantitative finding, that could be meaningful for a further clinical study.

For validation, without using any group information, we concatenate those zi(s)’s over all i=1,,116 and form an S×232 matrix and use lasso logistic regression to classify the two groups. When 12 predictors are selected (as a similar-size model to the one above using 6 ROIs), the AUC is 96.4%. Since zi(s)’s are obtained in an unsupervised way, this validation result shows that the multi-view clustering model produces meaningful representation for the nodes in this Alzheimer’s disease data. We provide further details on the clusterings, including the number of clusters, and the posterior co-assignment probability matrices in the Supplementary Materials S4.5.

7. Discussion

In this article, we present our discovery of a probabilistic model for popular spectral clustering algorithms. This enables straightforward uncertainty quantification and model-based extensions through the Bayesian framework. There are several directions worth exploring. First, our consistency theory is conducted under the condition of separable sets, similar to Ascolani et al. (2022). For general cases with non-separable sets, clustering consistency (especially on estimating K) is challenging to achieve; to our best knowledge, existing consistency theory only applies to data generated independently from a mixture model (Miller and Harrison, 2018; Zeng et al., 2023). For data generated dependently via a graph, this is still an unsolved problem. Second, in all of our forest models, we have been careful in choosing densities with tractable normalizing constants. One could relax this constraint by using densities f(yiyj,θ)=αfgf(yiyj;θ) and r(yi;θ)=αrgr(yi;θ), with g some similarity function, and (αf,αr) potentially intractable. In these cases, the forest posterior becomes (𝒯)(λαr/αf)K(0,i)𝒯gr(yi;θ)(i,j)𝒯gr(yiyj;θ). Therefore, one could choose an appropriate λ˜=λαr/αf (equivalent to choosing some value of λ), without knowing the value of αf or αr; nevertheless, how to calibrate λ˜ still requires further study. Third, a related idea is the Dirichlet Diffusion Tree (Neal, 2003), which considers a particle starting at the origin, following the path of previous particles, and diverging at a random time. The data are collected as the locations of particles at the end of a time period. Compared to the forest process, the diffusion tree process has the conditional likelihood given the tree invariant to the ordering of the data index, which is a stronger property compared to the marginal exchangeability of the data points. Therefore, it is interesting to further explore the relationship between those two processes.

Supplementary Material

Supp 1

Acknowledgment:

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada.

Footnotes

Conflict of interest statement: The authors report that there are no competing interests to declare.

References

  1. Aldous DJ (1990). The Random Walk Construction of Uniform Spanning Trees and Uniform Labelled Trees. SIAM Journal on Discrete Mathematics 3 (4), 450–465. [Google Scholar]
  2. Ascolani F, Lijoi A, Rebaudo G, and Zanella G (2022). Clustering Consistency With Dirichlet Process Mixtures. arXiv preprint arXiv:2205.12924. [Google Scholar]
  3. Aßmann C, Boysen-Hogrefe J, and Pape M (2016). Bayesian Analysis of Static and Dynamic Factor Models: An Ex-Post Approach Towards the Rotation Problem. Journal of Econometrics 192 (1), 190–206. [Google Scholar]
  4. Banerjee S, Akbani R, and Baladandayuthapani V (2015). Bayesian Nonparametric Graph Clustering. arXiv preprint arXiv:1509.07535. [Google Scholar]
  5. Barry D and Hartigan JA (1993). A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association 88 (421), 309–319. [Google Scholar]
  6. Blackwell D and MacQueen JB (1973). Ferguson Distributions via Pólya Urn Schemes. The Annals of Statistics 1 (2), 353–355. [Google Scholar]
  7. Blei DM and Frazier PI (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research 12 (8). [Google Scholar]
  8. Boykov Y, Veksler O, and Zabih R (2001). Fast Approximate Energy Minimization via Graph Cuts. IEEE Transactions on pattern analysis and machine intelligence 23 (11), 1222–1239. [Google Scholar]
  9. Braak H and Braak E (1991). Neuropathological Stageing of Alzheimer-Related Changes. Acta Neuropathologica 82 (4), 239–259. [DOI] [PubMed] [Google Scholar]
  10. Broder AZ (1989). Generating Random Spanning Trees. In Annual Symposium on Foundations of Computer Science, Volume 89, pp. 442–447. [Google Scholar]
  11. Byrne S and Dawid AP (2015). Structural Markov Graph Laws for Bayesian Model Uncertainty. The Annals of Statistics 43 (4), 1647–1681. [Google Scholar]
  12. Cai D, Campbell T, and Broderick T (2021). Finite Mixture Models Do Not Reliably Learn the Number of Components. In International Conference on Machine Learning, pp. 1158–1169. PMLR. [Google Scholar]
  13. Cao X, Khare K, and Ghosh M (2019). Posterior Graph Selection and Estimation Consistency for High-Dimensional Bayesian DAG Models. The Annals of Statistics 47 (1), 319–348. [Google Scholar]
  14. Chaiken S and Kleitman DJ (1978). Matrix Tree Theorems. Journal of Combinatorial Theory, Series A 24 (3), 377–381. [Google Scholar]
  15. Chandra NK, Canale A, and Dunson DB (2023). Escaping the Curse of Dimensionality in Bayesian Model Based Clustering. Journal of Machine Learning Research 24, 1–42. [Google Scholar]
  16. Chi Y, Song X, Zhou D, Hino K, and Tseng BL (2007). Evolutionary Spectral Clustering by Incorporating Temporal Smoothness. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 153–162. [Google Scholar]
  17. Coretto P and Hennig C (2016). Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering. Journal of the American Statistical Association 111 (516), 1648–1659. [Google Scholar]
  18. Crowley EM (1997). Product Partition Models for Normal Means. Journal of the American Statistical Association 92 (437), 192–198. [Google Scholar]
  19. Dahl DB, Johnson DJ, and Müller P (2022). Search Algorithms and Loss Functions for Bayesian Clustering. Journal of Computational and Graphical Statistics 31 (4),1189–1201. [Google Scholar]
  20. DeBruine ZJ, Melcher K, and Triche TJ Jr (2021). Fast and Robust Non-Negative Matrix Factorization for Single-Cell Experiments. bioRxiv, 2021–09. [Google Scholar]
  21. Diaconis P (1977). Finite Forms of de Finetti’s Theorem on Exchangeability. Synthese 36, 271–281. [Google Scholar]
  22. Duan LL and Dunson DB (2021a). Bayesian Distance Clustering. Journal of Machine Learning Research 22, 1–27. [PMC free article] [PubMed] [Google Scholar]
  23. Duan LL and Dunson DB (2021b). Bayesian Spanning Tree: Estimating the Backbone of the Dependence Graph. arXiv preprint arXiv:2106.16120. [Google Scholar]
  24. Duan LL, Michailidis G, and Ding M (2023). Bayesian Spiked Laplacian Graphs. Journal of Machine Learning Research 24 (3), 1–35. [Google Scholar]
  25. Edwards D, De Abreu GC, and Labouriau R (2010). Selecting High-Dimensional Mixed Graphical Models Using Minimal AIC or BIC Forests. BMC Bioinformatics 11 (1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ester M, Kriegel H-P, Sander J, and Xu X (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press. [Google Scholar]
  27. Fraley C and Raftery AE (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association 97 (458), 611–631. [Google Scholar]
  28. Frey BJ and Dueck D (2007). Clustering by Passing Messages Between Data Points. Science 315 (5814), 972–976. [DOI] [PubMed] [Google Scholar]
  29. Frühwirth-Schnatter S and Pyne S (2010). Bayesian Inference for Finite Mixtures of Univariate and Multivariate Skew-Normal and Skew-t Distributions. Biostatistics 11 (2), 317–336. [DOI] [PubMed] [Google Scholar]
  30. Geng J, Bhattacharya A, and Pati D (2019). Probabilistic Community Detection With Unknown Number of Communities. Journal of the American Statistical Association 114 (526), 893–905. [Google Scholar]
  31. Gower JC and Ross GJ (1969). Minimum Spanning Trees and Single Linkage Cluster Analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics) 18 (1), 54–64. [Google Scholar]
  32. Guha S and Baladandayuthapani V (2016). A Nonparametric Bayesian Technique for High-Dimensional Regression. Electronic Journal of Statistics 10 (2), 3374–3424. [Google Scholar]
  33. Han X, Tong X, and Fan Y (2021). Eigen Selection in Spectral Clustering: A Theory-Guided Practice. Journal of the American Statistical Association, 1–13.35757777 [Google Scholar]
  34. Hartigan JA (1990). Partition Models. Communications in Statistics-Theory and Methods 19 (8), 2745–2756. [Google Scholar]
  35. Klimova B, Maresova P, Valis M, Hort J, and Kuca K (2015). Alzheimer’s Disease and Language Impairments: Social Intervention and Medical Treatment. Clinical Interventions in Aging, 1401–1408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Kosmidis I and Karlis D (2016). Model-Based Clustering Using Copulas With Applications. Statistics and Computing 26 (5), 1079–1099. [Google Scholar]
  37. Kumar A, Rai P, and Daume H (2011). Co-regularized Multi-view Spectral Clustering. In Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, and Weinberger K (Eds.), Advances in Neural Information Processing Systems, Volume 24. Curran Associates, Inc. [Google Scholar]
  38. Lee SX and McLachlan GJ (2016). Finite Mixtures of Canonical Fundamental Skew t-Distributions. Statistics and Computing 26 (3), 573–589. [Google Scholar]
  39. Lei J and Lin KZ (2022). Bias-Adjusted Spectral Clustering in Multi-Layer Stochastic Block Models. Journal of the American Statistical Association, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Lei J and Rinaldo A (2015). Consistency of Spectral Clustering in Stochastic Block Models. The Annals of Statistics 43 (1), 215–237. [Google Scholar]
  41. Lewis JR, MacEachern SN, and Lee Y (2021). Bayesian Restricted Likelihood Methods: Conditioning on Insufficient Statistics in Bayesian Regression. Bayesian Analysis 16 (4), 1393–1462. [Google Scholar]
  42. Luo Z, Sang H, and Mallick B (2021). A Bayesian Contiguous Partitioning Method for Learning Clustered Latent Variables. Journal of Machine Learning Research 22. [Google Scholar]
  43. MacQueen J (1967). Classification and Analysis of Multivariate Observations. In 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297. [Google Scholar]
  44. Malsiner-Walli G, Frühwirth-Schnatter S, and Grün B (2017). Identifying Mixtures of Mixtures Using Bayesian Estimation. Journal of Computational and Graphical Statistics 26 (2), 285–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. McDaid AF, Murphy TB, Friel N, and Hurley NJ (2013). Improved Bayesian Inference for the Stochastic Block Model With Application to Large Networks. Computational Statistics & Data Analysis 60, 12–31. [Google Scholar]
  46. Medvedovic M and Sivaganesan S (2002). Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles. Bioinformatics 18 (9), 1194–1206. [DOI] [PubMed] [Google Scholar]
  47. Meilă M and Jaakkola T (2006). Tractable Bayesian Learning of Tree Belief Networks. Statistics and Computing 16 (1), 77–92. [Google Scholar]
  48. Meila M and Jordan MI (2000). Learning With Mixtures of Trees. Journal of Machine Learning Research 1 (Oct), 1–48. [Google Scholar]
  49. Miller JW (2019). An Elementary Derivation of the Chinese Restaurant Process From Sethuraman’s Stick-Breaking Process. Statistics & Probability Letters 146, 112–117. [Google Scholar]
  50. Miller JW and Dunson DB (2018). Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association 114 (527), 1113–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Miller JW and Harrison MT (2018). Mixture Models With a Prior on the Number of Components. Journal of the American Statistical Association 113 (521), 340–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Molitor J, Papathomas M, Jerrett M, and Richardson S (2010). Bayesian Profile Regression With an Application to the National Survey of Children’s Health. Biostatistics 11 (3), 484–498. [DOI] [PubMed] [Google Scholar]
  53. Mosbah M and Saheb N (1999). Non-Uniform Random Spanning Trees on Weighted Graphs. Theoretical Computer Science 218 (2), 263–271. [Google Scholar]
  54. Müller P and Quintana F (2010). Random Partition Models With Regression on Covariates. Journal of Statistical Planning and Inference 140 (10), 2801–2808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Müller P, Quintana F, and Rosner GL (2011). A Product Partition Model With Regression on Covariates. Journal of Computational and Graphical Statistics 20 (1), 260–278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Neal RM (2003). Density Modeling and Clustering Using Dirichlet Diffusion Trees. Bayesian Statistics 7, 619–629. [Google Scholar]
  57. Ng S-K, McLachlan GJ, Wang K, Ben-Tovim Jones L, and Ng S-W (2006). A Mixture Model With Random-Effects Components for Clustering Correlated Gene-Expression Profiles. Bioinformatics 22 (14), 1745–1752. [DOI] [PubMed] [Google Scholar]
  58. Noble S, Scheinost D, and Constable RT (2021). A Guide to the Measurement and Interpretation of fMRI Test-Retest Reliability. Current Opinion in Behavioral Sciences 40, 27–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Nowicki K and Snijders TAB (2001). Estimation and Prediction for Stochastic Blockstructures. Journal of the American Statistical Association 96 (455), 1077–1087. [Google Scholar]
  60. Paganin S, Herring AH, Olshan AF, and Dunson DB (2021). Centered Partition Processes: Informative Priors for Clustering. Bayesian Analysis 16 (1), 301–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Park J-H and Dunson DB (2010). Bayesian Generalized Product Partition Model. Statistica Sinica, 1203–1226. [Google Scholar]
  62. Petrone S, Guindani M, and Gelfand AE (2009). Hybrid Dirichlet mixture models for functional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (4), 755–782. [Google Scholar]
  63. Quintana FA and Iglesias PL (2003). Bayesian Clustering and Product Partition Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2), 557–574. [Google Scholar]
  64. Rasmussen C, Bernard J, Ghahramani Z, and Wild DL (2008). Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6 (4), 615–628. [DOI] [PubMed] [Google Scholar]
  65. Ren L, Du L, Carin L, and Dunson DB (2011). Logistic Stick-Breaking Process. Journal of Machine Learning Research 12 (1). [PMC free article] [PubMed] [Google Scholar]
  66. Rigon T, Herring AH, and Dunson DB (2023). A Generalized Bayes Framework for Probabilistic Clustering. Biometrika, 1–14. [Google Scholar]
  67. Rodríguez CE and Walker SG (2014). Univariate Bayesian Nonparametric Mixture Modeling With Unimodal Kernels. Statistics and Computing 24 (1), 35–49. [Google Scholar]
  68. Rohe K, Chatterjee S, and Yu B (2011). Spectral Clustering and the High-Dimensional Stochastic Blockmodel. The Annals of Statistics 39 (4), 1878–1915. [Google Scholar]
  69. Rolls ET, Huang C-C, Lin C-P, Feng J, and Joliot M (2020). Automated Anatomical Labelling Atlas 3. Neuroimage 206, 116189. [DOI] [PubMed] [Google Scholar]
  70. Schild A (2018). An Almost-Linear Time Algorithm for Uniform Random Spanning Tree Generation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 214–227. [Google Scholar]
  71. Shi D, Zhang H, Wang S, Wang G, and Ren K (2021). Application of Functional Magnetic Resonance Imaging in the Diagnosis of Parkinson’s Disease: A Histogram Analysis. Frontiers in Aging Neuroscience 13, 624731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Shi T, Belkin M, and Yu B (2009). Data Spectroscopy: Eigenspaces of Convolution Operators and Clustering. The Annals of Statistics, 3960–3984. [Google Scholar]
  73. Snijders TA and Nowicki K (1997). Estimation and Prediction for Stochastic Blockmodels for Graphs With Latent Block Structure. Journal of Classification 14 (1), 75–100. [Google Scholar]
  74. Socher R, Maas A, and Manning C (2011). Spectral Chinese Restaurant Processes: Nonparametric Clustering Based on Similarities. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 698–706. JMLR Workshop and Conference Proceedings. [Google Scholar]
  75. Teh YW, Jordan MI, Beal MJ, and Blei DM (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association 101 (476), 1566–1581. [Google Scholar]
  76. Von Luxburg U (2007). A Tutorial on Spectral Clustering. Statistics and Computing 17 (4), 395–416. [Google Scholar]
  77. Wade S and Ghahramani Z (2018). Bayesian Cluster Analysis: Point Estimation and Credible Balls. Bayesian Analysis 13 (2), 559–626. [Google Scholar]
  78. Wu S, Feng X, and Zhou W (2014). Spectral Clustering of High-Dimensional Data Exploiting Sparse Representation Vectors. Neurocomputing 135, 229–239. [Google Scholar]
  79. Yu Y, Wang T, and Samworth RJ (2015). A Useful Variant of the Davis–Kahan Theorem for Statisticians. Biometrika 102 (2), 315–323. [Google Scholar]
  80. Zelnik-Manor L and Perona P (2005). Self-Tuning Spectral Clustering. In Advances in Neural Information Processing Systems, Volume 17. [Google Scholar]
  81. Zeng C, Miller JW, and Duan LL (2023). Consistent Model-based Clustering using the Quasi-Bernoulli Stick-breaking Process. Journal of Machine Learning Research 24, 1–32. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES