MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS

Florentina Bunea; Christophe Giraud; Xi Luo; Martin Royer; Nicolas Verzelen

doi:10.1214/18-aos1794

. Author manuscript; available in PMC: 2022 Jul 15.

Published in final edited form as: Ann Stat. 2020 Feb 17;48(1):111–137. doi: 10.1214/18-aos1794

MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS

Florentina Bunea ^1,^*, Christophe Giraud ^2,^*, Xi Luo ^3,^*, Martin Royer ^2,^*, Nicolas Verzelen ^4,^*

PMCID: PMC9286061 NIHMSID: NIHMS1765231 PMID: 35847529

Abstract

The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X₁, … , X_p) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.

MSC2010 subject classifications: Primary 62H30, secondary 62C20

Keywords: Convergence rates, convex optimization, covariance matrices, high-dimensional inference

1. Introduction.

The problem of variable clustering is that of grouping similar components of a p-dimensional vector X = (X₁,…, X_p). These groups are referred to as clusters. In this work, we investigate the problem of cluster recovery from a sample of n independent copies of X. Variable clustering has had a long history in a variety of fields, with important examples stemming from gene expression data [19, 23, 41] or protein profile data [8]. The solutions to this problem are typically algorithmic and entirely data based. They include applications of K-means, hierarchical clustering, spectral clustering or versions of them. The statistical properties of these procedures have received a very limited amount of investigation. It is not currently known what probabilistic cluster models on X can be estimated by these popular techniques, or by their modifications. More generally, model-based variable clustering has received a limited amount of attention. One net advantage of model-based clustering is that population-level clusters are clearly defined, offering both interpretability of the clusters and a benchmark against which one can check the quality of a particular clustering algorithm.

In this work, we propose the G-block covariance model as a flexible model for variable clustering and show that the clusters given by this model are uniquely defined.We then motivate and develop two algorithms tailored to the model, COD and PECOK, and analyze their respective performance in terms of exact cluster recovery, for minimally separated clusters, under appropriately defined cluster separation metrics.

1.1. The G-block covariance model.

Our proposed model for variable clustering subsumes that the covariance matrix Σ of a centered random vector X ∈ R^p follows a block, or near-block, decomposition, with blocks corresponding to a partition G = {G₁, …, G_K} of {1, …, p}. This structure of the covariance matrix has been observed to hold, empirically, in a number of very recent studies on the parcelation of the human brain, for instance, [18, 20, 25, 40]. We further support these findings in Section 7, where we apply the clustering methods developed in this paper, tailored to G-block covariance models, for the clustering of brain regions.

To describe our model, we associate, to a partition G, a membership matrix A ∈ R^p×K defined by A_ak = 1 if a ∈ G_k, and A_ak = 0 otherwise.

The exact G-block covariance model. In view of the above discussion, clustering the variables (X₁, …, X_p) amounts to find a minimal (i.e., coarsest partition) G*, such that two variables belong to the same cluster if they have the same covariance with all other variables. This implies that the covariance matrix Σ of X decomposes as
$Σ = A C^{*} A^{t} + Γ,$ (1.1)
where A is relative to G*, C* is a symmetric K × K matrix and Г a diagonal matrix. When a such a decomposition exists with the partition G*, we say that X ∈ R^p follows an (exact) G*-block covariance model.
1. G-Latent model. Such a structure arises, for instance, when components of X that belong to the same group can be decomposed into the sum between a common latent variable and an uncorrelated random fluctuation. Similarity within group is therefore given by association with the same unobservable source. Specifically, the exact block-covariance model (1.1) holds, with a diagonal matrix Г, when
  $X_{a} = Z_{k (a)} + E_{a},$ (1.2)
  with Cov(Z_k(a), E_a) = 0, Cov(Z) = C*, and the individual fluctuations E_a are uncorrelated, and thus E has diagonal covariance matrix Г. The index assignment function k :{1, …, p) → {1, …, K}is defined by G_k = {a : k(a) = k}. In practice, this model is used to justify the construction of a single variable that represents a cluster, the average of X_a, a ∈ G_k, viewed as an observable proxy of Z_k(a). For example, a popular analysis approach for fMRI data, called region-of-interest (ROI) analysis [36], requires averaging the observations from multiple voxels (a imaging unit for a small cubic volume of the brain) within each ROI (or cluster of voxels) to produce new variables, each representing a larger and interpretable brain area. These new variables are then used for downstream analyses. From this perspective, model (1.2) can be used in practice (see, e.g., [7]) as a building block in a data analysis based on cluster representatives, which in turn requires accurate cluster estimation. Indeed, data-driven methods for clustering either voxels into regions or regions into functional systems, especially based on the covariance matrix of X, is becoming increasingly important; see, for example, [18, 20, 37, 40]. Accurate data-driven clustering methods also enable studying the cluster differences across subjects [17] or experimental conditions [22].
2. The Ising block model. The Ising block model has been proposed in [10] for modeling social interactions, for instance, political affinities. Under this model, the joint distribution of X ∈ {−1, 1}^p, a p-dimensional vector with binary entries, is given by
  $f (x) = \frac{1}{κ_{α, β}} exp [\frac{β}{2 p} \sum_{a ~ b} x_{a} x_{b} + \frac{α}{2 p} \sum_{a ≁ b} x_{a} x_{b}],$ (1.3)
  where the quantity κ_α,β is a normalizing constant, and the notation a ~ b means that the elements are in the same group of the partition. The variables X_a may for instance represent the votes of U.S. senators on a bill [6]. For parameters α > β, the density (1.3) models the fact that senators belonging to the same political group tend to share the same vote. By symmetry of the density f, the covariance matrix Σ of X decomposes as an exact block covariance model Σ = AC* A^t + Г where Г is diagonal. When all groups $G_{k}^{*}$ have identical size, we have C* = (ω_in − ω_out)I_k + ω_outJ and Г = (1 − ω_in)I, where the K × K matrix J has all entries equal to 1, and I_K denotes the K × K identity matrix, and the quantities ω_in, ω_out depend on α, β, p.
The approximate G-block model. In many situations, it is more appealing to group variables that nearly share the same covariance with all the other variables. In that situation, the covariance matrix Σ would decompose as
$Σ = A C A^{t} + Γ where Γ has small off-diagonal entries .$ (1.4)
Such a situation can arise, for instance, when X_a = (1 + δ_a)Z_k(a) + E_a, with δ_a = o(1) and the individual fluctuations E_a are uncorrelated, 1 ≤ a ≤ p.

1.2. Our contribution.

We assume that the data consist in i.i.d. observations X⁽¹⁾, …, X⁽ⁿ⁾ of a random vector X with mean 0 and covariance matrix Σ. This work is devoted to the development of computationally feasible methods that yield estimates $\hat{G}$ of G*, such that $\hat{G} = G^{*}$ , with high probability, when the clusters are minimally separated, and to characterize the minimal value of the cluster separation from a minimax perspective. The separation between clusters is a key element in quantifying the difficulty of a clustering task as, intuitively, well-separated clusters should be easier to identify. We consider two related, but different, separation metrics, that can be viewed as canonical whenever Σ. satisfies (1.4). Although all of our results allow, and are proved, for small departures from the diagonal structure of Г in (1.1), our main contribution can be best seen when Г is a diagonal matrix. We focus on this case below, for clarity of exposition. The case of Г being a perturbation of a diagonal matrix is treated in Section 6.

When Г is diagonal, our target partition G* can be easily defined. It is the unique minimal (with respect to partition refinement) partition G* for which there is a decomposition Σ = AC* A^t + Г, with A associated to G*. We refer to Section 2 for details. We observe in particular, that max_c≠a,b |Σ_ac – Σ_bc| > 0 if and only if X_a and X_b belong to different clusters in G*.

This last remark motivates our first metric MCOD based on the following COvariance Difference (COD) measure:

COD (a, b) ≔ max_{c \neq a, b} | Σ_{a c} - Σ_{b c} | for any a, b = 1, \dots, p .

(1.5)

We use the notation $a \overset{G^{*}}{~} b$ whenever a and b belong to the same group $G_{k}^{*}$ , for some k, in the partition G*, and similarly $a \overset{G^{*}}{≁} b$ means that there does not exist any group $G_{k}^{*}$ of the partition G* that contains both a and b. We define the MCOD metric as

MCOD (Σ) ≔ min_{a \overset{G^{*}}{≁} b} COD (a, b) .

(1.6)

The measure COD(a, b) quantifies the similarity of the covariances that X_a and X_b have, respectively, with all other variables. From this perspective, the size of MCOD(Σ) is a natural measure for the difficulty of clustering when analyzing clusters with components that are similar in this sense. Moreover, note that this metric is well defined even if C* of model (1.1) is not semipositive definite.

Another cluster separation metric appears naturally when we view model (1.1) as arising via model (1.2), or via small deviations from it. Then clusters in (1.1) are driven by the latent factors, and intuitively they differ when the latent factors differ. Specifically, we define the “within-between group” covariance gap

Δ (C^{*}) ≔ min_{_{j < k}} (C_{k k}^{*} + C_{j j}^{*} - 2 C_{j k}^{*}) = min_{_{j < k}} E [{(Z_{j} - Z_{k})}^{2}],

(1.7)

where the second equality holds whenever (1.2) holds. In the latter case, the matrix C*, which is the covariance matrix of the latent factors, is necessarily semipositive definite. Further, we observe that Δ(C*) = 0 implies Z_j = Z_k a.s. Conversely, we prove in Corollary 2.3 of Section 2 that if the decomposition (1.1) holds with Δ(C*) > 0, then the partition related to A is the partition G* described above. An instance of Δ(C*) > 0 corresponds to having the within group covariances stronger than those between groups. This suggests the usage of this metric Δ(C*) for cluster analysis whenever, in addition to the general model formulation (1.1), we also expect clusters to have this property, which has been observed, empirically, to hold in applications. For instance, it is implicit in the methods developed by [18] for creating a human brain atlas by partitioning appropriate covariance matrices. We also present a neuroscience-based data example in Section 7.

Formally, the two metrics are connected via the following chain of inequalities, proved in Lemma B.1 of Section 2 of the Supplementary Material [14], and valid as soon as the size of the smallest cluster is larger than one, Г and C* is semipositive definite (for the last inequality)

2 λ_{K} (C^{*}) \leq Δ (C^{*}) \leq 2 MCOD (Σ) \leq 2 \sqrt{Δ (C^{*})} max_{k = 1, \dots, K} \sqrt{C_{k k}^{*}} .

(1.8)

The first inequality shows that conditions on Δ(C*) are weaker than conditions on the minimal eigenvalue λ_K (C*) of C*. In order to preserve the generality of our model, we do not necessarily assume that λ_K (C*) > 0, as we show that, for model identifiability, it is enough to have the weaker condition Δ(C*) > 0, when the two quantities differ.

The second inequality in (1.8) shows that Δ(C*) and MCOD(Σ) can have the same order of magnitude, whereas the third inequality shows that they can also differ in order, and Δ(C*) can be as small as MCOD²(Σ) for small values of these metrics, which is our main focus. This suggests that different statistical assessments, and possibly different algorithms, should be developed for estimators of clusters defined by (1.1), depending on the cluster separation metric. To substantiate this intuition, we first derive, for each metric, the rate below which no algorithm can recover exactly the clusters defined by (1.1). We call this the minimax optimal threshold for cluster separation, and prove that it is different for the two metrics. We call an algorithm that can be proved to recover exactly clusters with separation above the minimax threshold a minimax optimal algorithm.

Theorem 3.1 in Section 3 shows that, for K ≥ 3 and for some numerical constant c > 0, no algorithm can estimate consistently clusters defined by (1.1) uniformly over covariance matrices fulfilling

MCOD (Σ) \geq c \sqrt{\frac{log (p)}{n}} .

Theorem 3.2 in Section 3 shows that optimal separation distances with respect to the metric Δ(C*) are sensitive to the size of the smallest cluster,

m^{*} = min_{1 \leq k \leq K} | G_{k}^{*} | .

Indeed, there exists a numerical constant c > 0, such that no algorithm can estimate consistently clusters defined by (1.1) uniformly over covariance matrices fulfilling

Δ (C^{*}) \geq c (\sqrt{\frac{log (p)}{n m^{*}}} \lor \frac{log (p)}{n}) .

(1.9)

The first term will be dominant whenever the smallest cluster has size m* < n/ log(p), which will be the case in most situations. The second term in (1.9) becomes dominant whenever m* > n/ log(p), which can also happen when p scales as n, and we have a few balanced clusters.

The PECOK algorithm is tailored to the Δ(C*) metric, and is shown in Theorem 5.3 to be near-minimax optimal. For instance, for balanced clusters, there exists a constant c′ such that exact recovery is guaranteed when $Δ (C^{*}) \geq c^{'} (\sqrt{\frac{K \lor log p}{m^{*} n}} + \frac{K \lor log (p)}{n})$ . This differs by factors in K from the Δ(C*)-minimax threshold, for general K, whereas it is of optimal order when K is a constant, or grows as slowly as log p. A similar discrepancy between minimax lower bounds and the performance of polynomial-time estimators has also been pinpointed in network clustering via the stochastic block model [16] and in sparse PCA [9]. It has been conjectured that, when K increases with n, there exists a gap between the statistical boundary, that is, the minimal cluster separation for which a statistical method achieves perfect clustering with high probability, and the polynomial boundary, that is, the minimal cluster separation for which there exists a polynomial-time algorithm that achieves perfect clustering. Further investigation of this computational trade-off is beyond the scope of this paper and we refer to [16] and [9] for more details.

However, if we consider directly the metric MCOD(Σ), and its corresponding, larger, minimax threshold, we derive the COD algorithm, which is minimax optimal with respect to MCOD(Σ) when K ≥ 3. In view of (1.8), it is also minimax optimal with respect to Δ(C*), whenever there exist small clusters, the size of which does not change with n. The description of the two algorithms and theoretical properties are given in Sections 4 and 5, respectively, for exact block covariance models. Companions of these results, regarding the performance of the algorithms for approximate block covariance models are given in Section 6, in Theorem 6.2 and Theorem 6.5, respectively.

Table 1 gives a snapshot of our results, which for ease of presentation, correspond to the case of balanced clusters, with the same number of variables per cluster. We stress that neither our algorithms, nor our theory, is restricted to this case, but the exposition becomes more transparent in this situation.

Table 1.

Algorithm performance relative to minimax thresholds of each metric

Metric	Minimax threshold	PECOK	COD
d₁ ≕ Δ(C*)	$\sqrt{\frac{log p}{m n}} + \frac{log p}{n}$	Minimax optimal w.r.t. d₁ when K = O(log(p)).	Minimax optimal w.r.t. d₁ when m is constant.
d₂ ≕ MCOD(∑)	$\sqrt{\frac{log p}{n}}$ when K ≥ 3	Minimax optimal w.r.t. d₂ when m > n/ log(p) and K = O(log p).	Minimax optimal w.r.t. d₂ when K ≥ 3.

Open in a new tab

In this table, m denotes the size of the smallest cluster in the partition. The performance of COD under d₁ follows from the second inequality in (1.8), whereas the performance of PECOK under d₂ follows from the last inequality in (1.8). The overall message transmitted by Table 1 and our analysis is that, irrespective of the separation metric, the COD algorithm will be most powerful whenever we expect to have at least one, possibly more, small clusters, a situation that is typically not handled well in practice by most of the popular clustering algorithms; see [12] for an in-depth review. The PECOK algorithm is expected to work best for larger clusters, in particular when there are no clusters of size one. We defer more comments on the relative numerical performance of the methods to the discussion Section 8.3.

We emphasize that both our algorithms are generally applicable, and our performance analysis is only in terms of the most difficult scenarios, when two different clusters are almost indistinguishable and yet, as our results show, consistently estimable. Our extensive simulation results confirm these theoretical findings.

We summarize below our key contributions.

An identifiable model for variable clustering and metrics for cluster separation. We advocate model-based variable clustering, as a way of proposing objectively defined and interpretable clusters. We propose identifiable G-block covariance models for clustering, and prove cluster identifiability in Proposition 2.2 of Section 2.
Minimax lower bounds on cluster separation metrics for exact partition recovery. Two of our main results are Theorem 3.2 and Theorem 3.1, presented in Section 3, in which we establish, respectively, minimax limits on the size of the Δ(C*)-cluster separation and MCOD(Σ)-cluster separation below which no algorithm can recover clusters defined by (1.1) consistently, from a sample of size n on X. To the best of our knowledge, these are the first results of this type in variable clustering.
Variable clustering procedures with guaranteed exact recovery of minimally separated clusters. The results of (1) and (2) provide a much needed framework for motivating variable clustering algorithm development and for clustering algorithm assessments.

In particular, they motivate a correction of a convex relaxation of the K-means algorithm, leading to our proposed PECOK procedure, based on Semidefinite Programing (SDP). Theorem 5.3 shows it to be near-minimax optimal with respect to the Δ(C*) metric. The PECOK–Δ(C*) pairing is natural, as Δ(C*) measures the difference of the “within cluster” signal relative to the “between clusters” signal, which is the idea that underlies K-means type procedures. To the best of our knowledge, this is the first work that explicitly shows what model-based clusters of variables can be estimated via K-means style methods, and assesses theoretically the quality of estimation. Moreover, our work shows that the results obtained in [10], for the block Ising model, can be generalized to arbitrary values of K and unbalanced clusters.

The COD procedure is a companion of PECOK for clusters given by model (1.1), and is minimax optimal with respect to the MCOD(Σ) cluster separation when K ≥ 3, as established in Theorem 3.1. Another advantage of COD is of computational nature, as SDP-based methods, although convex, can be computationally involved.
Comparison with corrected spectral variable clustering methods. In Section 5.4, we connect PECOK with another popular algorithm, spectral clustering. Spectral clustering is less computationally involved than PECOK, but the theoretical guaranties that we can offer for it are weaker.

1.3. Organization of the paper.

The rest of the paper is organized as follows:

Sections 1.4 and 1.5 contain the notation and distributional assumptions used throughout the paper.

For clarity of exposition, Sections 2–5 contain results established for model (1.1), when is Г a diagonal matrix. Extensions to the case when Г has small off-diagonal entries are presented in Section 6.

Section 2 shows that we have a uniquely defined target of estimation, the partition G*.

Section 3 derives the minimax thresholds on the separation metrics Δ(C*) and MCOD(Σ), respectively, for estimating G* consistently.

Section 4 is devoted to the COD algorithm, and its analysis.

Section 5 is devoted to the PECOK algorithm and its analysis.

Section 5.4 analyzes spectral clustering for variable clustering, and compares it with PECOK.

Section 6 contains extensions to approximate G-block covariance models.

Section 7 presents their application to the clustering of putative brain areas using a real fMRI data.

Section 8 contains a discussion of our results and overall recommendations regarding the usage of our methods. Given the space constraints, all proofs and simulation results are included in the Supplementary Material.

The implementation of PECOK can be found at http://github.com/martinroyer/pecok/ and that of COD at http://CRAN.R-project.org/package=cord.

1.4. Notation.

We denote by X the n × p matrix with rows corresponding to observations X⁽ⁱ⁾ ∈ R^p, for i = 1, … , n. The sample covariance matrix $\hat{Σ}$ is defined by

\hat{Σ} = \frac{1}{n} X^{t} X = \frac{1}{n} \sum_{i = 1}^{n} X^{(i)} {(X^{(i)})}^{t} .

Given a vector υ and q ≥ 1, |υ|_q stands for the ℓ_q norm. For a generic matrix M: |M|_q denotes its the entrywise ℓ_q norm, ‖M‖_op denotes its operator norm and ‖M‖_F refers to the Frobenius norm. We use M_:a, M_b:, to denote the ath column or, respectively, bth row of a generic matrix M. The bracket 〈·, ·〉 refers to the Frobenius scalar product. Given a matrix M, we denote supp(M) its support, that is, the set of indices (i, j) such that M_ij ≠ 0. I denotes the identity matrix. We define the variation seminorm of a diagonal matrix D as |D|_V ≔ max_a D_aa − min_a D_aa. We use B ≽ 0 to denote a symmetric and positive semidefinite matrix.

Throughout this paper will make use of the notation c₁, c₂, … to denote positive constants independent of n, p, K, m. The same letter, for instance c₁ may be used in different statements and may denote different constants, which are made clear within each statement, when there is no possibility for confusion.

We use [p] to denote the set {1, …, p}. We use the notation $a \overset{G}{~} b$ whenever a, b ∈ G_k, for the same k. Also, m = min_k |G_k| stands for the size of the smallest group of the partition G.

The notation ≳ and ≲ is used for whenever the inequalities hold up to multiplicative numerical constants.

1.5. Distributional assumptions.

For a p-dimensional random vector Y, its Orlicz norm is defined by $‖ Y ‖_{ψ_{2}} = {sup}_{t \in R^{p} : ‖ t ‖_{2} = 1} inf {s > 0 : E [e^{(Z^{t} t) / s^{2}} \leq 2]}$ . Throughout the paper, we will assume that X follows a sub-Gaussian distribution. Specifically, we use the following.

Assumption 1 (Sub-Gaussian distributions).

The exists L > 0 such that the random vector Σ^−1/2 X satisfies ${‖ Σ^{- 1 / 2} X ‖}_{ψ_{2}} \leq L$ .

Our class of distributions includes, in particular, that of bounded distributions, which may be of independent interest, as example (ii) illustrates. We will therefore also specialize some of our results to this case, in which case we will use directly:

Assumption 1-bis (Bounded distributions).

There exists M > 0 such that max_{i=1,…, p} |X_i|≤ M almost surely.

Gaussian distributions satisfy Assumption 1 with L = 1. A bounded distribution is also sub-Gaussian, but the corresponding quantity L can be much larger than M, and sharper results can be obtained if Assumption 1-bis holds.

2. Cluster identifiability in G-block covariance models.

To keep the presentation focused, we consider in Sections 2–5 the model (1.1) with Г diagonal. We treat the case corresponding to a diagonally dominant Г in Section 6 below. In the sequel, it is assumed that p > 2.

We observe that if the decomposition (1.1) holds for a partition G, it also holds for any subpartition of G. It is natural therefore to seek the smallest (coarsest) of such partitions, that is the partition with the least number of groups for which (1.1) holds. Since the partition ordering is a partial order, the smallest partition is not necessarily unique. However, the following lemma shows that uniqueness is guaranteed for our model class.

Lemma 2.1.

Consider any covariance matrix Σ:

There exists a unique minimal partition G* such that Σ = AC A^t + Г for some diagonal matrix Г, some membership matrix A associated to G* and some matrix C.
The partition G* is given by the equivalence classes of the relation
$a \equiv b if and only if COD (a, b) ≔ max_{_{c \neq a, b}} | Σ_{a c} - Σ_{b c} | = 0.$ (2.1)

Proof.

If decomposition Σ = AC A^t + Г holds with A related to a partition G, then we have COD(a, b) = 0 for any a, b belonging to the same group of G. Hence, each group G_k of G is included in one of the equivalence class of ≡. As a consequence, G is a finer partition than G* as defined in (b). Hence, G* is the (unique) minimal partition such that decomposition Σ = AC A^t + Г holds. □

As a consequence, the partition G* is well defined and is identifiable. Next we discuss the definitions of MCOD and Δ metrics. For any partition G, we let $MCOD (Σ, G) ≔ {min}_{a \overset{G}{≁} b} COD (a, b)$ , where we recall that the notation $a \overset{G}{≁} b$ means that a and b are not in a same group of the partition G. By definition of G*, we notice that MCOD(Σ, G*) > 0 and the next proposition shows that G* is characterized by this property.

Proposition 2.2.

Let G be any partition such that MCOD(Σ, G) > 0 and the decomposition Σ = AC A^t + Г holds with A associated to G. Then G = G*.

The proofs of this proposition and the following corollary are given in Section B of the Supplementary Material [14]. In what follows, we use the notation MCOD(Σ) for MCOD(Σ, G*).

In general, without further restrictions on the model parameters, the decomposition Σ = AC A^t + Г with A relative to G* is not unique. If, for instance Σ is the identity matrix I, then G* is the complete partition (with p groups) and the decomposition (1.1) holds for any (C, Г) = (λI, (1 – λ)I) with λ ∈ R.

Recall that $m^{*} ≔ min | G_{k}^{*} |$ stands for the size of the smallest cluster. If we assume that m* > 1 (no singleton), then Г is uniquely defined. Besides, the matrix C in (1.1) is only defined up to a permutation of its rows and columns. In the sequel, we denote C* any of these matrices C. When the partition contains singletons (m* = 1), the matrix decomposition Σ = AC A^t + Г is made unique (up to a permutation of row and columns of C) by putting the additional constraint that the entries Г_aa corresponding to singletons are equal to 0. Since the definition of Δ(C) is invariant with respect to permutation of rows and columns, this implies that Δ(C*) is well defined for any covariance matrix Σ.

For arbitrary Σ, Δ(C*) is not necessarily positive. Nevertheless, if Δ(C*) > 0, then G* is characterized by this property.

Corollary 2.3.

Let G be a partition such that m = min_k |G_k| ≥ 2, the decomposition Σ = AC A^t + Г holds with A associated to G and Δ(C) > 0. Then G = G*.

As pointed in (1.7), in the latent model (1.2), Δ(C*) is equal to the square of the minimal L²-norm between two latent variables. So, in this case, the condition Δ(C*) > 0 simply requires that all latent variables are distinct.

3. Minimax thresholds on cluster separation for perfect recovery.

Before developing variable clustering procedures, we begin by assessing the limits of the size of each of the two cluster separation metrics below which no algorithm can be expected to recover the clusters perfectly. We denote by $m^{*} = {min}_{k} | G_{k}^{*} |$ the size of the smallest cluster of the target partition G* defined above. For 1 ≤ m ≤ p/2 and η > 0, we define $M (m, η)$ as the set of covariance matrices Σ fulfilling MCOD(Σ) > η|Σ|_∞ and whose associated partition G* has groups of equal size m* ≥ m. Similarly, for τ > 0, we define $D (m, τ)$ as the set of covariance matrices Σ fulfilling Δ(C*) > τ |Г|_∞ and whose associated partition G* has groups of equal size m* ≥ m. We use the notation $ℙ_{Σ}$ to refer to the normal distribution with covariance Σ.

Theorem 3.1.

There exists a positive constant c₁ such that, for any 1 ≤ m ≤ p/3 and any η such that

0 \leq η < η^{*} ≔ c_{1} \sqrt{\frac{log (p)}{n}},

(3.1)

we have ${inf}_{\hat{G}} {sup}_{Σ \in M (m, η)} ℙ_{Σ} (\hat{G} \neq G^{*}) \geq 1 / 7$ , where the infimum is taken over all possible estimators.

When 2 ≤ m = p/2, the same result holds but with the Condition (3.1) replaced by

0 \leq η < η^{*} ≔ c_{1} [\sqrt{\frac{log (p)}{n p}} \lor \frac{log (p)}{n}] .

(3.2)

We also have the following.

Theorem 3.2.

There exist positive constants c₁–c₃ such that the following holds for any 2 ≤ m ≤ p/2. For any τ such that

0 \leq τ < τ_{*} ≔ c_{1} [\sqrt{\frac{log (p)}{n (m - 1)}} \lor \frac{log (p)}{n}],

(3.3)

then ${inf}_{\hat{G}} {sup}_{Σ \in D (m, η)} ℙ_{Σ} (\hat{G} \neq G^{*}) \geq 1 / 7$ , where the infimum is taken over all estimators.

Conversely, there exists a procedure $\hat{G}$ satisfying ${sup}_{Σ \in D (m, τ)} ℙ_{Σ} (\hat{G} \neq G^{*}) \leq c_{3} / p$ for any τ such that

τ > τ^{*} ≔ c_{2} [\sqrt{\frac{log (p)}{n (m - 1)}} \lor \frac{log (p)}{n}] .

Theorems 3.2 and 3.1 show that if either metric falls below the thresholds in (3.3) and (3.1) or (3.2), respectively, the estimated partition $\hat{G}$ , irrespective of the method of estimation, cannot achieve perfect recovery with high probability uniformly over the set $M (m, η)$ or $D (m, τ)$ . The proofs are given in Section C of the Supplementary Material [14]. We note that the Δ(C*) minimax threshold takes into account the size m* of the smallest cluster and, therefore, the required cluster separation becomes smaller for large clusters. This is not the case for the second metric, as soon as there are at least 3 groups. The proof of (3.1) provides an example where we have K = 3 clusters, that are very large, of size m* = p/3 each, and where the MCOD(Σ) threshold does not decrease with m*.

Theorem 3.2 also provides a matching upper bound for the minimax threshold. Unfortunately, the procedure achieving this bound has an exponential computational complexity (see Section C.3 in the Supplementary Material [14] for further details and Section 5 for a near-minimax optimal algorithm with polynomial computational complexity).

4. COD for variable clustering.

4.1. COD procedure.

We begin with a procedure that can be viewed as natural for model (1.1). It is based on the following intuition. Two indices a and b belong to the same cluster of G*, if and only if COD(a, b) = 0, with COD defined in (2.1). Equivalently, a and b belong to the same cluster when

s COD (a, b) ≕ max_{c \neq a, b} \frac{| c o v (X_{a} - X_{b}, X_{c}) |}{\sqrt{v a r (X_{b} - X_{a}) v a r (X_{c})}} = max_{c \neq a, b} | c o r (X_{a} - X_{b}, X_{c}) | = 0,

where sCOD stands for scaled COvariance Differences. In the following, we work with this quantity, as it is scale invariant. It is natural to place a and b in the same cluster when the estimator $\hat{s COD} (a, b)$ is below a certain threshold, where

\hat{s COD} (a, b) ≔ max_{c \neq a, b} | \hat{c o r} (X_{a} - X_{b}, X_{c}) | = max_{c \neq a, b} | \frac{{\hat{Σ}}_{a c} - {\hat{Σ}}_{b c}}{\sqrt{({\hat{Σ}}_{a a} + {\hat{Σ}}_{b b} - 2 {\hat{Σ}}_{a b}) {\hat{Σ}}_{c c}}} | .

(4.1)

We estimate the partition $\hat{G}$ according to the simple COD algorithm explained below. The algorithm does not require as input the specification of the number K of groups, which is automatically estimated by our procedure. Step 3(c) of the algorithm is called the “or“ rule, and can be replaced with the “and” rule below, without changing the theoretical properties of our algorithm,

{\hat{G}}_{l} = {j \in S : \hat{s COD} (a_{l}, j) \lor \hat{s COD} (b_{l}, j) \leq α} .

The numerical performance of these two rules are also very close through simulation studies, same as we reported on a related COD procedure on correlations [13]. Due to these small differences, we will focus on the “or” rule for the sake of space. The algorithmic complexity for computing $\hat{Σ}$ is O(p²n) and the complexity of COD is O(p³), so the overall complexity of our estimation procedure is O(p²(p ∨ n)). The procedure is also valid when Г has very small off-diagonal entries, and the results are presented in Section 6.

The COD Algorithm.

Input: $\hat{Σ}$ and α > 0
Initialization: S = {1, …, p} and l = 0
Repeat: while S ≠ ∅
1. l ← l + 1
2. If |S|= 1 Then ${\hat{G}}_{l} = S$
3. If |S| > 1 Then
  1. $(a_{l}, b_{l}) = {argmin}_{a, b \in S, a \neq b} \hat{s COD} (a, b)$
  2. If $\hat{s COD} (a_{l}, b_{l}) > α$ Then ${\hat{G}}_{l} = {a_{l}}$
    
    Else ${\hat{G}}_{l} = {j \in S : \hat{s COD} (a_{l}, j) \land \hat{s COD} (b_{l}, j) \leq α}$
4. $S \leftarrow S \ {\hat{G}}_{l}$
Output: the partition $\hat{G} = {({\hat{G}}_{l})}_{l = 1, \dots, k}$

4.2. Perfect cluster recovery with COD for minimax optimal MCOD(Σ) cluster separation.

Theorem 4.1 shows that the partition $\hat{G}$ produced by the COD algorithm has the property that $\hat{G} = G^{*}$ , with high probability, as soon as the separation MCOD(Σ) between clusters exceeds a constant times the threshold (3.1) of Theorem 3.1 of the previous section.

Theorem 4.1.

Under the distributional Assumption 1, there exist numerical constants c₁, c₂ > 0 such that, if

α \geq c_{1} L^{2} \sqrt{\frac{log (p)}{n}}

and MCOD(Σ) > 3α|Σ|_∞, then we have exact cluster recovery with probability 1 − c₂/p.

We recall that for Gaussian data the constant L = 1. The proof is given in Section D.1 of the Supplementary Material [14].

We observe that while the COD algorithm succeeds to recover G* at the minimax separation rate (3.1) when K ≥ 3, it does not offer guaranties at the minimax separation rate (3.2) when K = 2. In this last case (K = 2), we observe that

\frac{1}{2} Δ (C^{*}) \leq MCOD (Σ) \leq Δ (C^{*}),

so the metric MCOD(Σ) is equivalent to Δ(C*) and we refer to the Section 5 for an optimal algorithm.

4.3. A data-driven calibration procedure for COD.

The performance of the COD algorithm depends on the value of the threshold parameter α. Whereas Theorem 4.1 ensures that a good value for α is the order of $\sqrt{log p / n}$ , its optimal value depends on the actual distribution (at least through the sub-Gaussian norm) and is unknown to the statistician. We propose below a new, fully data dependent, criterion for selecting α, and the corresponding partition $\hat{G}$ , from a set of candidate partitions $G$ . This criterion is based on data splitting. Let us consider two independent subsamples of the original sample, $D^{1}$ and $D^{1}$ , each of size n/2.

We denote by ${\hat{G}}^{(1)}$ a collection of partitions computed from $D^{1}$ , for instance via the COD algorithm with a varying threshold α. For any a < b, we use $D^{i}$ , i = 1, 2, to calculate, respectively,

{\hat{Δ}}_{a b}^{(i)} ≕ {[{\hat{Cor}}^{(i)} (X_{a} - X_{b}, X_{c})]}_{c \neq a, b}, i = 1, 2.

Since Δ_ab ≔ [Cor(X_a − X_b, X_c)]_c≠a,b equals zero if and only if $a \overset{G}{~} b$ , we want to select a partition G such that ${\hat{Δ}}_{a b}^{(2)} 1_{a \overset{G}{≁} b}$ is a good predictor of Δ_ab. To implement this principle, it remains to evaluate Δ_ab independently of ${\hat{Δ}}_{a b}^{(2)}$ . For this evaluation, we propose to reuse sample $D^{1}$ which has already been used to build the family of partitions ${\hat{G}}^{(1)}$ . More precisely, we select $\hat{G} \in {\hat{G}}^{(1)}$ by minimizing the data-splitting criterion $H$ :

\hat{G} \in \underset{G \in {\hat{G}}^{(1)}}{argmin} H (G) with H (G) ≕ \sum_{a < b} [{| {\hat{Δ}}_{a b}^{(2)} 1_{a \overset{G}{≁} b} - {\hat{Δ}}_{a b}^{(1)} |}_{\infty}^{2}] .

The following proposition assesses the performance of $\hat{G}$ . We need the following additional assumption.

(P1) If Cor (X_{a} - X_{b}, X_{c}) = 0, then E \hat{Cor} (X_{a} - X_{b}, X_{c}) = 0.

In general, the sample correlation is not an unbiased estimator of the population level correlation. Still, (P1) is satisfied when the data are normally distributed or in a latent model (1.2) when the noise variables E_a have a symmetric distribution. The next proposition provides guaranties for criterion $H$ , averaged over $D^{2}$ , and denoted by $E^{(2)} [H (G)]$ . The proof is given in Section D.2 of the Supplementary Material [14].

Proposition 4.2.

Assume that the distributional Assumption 1 and (P1) hold. Then there exists a constant c₁ > 0 such that, when $MCOD (Σ) > c_{1} | Σ |_{\infty} L^{2} \sqrt{log (p) / n}$ , we have

E^{(2)} [H (G^{*})] \leq min_{G \in {\hat{G}}^{(1)}} E^{(2)} [H (G)],

(4.2)

both with probability larger than 1 – 4/p and in expectation with respect to $ℙ^{(1)}$ .

Under the condition $MCOD (Σ) > c_{1} | Σ |_{\infty} L^{2} \sqrt{log (p) / n}$ , Theorem 4.1 ensures that G* belongs to ${\hat{G}}^{(1)}$ with high probability, whereas (4.2) suggests that the criterion is minimized at G*.

If we consider a data-splitting algorithm based on $\hat{COD} (a, b)$ instead of $\hat{s COD} (a, b)$ , then we can obtain a counterpart of Proposition 4.2 without requiring the additional assumption (P1). Still, we favor the procedure based on $\hat{s COD} (a, b)$ mainly for its scale-invariance property.

5. Penalized convex K-means: PECOK.

5.1. PECOK algorithm.

Motivated by the fact that the COD algorithm is minimax optimal with respect to the MCOD(Σ) metric for K ≥ 3, but not necessarily with respect to the Δ(C*) metric (unless the size of the smallest cluster is constant), we propose below an alternative procedure, that adapts to this metric. Our second method is a natural extension of one of the most popular clustering strategies. When we view the G-block covariance model as arising via the latent factor representation in (i) in the Introduction, the canonical clustering approach would be via the K-means algorithm [30], which is NP-hard [5]. Following Peng and Wei [34], we consider a convex relaxation of it, which is computationally feasible in polynomial time. We argue below that, for estimating clusters given by (1.1), one needs to further tailor it to our model. The statistical analysis of the modified procedure is the first to establish consistency of variable clustering via K-means type procedures, to the best of our knowledge.

The estimator offered by the standard K-means algorithm, with the number K of groups of G* known, is

\hat{G} \in \underset{G}{argmin} crit (X, G) with crit (X, G) = \sum_{a = 1}^{p} min_{_{k = 1, \dots, K}} {‖ X_{: a} - {\bar{X}}_{G_{k}} ‖}^{2},

(5.1)

and ${\bar{X}}_{G_{k}} = {| G_{k} |}^{- 1} \sum_{a \in G_{k}} X_{: a}$ .

For a partition G, let us introduce the corresponding partnership matrix B by

B_{a b} = {\begin{array}{l} \frac{1}{| G_{k} |} & if a and b are in the same group G_{k}, \\ 0 & if a and b are in a different groups. \end{array}

(5.2)

We observe that B_ab > 0 if and only if $a \overset{G}{~} b$ . In particular, there is a one-to-one correspondence between partitions G and their corresponding partnership matrices. It is shown in Peng and Wei [34] that the collection of such matrices B is described by the collection $O$ of orthogonal projectors fulfilling tr(B) = K, B1 = 1 and B_ab ≥ 0 for all a, b.

Theorem 2.2 in Peng and Wei [34] shows that solving the K-means problem is equivalent to finding the global maximum

\bar{B} = \underset{B \in O}{argmax} 〈 \hat{Σ}, B 〉,

(5.3)

and then recovering $\hat{G}$ from $\bar{B}$ .

The set of orthogonal projectors is not convex, so, following Peng and Wei [34], we consider a convex relaxation $C$ of $O$ obtained by relaxing the condition “B orthogonal projector,” by “B positive semidefinite,” leading to

C ≔ {B \in R^{p \times p} : \begin{array}{l} • B ≽ 0 (symmetric and positive semidefinite) \\ • \sum_{a} B_{a b} = 1, \forall b \\ • B_{a b} \geq 0, \forall a, b \\ • tr (B) = K \end{array}} .

(5.4)

Thus, the (uncorrected) convex relaxation of K-means is equivalent with finding

\tilde{B} = \underset{B \in C}{argmax} 〈 \hat{Σ}, B 〉 .

(5.5)

To assess the relevance of this estimator, we first study its behavior at the population level, when $\hat{Σ}$ is replaced by Σ in (5.5). Indeed, if the minimizer of our criterion does not recover the true partition at the population level, we cannot expect it to be consistent, even in a large sample asymptotic context (fixed p, n goes to infinity). We recall that |Г|_V ≔ max_a Г_aa – min_a Г_aa.

Proposition 5.1.

Assume that Δ(C*) > 2|Г|_V /m*. Then $B^{*} = {argmax}_{B \in O} 〈 Σ, B 〉$ . If Δ(C*) > 7|Г|_V /m*, then $B^{*} = {argmax}_{B \in C} 〈 Σ, B 〉$ .

For Δ(C*) large enough, the population version of convexified K-means recovers B*. The next proposition illustrates that the condition Δ(C*) > 2|Г|_V /m* for population K-means is in fact necessary.

Proposition 5.2.

Consider the model (1.1) with

C^{*} = [\begin{matrix} α & 0 & 0 \\ 0 & β & β - τ \\ 0 & β - τ & β \end{matrix}], Γ = [\begin{matrix} γ_{+} & 0 & 0 \\ 0 & γ_{-} & 0 \\ 0 & 0 & γ_{-} \end{matrix}] a n d | G_{1}^{*} | = | G_{2}^{*} | = | G_{3}^{*} | = m^{*} .

The population maximizer $B_{Σ} = {argmax}_{B \in O} 〈 Σ, B 〉$ is not equal to B* as soon as $2 τ = Δ (C^{*}) < \frac{2}{m^{*}} | Γ |_{V}$ .

The two propositions above are proved in Section A.1 in the Supplementary Material [14]. As a consequence, when Г is not proportional to the identity matrix, the population minimizers based on K-means and convexified K-means do not necessary recover the true partition even when the “within-between group” covariance gap is strictly positive. This undesirable behavior of K-means is not completely unexpected as K-means is a quantization algorithm which aims to find clusters of similar width, instead of “homogeneous” clusters. Hence, we need to modify it for our purpose.

This leads us to suggesting a population level correction in Proposition 5.1. Indeed, as a direct Corollary of Proposition 5.1, we have

B^{*} = \underset{B \in C}{argmin} 〈 Σ - Γ, B 〉,

as long as Δ(C*) > 0. This suggests the following PEnalized COnvex K-means (PECOK) algorithm, in three steps. The main step 2 produces an estimator $\hat{B}$ of B from which we derive the estimated partition $\hat{G}$ . We summarize this below.

The PECOK Algorithm.

Step 1. Estimate Г by $\hat{Γ}$ .
Step 2. Estimate B* by $\hat{B} = \underset{B \in C}{argmax} (〈 \hat{Σ}, B 〉 - 〈 \hat{Γ}, B 〉)$ .
Step 3. Estimate G* by applying a clustering algorithm to the columns of $\hat{B}$ .

The required inputs for Step 2 of our algorithm are: (i) $\hat{Σ}$ , the sample covariance matrix; (ii) $\hat{Γ}$ , the estimator produced at Step 1; and (iii) K, the number of groups. Our only requirement on the clustering algorithm applied in Step 3 is that it succeeds to recover the partition G* when applied to true partnership matrix B*. The standard K-means algorithm [30] seeded with K distinct centroids, kmeans++ [4] or any approximate K-means as defined in (5.13) in Section 5.4, fulfill this property.

We view the term $〈 \hat{Γ}, B 〉$ as a penalty term on B, with data dependent weights $\hat{Γ}$ . Therefore, the construction of an accurate estimator $\hat{Γ}$ of Г is a crucial step for guaranteeing the statistical optimality of the PECOK estimator.

5.2. Construction of $\hat{Γ}$ .

Estimating Г before estimating the partition itself is a nontrivial task, and needs to be done with care. We explain our estimation below and analyze it in Proposition A.10 in Section A.5. We show that this estimator of Г is appropriate whenever Г is a diagonal matrix (or diagonally dominant, with small off-diagonal entries). For any a, b ∈ [p], define

V (a, b) ≔ max_{c, d \in [p] \ {a, b}} \frac{| ({\hat{Σ}}_{a c} - {\hat{Σ}}_{a d}) - ({\hat{Σ}}_{b c} - {\hat{Σ}}_{b d}) |}{\sqrt{{\hat{Σ}}_{c c} + {\hat{Σ}}_{d d} - 2 {\hat{Σ}}_{c d}}},

(5.6)

with the convention 0/0 = 0. Guided by the block structure of Σ, we define

b_{1} (a) ≔ \underset{b \in [p] \ {a}}{argmin} V (a, b) and b_{2} (a) ≔ \underset{b \in [p] \ {a, b_{1} (a)}}{argmin} V (a, b),

to be two elements ”close” to a, that is, two indices b₁ = b₁(a) and b₂ = b₂(a) such that the empirical covariance difference ${\hat{Σ}}_{b_{i} c} - {\hat{Σ}}_{b_{i} d}$ , i = 1, 2, is most similar to ${\hat{Σ}}_{a c} - {\hat{Σ}}_{a d}$ , for all variables c and d not equal to a or b_i, i = 1, 2. It is expected that b₁(a) and b₂(a) either belong to the same group as a, or belong to some ”close” groups. Then our estimator $\hat{Γ}$ is a diagonal matrix, defined by

{\hat{Γ}}_{a a} = {\hat{Σ}}_{a a} + {\hat{Σ}}_{b_{1} (a) b_{2} (a)} - {\hat{Σ}}_{a b_{1} (a)} - {\hat{Σ}}_{a b_{2} (a)} for a = 1, \dots, p .

(5.7)

Intuitively, ${\hat{Γ}}_{a a}$ should be close to $Σ_{a a} + Σ_{b_{1} (a) b_{2} (a)} - Σ_{a b_{1} (a)} - Σ_{a b_{2} (a)}$ , which is equal to Г_aa in the favorable event where both b₁(a) and b₂(a) belong to the same group as a.

In general, b₁(a) and b₂(a) cannot be guaranteed to belong to the same group as a. Nevertheless, these two surrogates b₁(a) and b₂(a) are close enough to a so that $| {\hat{Γ}}_{a a} - Γ_{a a} |$ to be at most of the order of $| Γ |_{\infty} \sqrt{log (p) / n}$ in ℓ^∞-norm, as shown in Proposition A.10 in Section A.5 of the Supplementary Material [14]. A slightly simpler estimator of Г was proposed in Appendix A of a previous version of this work [15], but a bound on $| {\hat{Γ}}_{a a} - Γ_{a a} |$ for that estimator contains a factor proportional to |Σ|_∞, which is not desirable, and can be avoided by (5.7). In the next subsection, we show that our proposed $\hat{Γ}$ leads to perfect recovery of G*, via PECOK, under minimal separation conditions.

Note that PECOK requires the knowledge of the true number K of groups. When the number K of groups itself is unknown, we can modify the PECOK criterion by adding a penalty term as explained in a previous version of our work [15], Section 4. Alternatively, we propose in Section G of Supplementary Material [14] selection via a simple data-splitting procedure.

5.3. Perfect cluster recovery with PECOK for near-minimax Δ-cluster separation.

We show in this section that the PECOK estimator recovers the clusters exactly, with high probability, at a near-minimax separation rate with respect to the Δ(C*) metric.

Theorem 5.3.

There exist c₁, c₂, c₃ three positive constants such that the following holds. Let $\hat{Γ}$ be any estimator of Г, such that $| \hat{Γ} - Γ |_{V} \leq δ_{n, p}$ with probability 1 − c₁/p. Then, under Assumption 1, and when L⁴ log(p) ≤ c₃n and

Δ (C^{*}) \geq c_{L} [‖ Γ ‖_{\infty} \{\sqrt{\frac{log p}{m^{*} n}} + \sqrt{\frac{p}{n m^{*}^{2}}} + \frac{log (p)}{n} + \frac{p}{n m^{*}}\} + \frac{δ_{n, p}}{m^{*}}],

(5.8)

then $\hat{B} = B^{*}$ and $\hat{G} = G^{*}$ , with probability higher than 1 − c₁/p. Here, c_L is a positive constant that only depends on L in Assumption 1. In particular, if $\hat{Γ}$ is the estimator (5.7), the same conclusion holds with probability higher than 1 − c₂/p when

Δ (C^{*}) \geq c_{L} ‖ Γ ‖_{\infty} \{\sqrt{\frac{log p}{m^{*} n}} + \sqrt{\frac{p}{n m^{*}^{2}}} + \frac{log (p)}{n} + \frac{p}{n m^{*}}\} .

(5.9)

The proof is given in Section A.3 of the Supplementary Material [14].

Remark 1.

We left the term δ_n,p explicit in (5.8) in order to make clear how the estimation of Г affects the cluster separation Δ(C*) metric. Without a correction (i.e., taking $\hat{Γ} = 0$ ), the term δ_n,p/m* equals |Г|_V /m* which is nonzero (and does not decrease in a high-sample asymptotic) unless Г has equal diagonal entries. This phenomenon is consistent with the population analysis in the previous subsection. Display (5.9) shows that the separation condition can be much decreased with the correction. In particular, for balanced clusters, that is when $m^{*} = \frac{p}{K}$ , exact recovery is guaranteed when

Δ (C^{*}) \geq c_{L} [\sqrt{\frac{K \lor log p}{m^{*} n}} + \frac{K \lor log p}{n}],

(5.10)

for an appropriate constant c_L > 0. In view of Theorem 3.2, when m* ≥ cp/ log(p) the rate is minimax optimal, since in this case K = p/m* = O(log(p)). When m* = o(p/ log(p)), the number K of clusters grows faster than log(p), and we possibly lose a factor K/log(p) relative to the optimal rate.

As discussed in the Introduction, this gap is possibly due to a computational barrier and we refer to [16] for a discussion in the related stochastic block model.

Bounded variables X also follow a sub-Gaussian distribution. Nevertheless, the corresponding sub-Gaussian norm L may be large and Theorem 5.3 can sometimes be improved, as in Theorem 5.4 below, proved in Section A.3 of the Supplementary Material [14].

Theorem 5.4.

There exist c₁, c₂, c₃ three positive constants such that the following holds. Let $\hat{Γ}$ be any estimator of Г, such that ${| \hat{Γ} - Γ |}_{V} \leq δ_{n, p}$ with probability 1 − c₁/p. Then, under Assumption 1-bis, and

Δ (C^{*}) \geq c_{2} [M ‖ Γ ‖_{\infty}^{1 / 2} \sqrt{\frac{p log (p)}{n m^{*}^{2}}} + M^{2} \frac{p log (p)}{n m^{*}} + \frac{δ_{n, p}}{m^{*}}] .

(5.11)

Then $\hat{B} = B^{*}$ and $\hat{G} = G^{*}$ , with probability higher than 1 − c₁/p.

When we choose $\hat{Γ}$ as in (5.7), the term δ_n,p/m* can be simplified as under Assumption 1; see Proposition A.10 in Section A.5 of the Supplementary Material [14]. For balanced clusters, $m^{*} = \frac{p}{K}$ , Condition (5.11) can be simplified in

Δ (C^{*}) \geq c_{2} [M ‖ Γ ‖_{\infty}^{1 / 2} \sqrt{\frac{K log (p)}{n m^{*}}} + M^{2} \frac{K log (p)}{n} + \frac{δ_{n, p}}{m^{*}}] .

In comparison to (5.10), the condition does no longer depend on the sub-Gaussian norm L, but the term K ∨ log(p) has been replaced by K log(p).

Remark 2.

For the Ising block model (1.3) with K balanced groups, we have M = 1 and p = m*K, C* = (ω_in − ω_out)I_K + ω_outJ and Г = (1 − ω_in)I_K. As a consequence, no diagonal correction is needed, that is we can take $\hat{Γ} = 0$ , and since |Г|_V = 0, we have δ_n,p = 0.Then, for K balanced groups, condition (5.11) simplifies to

(ω_{in} - ω_{out}) ≳ K \sqrt{\frac{log (p)}{n p}} + \frac{K log (p)}{n} .

In the specific case K = 2, we recover (up to numerical multiplicative constants) the optimal rate proved in [10]. Our procedure and analysis provide a generalization of these results, as they are valid for general K and Theorem 5.4 also allows for unbalanced clusters.

5.4. A comparison between PECOK and spectral clustering.

In this section, we discuss connections between the PECOK algorithm introduced above and spectral clustering, a method that has become popular in network clustering.

First we recall the premise of spectral clustering, adapted to our context. For G*-block covariance models as (1.1), we have Σ – Г = AC* A^t. Let U be the p × K matrix collecting the K leading eigenvectors of Σ − Г. It has been shown (see, e.g., Lemma 2.1 in Lei and Rinaldo [28]) that a and b belong to the same cluster if and only if U_a: = U_b: and if and only if [UU^t]_a: = [UU^t]_b:. When used for variable clustering, uncorrected spectral clustering consists in applying a clustering algorithm, such as K-means, on the rows of the p × K-matrix obtained by retaining the K leading eigenvectors of $\hat{Σ}$ .

SC Algorithm.

Compute $\hat{V}$ , the matrix of the K leading eigenvectors of $\hat{Σ}$
Estimate G* by applying a (rotation invariant) clustering method to the rows of $\hat{V}$ .

Arguing as in Peng and Wei [34], we have the following.

Lemma 5.5.

SC algorithm is equivalent to the following algorithm:

Step 1. Find $\bar{B} = argmax {〈 \hat{Σ}, B 〉 : tr (B) = K, I ≽ B ≽ 0}$ .
Step 2. Estimate G* by applying a (rotation invariant) clustering method to the rows of $\bar{B}$ .

The connection between (unpenalized) PECOK and spectral clustering now becomes clear. The (unpenalized) PECOK estimator $\tilde{B}$ defined in (5.5) involves the calculation of

\tilde{B} = \underset{B}{argmax} {〈 \hat{Σ}, B 〉 : B 1 = 1, B_{a b} \geq 0, tr (B) = K, B ≽ 0} .

(5.12)

Since the matrices B involved in (5.12) are doubly stochastic, their eigenvalues are smaller than 1, and hence (5.12) is equivalent to $\tilde{B} = {argmax}_{B} {〈 \hat{Σ}, B 〉 : B 1 = 1, B_{a b} \geq 0, tr (B) = K, I ≽ B ≽ 0}$ . Note then that $\bar{B}$ can be viewed as a less constrained version of $\tilde{B}$ , in which $C$ is replaced by $\bar{C} = {B : tr (B) = K, I ≽ B ≽ 0}$ , where we have dropped the p(p + 1)/2 constraints given by B1 = 1, and B_ab ≥ 0. The proof of Lemma 5.5 shows that $\bar{B} = \hat{V} {\hat{V}}^{t}$ , so, contrary to $\hat{B}$ , the estimator $\bar{B}$ is (almost surely) never equal to B*. Below, we adapt the arguments of [28] in order to provide some guarantees for a corrected version of spectral clustering.

In view of this connection between spectral clustering and unpenalized PECOK and of the fact that the population justification of spectral clustering deals with the spectral decomposition of Σ − Г, we propose the following corrected version of the algorithm, based on $\tilde{Σ} ≔ \hat{Σ} - \hat{Γ}$ .

CSC Algorithm.

Compute $\hat{U}$ , the matrix of the K leading eigenvectors of $\tilde{Σ} ≔ \hat{Σ} - \hat{Γ}$
Estimate G* by clustering the rows of $\hat{U}$ , via an η-approximation of K-means (5.13).

For η > 1, an η-approximation of K-means is a clustering algorithm producing a partition $\hat{G}$ such that

crit ({\hat{U}}^{t}, \hat{G}) \leq η min_{_{G}} crit ({\hat{U}}^{t}, G),

(5.13)

with crit(·, ·) the K-means criterion (5.1). Although solving K-means is NP-Hard [5], there exist polynomial time approximate K-means algorithms; see Kumar et al. [26]. As a consequence of the above discussion, the first step of CSC can be interpreted as a relaxation of the program associated to the PECOK estimator $\hat{B}$ .

To simplify the presentation of the results for CSC procedure, we assume in the following that all the groups have the same size $| G_{1}^{*} | = \dots = | G_{K}^{*} | = m^{*} = p / K$ . We emphasize that this information is not required by either PECOK or CSC, or in the proof of Proposition 5.6 below. We only use it here for simplicity. We denote by S_K the set of permutations on {1,…, K} and we denote by

\bar{L} (\hat{G}, G^{*}) = min_{σ \in S_{K}} \sum_{k = 1}^{K} \frac{| G_{k}^{*} \ {\hat{G}}_{σ (k)} |}{m^{*}}

the sum of the ratios of miss-assigned variables with indices in $G_{k}^{*}$ . In the previous sections, we studied perfect recovery of G*, which would correspond to $\bar{L} (\hat{G}, G^{*}) = 0$ . We give below conditions under which $\bar{L} (\hat{G}, G^{*}) \leq ρ$ , for an appropriate quantity ρ < 1. We begin with a general theorem pertaining to partial partition recovery by CSC, under a “signal-to-noise ratio” involving the smallest eigenvalue λ_K (C*) of C*.

Proposition 5.6.

Let Re(Σ) = tr(Σ)/‖Σ‖_op denote the effective rank of Σ. There exist c_η,L > 0 only depending on η and L and a numerical constant c₁ such that the following holds under Assumption 1. For any 0 < ρ < 1, if

λ_{K} (C^{*}) \geq \frac{c_{η, L} \sqrt{K} ‖ Σ ‖_{op}}{m^{*} \sqrt{ρ}} \sqrt{\frac{Re (Σ) \lor log (p)}{n}},

(5.14)

then $\bar{L} (\hat{G}, G^{*}) \leq ρ$ , with probability larger than 1 − c₁/p.

The proof extends the arguments of [28], initially developed for clustering procedures in stochastic block models, to our context. Specifically, we relate the error $\bar{L} (\hat{G}, G^{*})$ to the noise level, quantified in this problem by ${‖ \tilde{Σ} - A C^{*} A^{t} ‖}_{op}$ . We then employ the results of [24] to show that this operator norm can be controlled, with high probability, which leads to the conclusion of the theorem.

As n goes to infinity, the right-hand side of Condition (5.14) goes to zero, and CSC is therefore consistent in a large sample asymptotic. In contrast, we emphasize that (uncorrected) SC algorithm is not consistent as can be shown by a population analysis similar to that of Proposition 5.2.

We observe that Δ(C*) ≥ 2λ_K (C*), so we can compare the lower bound (5.14) on λ_K (C*) to the lower-bound (5.10) on Δ(C*). To further facilitate the comparison between CSC and PECOK, we discuss both the conditions and the conclusion of this theorem in the simple setting where C* = τ I_K and Г = I_p. Then the cluster separation measures coincide up to a factor 2, Δ(C*) = 2λ_K (C*) = 2τ.

Corollary 5.7 (Illustrative example: C* = τ I_K and Г = I_p).

There exist three positive numerical constants c_η,L, $c_{η, L}^{'}$ and c₃ such that the following holds under Assumption 1. For any 0 < ρ < 1, if

ρ \geq c_{η, L} [\frac{K^{2}}{n} + \frac{K log (p)}{n}] and τ \geq c_{η, L}^{'} [\frac{K^{2}}{ρ n} \lor \frac{K}{\sqrt{ρ n m^{*}}}],

(5.15)

then $\bar{L} (\hat{G}, G^{*}) \leq ρ$ , with probability larger than 1 − c₃/p.

Recall that Theorem 5.3 above states that when $\hat{G}$ is obtained via the PECOK algorithm, and if $τ ≳ \sqrt{\frac{K \lor log p}{m^{*} n}} + \frac{log (p) \lor K}{n}$ , then $\bar{L} (\hat{G}, G^{*}) = 0$ , or equivalently, $\hat{G} = G^{*}$ , with high probability. We can therefore provide the following comparison (we refer to Section G of the Supplementary Material [14] for a numerical comparison):

When $τ ≳ \sqrt{\frac{K \lor log p}{m^{*} n}} + \frac{log (p) \lor K}{n}$ , and under the additional condition that n ≳ (K ∨ log(p))²/K, CSC algorithm satisfies $\bar{L} (\hat{G}, G^{*}) \leq K^{2} / (K \lor log (p))$ . So, for K = o(log(p)) and for a large enough sample size n ≳ (log(p))²/K, the fraction of misclassified variables by CSC is vanishing as O(K/ log(p)) for $τ ≳ \sqrt{\frac{log p}{m^{*} n}} + \frac{log (p)}{n}$ . This guaranty is slightly weaker than for PECOK which ensures exact recovery in this setting. This discrepancy may be an artifact of the proof technique. Very recent works [1, 31] (released during the reviewing process of this paper) present reconstruction error bounds tighter than those of [28], for (variants of) spectral clustering, when applied to two parameter SBMs, for network data, not the type of data analyzed in this work.
When we move away from the case C* = τ I_K, the guaranties for CSC can degenerate. For instance, when Г = I and C* = τ I_K + αJ, with J being the matrix with all entries equal to one, as in the Ising block model discussed on page 126. Notice that in this case we continue to have Δ(C*) = 2λ_K (C*) = 2τ. Then, for a given, fixed, value of ρ and K fixed, condition (5.14) requires a cluster separation at least
$τ ≳ \frac{α \sqrt{log (p)}}{\sqrt{n ρ}},$
which is independent of m*, unlike the condition $τ ≳ \sqrt{\frac{K \lor log p}{m^{*} n}} + \frac{log (p) \lor K}{n}$ for PECOK. This unpleasant feature is induced by the inflation of ${‖ \tilde{Σ} - A C^{*} A^{t} ‖}_{op}$ with α. Again, this weakness in the guarantees may be an artifact of the proof, which relies on the Davis–Kahan inequality for controlling the alignment between the sample eigenvectors associated with the K largest eigenvalues and their population counterpart.

All the results of this section are proved in Section E of the Supplementary Material [14].

6. Approximate G-block covariance models.

In the previous sections, we have proved that under some separation conditions, COD and PECOK procedures are able to exactly recover the partition G*. However, in practical situations, the separation conditions may not be met. Besides, if the entries of Σ have been modified by an infinitesimal perturbation, then the corresponding partition G* would consist of p singletons.

As a consequence, it may be more realistic and more appealing from a practical point of view to look for a partition G[K] with K < |G*| groups such that Σ is close to a matrix of the form AC A^t + Г where Г is diagonal and A is associated to G[K]. This is equivalent to considering a decomposition Σ = AC A^t + Г with Г nondiagonal, where the nondiagonal entries of Г are small. In the sequence, we write R = Г − Diag(Г) for the matrix of the off-diagonal elements of Г and D = Diag(Г) for the diagonal matrix given by the diagonal of Г.

In the next subsection, we discuss under which conditions the partition G[K] is identifiable and then, we prove that COD and PECOK are able to recover these partitions.

6.1. Identifiability of approximate G-block covariance models.

When Г is allowed to be not exactly equal to a diagonal matrix, we encounter a further identifiability issue, as a generic matrix Σ may admit many decompositions Σ = AC A^t + Г. In fact, such a decomposition holds for any membership matrix A and any matrix C if we define Г = Σ − AC A^t. So we need to specify the kind of decomposition that we are looking for. For K being fixed, we would like to consider the partition G with K clusters that maximizes the distance between goups (e.g., MCOD(Σ, G)) while having the smallest possible noise term |R|_∞. Unfortunately, such a partition G does not necessarily exist and is not necessarily unique. Let us illustrate this situation with a simple example.

Example. Assume that Σ is given by

Σ = [\begin{matrix} 2 r & 0 & 0 \\ 0 & 2 r & 0 \\ 0 & 0 & 2 r \end{matrix}] + I_{p},

with r > 0, with the convention that each entry corresponds to a block of size 2. Considering partitions with 2 groups and allowing Г to be nondiagonal, we can decompose Σ using different partitions. For instance,

Σ = \underset{= A_{1} C_{1} A_{1}^{t}}{\underset{︸}{[\begin{matrix} 2 r & 0 & 0 \\ 0 & r & r \\ 0 & r & r \end{matrix}]}} + \underset{= Γ_{1}}{\underset{︸}{[\begin{matrix} 0 & 0 & 0 \\ 0 & r & - r \\ 0 & - r & r \end{matrix}] + I_{p}}} = \underset{= A_{2} C_{2} A_{2}^{t}}{\underset{︸}{[\begin{matrix} r & r & 0 \\ r & r & 0 \\ 0 & 0 & 2 r \end{matrix}]}} + \underset{= Γ_{2}}{\underset{︸}{[\begin{matrix} r & - r & 0 \\ - r & r & 0 \\ 0 & 0 & 0 \end{matrix}] + I_{p}}} .

Importantly, the two decompositions correspond to two different partitions G₁ and G₂ and both decompositions have |R_i|_∞ = r and MCOD(Σ, G_i) = 2r = 2|R|_∞, for i = 1, 2. In addition, no decomposition Σ = AC A^t + D + R with associated partition in 2 groups, satisfies MCOD(Σ, G) > 2r or |R|_∞ < r. As a consequence, there is no satisfying way to define a unique partition maximizing MCOD(Σ, G), while having |R|_∞ as small as possible. We show below that the cutoff MCOD(Σ, G) > 2|R|_∞ is actually sufficient for partition identifiability.

For this, let us define $P_{j} (Σ, K)$ , j = {1, 2} as the set of quadruplets (A, C, D, R) such that Σ = AC A^t + D + R, with A a membership matrix associated to a partition G with K groups with min_k |G_K| ≥ j, and D and R defined as above. Hence $P_{1}$ corresponds to partitions without restrictions on the minimum group size. For instance, singletons are allowed. In contrast, $P_{2}$ only contains partitions without singletons. We define

ρ_{1} (Σ, K) = max {MCOD (Σ, G) / | R |_{\infty} : (A, C, D, R) \in P_{1} (Σ, K) and G associated to A},

ρ_{2} (Σ, K) = max {Δ (C) / | R |_{\infty} : (A, C, D, R) \in P_{2} (Σ, K)} .

We view ρ₁ and ρ₂ as respective measures of “purity” of the block structure of Σ.

Proposition 6.1.

Assume that ρ₁(Σ, K) > 2. Then, there exists a unique partition G such that there exists a decomposition Σ = AC A^t + Г, with A associated to G and MCOD(Σ, G) > 2|R|_∞. We denote by G₁[K] this partition.
Assume that ρ₂(Σ, K) > 8. Then there exists a unique partition G with min_k |G_k| ≥ 2, such that there exists a decomposition Σ = AC A^t + Г, with A associated to G and Δ(C) > 8|R|_∞. We denote by G₂[K] this partition.
In addition, if both ρ₁(Σ, K) > 2 and ρ₂(Σ, K) > 8, then G₁[K]= G₂[K].

The conditions ρ₁(Σ, K) > 2 and ρ₂(Σ, K) > 8 are minimal for defining uniquely the partition G₁[K]. For ρ₁, this has been illustrated in the example above the proposition. For ρ₂, we provide a counter example when ρ₂(Σ, K) = 8 in Section B.3 of the Supplementary Material [14]. The proof of Proposition 6.1. is given in Section B.2 of [14].

The conclusion of Proposition 6.1 does essentially revert to that of Proposition 2.2 of Section 2 as soon as |R|_∞ is small enough respective to the cluster separation sizes. Denoting K* the number of groups of G*, we observe that G₁[K*] = G* and G₂[K*] = G* if m* ≥ 2. Besides, ρ₁(Σ, K) = ρ₂(Σ, K) = 0 for K > K*. For K < K* and when G₁[K] (resp., G₂[K]) are well defined, then the partition G₁[K] (resp., G₂[K]) is coarser than G*. In other words, G₁[K] is derived from G* by merging groups $G_{k}^{*}$ thereby increasing MCOD(Σ, G) (resp., Δ(C)) while requiring |R|_∞ to be small enough.

We point out that, in general, there is no unique decomposition Σ = AC A^t + Г with A associated to G₂[K], even when min_k |G₂[K]_k| ≥ 2. Actually, it can be possible to change some entries of C and R, while keeping C + R, Δ(C) and |R|_∞ unchanged.

6.2. The COD algorithm for approximate G-block covariance models.

Weshowbelow that the COD algorithm is still applicable if Σ has small departures from a block structure. We set λ_min(Σ) for the smallest eigenvalue of Σ.

Theorem 6.2.

Under the distributional Assumption 1, there exist numerical constants c₁, c₂ > 0 such that the following holds for all $α \geq c_{1} L^{2} \sqrt{\frac{log p}{n}}$ . If, for some partition G and decomposition Σ = AC A^t + R + D, we have

| R |_{\infty} \leq \frac{λ_{min} (Σ)}{2 \sqrt{2}} α and  MCOD (Σ, G) > 3 α | Σ |_{\infty},

(6.1)

then COD recovers G with probability higher than 1 − c₂/p.

The proof is given in Section D.1 of the Supplementary Material [14]. If G satisfies the assumptions of Theorem 6.2, then it follows from Proposition 6.1 that G = G₁[K] for some K > 0. First consider the situation where the tuning parameter α is chosen to be of the order $\sqrt{log (p) / n}$ . If MCOD(Σ, G*) ≥ 3α|Σ|_∞, then COD selects G* with high probability. If MCOD(Σ, G*) is smaller than this threshold, then no procedure is able to recover G* with high probability (Theorem 3.1). Nevertheless, COD is able to recover a coarser partition G₁[K] whose corresponding MCOD metric MCOD(Σ, G) is higher than the threshold 3α|Σ|_∞ and whose matrix R is small enough. For larger α, then COD recovers a coarser partition G (corresponding to G₁[K] with a smaller K) whose corresponding approximation |R|_∞ is allowed to be larger.

6.3. The PECOK algorithm for approximate G-block covariance models.

In this subsection, we investigate the behavior of PECOK under the approximate G-block models. The number K of groups being fixed, we assume that ρ₂(Σ, K) > 8 so that G₂[K] is well defined. We shall prove that PECOK recovers G₂[K] with high probability. By abusing the notation, we denote in this subsection G* for the target partition G₂[K], B* for the associated partnership matrix and $(A, C^{*}, D, R) \in P_{2} (Σ, K)$ any decomposition of Σ maximizing Δ(C)/|R|_∞.

Similar to Proposition 5.1, we first provide sufficient conditions on C* under which a population version of PECOK can recover the true partition.

Proposition 6.3.

If, $Δ (C^{*}) > \frac{7 | D |_{V} + 2 ‖ R ‖_{op}}{m} + 3 | R |_{\infty}$ , then $B^{*} = {argmin}_{B \in C} 〈 Σ, B 〉$ .

Corollary 6.4.

If $Δ (C^{*}) > 3 | R |_{\infty} + \frac{2 ‖ R ‖_{op}}{m}$ , then $B^{*} = {argmin}_{B \in C} 〈 Σ - D, B 〉$ .

In contrast to the exact G-block model, the cluster distance Δ(C*) now needs to be larger than |R|_∞ for the population version to recover the true partition. The |R|_∞ condition is fact necessary as discussed in Section 6.1. In comparison to the necessary conditions discussed in Section 6.1, there is an additional ‖R‖_op/ m term. The proofs are given in Section A.2 in the Supplementary Material [14].

We now examine the behavior of PECOK when we specify the estimator $\hat{Γ}$ to be as in (5.7). Note that in this approximate block covariance setting, the diagonal estimator $\hat{Γ}$ is in fact an estimator of the diagonal matrix D. In order to derive deviation bounds for our estimator $\hat{Γ}$ , we need the following diagonal dominance assumption.

Assumption 2 (diagonal dominance of Г).

The matrix Г = D + R fulfills

Γ_{a a} \geq 3 max_{c : c \neq a} | Γ_{a c} | (or equivalently D_{a a} \geq 3 max_{c : c \neq a} | R_{a c} |) .

(6.2)

The next theorem states that PECOK estimator $\hat{B}$ recovers the groups under similar conditions to that of Theorem 5.3 if R is small enough. The proof is given in Section A.3 of the Supplementary Material [14].

Theorem 6.5.

There exist c₁, c₂, c_L, $c_{L}^{'}$ four positive constants such that the following holds. Under Assumptions 1 and 2, and when L⁴ log(p) ≤ c₁n and

| R |_{\infty} + \frac{\sqrt{| R |_{\infty} | D |_{\infty}} + ‖ R ‖_{op}}{m} \leq c_{L} ‖ Γ ‖_{op} {\sqrt{\frac{log p}{m n}} + \sqrt{\frac{p}{n m^{2}}} + \frac{log (p)}{n} + \frac{p}{n m}}

(6.3)

we have $\hat{B} = B^{*}$ and $\hat{G} = G^{*}$ , with probability higher than 1 − c₂/p, as soon as

Δ (C^{*}) \geq c_{L}^{'} [‖ Γ ‖_{o p} {\sqrt{\frac{log p}{m n}} + \sqrt{\frac{p}{n m^{2}}} + \frac{log (p)}{n} + \frac{p}{n m}}] .

(6.4)

So, as a long as |R|_∞ and ‖R‖_op are small enough so that (6.3) are satisfied, PECOK algorithm will correctly identify the target partition G* at the Δ-(near) optimal minimax level (6.4). A counterpart of Theorem 6.5 for Assumption 1-bis is provided in Section A.3 of the Supplementary Material [14]

7. Data analysis.

Using functional MRI data, [37] found that the human brain putative areas are organized into clusters, sometimes referred to as networks or functional systems. We use a publicly available fMRI data set to illustrate the clusters recovered by different methods. The data set was originally published in [39] and is publicly available from Open fMRI (https://openfmri.org/data-sets) under the accession number ds000007. We will focus on analyzing two scan sessions from subject 1 under a visual-motor stop/go task (task 1). Before performing the analysis, we follow the preprocessing steps suggested by [39], and we follow [37] to subsample the whole brain data using p = 264 putative areas; see Section A.3 of the Supplementary Material [14] for details. This subject was also scanned in two separate sessions, and each session yielded n = 180 samples for each putative area.

We apply our data-splitting approach described in Section 4.3 to these two-session data. Using the first scan session data only, we first estimate $\hat{G}$ using COD and COD-CC on a fine grid of $α = c \sqrt{log (p) / n}$ where c = 0.5, 0.6, …, 3. For a fair comparison, we set K in PECOK to be the same as the resulting K’s found by COD. We then use the second session data to evaluate the loss $H (G)$ given in Section 4.3. Among our methods (COD, COD-CC and PECOK), COD yields the smallest loss when K = 142. We thus first focus on illustrating the COD clusters here. Table 2 lists the largest cluster of putative areas recovered by COD and their functional classification based on prior knowledge. Most of these areas are classified to be related to visual, motor and task functioning, which is consistent with the implication of our experimental task that requires the subject to perform motor responses based on visual stimuli. Figure 1(a) plots the locations of these coordinates on a standard brain template. It shows that our COD cluster appears to come mostly from approximately symmetric locations from the left and right hemisphere, though we do not enforce this brain function symmetry in our algorithm. Note that the original coordinates in [37] are not sampled with exact symmetry from both hemispheres of the brain, and thus we do not expect exact symmetric locations in the resulting clusters based on these coordinates.

Table 2.

MNI coordinates (X, Y, Z, in mm) of the largest COD group and their functioning classification

X	Y	Z	Function	X	Y	Z	Function
40	−72	14	visual	−7	−21	65	motor
−28	−79	19	visual	−7	−33	72	motor
20	−66	2	visual	13	−33	75	motor
29	−77	25	visual	10	−46	73	motor
37	−81	1	visual	36	−9	14	motor
47	10	33	task	−53	−10	24	motor
−41	6	33	task	−37	−29	−26	uncertain
38	43	15	task	52	−34	−27	uncertain
−41	−75	26	default	−58	−26	−15	uncertain
8	48	−15	default	−42	−60	−9	attention
22	39	39	default	−11	26	25	saliency

Open in a new tab

Fig. 1. — (a) Plot of the coordinates of the largest COD cluster overplayed over a standard brain template. The coordinates are shown as red balls. (b) Comparison of COD, COD-CC, PECOK, K-means, HC and SC using the Frobenius prediction loss criterion (7.1) where the groups are estimated by these methods, respectively.

Because there are no gold standards for partitioning the brain, we follow common practice and use a prediction criterion to further compare the clustering performance of different methods. For a fair comparison, we also estimate $\hat{G}$ using K-means, HC and spectral clustering on the same resulting K’s found by COD. The prediction criterion is as follows. We first compute the covariance matrices ${\hat{S}}_{1}$ and ${\hat{S}}_{2}$ from the first and second session data respectively. For a grouping estimate $\hat{G}$ , we use the following loss to evaluate its performance:

{‖ {\hat{S}}_{2} - ϒ ({\hat{S}}_{1}, \hat{G}) ‖}_{F},

(7.1)

where block averaging operator ϒ (R, G) produces a G-block structured matrix based on $\hat{G}$ . For any a ∈ G_k and b ∈ G_k′, the output matrix entry [ϒ (R, G)]_ab is given by

{[ϒ (R, G)]}_{a b} = {\begin{array}{l} {| G_{k} |}^{- 1} {(| G_{k} | - 1)}^{- 1} \sum_{i, j \in G_{k}, i \neq j} R_{i j} & if a \neq b and k = k^{'}, \\ {| G_{k} |}^{- 1} {| G_{k^{'}} |}^{- 1} \sum_{i \in G_{k}, j \in G_{k^{'}}} R_{i j} & if a \neq b and k \neq k^{'}, \\ 1 & if a = b . \end{array}

In essence, this operator smooths over the matrix entries with indices in the same group, and one may expect that such smoothing over variables in the true cluster will reduce the loss (7.1) while smoothing over different clusters will increase the loss.

Figure 1(b) compares the prediction loss values under different group sizes for each method. This shows that our data-splitting approach for COD indeed selects a value K = 142 that is immediately next to a slightly larger one (K = 206), the latter having the smallest prediction loss, near the bottom plateau. However, the differences are almost negligible. This suggests that our data-splitting criterion, which comes with theoretical guarantees, also provides good prediction performance in this real data example, while selecting a slightly smaller K, as desired, since this makes the resulting clusters easier to describe and interpret.

Regardless of the choice of K or α, Figure 1(b) also shows that COD almost always yields the smallest prediction loss for a wide range of K, while PECOK does slightly better when K is between 5 and 10. Though COD-cc has large losses for medium or small K, its performance is very close to the best performer COD near K = 146. Kmeans in this example is the closest competing method, while the other two methods (HC and SC) yield larger losses across the choices of K.

8. Discussion.

In this section, we discuss some related models and give an overall recommendation on the usage of our methods.

8.1. Comparison with stochastic block model.

The problem of variable clustering that we consider in this work is fundamentally different from that of variable clustering from network data. The latter, especially in the context of the Stochastic Block Model (SBM), has received a large amount of attention over the past years, for instance [2, 16, 21, 27–29, 33]. The most important difference stems from the nature of the data: the data analyzed via the SBM is a p × p binary matrix A, called the adjacency matrix, with entries assumed to have been generated as independent Bernoulli random variables; its expected value is assumed to have a block structure. In contrast, the data matrix X generated from a G-block covariance is a n × p matrix with real entries, and rows viewed as i.i.d copies of a p-dimensional vector X with mean zero and dependent entries. The covariance matrix Σ of X is assumed to have (up to the diagonal) a block structure.

Need for a correction.

Even though the analysis of the methods in our setting would differ from the SBM setting, we could have applied available clustering procedures tailored for SBMs to the empirical covariance matrix $\hat{Σ} = X^{t} X / n$ by treating it as some sort of weighted adjacency matrix. It turns out that applying verbatim the spectral clustering procedure of Lei and Rinaldo [28] or the SDP such as the ones in [3] would lead to poor results. The main reason for this is that, in our setting, we need to correct both the spectral algorithm and the SDP to recover the correct clusters (Section 5). Second, the SDPs studied in the SBM context (such as those of [3]) do not handle properly groups with different and unknown sizes, contrary to our SDP. To the best of our knowledge, our SDP (without correction) has only been independently studied by Mixon et al. [32] in the context of Gaussian mixtures.

Analysis of the SDP.

As for the mathematical arguments, our analysis of the SDP in our on covariance-type model differs from that in mean-type models partly because of the the presence of nontrivial cross-product terms. Instead of relying on dual certificates arguments as in other work such as [35], we directly investigate the primal problem and combine different duality-norm bounds. The crucial step is the Lemma A.3 in the Supplementary Material [14] which allows to control the Frobenius inner product by a (unusual) combination of ℓ¹ and spectral control. In our opinion, our approach is more transparent than dual certificates techniques, especially in the presence of a correction $\hat{Γ}$ and allows for the attainment of optimal convergence rates.

8.2. Extension to other models.

The general strategy of correcting a convex relaxation of K-means can be applied to other models. In [38], one of the authors has adapted the PECOK algorithm to the clustering problem of mixture of sub-Gaussian distributions. In particular, in the high-dimensional setting where the correction plays a key role, [38] obtains sharper separation conditions dependencies than in state-of-the-art clustering procedures [32]. Extensions to model-based overlapping clustering are beyond the scope of this paper, but we refer to [11] for recent results.

8.3. Practical recommendations.

Based on our extensive simulation studies, we conclude this section with general recommendations on the usage of our proposed algorithms.

If p is moderate in size, and if there are reasons to believe that no singletons exist in a particular application, or if they have been removed in a pre-processing step, we recommend the usage of the PECOK algorithm, which is numerically superior to existing methods: exact recovery can be reached for relatively small sample sizes. COD is also very competitive, but requires a slightly larger sample size to reach the same performance as PECOK. The constraint on the size of p reflects the existing computational limits in state-of-the art algorithms for SDP, not the statistical capabilities of the procedure, the theoretical analysis of which being one of the foci of this work.

If p is large, we recommend COD-type algorithms. Since COD is optimization-free, it scales very well with p, and only requires a moderate sample size to reach exact cluster recovery. Moreover, COD adapts very well to data that contains singletons and, more generally, to data that is expected to have many inhomogeneous clusters.

Supplementary Material

supplementary material

NIHMS1765231-supplement-supplementary_material.pdf^{(511.4KB, pdf)}

Acknowledgments.

We thank the Editors and anonymous reviewers for their helpful suggestions. We thank Andrea Montanari for pointing to us the reference [32].

This work was supported in part by the CNRS PICS grant HighClust.

The first author was supported in part by NSF Grant DMS-1712709.

The second author was supported in part by by the LabEx LMH, ANR-11-LABX-0056-LMH.

The third author was supported in part by NSF Grant DMS-1557467 and NIH Grants R01EB022911, P01AA019072, P20GM103645, P30AI042853 and S10OD016366.

The fourth author was supported by IDEX Paris-Saclay IDI Grant ANR-11-IDEX-0003-02.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Model assisted variable clustering: Minimax-optimal recovery and algorithms” (DOI: 10.1214/18-AOS1794SUPP; .pdf). This supplement contains proofs of the theoretical results, the simulation results and additional supporting information regarding the data analysis.

REFERENCES

[1].Abbe E, Fan J, Wang K and Zhong Y (2017). Entrywise eigenvector analysis of random matrices with low expected rank. ArXiv Preprint arXiv:1709.09565. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Abbe E and Sandon C (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science—FOCS 2015 670–688. IEEE Computer Soc., Los Alamitos, CA. MR3473334 [Google Scholar]
[3].Amini AA and Levina E (2018). On semidefinite relaxations for the block model. Ann. Statist 46 149–179. MR3766949 10.1214/17-AOS1545 [DOI] [Google Scholar]
[4].Arthur D and Vassilvitskii S (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035. ACM, New York. MR2485254 [Google Scholar]
[5].Awasthi P, Charikar M, Krishnaswamy R and Sinop AK (2015). The hardness of approximations of Euclidean k-means. In 31st International Symposium on Computational Geometry. LIPIcs. Leibniz Int. Proc. Inform 34 754–767. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern. MR3392820 [Google Scholar]
[6].Banerjee O, El Ghaoui L and d’Aspremont A (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res 9 485–516. MR2417243 [Google Scholar]
[7].Bellec P, Perlbarg V, Jbabdi S, Pélégrini-Issac M, Anton J-L, Doyon J and Benali H (2006). Identification of large-scale networks in the brain using fMRI. NeuroImage 29 1231–1243. [DOI] [PubMed] [Google Scholar]
[8].Bernardes JS, Vieira FR, Costa LM and Zaverucha G (2015). Evaluation and improvements of clustering algorithms for detecting remote homologous protein families. BMC Bioinform. 16 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Berthet Q and Rigollet P (2013). Complexity theoretic lower bounds for sparse principal component detection. In Proceedings of the 26th Annual Conference on Learning Theory (Shalev-Shwartz S and Steinwart I, eds.). Proceedings of Machine Learning Research 30 1046–1066. PMLR, Princeton, NJ. [Google Scholar]
[10].Berthet Q, Rigollet P and Srivastava P (2018). Exact recovery in the Ising blockmodel. Ann. Statist To appear. arXiv:1612.03880. [Google Scholar]
[11].Bing M, Bunea F, Ning Y and Wegkamp M (2018). Adaptive estimation in structured factor models with applications to overlapping clustering. ArXiv E-prints. [Google Scholar]
[12].Bouveyron C and Brunet-Saumard C (2014). Model-based clustering of high-dimensional data: A review. Comput. Statist. Data Anal 71 52–78. MR3131954 10.1016/j.csda.2012.12.008 [DOI] [Google Scholar]
[13].Bunea F, Giraud C and Luo X (2015). Minimax optimal variable clustering in G-models via cord. ArXiv Preprint arXiv:1508.01939. [Google Scholar]
[14].Bunea F, Giraud C, Luo X, Royer M and Verzelen N (2020). Supplement to “Model assisted variable clustering: Minimax-optimal recovery and algorithms.” 10.1214/18-AOS1794SUPP. [DOI] [PMC free article] [PubMed]
[15].Bunea F, Giraud C, Royer M and Verzelen N (2016). PECOK: A convex optimization approach to variable clustering. ArXiv Preprint arXiv:1606.05100. [Google Scholar]
[16].Chen Y and Xu J (2016). Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J. Mach. Learn. Res 17 Paper No. 27, 57. MR3491121 [Google Scholar]
[17].Chong M, Bhushan C, Joshi AA, Choi S, Haldar JP, Shattuck DW, Spreng RN and Leahy RM (2017). Individual parcellation of resting fMRI with a group functional connectivity prior. NeuroImage 156 87–100. 10.1016/j.neuroimage.2017.04.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Craddock RC, James GA, Holtzheimer PE, Hu XP and Mayberg HS (2012). A whole brain fMRI atlas generated via spatially constrained spectral clustering. Hum. Brain Mapp 33 1914–1928. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Frei N, Garcia AV, Bigeard J, Zaag R, Bueso E, Garmier M, Pateyron S, de Tauzia-Moreau ML, Brunaud V et al. (2014). Functional analysis of Arabidopsisimmune-related MAPKs uncovers a role for MPK3 as negative regulator of inducible defences. Genome Biol. 15 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Glasser MF, Coalson TS, Robinson EC, Hacker CD, Harwell J, Yacoub E, Ugurbil K, Andersson J, Beckmann CF et al. (2016). A multi-modal parcelation of human cerebral cortex. Nature 536 171–178. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Guédon O and Vershynin R (2016). Community detection in sparse networks via Grothendieck’s inequality. Probab. Theory Related Fields 165 1025–1049. MR3520025 10.1007/s00440-015-0659-z [DOI] [Google Scholar]
[22].James GA, Hazaroglu O and Bush KA (2016). A human brain atlas derived via n-cut parcellation of resting-state and task-based fMRI data. Magn. Reson. Imaging 34 209–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Jiang D, Tang C and Zhang A (2004). Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng 16 1370–1386. [Google Scholar]
[24].Koltchinskii V and Lounici K (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110–133. MR3556768 10.3150/15-BEJ730 [DOI] [Google Scholar]
[25].Kong R, Li J, Sun N, Sabuncu M, Liu H, Schaefer A, Zuo X-N, Holmes A, Eickhoff S et al. (2018). Spatial topography of individual-specific cortical networks predicts human cognition, personality and emotion. https://www.biorxiv.org/content/early/2018/01/31/213041. [DOI] [PMC free article] [PubMed]
[26].Kumar A., Sabharwal Y. and Sen S. (2004). A simple linear time (1 + ϵ)-approximation algorithm for k-means clustering in any dimensions. In Foundations of Computer Science, 2004. Proceedings. 45th Annual IEEE Symposium on 454–462. [Google Scholar]
[27].Le CM, Levina E and Vershynin R (2016). Optimization via low-rank approximation for community detection in networks. Ann. Statist 44 373–400. MR3449772 10.1214/15-AOS1360 [DOI] [Google Scholar]
[28].Lei J and Rinaldo A (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist 43 215–237. MR3285605 10.1214/14-AOS1274 [DOI] [Google Scholar]
[29].Lei J and Zhu L (2014). A generic sample splitting approach for refined community recovery in stochastic block models. ArXiv Preprint arXiv:1411.1469. [Google Scholar]
[30].Lloyd SP (1982). Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 129–137. MR0651807 10.1109/TIT.1982.1056489 [DOI] [Google Scholar]
[31].Lu Y and Zhou HH (2016). Statistical and computational guarantees of Lloyd’s algorithm and its variants. ArXiv Preprint arXiv:1612.02099. [Google Scholar]
[32].Mixon DG, Villar S and WARD R (2017). Clustering subgaussian mixtures by semidefinite programming. Inf. Inference 6 389–415. MR3764529 10.1093/imaiai/iax001 [DOI] [Google Scholar]
[33].Mossel E, Neeman J and Sly A (2014). Consistency thresholds for binary symmetric block models. ArXiv Preprint arXiv:1407.1591. [Google Scholar]
[34].Peng J and Wei Y (2007). Approximating K-means-type clustering via semidefinite programming. SIAM J. Optim 18 186–205. MR2299680 10.1137/050641983 [DOI] [Google Scholar]
[35].Perry A and Wein AS (2015). A semidefinite program for unbalanced multisection in the stochastic block model. ArXiv E-prints arXiv:1507.05605. [Google Scholar]
[36].Poldrack RA (2007). Region of interest analysis for fMRI. Soc. Cogn. Affect. Neurosci 2 67–70. 10.1093/scan/nsm006 [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Power JD, Cohen AL, Nelson SM, Wig GS, Barnes KA, Church JA, Vogel AC, Laumann TO, Miezin FM et al. (2011). Functional network organization of the human brain. Neuron 72 665–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Royer M (2017). Adaptive clustering through semidefinite programming. In Advances in Neural Information Processing Systems (NIPS). [PMC free article] [PubMed] [Google Scholar]
[39].Xue G, Aron AR and Poldrack RA (2008). Common neural substrates for inhibition of spoken and manual responses. Cereb. Cortex 18 1923–1932. [DOI] [PubMed] [Google Scholar]
[40].Yeo BT, Krienen FM, Sepulcre J, Sabuncu MR, Lashkari D, Hollinshead M, Roffman JL, Smoller JW, Zöllei L et al. (2011). The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J. Neurophysiol 106 1125–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Zaag R, Tamby J, Guichard C, Tariq Z, Rigaill G, Delannoy E, Renou J, Balzergue S, Mary-Huard T et al. (2015). GEM2Net: From gene expression modeling to -omics networks, a new CATdb module to investigate Arabidopsis thaliana genes involved in stress response. Nucleic Acids Res. 43 1010–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

NIHMS1765231-supplement-supplementary_material.pdf^{(511.4KB, pdf)}

[R1] [1].Abbe E, Fan J, Wang K and Zhong Y (2017). Entrywise eigenvector analysis of random matrices with low expected rank. ArXiv Preprint arXiv:1709.09565. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Abbe E and Sandon C (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science—FOCS 2015 670–688. IEEE Computer Soc., Los Alamitos, CA. MR3473334 [Google Scholar]

[R3] [3].Amini AA and Levina E (2018). On semidefinite relaxations for the block model. Ann. Statist 46 149–179. MR3766949 10.1214/17-AOS1545 [DOI] [Google Scholar]

[R4] [4].Arthur D and Vassilvitskii S (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035. ACM, New York. MR2485254 [Google Scholar]

[R5] [5].Awasthi P, Charikar M, Krishnaswamy R and Sinop AK (2015). The hardness of approximations of Euclidean k-means. In 31st International Symposium on Computational Geometry. LIPIcs. Leibniz Int. Proc. Inform 34 754–767. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern. MR3392820 [Google Scholar]

[R6] [6].Banerjee O, El Ghaoui L and d’Aspremont A (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res 9 485–516. MR2417243 [Google Scholar]

[R7] [7].Bellec P, Perlbarg V, Jbabdi S, Pélégrini-Issac M, Anton J-L, Doyon J and Benali H (2006). Identification of large-scale networks in the brain using fMRI. NeuroImage 29 1231–1243. [DOI] [PubMed] [Google Scholar]

[R8] [8].Bernardes JS, Vieira FR, Costa LM and Zaverucha G (2015). Evaluation and improvements of clustering algorithms for detecting remote homologous protein families. BMC Bioinform. 16 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Berthet Q and Rigollet P (2013). Complexity theoretic lower bounds for sparse principal component detection. In Proceedings of the 26th Annual Conference on Learning Theory (Shalev-Shwartz S and Steinwart I, eds.). Proceedings of Machine Learning Research 30 1046–1066. PMLR, Princeton, NJ. [Google Scholar]

[R10] [10].Berthet Q, Rigollet P and Srivastava P (2018). Exact recovery in the Ising blockmodel. Ann. Statist To appear. arXiv:1612.03880. [Google Scholar]

[R11] [11].Bing M, Bunea F, Ning Y and Wegkamp M (2018). Adaptive estimation in structured factor models with applications to overlapping clustering. ArXiv E-prints. [Google Scholar]

[R12] [12].Bouveyron C and Brunet-Saumard C (2014). Model-based clustering of high-dimensional data: A review. Comput. Statist. Data Anal 71 52–78. MR3131954 10.1016/j.csda.2012.12.008 [DOI] [Google Scholar]

[R13] [13].Bunea F, Giraud C and Luo X (2015). Minimax optimal variable clustering in G-models via cord. ArXiv Preprint arXiv:1508.01939. [Google Scholar]

[R14] [14].Bunea F, Giraud C, Luo X, Royer M and Verzelen N (2020). Supplement to “Model assisted variable clustering: Minimax-optimal recovery and algorithms.” 10.1214/18-AOS1794SUPP. [DOI] [PMC free article] [PubMed]

[R15] [15].Bunea F, Giraud C, Royer M and Verzelen N (2016). PECOK: A convex optimization approach to variable clustering. ArXiv Preprint arXiv:1606.05100. [Google Scholar]

[R16] [16].Chen Y and Xu J (2016). Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J. Mach. Learn. Res 17 Paper No. 27, 57. MR3491121 [Google Scholar]

[R17] [17].Chong M, Bhushan C, Joshi AA, Choi S, Haldar JP, Shattuck DW, Spreng RN and Leahy RM (2017). Individual parcellation of resting fMRI with a group functional connectivity prior. NeuroImage 156 87–100. 10.1016/j.neuroimage.2017.04.054 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Craddock RC, James GA, Holtzheimer PE, Hu XP and Mayberg HS (2012). A whole brain fMRI atlas generated via spatially constrained spectral clustering. Hum. Brain Mapp 33 1914–1928. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Frei N, Garcia AV, Bigeard J, Zaag R, Bueso E, Garmier M, Pateyron S, de Tauzia-Moreau ML, Brunaud V et al. (2014). Functional analysis of Arabidopsisimmune-related MAPKs uncovers a role for MPK3 as negative regulator of inducible defences. Genome Biol. 15 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Glasser MF, Coalson TS, Robinson EC, Hacker CD, Harwell J, Yacoub E, Ugurbil K, Andersson J, Beckmann CF et al. (2016). A multi-modal parcelation of human cerebral cortex. Nature 536 171–178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Guédon O and Vershynin R (2016). Community detection in sparse networks via Grothendieck’s inequality. Probab. Theory Related Fields 165 1025–1049. MR3520025 10.1007/s00440-015-0659-z [DOI] [Google Scholar]

[R22] [22].James GA, Hazaroglu O and Bush KA (2016). A human brain atlas derived via n-cut parcellation of resting-state and task-based fMRI data. Magn. Reson. Imaging 34 209–218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Jiang D, Tang C and Zhang A (2004). Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng 16 1370–1386. [Google Scholar]

[R24] [24].Koltchinskii V and Lounici K (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110–133. MR3556768 10.3150/15-BEJ730 [DOI] [Google Scholar]

[R25] [25].Kong R, Li J, Sun N, Sabuncu M, Liu H, Schaefer A, Zuo X-N, Holmes A, Eickhoff S et al. (2018). Spatial topography of individual-specific cortical networks predicts human cognition, personality and emotion. https://www.biorxiv.org/content/early/2018/01/31/213041. [DOI] [PMC free article] [PubMed]

[R26] [26].Kumar A., Sabharwal Y. and Sen S. (2004). A simple linear time (1 + ϵ)-approximation algorithm for k-means clustering in any dimensions. In Foundations of Computer Science, 2004. Proceedings. 45th Annual IEEE Symposium on 454–462. [Google Scholar]

[R27] [27].Le CM, Levina E and Vershynin R (2016). Optimization via low-rank approximation for community detection in networks. Ann. Statist 44 373–400. MR3449772 10.1214/15-AOS1360 [DOI] [Google Scholar]

[R28] [28].Lei J and Rinaldo A (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist 43 215–237. MR3285605 10.1214/14-AOS1274 [DOI] [Google Scholar]

[R29] [29].Lei J and Zhu L (2014). A generic sample splitting approach for refined community recovery in stochastic block models. ArXiv Preprint arXiv:1411.1469. [Google Scholar]

[R30] [30].Lloyd SP (1982). Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 129–137. MR0651807 10.1109/TIT.1982.1056489 [DOI] [Google Scholar]

[R31] [31].Lu Y and Zhou HH (2016). Statistical and computational guarantees of Lloyd’s algorithm and its variants. ArXiv Preprint arXiv:1612.02099. [Google Scholar]

[R32] [32].Mixon DG, Villar S and WARD R (2017). Clustering subgaussian mixtures by semidefinite programming. Inf. Inference 6 389–415. MR3764529 10.1093/imaiai/iax001 [DOI] [Google Scholar]

[R33] [33].Mossel E, Neeman J and Sly A (2014). Consistency thresholds for binary symmetric block models. ArXiv Preprint arXiv:1407.1591. [Google Scholar]

[R34] [34].Peng J and Wei Y (2007). Approximating K-means-type clustering via semidefinite programming. SIAM J. Optim 18 186–205. MR2299680 10.1137/050641983 [DOI] [Google Scholar]

[R35] [35].Perry A and Wein AS (2015). A semidefinite program for unbalanced multisection in the stochastic block model. ArXiv E-prints arXiv:1507.05605. [Google Scholar]

[R36] [36].Poldrack RA (2007). Region of interest analysis for fMRI. Soc. Cogn. Affect. Neurosci 2 67–70. 10.1093/scan/nsm006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Power JD, Cohen AL, Nelson SM, Wig GS, Barnes KA, Church JA, Vogel AC, Laumann TO, Miezin FM et al. (2011). Functional network organization of the human brain. Neuron 72 665–678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Royer M (2017). Adaptive clustering through semidefinite programming. In Advances in Neural Information Processing Systems (NIPS). [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Xue G, Aron AR and Poldrack RA (2008). Common neural substrates for inhibition of spoken and manual responses. Cereb. Cortex 18 1923–1932. [DOI] [PubMed] [Google Scholar]

[R40] [40].Yeo BT, Krienen FM, Sepulcre J, Sabuncu MR, Lashkari D, Hollinshead M, Roffman JL, Smoller JW, Zöllei L et al. (2011). The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J. Neurophysiol 106 1125–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Zaag R, Tamby J, Guichard C, Tariq Z, Rigaill G, Delannoy E, Renou J, Balzergue S, Mary-Huard T et al. (2015). GEM2Net: From gene expression modeling to -omics networks, a new CATdb module to investigate Arabidopsis thaliana genes involved in stress response. Nucleic Acids Res. 43 1010–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS

Florentina Bunea

Christophe Giraud

Xi Luo

Martin Royer

Nicolas Verzelen

Abstract

1. Introduction.

1.1. The G-block covariance model.

1.2. Our contribution.

Table 1.

1.3. Organization of the paper.

1.4. Notation.

1.5. Distributional assumptions.

Assumption 1 (Sub-Gaussian distributions).

Assumption 1-bis (Bounded distributions).

2. Cluster identifiability in G-block covariance models.

Lemma 2.1.

Proof.

Proposition 2.2.

Corollary 2.3.

3. Minimax thresholds on cluster separation for perfect recovery.

Theorem 3.1.

Theorem 3.2.

4. COD for variable clustering.

4.1. COD procedure.

4.2. Perfect cluster recovery with COD for minimax optimal MCOD(Σ) cluster separation.

Theorem 4.1.

4.3. A data-driven calibration procedure for COD.

Proposition 4.2.

5. Penalized convex K-means: PECOK.

5.1. PECOK algorithm.

Proposition 5.1.

Proposition 5.2.

5.2. Construction of Γ^.

5.3. Perfect cluster recovery with PECOK for near-minimax Δ-cluster separation.

Theorem 5.3.

Remark 1.

Theorem 5.4.

Remark 2.

5.4. A comparison between PECOK and spectral clustering.

Lemma 5.5.

Proposition 5.6.

Corollary 5.7 (Illustrative example: C* = τ IK and Г = Ip).

6. Approximate G-block covariance models.

6.1. Identifiability of approximate G-block covariance models.

Proposition 6.1.

6.2. The COD algorithm for approximate G-block covariance models.

Theorem 6.2.

6.3. The PECOK algorithm for approximate G-block covariance models.

Proposition 6.3.

Corollary 6.4.

Assumption 2 (diagonal dominance of Г).

Theorem 6.5.

7. Data analysis.

Table 2.

Fig. 1.

8. Discussion.

8.1. Comparison with stochastic block model.

Need for a correction.

Analysis of the SDP.

8.2. Extension to other models.

8.3. Practical recommendations.

Supplementary Material

Acknowledgments.

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

5.2. Construction of $\hat{Γ}$ .

Corollary 5.7 (Illustrative example: C* = τ I_K and Г = I_p).