Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jul 15.
Published in final edited form as: Ann Stat. 2020 Feb 17;48(1):111–137. doi: 10.1214/18-aos1794

MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS

Florentina Bunea 1,*, Christophe Giraud 2,*, Xi Luo 3,*, Martin Royer 2,*, Nicolas Verzelen 4,*
PMCID: PMC9286061  NIHMSID: NIHMS1765231  PMID: 35847529

Abstract

The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X1, … , Xp) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.

MSC2010 subject classifications: Primary 62H30, secondary 62C20

Keywords: Convergence rates, convex optimization, covariance matrices, high-dimensional inference

1. Introduction.

The problem of variable clustering is that of grouping similar components of a p-dimensional vector X = (X1,…, Xp). These groups are referred to as clusters. In this work, we investigate the problem of cluster recovery from a sample of n independent copies of X. Variable clustering has had a long history in a variety of fields, with important examples stemming from gene expression data [19, 23, 41] or protein profile data [8]. The solutions to this problem are typically algorithmic and entirely data based. They include applications of K-means, hierarchical clustering, spectral clustering or versions of them. The statistical properties of these procedures have received a very limited amount of investigation. It is not currently known what probabilistic cluster models on X can be estimated by these popular techniques, or by their modifications. More generally, model-based variable clustering has received a limited amount of attention. One net advantage of model-based clustering is that population-level clusters are clearly defined, offering both interpretability of the clusters and a benchmark against which one can check the quality of a particular clustering algorithm.

In this work, we propose the G-block covariance model as a flexible model for variable clustering and show that the clusters given by this model are uniquely defined.We then motivate and develop two algorithms tailored to the model, COD and PECOK, and analyze their respective performance in terms of exact cluster recovery, for minimally separated clusters, under appropriately defined cluster separation metrics.

1.1. The G-block covariance model.

Our proposed model for variable clustering subsumes that the covariance matrix Σ of a centered random vector XRp follows a block, or near-block, decomposition, with blocks corresponding to a partition G = {G1, …, GK} of {1, …, p}. This structure of the covariance matrix has been observed to hold, empirically, in a number of very recent studies on the parcelation of the human brain, for instance, [18, 20, 25, 40]. We further support these findings in Section 7, where we apply the clustering methods developed in this paper, tailored to G-block covariance models, for the clustering of brain regions.

To describe our model, we associate, to a partition G, a membership matrix ARp×K defined by Aak = 1 if aGk, and Aak = 0 otherwise.

  1. The exact G-block covariance model. In view of the above discussion, clustering the variables (X1, …, Xp) amounts to find a minimal (i.e., coarsest partition) G*, such that two variables belong to the same cluster if they have the same covariance with all other variables. This implies that the covariance matrix Σ of X decomposes as
    Σ=AC*At+Γ, (1.1)
    where A is relative to G*, C* is a symmetric K × K matrix and Г a diagonal matrix. When a such a decomposition exists with the partition G*, we say that XRp follows an (exact) G*-block covariance model.
    1. G-Latent model. Such a structure arises, for instance, when components of X that belong to the same group can be decomposed into the sum between a common latent variable and an uncorrelated random fluctuation. Similarity within group is therefore given by association with the same unobservable source. Specifically, the exact block-covariance model (1.1) holds, with a diagonal matrix Г, when
      Xa=Zk(a)+Ea, (1.2)
      with Cov(Zk(a), Ea) = 0, Cov(Z) = C*, and the individual fluctuations Ea are uncorrelated, and thus E has diagonal covariance matrix Г. The index assignment function k :{1, …, p) → {1, …, K}is defined by Gk = {a : k(a) = k}. In practice, this model is used to justify the construction of a single variable that represents a cluster, the average of Xa, aGk, viewed as an observable proxy of Zk(a). For example, a popular analysis approach for fMRI data, called region-of-interest (ROI) analysis [36], requires averaging the observations from multiple voxels (a imaging unit for a small cubic volume of the brain) within each ROI (or cluster of voxels) to produce new variables, each representing a larger and interpretable brain area. These new variables are then used for downstream analyses. From this perspective, model (1.2) can be used in practice (see, e.g., [7]) as a building block in a data analysis based on cluster representatives, which in turn requires accurate cluster estimation. Indeed, data-driven methods for clustering either voxels into regions or regions into functional systems, especially based on the covariance matrix of X, is becoming increasingly important; see, for example, [18, 20, 37, 40]. Accurate data-driven clustering methods also enable studying the cluster differences across subjects [17] or experimental conditions [22].
    2. The Ising block model. The Ising block model has been proposed in [10] for modeling social interactions, for instance, political affinities. Under this model, the joint distribution of X ∈ {−1, 1}p, a p-dimensional vector with binary entries, is given by
      f(x)=1κα,βexp[β2pa~bxaxb+α2pabxaxb], (1.3)
      where the quantity κα,β is a normalizing constant, and the notation a ~ b means that the elements are in the same group of the partition. The variables Xa may for instance represent the votes of U.S. senators on a bill [6]. For parameters α > β, the density (1.3) models the fact that senators belonging to the same political group tend to share the same vote. By symmetry of the density f, the covariance matrix Σ of X decomposes as an exact block covariance model Σ = AC* At + Г where Г is diagonal. When all groups Gk* have identical size, we have C* = (ωinωout)Ik + ωoutJ and Г = (1 − ωin)I, where the K × K matrix J has all entries equal to 1, and IK denotes the K × K identity matrix, and the quantities ωin, ωout depend on α, β, p.
  2. The approximate G-block model. In many situations, it is more appealing to group variables that nearly share the same covariance with all the other variables. In that situation, the covariance matrix Σ would decompose as
    Σ=ACAt+Γ whereΓhassmalloff-diagonalentries. (1.4)
    Such a situation can arise, for instance, when Xa = (1 + δa)Zk(a) + Ea, with δa = o(1) and the individual fluctuations Ea are uncorrelated, 1 ≤ ap.

1.2. Our contribution.

We assume that the data consist in i.i.d. observations X(1), …, X(n) of a random vector X with mean 0 and covariance matrix Σ. This work is devoted to the development of computationally feasible methods that yield estimates G^ of G*, such that G^=G*, with high probability, when the clusters are minimally separated, and to characterize the minimal value of the cluster separation from a minimax perspective. The separation between clusters is a key element in quantifying the difficulty of a clustering task as, intuitively, well-separated clusters should be easier to identify. We consider two related, but different, separation metrics, that can be viewed as canonical whenever Σ. satisfies (1.4). Although all of our results allow, and are proved, for small departures from the diagonal structure of Г in (1.1), our main contribution can be best seen when Г is a diagonal matrix. We focus on this case below, for clarity of exposition. The case of Г being a perturbation of a diagonal matrix is treated in Section 6.

When Г is diagonal, our target partition G* can be easily defined. It is the unique minimal (with respect to partition refinement) partition G* for which there is a decomposition Σ = AC* At + Г, with A associated to G*. We refer to Section 2 for details. We observe in particular, that maxca,bac – Σbc| > 0 if and only if Xa and Xb belong to different clusters in G*.

This last remark motivates our first metric MCOD based on the following COvariance Difference (COD) measure:

COD(a,b)maxca,b|ΣacΣbc| foranya,b=1,,p. (1.5)

We use the notation a~G*b whenever a and b belong to the same group Gk*, for some k, in the partition G*, and similarly aG*b means that there does not exist any group Gk* of the partition G* that contains both a and b. We define the MCOD metric as

MCOD(Σ)minaG*bCOD(a,b). (1.6)

The measure COD(a, b) quantifies the similarity of the covariances that Xa and Xb have, respectively, with all other variables. From this perspective, the size of MCOD(Σ) is a natural measure for the difficulty of clustering when analyzing clusters with components that are similar in this sense. Moreover, note that this metric is well defined even if C* of model (1.1) is not semipositive definite.

Another cluster separation metric appears naturally when we view model (1.1) as arising via model (1.2), or via small deviations from it. Then clusters in (1.1) are driven by the latent factors, and intuitively they differ when the latent factors differ. Specifically, we define the “within-between group” covariance gap

Δ(C*)minj<k(Ckk*+Cjj*2Cjk*)=minj<kE[(ZjZk)2], (1.7)

where the second equality holds whenever (1.2) holds. In the latter case, the matrix C*, which is the covariance matrix of the latent factors, is necessarily semipositive definite. Further, we observe that Δ(C*) = 0 implies Zj = Zk a.s. Conversely, we prove in Corollary 2.3 of Section 2 that if the decomposition (1.1) holds with Δ(C*) > 0, then the partition related to A is the partition G* described above. An instance of Δ(C*) > 0 corresponds to having the within group covariances stronger than those between groups. This suggests the usage of this metric Δ(C*) for cluster analysis whenever, in addition to the general model formulation (1.1), we also expect clusters to have this property, which has been observed, empirically, to hold in applications. For instance, it is implicit in the methods developed by [18] for creating a human brain atlas by partitioning appropriate covariance matrices. We also present a neuroscience-based data example in Section 7.

Formally, the two metrics are connected via the following chain of inequalities, proved in Lemma B.1 of Section 2 of the Supplementary Material [14], and valid as soon as the size of the smallest cluster is larger than one, Г and C* is semipositive definite (for the last inequality)

2λK(C*)Δ(C*)2MCOD(Σ)2Δ(C*)maxk=1,,KCkk*. (1.8)

The first inequality shows that conditions on Δ(C*) are weaker than conditions on the minimal eigenvalue λK (C*) of C*. In order to preserve the generality of our model, we do not necessarily assume that λK (C*) > 0, as we show that, for model identifiability, it is enough to have the weaker condition Δ(C*) > 0, when the two quantities differ.

The second inequality in (1.8) shows that Δ(C*) and MCOD(Σ) can have the same order of magnitude, whereas the third inequality shows that they can also differ in order, and Δ(C*) can be as small as MCOD2(Σ) for small values of these metrics, which is our main focus. This suggests that different statistical assessments, and possibly different algorithms, should be developed for estimators of clusters defined by (1.1), depending on the cluster separation metric. To substantiate this intuition, we first derive, for each metric, the rate below which no algorithm can recover exactly the clusters defined by (1.1). We call this the minimax optimal threshold for cluster separation, and prove that it is different for the two metrics. We call an algorithm that can be proved to recover exactly clusters with separation above the minimax threshold a minimax optimal algorithm.

Theorem 3.1 in Section 3 shows that, for K ≥ 3 and for some numerical constant c > 0, no algorithm can estimate consistently clusters defined by (1.1) uniformly over covariance matrices fulfilling

MCOD(Σ)clog(p)n.

Theorem 3.2 in Section 3 shows that optimal separation distances with respect to the metric Δ(C*) are sensitive to the size of the smallest cluster,

m*=min1kK|Gk*|.

Indeed, there exists a numerical constant c > 0, such that no algorithm can estimate consistently clusters defined by (1.1) uniformly over covariance matrices fulfilling

Δ(C*)c(log(p)nm*log(p)n). (1.9)

The first term will be dominant whenever the smallest cluster has size m* < n/ log(p), which will be the case in most situations. The second term in (1.9) becomes dominant whenever m* > n/ log(p), which can also happen when p scales as n, and we have a few balanced clusters.

The PECOK algorithm is tailored to the Δ(C*) metric, and is shown in Theorem 5.3 to be near-minimax optimal. For instance, for balanced clusters, there exists a constant c′ such that exact recovery is guaranteed when Δ(C*)c(Klogpm*n+Klog(p)n). This differs by factors in K from the Δ(C*)-minimax threshold, for general K, whereas it is of optimal order when K is a constant, or grows as slowly as log p. A similar discrepancy between minimax lower bounds and the performance of polynomial-time estimators has also been pinpointed in network clustering via the stochastic block model [16] and in sparse PCA [9]. It has been conjectured that, when K increases with n, there exists a gap between the statistical boundary, that is, the minimal cluster separation for which a statistical method achieves perfect clustering with high probability, and the polynomial boundary, that is, the minimal cluster separation for which there exists a polynomial-time algorithm that achieves perfect clustering. Further investigation of this computational trade-off is beyond the scope of this paper and we refer to [16] and [9] for more details.

However, if we consider directly the metric MCOD(Σ), and its corresponding, larger, minimax threshold, we derive the COD algorithm, which is minimax optimal with respect to MCOD(Σ) when K ≥ 3. In view of (1.8), it is also minimax optimal with respect to Δ(C*), whenever there exist small clusters, the size of which does not change with n. The description of the two algorithms and theoretical properties are given in Sections 4 and 5, respectively, for exact block covariance models. Companions of these results, regarding the performance of the algorithms for approximate block covariance models are given in Section 6, in Theorem 6.2 and Theorem 6.5, respectively.

Table 1 gives a snapshot of our results, which for ease of presentation, correspond to the case of balanced clusters, with the same number of variables per cluster. We stress that neither our algorithms, nor our theory, is restricted to this case, but the exposition becomes more transparent in this situation.

Table 1.

Algorithm performance relative to minimax thresholds of each metric

Metric Minimax threshold PECOK COD
d1 ≕ Δ(C*) logpmn+logpn Minimax optimal w.r.t. d1 when K = O(log(p)). Minimax optimal w.r.t. d1 when m is constant.
d2 ≕ MCOD(∑) logpn when K ≥ 3 Minimax optimal w.r.t. d2 when m > n/ log(p) and K = O(log p). Minimax optimal w.r.t. d2 when K ≥ 3.

In this table, m denotes the size of the smallest cluster in the partition. The performance of COD under d1 follows from the second inequality in (1.8), whereas the performance of PECOK under d2 follows from the last inequality in (1.8). The overall message transmitted by Table 1 and our analysis is that, irrespective of the separation metric, the COD algorithm will be most powerful whenever we expect to have at least one, possibly more, small clusters, a situation that is typically not handled well in practice by most of the popular clustering algorithms; see [12] for an in-depth review. The PECOK algorithm is expected to work best for larger clusters, in particular when there are no clusters of size one. We defer more comments on the relative numerical performance of the methods to the discussion Section 8.3.

We emphasize that both our algorithms are generally applicable, and our performance analysis is only in terms of the most difficult scenarios, when two different clusters are almost indistinguishable and yet, as our results show, consistently estimable. Our extensive simulation results confirm these theoretical findings.

We summarize below our key contributions.

  1. An identifiable model for variable clustering and metrics for cluster separation. We advocate model-based variable clustering, as a way of proposing objectively defined and interpretable clusters. We propose identifiable G-block covariance models for clustering, and prove cluster identifiability in Proposition 2.2 of Section 2.

  2. Minimax lower bounds on cluster separation metrics for exact partition recovery. Two of our main results are Theorem 3.2 and Theorem 3.1, presented in Section 3, in which we establish, respectively, minimax limits on the size of the Δ(C*)-cluster separation and MCOD(Σ)-cluster separation below which no algorithm can recover clusters defined by (1.1) consistently, from a sample of size n on X. To the best of our knowledge, these are the first results of this type in variable clustering.

  3. Variable clustering procedures with guaranteed exact recovery of minimally separated clusters. The results of (1) and (2) provide a much needed framework for motivating variable clustering algorithm development and for clustering algorithm assessments.

    In particular, they motivate a correction of a convex relaxation of the K-means algorithm, leading to our proposed PECOK procedure, based on Semidefinite Programing (SDP). Theorem 5.3 shows it to be near-minimax optimal with respect to the Δ(C*) metric. The PECOK–Δ(C*) pairing is natural, as Δ(C*) measures the difference of the “within cluster” signal relative to the “between clusters” signal, which is the idea that underlies K-means type procedures. To the best of our knowledge, this is the first work that explicitly shows what model-based clusters of variables can be estimated via K-means style methods, and assesses theoretically the quality of estimation. Moreover, our work shows that the results obtained in [10], for the block Ising model, can be generalized to arbitrary values of K and unbalanced clusters.

    The COD procedure is a companion of PECOK for clusters given by model (1.1), and is minimax optimal with respect to the MCOD(Σ) cluster separation when K ≥ 3, as established in Theorem 3.1. Another advantage of COD is of computational nature, as SDP-based methods, although convex, can be computationally involved.

  4. Comparison with corrected spectral variable clustering methods. In Section 5.4, we connect PECOK with another popular algorithm, spectral clustering. Spectral clustering is less computationally involved than PECOK, but the theoretical guaranties that we can offer for it are weaker.

1.3. Organization of the paper.

The rest of the paper is organized as follows:

Sections 1.4 and 1.5 contain the notation and distributional assumptions used throughout the paper.

For clarity of exposition, Sections 25 contain results established for model (1.1), when is Г a diagonal matrix. Extensions to the case when Г has small off-diagonal entries are presented in Section 6.

Section 2 shows that we have a uniquely defined target of estimation, the partition G*.

Section 3 derives the minimax thresholds on the separation metrics Δ(C*) and MCOD(Σ), respectively, for estimating G* consistently.

Section 4 is devoted to the COD algorithm, and its analysis.

Section 5 is devoted to the PECOK algorithm and its analysis.

Section 5.4 analyzes spectral clustering for variable clustering, and compares it with PECOK.

Section 6 contains extensions to approximate G-block covariance models.

Section 7 presents their application to the clustering of putative brain areas using a real fMRI data.

Section 8 contains a discussion of our results and overall recommendations regarding the usage of our methods. Given the space constraints, all proofs and simulation results are included in the Supplementary Material.

The implementation of PECOK can be found at http://github.com/martinroyer/pecok/ and that of COD at http://CRAN.R-project.org/package=cord.

1.4. Notation.

We denote by X the n × p matrix with rows corresponding to observations X(i)Rp, for i = 1, … , n. The sample covariance matrix Σ^ is defined by

Σ^=1nXtX=1ni=1nX(i)(X(i))t.

Given a vector υ and q ≥ 1, |υ|q stands for the q norm. For a generic matrix M: |M|q denotes its the entrywise q norm, ‖M‖op denotes its operator norm and ‖M‖F refers to the Frobenius norm. We use M:a, Mb:, to denote the ath column or, respectively, bth row of a generic matrix M. The bracket 〈·, ·〉 refers to the Frobenius scalar product. Given a matrix M, we denote supp(M) its support, that is, the set of indices (i, j) such that Mij ≠ 0. I denotes the identity matrix. We define the variation seminorm of a diagonal matrix D as |D|V ≔ maxa Daa − mina Daa. We use B ≽ 0 to denote a symmetric and positive semidefinite matrix.

Throughout this paper will make use of the notation c1, c2, … to denote positive constants independent of n, p, K, m. The same letter, for instance c1 may be used in different statements and may denote different constants, which are made clear within each statement, when there is no possibility for confusion.

We use [p] to denote the set {1, …, p}. We use the notation a~Gb whenever a, bGk, for the same k. Also, m = mink |Gk| stands for the size of the smallest group of the partition G.

The notation ≳ and ≲ is used for whenever the inequalities hold up to multiplicative numerical constants.

1.5. Distributional assumptions.

For a p-dimensional random vector Y, its Orlicz norm is defined by Yψ2=suptRp:t2=1inf{s>0:E[e(Ztt)/s22]}. Throughout the paper, we will assume that X follows a sub-Gaussian distribution. Specifically, we use the following.

Assumption 1 (Sub-Gaussian distributions).

The exists L > 0 such that the random vector Σ−1/2 X satisfies Σ1/2Xψ2L.

Our class of distributions includes, in particular, that of bounded distributions, which may be of independent interest, as example (ii) illustrates. We will therefore also specialize some of our results to this case, in which case we will use directly:

Assumption 1-bis (Bounded distributions).

There exists M > 0 such that maxi=1,…, p |Xi|≤ M almost surely.

Gaussian distributions satisfy Assumption 1 with L = 1. A bounded distribution is also sub-Gaussian, but the corresponding quantity L can be much larger than M, and sharper results can be obtained if Assumption 1-bis holds.

2. Cluster identifiability in G-block covariance models.

To keep the presentation focused, we consider in Sections 25 the model (1.1) with Г diagonal. We treat the case corresponding to a diagonally dominant Г in Section 6 below. In the sequel, it is assumed that p > 2.

We observe that if the decomposition (1.1) holds for a partition G, it also holds for any subpartition of G. It is natural therefore to seek the smallest (coarsest) of such partitions, that is the partition with the least number of groups for which (1.1) holds. Since the partition ordering is a partial order, the smallest partition is not necessarily unique. However, the following lemma shows that uniqueness is guaranteed for our model class.

Lemma 2.1.

Consider any covariance matrix Σ:

  1. There exists a unique minimal partition G* such that Σ = AC At + Г for some diagonal matrix Г, some membership matrix A associated to G* and some matrix C.

  2. The partition G* is given by the equivalence classes of the relation
    ab  if and only if  COD(a,b)maxca,b|ΣacΣbc|=0. (2.1)

Proof.

If decomposition Σ = AC At + Г holds with A related to a partition G, then we have COD(a, b) = 0 for any a, b belonging to the same group of G. Hence, each group Gk of G is included in one of the equivalence class of ≡. As a consequence, G is a finer partition than G* as defined in (b). Hence, G* is the (unique) minimal partition such that decomposition Σ = AC At + Г holds. □

As a consequence, the partition G* is well defined and is identifiable. Next we discuss the definitions of MCOD and Δ metrics. For any partition G, we let MCOD(Σ,G)minaGbCOD(a,b), where we recall that the notation aGb means that a and b are not in a same group of the partition G. By definition of G*, we notice that MCOD(Σ, G*) > 0 and the next proposition shows that G* is characterized by this property.

Proposition 2.2.

Let G be any partition such that MCOD(Σ, G) > 0 and the decomposition Σ = AC At + Г holds with A associated to G. Then G = G*.

The proofs of this proposition and the following corollary are given in Section B of the Supplementary Material [14]. In what follows, we use the notation MCOD(Σ) for MCOD(Σ, G*).

In general, without further restrictions on the model parameters, the decomposition Σ = AC At + Г with A relative to G* is not unique. If, for instance Σ is the identity matrix I, then G* is the complete partition (with p groups) and the decomposition (1.1) holds for any (C, Г) = (λI, (1 – λ)I) with λR.

Recall that m*min|Gk*| stands for the size of the smallest cluster. If we assume that m* > 1 (no singleton), then Г is uniquely defined. Besides, the matrix C in (1.1) is only defined up to a permutation of its rows and columns. In the sequel, we denote C* any of these matrices C. When the partition contains singletons (m* = 1), the matrix decomposition Σ = AC At + Г is made unique (up to a permutation of row and columns of C) by putting the additional constraint that the entries Гaa corresponding to singletons are equal to 0. Since the definition of Δ(C) is invariant with respect to permutation of rows and columns, this implies that Δ(C*) is well defined for any covariance matrix Σ.

For arbitrary Σ, Δ(C*) is not necessarily positive. Nevertheless, if Δ(C*) > 0, then G* is characterized by this property.

Corollary 2.3.

Let G be a partition such that m = mink |Gk| ≥ 2, the decomposition Σ = AC At + Г holds with A associated to G and Δ(C) > 0. Then G = G*.

As pointed in (1.7), in the latent model (1.2), Δ(C*) is equal to the square of the minimal L2-norm between two latent variables. So, in this case, the condition Δ(C*) > 0 simply requires that all latent variables are distinct.

3. Minimax thresholds on cluster separation for perfect recovery.

Before developing variable clustering procedures, we begin by assessing the limits of the size of each of the two cluster separation metrics below which no algorithm can be expected to recover the clusters perfectly. We denote by m*=mink|Gk*| the size of the smallest cluster of the target partition G* defined above. For 1 ≤ mp/2 and η > 0, we define M(m,η) as the set of covariance matrices Σ fulfilling MCOD(Σ) > η|Σ| and whose associated partition G* has groups of equal size m* ≥ m. Similarly, for τ > 0, we define D(m,τ) as the set of covariance matrices Σ fulfilling Δ(C*) > τ |Г| and whose associated partition G* has groups of equal size m* ≥ m. We use the notation Σ to refer to the normal distribution with covariance Σ.

Theorem 3.1.

There exists a positive constant c1 such that, for any 1 ≤ mp/3 and any η such that

0η<η*c1log(p)n, (3.1)

we have infG^supΣM(m,η)Σ(G^G*)1/7, where the infimum is taken over all possible estimators.

When 2 ≤ m = p/2, the same result holds but with the Condition (3.1) replaced by

0η<η*c1[log(p)nplog(p)n]. (3.2)

We also have the following.

Theorem 3.2.

There exist positive constants c1–c3 such that the following holds for any 2 ≤ mp/2. For any τ such that

0τ<τ*c1[log(p)n(m1)log(p)n], (3.3)

then infG^supΣD(m,η)Σ(G^G*)1/7, where the infimum is taken over all estimators.

Conversely, there exists a procedure G^ satisfying supΣD(m,τ)Σ(G^G*)c3/p for any τ such that

τ>τ*c2[log(p)n(m1)log(p)n].

Theorems 3.2 and 3.1 show that if either metric falls below the thresholds in (3.3) and (3.1) or (3.2), respectively, the estimated partition G^, irrespective of the method of estimation, cannot achieve perfect recovery with high probability uniformly over the set M(m,η) or D(m,τ). The proofs are given in Section C of the Supplementary Material [14]. We note that the Δ(C*) minimax threshold takes into account the size m* of the smallest cluster and, therefore, the required cluster separation becomes smaller for large clusters. This is not the case for the second metric, as soon as there are at least 3 groups. The proof of (3.1) provides an example where we have K = 3 clusters, that are very large, of size m* = p/3 each, and where the MCOD(Σ) threshold does not decrease with m*.

Theorem 3.2 also provides a matching upper bound for the minimax threshold. Unfortunately, the procedure achieving this bound has an exponential computational complexity (see Section C.3 in the Supplementary Material [14] for further details and Section 5 for a near-minimax optimal algorithm with polynomial computational complexity).

4. COD for variable clustering.

4.1. COD procedure.

We begin with a procedure that can be viewed as natural for model (1.1). It is based on the following intuition. Two indices a and b belong to the same cluster of G*, if and only if COD(a, b) = 0, with COD defined in (2.1). Equivalently, a and b belong to the same cluster when

sCOD(a,b)maxca,b|cov(XaXb,Xc)|var(XbXa)var(Xc)=maxca,b|cor(XaXb,Xc)|=0,

where sCOD stands for scaled COvariance Differences. In the following, we work with this quantity, as it is scale invariant. It is natural to place a and b in the same cluster when the estimator sCOD^(a,b) is below a certain threshold, where

sCOD^(a,b)maxca,b|cor^(XaXb,Xc)|=maxca,b|Σ^acΣ^bc(Σ^aa+Σ^bb2Σ^ab)Σ^cc|. (4.1)

We estimate the partition G^ according to the simple COD algorithm explained below. The algorithm does not require as input the specification of the number K of groups, which is automatically estimated by our procedure. Step 3(c) of the algorithm is called the “or“ rule, and can be replaced with the “and” rule below, without changing the theoretical properties of our algorithm,

G^l={jS:sCOD^(al,j)sCOD^(bl,j)α}.

The numerical performance of these two rules are also very close through simulation studies, same as we reported on a related COD procedure on correlations [13]. Due to these small differences, we will focus on the “or” rule for the sake of space. The algorithmic complexity for computing Σ^ is O(p2n) and the complexity of COD is O(p3), so the overall complexity of our estimation procedure is O(p2(pn)). The procedure is also valid when Г has very small off-diagonal entries, and the results are presented in Section 6.

The COD Algorithm.

  • Input: Σ^ and α > 0

  • Initialization: S = {1, …, p} and l = 0

  • Repeat: while S ≠ ∅
    1. ll + 1
    2. If |S|= 1 Then G^l=S
    3. If |S| > 1 Then
      1. (al,bl)=argmina,bS,absCOD^(a,b)
      2. If sCOD^(al,bl)>α Then G^l={al}
        Else G^l={jS:sCOD^(al,j)sCOD^(bl,j)α}
    4. SS\G^l
  • Output: the partition G^=(G^l)l=1,,k

4.2. Perfect cluster recovery with COD for minimax optimal MCOD(Σ) cluster separation.

Theorem 4.1 shows that the partition G^ produced by the COD algorithm has the property that G^=G*, with high probability, as soon as the separation MCOD(Σ) between clusters exceeds a constant times the threshold (3.1) of Theorem 3.1 of the previous section.

Theorem 4.1.

Under the distributional Assumption 1, there exist numerical constants c1, c2 > 0 such that, if

αc1L2log(p)n

and MCOD(Σ) > 3α|Σ|, then we have exact cluster recovery with probability 1 − c2/p.

We recall that for Gaussian data the constant L = 1. The proof is given in Section D.1 of the Supplementary Material [14].

We observe that while the COD algorithm succeeds to recover G* at the minimax separation rate (3.1) when K ≥ 3, it does not offer guaranties at the minimax separation rate (3.2) when K = 2. In this last case (K = 2), we observe that

12Δ(C*)MCOD(Σ)Δ(C*),

so the metric MCOD(Σ) is equivalent to Δ(C*) and we refer to the Section 5 for an optimal algorithm.

4.3. A data-driven calibration procedure for COD.

The performance of the COD algorithm depends on the value of the threshold parameter α. Whereas Theorem 4.1 ensures that a good value for α is the order of logp/n, its optimal value depends on the actual distribution (at least through the sub-Gaussian norm) and is unknown to the statistician. We propose below a new, fully data dependent, criterion for selecting α, and the corresponding partition G^, from a set of candidate partitions G. This criterion is based on data splitting. Let us consider two independent subsamples of the original sample, D1 and D1, each of size n/2.

We denote by G^1 a collection of partitions computed from D1, for instance via the COD algorithm with a varying threshold α. For any a < b, we use Di, i = 1, 2, to calculate, respectively,

Δ^ab(i)[Cor^(i)(XaXb,Xc)]ca,b,i=1,2.

Since Δab ≔ [Cor(XaXb, Xc)]c≠a,b equals zero if and only if a~Gb, we want to select a partition G such that Δ^ab(2)1aGb is a good predictor of Δab. To implement this principle, it remains to evaluate Δab independently of Δ^ab(2). For this evaluation, we propose to reuse sample D1 which has already been used to build the family of partitions G^1. More precisely, we select G^G^1 by minimizing the data-splitting criterion H:

G^argminGG^(1)H(G)  with  H(G)a<b[|Δ^ab(2)1aGbΔ^ab(1)|2].

The following proposition assesses the performance of G^. We need the following additional assumption.

(P1) If Cor(XaXb,Xc)=0, then ECor^(XaXb,Xc)=0.

In general, the sample correlation is not an unbiased estimator of the population level correlation. Still, (P1) is satisfied when the data are normally distributed or in a latent model (1.2) when the noise variables Ea have a symmetric distribution. The next proposition provides guaranties for criterion H, averaged over D2, and denoted by E(2)[H(G)]. The proof is given in Section D.2 of the Supplementary Material [14].

Proposition 4.2.

Assume that the distributional Assumption 1 and (P1) hold. Then there exists a constant c1 > 0 such that, when MCOD(Σ)>c1|Σ|L2log(p)/n, we have

E(2)[H(G*)]minGG^(1)E(2)[H(G)], (4.2)

both with probability larger than 1 – 4/p and in expectation with respect to (1).

Under the condition MCOD(Σ)>c1|Σ|L2log(p)/n, Theorem 4.1 ensures that G* belongs to G^1 with high probability, whereas (4.2) suggests that the criterion is minimized at G*.

If we consider a data-splitting algorithm based on COD^(a,b) instead of sCOD^(a,b), then we can obtain a counterpart of Proposition 4.2 without requiring the additional assumption (P1). Still, we favor the procedure based on sCOD^(a,b) mainly for its scale-invariance property.

5. Penalized convex K-means: PECOK.

5.1. PECOK algorithm.

Motivated by the fact that the COD algorithm is minimax optimal with respect to the MCOD(Σ) metric for K ≥ 3, but not necessarily with respect to the Δ(C*) metric (unless the size of the smallest cluster is constant), we propose below an alternative procedure, that adapts to this metric. Our second method is a natural extension of one of the most popular clustering strategies. When we view the G-block covariance model as arising via the latent factor representation in (i) in the Introduction, the canonical clustering approach would be via the K-means algorithm [30], which is NP-hard [5]. Following Peng and Wei [34], we consider a convex relaxation of it, which is computationally feasible in polynomial time. We argue below that, for estimating clusters given by (1.1), one needs to further tailor it to our model. The statistical analysis of the modified procedure is the first to establish consistency of variable clustering via K-means type procedures, to the best of our knowledge.

The estimator offered by the standard K-means algorithm, with the number K of groups of G* known, is

G^argminGcrit(X,G)  with crit(X,G)=a=1pmink=1,,KX:aX¯Gk2, (5.1)

and X¯Gk=|Gk|1aGkX:a.

For a partition G, let us introduce the corresponding partnership matrix B by

Bab={1|Gk| if a and b are in the same group Gk,0 if a and b are in a different groups.  (5.2)

We observe that Bab > 0 if and only if a~Gb. In particular, there is a one-to-one correspondence between partitions G and their corresponding partnership matrices. It is shown in Peng and Wei [34] that the collection of such matrices B is described by the collection O of orthogonal projectors fulfilling tr(B) = K, B1 = 1 and Bab ≥ 0 for all a, b.

Theorem 2.2 in Peng and Wei [34] shows that solving the K-means problem is equivalent to finding the global maximum

B¯=argmaxBOΣ^,B, (5.3)

and then recovering G^ from B¯.

The set of orthogonal projectors is not convex, so, following Peng and Wei [34], we consider a convex relaxation C of O obtained by relaxing the condition “B orthogonal projector,” by “B positive semidefinite,” leading to

C{BRp×p:B0 (symmetric and positive semidefinite) aBab=1,bBab0,a,btr(B)=K}. (5.4)

Thus, the (uncorrected) convex relaxation of K-means is equivalent with finding

B˜=argmaxBCΣ^,B. (5.5)

To assess the relevance of this estimator, we first study its behavior at the population level, when Σ^ is replaced by Σ in (5.5). Indeed, if the minimizer of our criterion does not recover the true partition at the population level, we cannot expect it to be consistent, even in a large sample asymptotic context (fixed p, n goes to infinity). We recall that |Г|V ≔ maxa Гaa – mina Гaa.

Proposition 5.1.

Assume that Δ(C*) > 2|Г|V /m*. Then B*=argmaxBOΣ,B. If Δ(C*) > 7|Г|V /m*, then B*=argmaxBCΣ,B.

For Δ(C*) large enough, the population version of convexified K-means recovers B*. The next proposition illustrates that the condition Δ(C*) > 2|Г|V /m* for population K-means is in fact necessary.

Proposition 5.2.

Consider the model (1.1) with

C*=[α000ββτ0βτβ],Γ=[γ+000γ000γ]and|G1*|=|G2*|=|G3*|=m*.

The population maximizer BΣ=argmaxBOΣ,B is not equal to B* as soon as 2τ=Δ(C*)<2m*|Γ|V.

The two propositions above are proved in Section A.1 in the Supplementary Material [14]. As a consequence, when Г is not proportional to the identity matrix, the population minimizers based on K-means and convexified K-means do not necessary recover the true partition even when the “within-between group” covariance gap is strictly positive. This undesirable behavior of K-means is not completely unexpected as K-means is a quantization algorithm which aims to find clusters of similar width, instead of “homogeneous” clusters. Hence, we need to modify it for our purpose.

This leads us to suggesting a population level correction in Proposition 5.1. Indeed, as a direct Corollary of Proposition 5.1, we have

B*=argminBCΣΓ,B,

as long as Δ(C*) > 0. This suggests the following PEnalized COnvex K-means (PECOK) algorithm, in three steps. The main step 2 produces an estimator B^ of B from which we derive the estimated partition G^. We summarize this below.

The PECOK Algorithm.

  • Step 1. Estimate Г by Γ^.

  • Step 2. Estimate B* by B^=argmaxBC(Σ^,BΓ^,B).

  • Step 3. Estimate G* by applying a clustering algorithm to the columns of B^.

The required inputs for Step 2 of our algorithm are: (i) Σ^, the sample covariance matrix; (ii) Γ^, the estimator produced at Step 1; and (iii) K, the number of groups. Our only requirement on the clustering algorithm applied in Step 3 is that it succeeds to recover the partition G* when applied to true partnership matrix B*. The standard K-means algorithm [30] seeded with K distinct centroids, kmeans++ [4] or any approximate K-means as defined in (5.13) in Section 5.4, fulfill this property.

We view the term Γ^,B as a penalty term on B, with data dependent weights Γ^. Therefore, the construction of an accurate estimator Γ^ of Г is a crucial step for guaranteeing the statistical optimality of the PECOK estimator.

5.2. Construction of Γ^.

Estimating Г before estimating the partition itself is a nontrivial task, and needs to be done with care. We explain our estimation below and analyze it in Proposition A.10 in Section A.5. We show that this estimator of Г is appropriate whenever Г is a diagonal matrix (or diagonally dominant, with small off-diagonal entries). For any a, b ∈ [p], define

V(a,b)maxc,d[p]\{a,b}|(Σ^acΣ^ad)(Σ^bcΣ^bd)|Σ^cc+Σ^dd2Σ^cd, (5.6)

with the convention 0/0 = 0. Guided by the block structure of Σ, we define

b1(a)argminb[p]\{a}V(a,b) and b2(a)argminb[p]\{a,b1(a)}V(a,b),

to be two elements ”close” to a, that is, two indices b1 = b1(a) and b2 = b2(a) such that the empirical covariance difference Σ^bicΣ^bid, i = 1, 2, is most similar to Σ^acΣ^ad, for all variables c and d not equal to a or bi, i = 1, 2. It is expected that b1(a) and b2(a) either belong to the same group as a, or belong to some ”close” groups. Then our estimator Γ^ is a diagonal matrix, defined by

Γ^aa=Σ^aa+Σ^b1(a)b2(a)Σ^ab1(a)Σ^ab2(a)  for a=1,,p. (5.7)

Intuitively, Γ^aa should be close to Σaa+Σb1(a)b2(a)Σab1(a)Σab2(a), which is equal to Гaa in the favorable event where both b1(a) and b2(a) belong to the same group as a.

In general, b1(a) and b2(a) cannot be guaranteed to belong to the same group as a. Nevertheless, these two surrogates b1(a) and b2(a) are close enough to a so that |Γ^aaΓaa| to be at most of the order of |Γ|log(p)/n in -norm, as shown in Proposition A.10 in Section A.5 of the Supplementary Material [14]. A slightly simpler estimator of Г was proposed in Appendix A of a previous version of this work [15], but a bound on |Γ^aaΓaa| for that estimator contains a factor proportional to |Σ|, which is not desirable, and can be avoided by (5.7). In the next subsection, we show that our proposed Γ^ leads to perfect recovery of G*, via PECOK, under minimal separation conditions.

Note that PECOK requires the knowledge of the true number K of groups. When the number K of groups itself is unknown, we can modify the PECOK criterion by adding a penalty term as explained in a previous version of our work [15], Section 4. Alternatively, we propose in Section G of Supplementary Material [14] selection via a simple data-splitting procedure.

5.3. Perfect cluster recovery with PECOK for near-minimax Δ-cluster separation.

We show in this section that the PECOK estimator recovers the clusters exactly, with high probability, at a near-minimax separation rate with respect to the Δ(C*) metric.

Theorem 5.3.

There exist c1, c2, c3 three positive constants such that the following holds. Let Γ^ be any estimator of Г, such that |Γ^Γ|Vδn,p with probability 1 − c1/p. Then, under Assumption 1, and when L4 log(p) ≤ c3n and

ΔC*cLΓlogpm*n+pnm*2+log(p)n+pnm*+δn,pm*, (5.8)

then B^=B* and G^=G*, with probability higher than 1 − c1/p. Here, cL is a positive constant that only depends on L in Assumption 1. In particular, if Γ^ is the estimator (5.7), the same conclusion holds with probability higher than 1 − c2/p when

ΔC*cLΓlogpm*n+pnm*2+log(p)n+pnm*. (5.9)

The proof is given in Section A.3 of the Supplementary Material [14].

Remark 1.

We left the term δn,p explicit in (5.8) in order to make clear how the estimation of Г affects the cluster separation Δ(C*) metric. Without a correction (i.e., taking Γ^=0), the term δn,p/m* equals |Г|V /m* which is nonzero (and does not decrease in a high-sample asymptotic) unless Г has equal diagonal entries. This phenomenon is consistent with the population analysis in the previous subsection. Display (5.9) shows that the separation condition can be much decreased with the correction. In particular, for balanced clusters, that is when m*=pK, exact recovery is guaranteed when

Δ(C*)cL[Klogpm*n+Klogpn], (5.10)

for an appropriate constant cL > 0. In view of Theorem 3.2, when m* ≥ cp/ log(p) the rate is minimax optimal, since in this case K = p/m* = O(log(p)). When m* = o(p/ log(p)), the number K of clusters grows faster than log(p), and we possibly lose a factor K/log(p) relative to the optimal rate.

As discussed in the Introduction, this gap is possibly due to a computational barrier and we refer to [16] for a discussion in the related stochastic block model.

Bounded variables X also follow a sub-Gaussian distribution. Nevertheless, the corresponding sub-Gaussian norm L may be large and Theorem 5.3 can sometimes be improved, as in Theorem 5.4 below, proved in Section A.3 of the Supplementary Material [14].

Theorem 5.4.

There exist c1, c2, c3 three positive constants such that the following holds. Let Γ^ be any estimator of Г, such that |Γ^Γ|Vδn,p with probability 1 − c1/p. Then, under Assumption 1-bis, and

ΔC*c2MΓ1/2plog(p)nm*2+M2plog(p)nm*+δn,pm*. (5.11)

Then B^=B* and G^=G*, with probability higher than 1 − c1/p.

When we choose Γ^ as in (5.7), the term δn,p/m* can be simplified as under Assumption 1; see Proposition A.10 in Section A.5 of the Supplementary Material [14]. For balanced clusters, m*=pK, Condition (5.11) can be simplified in

Δ(C*)c2[MΓ1/2Klog(p)nm*+M2Klog(p)n+δn,pm*].

In comparison to (5.10), the condition does no longer depend on the sub-Gaussian norm L, but the term K ∨ log(p) has been replaced by K log(p).

Remark 2.

For the Ising block model (1.3) with K balanced groups, we have M = 1 and p = m*K, C* = (ωinωout)IK + ωoutJ and Г = (1 − ωin)IK. As a consequence, no diagonal correction is needed, that is we can take Γ^=0, and since |Г|V = 0, we have δn,p = 0.Then, for K balanced groups, condition (5.11) simplifies to

(ωin ωout )Klog(p)np+Klog(p)n.

In the specific case K = 2, we recover (up to numerical multiplicative constants) the optimal rate proved in [10]. Our procedure and analysis provide a generalization of these results, as they are valid for general K and Theorem 5.4 also allows for unbalanced clusters.

5.4. A comparison between PECOK and spectral clustering.

In this section, we discuss connections between the PECOK algorithm introduced above and spectral clustering, a method that has become popular in network clustering.

First we recall the premise of spectral clustering, adapted to our context. For G*-block covariance models as (1.1), we have Σ – Г = AC* At. Let U be the p × K matrix collecting the K leading eigenvectors of Σ − Г. It has been shown (see, e.g., Lemma 2.1 in Lei and Rinaldo [28]) that a and b belong to the same cluster if and only if Ua: = Ub: and if and only if [UUt]a: = [UUt]b:. When used for variable clustering, uncorrected spectral clustering consists in applying a clustering algorithm, such as K-means, on the rows of the p × K-matrix obtained by retaining the K leading eigenvectors of Σ^.

SC Algorithm.

  1. Compute V^, the matrix of the K leading eigenvectors of Σ^

  2. Estimate G* by applying a (rotation invariant) clustering method to the rows of V^.

Arguing as in Peng and Wei [34], we have the following.

Lemma 5.5.

SC algorithm is equivalent to the following algorithm:

  • Step 1. Find B¯=argmax{Σ^,B:tr(B)=K,IB0}.

  • Step 2. Estimate G* by applying a (rotation invariant) clustering method to the rows of B¯.

The connection between (unpenalized) PECOK and spectral clustering now becomes clear. The (unpenalized) PECOK estimator B˜ defined in (5.5) involves the calculation of

B˜=argmaxB{Σ^,B:B1=1,Bab0,tr(B)=K,B0}. (5.12)

Since the matrices B involved in (5.12) are doubly stochastic, their eigenvalues are smaller than 1, and hence (5.12) is equivalent to B˜=argmaxB{Σ^,B:B1=1,Bab0,tr(B)=K,IB0}. Note then that B¯ can be viewed as a less constrained version of B˜, in which C is replaced by C¯={B:tr(B)=K,IB0}, where we have dropped the p(p + 1)/2 constraints given by B1 = 1, and Bab ≥ 0. The proof of Lemma 5.5 shows that B¯=V^V^t, so, contrary to B^, the estimator B¯ is (almost surely) never equal to B*. Below, we adapt the arguments of [28] in order to provide some guarantees for a corrected version of spectral clustering.

In view of this connection between spectral clustering and unpenalized PECOK and of the fact that the population justification of spectral clustering deals with the spectral decomposition of Σ − Г, we propose the following corrected version of the algorithm, based on Σ˜Σ^Γ^.

CSC Algorithm.

  1. Compute U^, the matrix of the K leading eigenvectors of Σ˜Σ^Γ^

  2. Estimate G* by clustering the rows of U^, via an η-approximation of K-means (5.13).

For η > 1, an η-approximation of K-means is a clustering algorithm producing a partition G^ such that

crit(U^t,G^)ηminGcrit(U^t,G), (5.13)

with crit(·, ·) the K-means criterion (5.1). Although solving K-means is NP-Hard [5], there exist polynomial time approximate K-means algorithms; see Kumar et al. [26]. As a consequence of the above discussion, the first step of CSC can be interpreted as a relaxation of the program associated to the PECOK estimator B^.

To simplify the presentation of the results for CSC procedure, we assume in the following that all the groups have the same size |G1*|==|GK*|=m*=p/K. We emphasize that this information is not required by either PECOK or CSC, or in the proof of Proposition 5.6 below. We only use it here for simplicity. We denote by SK the set of permutations on {1,…, K} and we denote by

L¯(G^,G*)=minσSKk=1K|Gk*\G^σ(k)|m*

the sum of the ratios of miss-assigned variables with indices in Gk*. In the previous sections, we studied perfect recovery of G*, which would correspond to L¯(G^,G*)=0. We give below conditions under which L¯(G^,G*)ρ, for an appropriate quantity ρ < 1. We begin with a general theorem pertaining to partial partition recovery by CSC, under a “signal-to-noise ratio” involving the smallest eigenvalue λK (C*) of C*.

Proposition 5.6.

Let Re(Σ) = tr(Σ)/‖Σ‖op denote the effective rank of Σ. There exist cη,L > 0 only depending on η and L and a numerical constant c1 such that the following holds under Assumption 1. For any 0 < ρ < 1, if

λK(C*)cη,LKΣopm*ρRe(Σ)log(p)n, (5.14)

then L¯(G^,G*)ρ, with probability larger than 1 − c1/p.

The proof extends the arguments of [28], initially developed for clustering procedures in stochastic block models, to our context. Specifically, we relate the error L¯(G^,G*) to the noise level, quantified in this problem by Σ˜AC*Atop. We then employ the results of [24] to show that this operator norm can be controlled, with high probability, which leads to the conclusion of the theorem.

As n goes to infinity, the right-hand side of Condition (5.14) goes to zero, and CSC is therefore consistent in a large sample asymptotic. In contrast, we emphasize that (uncorrected) SC algorithm is not consistent as can be shown by a population analysis similar to that of Proposition 5.2.

We observe that Δ(C*) ≥ 2λK (C*), so we can compare the lower bound (5.14) on λK (C*) to the lower-bound (5.10) on Δ(C*). To further facilitate the comparison between CSC and PECOK, we discuss both the conditions and the conclusion of this theorem in the simple setting where C* = τ IK and Г = Ip. Then the cluster separation measures coincide up to a factor 2, Δ(C*) = 2λK (C*) = 2τ.

Corollary 5.7 (Illustrative example: C* = τ IK and Г = Ip).

There exist three positive numerical constants cη,L, cη,L and c3 such that the following holds under Assumption 1. For any 0 < ρ < 1, if

ρcη,L[K2n+Klog(p)n] and τcη,L[K2ρnKρnm*], (5.15)

then L¯(G^,G*)ρ, with probability larger than 1 − c3/p.

Recall that Theorem 5.3 above states that when G^ is obtained via the PECOK algorithm, and if τKlogpm*n+log(p)Kn, then L¯(G^,G*)=0, or equivalently, G^=G*, with high probability. We can therefore provide the following comparison (we refer to Section G of the Supplementary Material [14] for a numerical comparison):

  • When τKlogpm*n+log(p)Kn, and under the additional condition that n ≳ (K ∨ log(p))2/K, CSC algorithm satisfies L¯(G^,G*)K2/(Klog(p)). So, for K = o(log(p)) and for a large enough sample size n ≳ (log(p))2/K, the fraction of misclassified variables by CSC is vanishing as O(K/ log(p)) for τlogpm*n+log(p)n. This guaranty is slightly weaker than for PECOK which ensures exact recovery in this setting. This discrepancy may be an artifact of the proof technique. Very recent works [1, 31] (released during the reviewing process of this paper) present reconstruction error bounds tighter than those of [28], for (variants of) spectral clustering, when applied to two parameter SBMs, for network data, not the type of data analyzed in this work.

  • When we move away from the case C* = τ IK, the guaranties for CSC can degenerate. For instance, when Г = I and C* = τ IK + αJ, with J being the matrix with all entries equal to one, as in the Ising block model discussed on page 126. Notice that in this case we continue to have Δ(C*) = 2λK (C*) = 2τ. Then, for a given, fixed, value of ρ and K fixed, condition (5.14) requires a cluster separation at least
    ταlog(p)nρ,
    which is independent of m*, unlike the condition τKlogpm*n+log(p)Kn for PECOK. This unpleasant feature is induced by the inflation of Σ˜AC*Atop with α. Again, this weakness in the guarantees may be an artifact of the proof, which relies on the Davis–Kahan inequality for controlling the alignment between the sample eigenvectors associated with the K largest eigenvalues and their population counterpart.

All the results of this section are proved in Section E of the Supplementary Material [14].

6. Approximate G-block covariance models.

In the previous sections, we have proved that under some separation conditions, COD and PECOK procedures are able to exactly recover the partition G*. However, in practical situations, the separation conditions may not be met. Besides, if the entries of Σ have been modified by an infinitesimal perturbation, then the corresponding partition G* would consist of p singletons.

As a consequence, it may be more realistic and more appealing from a practical point of view to look for a partition G[K] with K < |G*| groups such that Σ is close to a matrix of the form AC At + Г where Г is diagonal and A is associated to G[K]. This is equivalent to considering a decomposition Σ = AC At + Г with Г nondiagonal, where the nondiagonal entries of Г are small. In the sequence, we write R = Г − Diag(Г) for the matrix of the off-diagonal elements of Г and D = Diag(Г) for the diagonal matrix given by the diagonal of Г.

In the next subsection, we discuss under which conditions the partition G[K] is identifiable and then, we prove that COD and PECOK are able to recover these partitions.

6.1. Identifiability of approximate G-block covariance models.

When Г is allowed to be not exactly equal to a diagonal matrix, we encounter a further identifiability issue, as a generic matrix Σ may admit many decompositions Σ = AC At + Г. In fact, such a decomposition holds for any membership matrix A and any matrix C if we define Г = Σ − AC At. So we need to specify the kind of decomposition that we are looking for. For K being fixed, we would like to consider the partition G with K clusters that maximizes the distance between goups (e.g., MCOD(Σ, G)) while having the smallest possible noise term |R|. Unfortunately, such a partition G does not necessarily exist and is not necessarily unique. Let us illustrate this situation with a simple example.

Example. Assume that Σ is given by

Σ=[2r0002r0002r]+Ip,

with r > 0, with the convention that each entry corresponds to a block of size 2. Considering partitions with 2 groups and allowing Г to be nondiagonal, we can decompose Σ using different partitions. For instance,

Σ=[2r000rr0rr]=A1C1A1t+[0000rr0rr]+Ip=Γ1=[rr0rr0002r]=A2C2A2t+[rr0rr0000]+Ip=Γ2.

Importantly, the two decompositions correspond to two different partitions G1 and G2 and both decompositions have |Ri| = r and MCOD(Σ, Gi) = 2r = 2|R|, for i = 1, 2. In addition, no decomposition Σ = AC At + D + R with associated partition in 2 groups, satisfies MCOD(Σ, G) > 2r or |R| < r. As a consequence, there is no satisfying way to define a unique partition maximizing MCOD(Σ, G), while having |R| as small as possible. We show below that the cutoff MCOD(Σ, G) > 2|R| is actually sufficient for partition identifiability.

For this, let us define Pj(Σ,K), j = {1, 2} as the set of quadruplets (A, C, D, R) such that Σ = AC At + D + R, with A a membership matrix associated to a partition G with K groups with mink |GK| ≥ j, and D and R defined as above. Hence P1 corresponds to partitions without restrictions on the minimum group size. For instance, singletons are allowed. In contrast, P2 only contains partitions without singletons. We define

ρ1(Σ,K)=max{MCOD(Σ,G)/|R|:(A,C,D,R)P1(Σ,K) and G associated to A},
ρ2(Σ,K)=max{Δ(C)/|R|:(A,C,D,R)P2(Σ,K)}.

We view ρ1 and ρ2 as respective measures of “purity” of the block structure of Σ.

Proposition 6.1.

  1. Assume that ρ1(Σ, K) > 2. Then, there exists a unique partition G such that there exists a decomposition Σ = AC At + Г, with A associated to G and MCOD(Σ, G) > 2|R|. We denote by G1[K] this partition.

  2. Assume that ρ2(Σ, K) > 8. Then there exists a unique partition G with mink |Gk| ≥ 2, such that there exists a decomposition Σ = AC At + Г, with A associated to G and Δ(C) > 8|R|. We denote by G2[K] this partition.

  3. In addition, if both ρ1(Σ, K) > 2 and ρ2(Σ, K) > 8, then G1[K]= G2[K].

The conditions ρ1(Σ, K) > 2 and ρ2(Σ, K) > 8 are minimal for defining uniquely the partition G1[K]. For ρ1, this has been illustrated in the example above the proposition. For ρ2, we provide a counter example when ρ2(Σ, K) = 8 in Section B.3 of the Supplementary Material [14]. The proof of Proposition 6.1. is given in Section B.2 of [14].

The conclusion of Proposition 6.1 does essentially revert to that of Proposition 2.2 of Section 2 as soon as |R| is small enough respective to the cluster separation sizes. Denoting K* the number of groups of G*, we observe that G1[K*] = G* and G2[K*] = G* if m* ≥ 2. Besides, ρ1(Σ, K) = ρ2(Σ, K) = 0 for K > K*. For K < K* and when G1[K] (resp., G2[K]) are well defined, then the partition G1[K] (resp., G2[K]) is coarser than G*. In other words, G1[K] is derived from G* by merging groups Gk* thereby increasing MCOD(Σ, G) (resp., Δ(C)) while requiring |R| to be small enough.

We point out that, in general, there is no unique decomposition Σ = AC At + Г with A associated to G2[K], even when mink |G2[K]k| ≥ 2. Actually, it can be possible to change some entries of C and R, while keeping C + R, Δ(C) and |R| unchanged.

6.2. The COD algorithm for approximate G-block covariance models.

Weshowbelow that the COD algorithm is still applicable if Σ has small departures from a block structure. We set λmin(Σ) for the smallest eigenvalue of Σ.

Theorem 6.2.

Under the distributional Assumption 1, there exist numerical constants c1, c2 > 0 such that the following holds for all αc1L2logpn. If, for some partition G and decomposition Σ = AC At + R + D, we have

|R|λmin(Σ)22α  and  MCOD(Σ,G)>3α|Σ|, (6.1)

then COD recovers G with probability higher than 1 − c2/p.

The proof is given in Section D.1 of the Supplementary Material [14]. If G satisfies the assumptions of Theorem 6.2, then it follows from Proposition 6.1 that G = G1[K] for some K > 0. First consider the situation where the tuning parameter α is chosen to be of the order log(p)/n. If MCOD(Σ, G*) ≥ 3α|Σ|, then COD selects G* with high probability. If MCOD(Σ, G*) is smaller than this threshold, then no procedure is able to recover G* with high probability (Theorem 3.1). Nevertheless, COD is able to recover a coarser partition G1[K] whose corresponding MCOD metric MCOD(Σ, G) is higher than the threshold 3α|Σ| and whose matrix R is small enough. For larger α, then COD recovers a coarser partition G (corresponding to G1[K] with a smaller K) whose corresponding approximation |R| is allowed to be larger.

6.3. The PECOK algorithm for approximate G-block covariance models.

In this subsection, we investigate the behavior of PECOK under the approximate G-block models. The number K of groups being fixed, we assume that ρ2(Σ, K) > 8 so that G2[K] is well defined. We shall prove that PECOK recovers G2[K] with high probability. By abusing the notation, we denote in this subsection G* for the target partition G2[K], B* for the associated partnership matrix and (A,C*,D,R)P2(Σ,K) any decomposition of Σ maximizing Δ(C)/|R|.

Similar to Proposition 5.1, we first provide sufficient conditions on C* under which a population version of PECOK can recover the true partition.

Proposition 6.3.

If, ΔC*>7|D|V+2Rop m+3|R|, then B*=argminBCΣ,B.

Corollary 6.4.

If ΔC*>3|R|+2Rop m, then B*=argminBCΣD,B.

In contrast to the exact G-block model, the cluster distance Δ(C*) now needs to be larger than |R| for the population version to recover the true partition. The |R| condition is fact necessary as discussed in Section 6.1. In comparison to the necessary conditions discussed in Section 6.1, there is an additional ‖R‖op/ m term. The proofs are given in Section A.2 in the Supplementary Material [14].

We now examine the behavior of PECOK when we specify the estimator Γ^ to be as in (5.7). Note that in this approximate block covariance setting, the diagonal estimator Γ^ is in fact an estimator of the diagonal matrix D. In order to derive deviation bounds for our estimator Γ^, we need the following diagonal dominance assumption.

Assumption 2 (diagonal dominance of Г).

The matrix Г = D + R fulfills

Γaa3maxc:ca|Γac|  (or equivalently  Daa3maxc:ca|Rac|). (6.2)

The next theorem states that PECOK estimator B^ recovers the groups under similar conditions to that of Theorem 5.3 if R is small enough. The proof is given in Section A.3 of the Supplementary Material [14].

Theorem 6.5.

There exist c1, c2, cL, cL four positive constants such that the following holds. Under Assumptions 1 and 2, and when L4 log(p) ≤ c1n and

|R|+|R||D|+RopmcLΓop{logpmn+pnm2+log(p)n+pnm} (6.3)

we have B^=B* and G^=G*, with probability higher than 1 − c2/p, as soon as

Δ(C*)cL[Γop{logpmn+pnm2+log(p)n+pnm}]. (6.4)

So, as a long as |R| and ‖Rop are small enough so that (6.3) are satisfied, PECOK algorithm will correctly identify the target partition G* at the Δ-(near) optimal minimax level (6.4). A counterpart of Theorem 6.5 for Assumption 1-bis is provided in Section A.3 of the Supplementary Material [14]

7. Data analysis.

Using functional MRI data, [37] found that the human brain putative areas are organized into clusters, sometimes referred to as networks or functional systems. We use a publicly available fMRI data set to illustrate the clusters recovered by different methods. The data set was originally published in [39] and is publicly available from Open fMRI (https://openfmri.org/data-sets) under the accession number ds000007. We will focus on analyzing two scan sessions from subject 1 under a visual-motor stop/go task (task 1). Before performing the analysis, we follow the preprocessing steps suggested by [39], and we follow [37] to subsample the whole brain data using p = 264 putative areas; see Section A.3 of the Supplementary Material [14] for details. This subject was also scanned in two separate sessions, and each session yielded n = 180 samples for each putative area.

We apply our data-splitting approach described in Section 4.3 to these two-session data. Using the first scan session data only, we first estimate G^ using COD and COD-CC on a fine grid of α=clog(p)/n where c = 0.5, 0.6, …, 3. For a fair comparison, we set K in PECOK to be the same as the resulting K’s found by COD. We then use the second session data to evaluate the loss H(G) given in Section 4.3. Among our methods (COD, COD-CC and PECOK), COD yields the smallest loss when K = 142. We thus first focus on illustrating the COD clusters here. Table 2 lists the largest cluster of putative areas recovered by COD and their functional classification based on prior knowledge. Most of these areas are classified to be related to visual, motor and task functioning, which is consistent with the implication of our experimental task that requires the subject to perform motor responses based on visual stimuli. Figure 1(a) plots the locations of these coordinates on a standard brain template. It shows that our COD cluster appears to come mostly from approximately symmetric locations from the left and right hemisphere, though we do not enforce this brain function symmetry in our algorithm. Note that the original coordinates in [37] are not sampled with exact symmetry from both hemispheres of the brain, and thus we do not expect exact symmetric locations in the resulting clusters based on these coordinates.

Table 2.

MNI coordinates (X, Y, Z, in mm) of the largest COD group and their functioning classification

X Y Z Function X Y Z Function
40 −72 14 visual −7 −21 65 motor
−28 −79 19 visual −7 −33 72 motor
20 −66 2 visual 13 −33 75 motor
29 −77 25 visual 10 −46 73 motor
37 −81 1 visual 36 −9 14 motor
47 10 33 task −53 −10 24 motor
−41 6 33 task −37 −29 −26 uncertain
38 43 15 task 52 −34 −27 uncertain
−41 −75 26 default −58 −26 −15 uncertain
8 48 −15 default −42 −60 −9 attention
22 39 39 default −11 26 25 saliency

Fig. 1.

Fig. 1.

(a) Plot of the coordinates of the largest COD cluster overplayed over a standard brain template. The coordinates are shown as red balls. (b) Comparison of COD, COD-CC, PECOK, K-means, HC and SC using the Frobenius prediction loss criterion (7.1) where the groups are estimated by these methods, respectively.

Because there are no gold standards for partitioning the brain, we follow common practice and use a prediction criterion to further compare the clustering performance of different methods. For a fair comparison, we also estimate G^ using K-means, HC and spectral clustering on the same resulting K’s found by COD. The prediction criterion is as follows. We first compute the covariance matrices S^1 and S^2 from the first and second session data respectively. For a grouping estimate G^, we use the following loss to evaluate its performance:

S^2ϒ(S^1,G^)F, (7.1)

where block averaging operator ϒ (R, G) produces a G-block structured matrix based on G^. For any aGk and bGk, the output matrix entry [ϒ (R, G)]ab is given by

[ϒ(R,G)]ab={|Gk|1(|Gk|1)1i,jGk,ijRij if ab and k=k,|Gk|1|Gk|1iGk,jGkRij if ab and kk,1 if a=b.

In essence, this operator smooths over the matrix entries with indices in the same group, and one may expect that such smoothing over variables in the true cluster will reduce the loss (7.1) while smoothing over different clusters will increase the loss.

Figure 1(b) compares the prediction loss values under different group sizes for each method. This shows that our data-splitting approach for COD indeed selects a value K = 142 that is immediately next to a slightly larger one (K = 206), the latter having the smallest prediction loss, near the bottom plateau. However, the differences are almost negligible. This suggests that our data-splitting criterion, which comes with theoretical guarantees, also provides good prediction performance in this real data example, while selecting a slightly smaller K, as desired, since this makes the resulting clusters easier to describe and interpret.

Regardless of the choice of K or α, Figure 1(b) also shows that COD almost always yields the smallest prediction loss for a wide range of K, while PECOK does slightly better when K is between 5 and 10. Though COD-cc has large losses for medium or small K, its performance is very close to the best performer COD near K = 146. Kmeans in this example is the closest competing method, while the other two methods (HC and SC) yield larger losses across the choices of K.

8. Discussion.

In this section, we discuss some related models and give an overall recommendation on the usage of our methods.

8.1. Comparison with stochastic block model.

The problem of variable clustering that we consider in this work is fundamentally different from that of variable clustering from network data. The latter, especially in the context of the Stochastic Block Model (SBM), has received a large amount of attention over the past years, for instance [2, 16, 21, 2729, 33]. The most important difference stems from the nature of the data: the data analyzed via the SBM is a p × p binary matrix A, called the adjacency matrix, with entries assumed to have been generated as independent Bernoulli random variables; its expected value is assumed to have a block structure. In contrast, the data matrix X generated from a G-block covariance is a n × p matrix with real entries, and rows viewed as i.i.d copies of a p-dimensional vector X with mean zero and dependent entries. The covariance matrix Σ of X is assumed to have (up to the diagonal) a block structure.

Need for a correction.

Even though the analysis of the methods in our setting would differ from the SBM setting, we could have applied available clustering procedures tailored for SBMs to the empirical covariance matrix Σ^=XtX/n by treating it as some sort of weighted adjacency matrix. It turns out that applying verbatim the spectral clustering procedure of Lei and Rinaldo [28] or the SDP such as the ones in [3] would lead to poor results. The main reason for this is that, in our setting, we need to correct both the spectral algorithm and the SDP to recover the correct clusters (Section 5). Second, the SDPs studied in the SBM context (such as those of [3]) do not handle properly groups with different and unknown sizes, contrary to our SDP. To the best of our knowledge, our SDP (without correction) has only been independently studied by Mixon et al. [32] in the context of Gaussian mixtures.

Analysis of the SDP.

As for the mathematical arguments, our analysis of the SDP in our on covariance-type model differs from that in mean-type models partly because of the the presence of nontrivial cross-product terms. Instead of relying on dual certificates arguments as in other work such as [35], we directly investigate the primal problem and combine different duality-norm bounds. The crucial step is the Lemma A.3 in the Supplementary Material [14] which allows to control the Frobenius inner product by a (unusual) combination of 1 and spectral control. In our opinion, our approach is more transparent than dual certificates techniques, especially in the presence of a correction Γ^ and allows for the attainment of optimal convergence rates.

8.2. Extension to other models.

The general strategy of correcting a convex relaxation of K-means can be applied to other models. In [38], one of the authors has adapted the PECOK algorithm to the clustering problem of mixture of sub-Gaussian distributions. In particular, in the high-dimensional setting where the correction plays a key role, [38] obtains sharper separation conditions dependencies than in state-of-the-art clustering procedures [32]. Extensions to model-based overlapping clustering are beyond the scope of this paper, but we refer to [11] for recent results.

8.3. Practical recommendations.

Based on our extensive simulation studies, we conclude this section with general recommendations on the usage of our proposed algorithms.

If p is moderate in size, and if there are reasons to believe that no singletons exist in a particular application, or if they have been removed in a pre-processing step, we recommend the usage of the PECOK algorithm, which is numerically superior to existing methods: exact recovery can be reached for relatively small sample sizes. COD is also very competitive, but requires a slightly larger sample size to reach the same performance as PECOK. The constraint on the size of p reflects the existing computational limits in state-of-the art algorithms for SDP, not the statistical capabilities of the procedure, the theoretical analysis of which being one of the foci of this work.

If p is large, we recommend COD-type algorithms. Since COD is optimization-free, it scales very well with p, and only requires a moderate sample size to reach exact cluster recovery. Moreover, COD adapts very well to data that contains singletons and, more generally, to data that is expected to have many inhomogeneous clusters.

Supplementary Material

supplementary material

Acknowledgments.

We thank the Editors and anonymous reviewers for their helpful suggestions. We thank Andrea Montanari for pointing to us the reference [32].

This work was supported in part by the CNRS PICS grant HighClust.

The first author was supported in part by NSF Grant DMS-1712709.

The second author was supported in part by by the LabEx LMH, ANR-11-LABX-0056-LMH.

The third author was supported in part by NSF Grant DMS-1557467 and NIH Grants R01EB022911, P01AA019072, P20GM103645, P30AI042853 and S10OD016366.

The fourth author was supported by IDEX Paris-Saclay IDI Grant ANR-11-IDEX-0003-02.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Model assisted variable clustering: Minimax-optimal recovery and algorithms” (DOI: 10.1214/18-AOS1794SUPP; .pdf). This supplement contains proofs of the theoretical results, the simulation results and additional supporting information regarding the data analysis.

REFERENCES

  • [1].Abbe E, Fan J, Wang K and Zhong Y (2017). Entrywise eigenvector analysis of random matrices with low expected rank. ArXiv Preprint arXiv:1709.09565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Abbe E and Sandon C (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science—FOCS 2015 670–688. IEEE Computer Soc., Los Alamitos, CA. MR3473334 [Google Scholar]
  • [3].Amini AA and Levina E (2018). On semidefinite relaxations for the block model. Ann. Statist 46 149–179. MR3766949 10.1214/17-AOS1545 [DOI] [Google Scholar]
  • [4].Arthur D and Vassilvitskii S (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035. ACM, New York. MR2485254 [Google Scholar]
  • [5].Awasthi P, Charikar M, Krishnaswamy R and Sinop AK (2015). The hardness of approximations of Euclidean k-means. In 31st International Symposium on Computational Geometry. LIPIcs. Leibniz Int. Proc. Inform 34 754–767. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern. MR3392820 [Google Scholar]
  • [6].Banerjee O, El Ghaoui L and d’Aspremont A (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res 9 485–516. MR2417243 [Google Scholar]
  • [7].Bellec P, Perlbarg V, Jbabdi S, Pélégrini-Issac M, Anton J-L, Doyon J and Benali H (2006). Identification of large-scale networks in the brain using fMRI. NeuroImage 29 1231–1243. [DOI] [PubMed] [Google Scholar]
  • [8].Bernardes JS, Vieira FR, Costa LM and Zaverucha G (2015). Evaluation and improvements of clustering algorithms for detecting remote homologous protein families. BMC Bioinform. 16 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Berthet Q and Rigollet P (2013). Complexity theoretic lower bounds for sparse principal component detection. In Proceedings of the 26th Annual Conference on Learning Theory (Shalev-Shwartz S and Steinwart I, eds.). Proceedings of Machine Learning Research 30 1046–1066. PMLR, Princeton, NJ. [Google Scholar]
  • [10].Berthet Q, Rigollet P and Srivastava P (2018). Exact recovery in the Ising blockmodel. Ann. Statist To appear. arXiv:1612.03880. [Google Scholar]
  • [11].Bing M, Bunea F, Ning Y and Wegkamp M (2018). Adaptive estimation in structured factor models with applications to overlapping clustering. ArXiv E-prints. [Google Scholar]
  • [12].Bouveyron C and Brunet-Saumard C (2014). Model-based clustering of high-dimensional data: A review. Comput. Statist. Data Anal 71 52–78. MR3131954 10.1016/j.csda.2012.12.008 [DOI] [Google Scholar]
  • [13].Bunea F, Giraud C and Luo X (2015). Minimax optimal variable clustering in G-models via cord. ArXiv Preprint arXiv:1508.01939. [Google Scholar]
  • [14].Bunea F, Giraud C, Luo X, Royer M and Verzelen N (2020). Supplement to “Model assisted variable clustering: Minimax-optimal recovery and algorithms.” 10.1214/18-AOS1794SUPP. [DOI] [PMC free article] [PubMed]
  • [15].Bunea F, Giraud C, Royer M and Verzelen N (2016). PECOK: A convex optimization approach to variable clustering. ArXiv Preprint arXiv:1606.05100. [Google Scholar]
  • [16].Chen Y and Xu J (2016). Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J. Mach. Learn. Res 17 Paper No. 27, 57. MR3491121 [Google Scholar]
  • [17].Chong M, Bhushan C, Joshi AA, Choi S, Haldar JP, Shattuck DW, Spreng RN and Leahy RM (2017). Individual parcellation of resting fMRI with a group functional connectivity prior. NeuroImage 156 87–100. 10.1016/j.neuroimage.2017.04.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Craddock RC, James GA, Holtzheimer PE, Hu XP and Mayberg HS (2012). A whole brain fMRI atlas generated via spatially constrained spectral clustering. Hum. Brain Mapp 33 1914–1928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Frei N, Garcia AV, Bigeard J, Zaag R, Bueso E, Garmier M, Pateyron S, de Tauzia-Moreau ML, Brunaud V et al. (2014). Functional analysis of Arabidopsisimmune-related MAPKs uncovers a role for MPK3 as negative regulator of inducible defences. Genome Biol. 15 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Glasser MF, Coalson TS, Robinson EC, Hacker CD, Harwell J, Yacoub E, Ugurbil K, Andersson J, Beckmann CF et al. (2016). A multi-modal parcelation of human cerebral cortex. Nature 536 171–178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Guédon O and Vershynin R (2016). Community detection in sparse networks via Grothendieck’s inequality. Probab. Theory Related Fields 165 1025–1049. MR3520025 10.1007/s00440-015-0659-z [DOI] [Google Scholar]
  • [22].James GA, Hazaroglu O and Bush KA (2016). A human brain atlas derived via n-cut parcellation of resting-state and task-based fMRI data. Magn. Reson. Imaging 34 209–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Jiang D, Tang C and Zhang A (2004). Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng 16 1370–1386. [Google Scholar]
  • [24].Koltchinskii V and Lounici K (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110–133. MR3556768 10.3150/15-BEJ730 [DOI] [Google Scholar]
  • [25].Kong R, Li J, Sun N, Sabuncu M, Liu H, Schaefer A, Zuo X-N, Holmes A, Eickhoff S et al. (2018). Spatial topography of individual-specific cortical networks predicts human cognition, personality and emotion. https://www.biorxiv.org/content/early/2018/01/31/213041. [DOI] [PMC free article] [PubMed]
  • [26].Kumar A., Sabharwal Y. and Sen S. (2004). A simple linear time (1 + ϵ)-approximation algorithm for k-means clustering in any dimensions. In Foundations of Computer Science, 2004. Proceedings. 45th Annual IEEE Symposium on 454–462. [Google Scholar]
  • [27].Le CM, Levina E and Vershynin R (2016). Optimization via low-rank approximation for community detection in networks. Ann. Statist 44 373–400. MR3449772 10.1214/15-AOS1360 [DOI] [Google Scholar]
  • [28].Lei J and Rinaldo A (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist 43 215–237. MR3285605 10.1214/14-AOS1274 [DOI] [Google Scholar]
  • [29].Lei J and Zhu L (2014). A generic sample splitting approach for refined community recovery in stochastic block models. ArXiv Preprint arXiv:1411.1469. [Google Scholar]
  • [30].Lloyd SP (1982). Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 129–137. MR0651807 10.1109/TIT.1982.1056489 [DOI] [Google Scholar]
  • [31].Lu Y and Zhou HH (2016). Statistical and computational guarantees of Lloyd’s algorithm and its variants. ArXiv Preprint arXiv:1612.02099. [Google Scholar]
  • [32].Mixon DG, Villar S and WARD R (2017). Clustering subgaussian mixtures by semidefinite programming. Inf. Inference 6 389–415. MR3764529 10.1093/imaiai/iax001 [DOI] [Google Scholar]
  • [33].Mossel E, Neeman J and Sly A (2014). Consistency thresholds for binary symmetric block models. ArXiv Preprint arXiv:1407.1591. [Google Scholar]
  • [34].Peng J and Wei Y (2007). Approximating K-means-type clustering via semidefinite programming. SIAM J. Optim 18 186–205. MR2299680 10.1137/050641983 [DOI] [Google Scholar]
  • [35].Perry A and Wein AS (2015). A semidefinite program for unbalanced multisection in the stochastic block model. ArXiv E-prints arXiv:1507.05605. [Google Scholar]
  • [36].Poldrack RA (2007). Region of interest analysis for fMRI. Soc. Cogn. Affect. Neurosci 2 67–70. 10.1093/scan/nsm006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Power JD, Cohen AL, Nelson SM, Wig GS, Barnes KA, Church JA, Vogel AC, Laumann TO, Miezin FM et al. (2011). Functional network organization of the human brain. Neuron 72 665–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Royer M (2017). Adaptive clustering through semidefinite programming. In Advances in Neural Information Processing Systems (NIPS). [PMC free article] [PubMed] [Google Scholar]
  • [39].Xue G, Aron AR and Poldrack RA (2008). Common neural substrates for inhibition of spoken and manual responses. Cereb. Cortex 18 1923–1932. [DOI] [PubMed] [Google Scholar]
  • [40].Yeo BT, Krienen FM, Sepulcre J, Sabuncu MR, Lashkari D, Hollinshead M, Roffman JL, Smoller JW, Zöllei L et al. (2011). The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J. Neurophysiol 106 1125–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Zaag R, Tamby J, Guichard C, Tariq Z, Rigaill G, Delannoy E, Renou J, Balzergue S, Mary-Huard T et al. (2015). GEM2Net: From gene expression modeling to -omics networks, a new CATdb module to investigate Arabidopsis thaliana genes involved in stress response. Nucleic Acids Res. 43 1010–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

RESOURCES