Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Oct 20.
Published in final edited form as: J Comput Graph Stat. 2014 Oct 20;23(4):985–1008. doi: 10.1080/10618600.2013.852554

Sparse Biclustering of Transposable Data

Kean Ming Tan 1, Daniela M Witten 2,
PMCID: PMC4212513  NIHMSID: NIHMS532803  PMID: 25364221

Abstract

We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a bicluster-specific mean term and a common variance, and perform biclustering by maximizing the corresponding log likelihood. We apply an ℓ1 penalty to the means of the biclusters in order to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of k-means clustering. We show that k-means clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for bi-clustering based on the matrix-variate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression data set. This article has supplementary material online.

Keywords: Clustering, Gene expression, ℓ1 penalty, Matrix-variate normal distribution, Unsupervised learning

1 Introduction

In recent years, much interest has centered around the unsupervised analysis of gene expression data and other types of high-dimensional biological data. Many proposals involve clustering the n observations on the basis of the p features, or clustering the p features on the basis of the n observations. We will refer to such proposals as one-way clustering in this paper, since either the rows or columns of a data matrix are clustered, but not both. An overview of some popular one-way clustering procedures can be found in Hastie et al. (2009).

In certain cases, we may be faced with transposable data, characterized by the fact that both the rows and columns are of scientific interest and may contain clusters or other structure (Lazzeroni & Owen 2002). One such example is gene expression data, in which the rows represent tissue samples and the columns represent genes for which expression measurements were obtained. In this case, there may be subgroups among the rows (corresponding to distinct sets of patients, perhaps with different subtypes of a disease) or subgroups among the columns (corresponding to groups of genes with shared expression patterns, potentially revealing important biological pathways) (Eisen et al. 1998). In this setting, one-way clustering seems inappropriate since it does not reflect the fact that both the rows and the columns are of scientific interest. To address this shortcoming, a number of proposals have been made for biclustering, which involves simultaneously clustering the rows and columns of a data matrix (among others, Cheng & Church 2000, Lazzeroni & Owen 2002, Getz et al. 2000, Tang et al. 2001, Madeira & Oliveira 2004, Cho et al. 2004, Cho & Dhillon 2008, Lee et al. 2010, Hochreiter et al. 2010). We define a bicluster to be a subset of the data matrix, corresponding to a set of observations and a set of features, such that all elements within the subset are similar to each other; some authors refer to this as a co-cluster. The concept of similarity must be defined based on the data set and the scientific question.

In the literature, various authors have used the term bicluster in different ways. Three distinct types of biclusters are displayed in Table 1. The simplest type of bicluster is a constant bicluster (Table 1(a)), in which all elements take on approximately a constant value. Within an additive coherent bicluster (Table 1(b)), an additive model holds for each element; this is related to a two-way ANOVA model. Finally, a multiplicative coherent bicluster (Table 1(c)) stems from a multiplicative model. Biclustering proposals have taken a number of forms, and have been aimed at detecting all three types of biclusters.

Table 1.

Biclusters with (a): constant values; (b): additive coherent values; and (c): multiplicative coherent values. Table adapted from Madeira & Oliveira (2004).

(a)
2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0
(b)
4.0 5.0 7.0 3.0
5.0 6.0 8.0 4.0
3.0 4.0 6.0 2.0
1.0 2.0 4.0 0.0
(c)
0.5 1.0 2.0 1.5
2.0 4.0 8.0 6.0
1.5 3.0 6.0 4.5
1.0 2.0 4.0 3.0

Gene expression data is high-dimensional, in the sense that pn. In this setting, it might be reasonable to assume that most genes do not contribute much to or differ between the biological conditions being studied, and so in a sense can be considered to be noise. A number of authors have recently suggested performing sparse one-way clustering of the observations in gene expression data, so that just a subset of the genes are used to cluster the observations (Pan & Shen 2007, Wang & Zhu 2008, Xie et al. 2008, Witten & Tibshirani 2010). This can yield more accurate clusters, and also allows biologists to focus their research efforts on those selected genes.

In this paper, we extend sparse one-way clustering to the biclustering problem. Assume that each element of the data matrix follows a normal distribution with a bicluster-specific mean value and a common variance. We can estimate the biclusters by maximizing the corresponding log likelihood. To achieve sparse biclustering, we maximize the ℓ1-penalized log likelihood. The proposed approach is illustrated on a toy example in Figure 1, in which it is shown that biclustering can result in more accurate cluster discovery than independent one-way clustering of the rows and columns of a data matrix. Our approach identifies constant and contiguous biclusters, as in Table 1(a).

Figure 1.

Figure 1

(a): A heatmap of a simulated 100 × 200 data set, with five row clusters and five column clusters. (b): True underlying mean signal within each cluster. (c): Mean signal estimated by independent 5-means clustering of the rows and 5-means clustering of the columns. (d): Mean signal estimated by biclustering, as described in Algorithm 8, with K=5, R=5, and λ=0. Biclustering results in more accurate clustering of both the rows and the columns than does independent 5-means clustering.

The rest of this paper is organized as follows. In Section 2, we review the biclustering literature. Section 3 contains our proposal for sparse biclustering, and in Section 4, we motivate our biclustering proposal further by exploring its connection with the singular value decomposition. In Section 5 we present an approach for selecting the tuning parameters associated with this proposal. In Section 6 we present the results of simulation studies, and Section 7 contains an application to a gene expression data set. We propose a more general formulation for biclustering using the matrix-variate normal distribution in Section 8. The Discussion is in Section 9.

2 Past work on biclustering

In the literature, biclustering proposals have taken a number of forms, and date back to at least Hartigan (1972). For instance, some authors have independently clustered the rows and the columns of the data matrix, and others have suggested performing matrix factorization and examining the resulting singular vectors in order to identify biclusters. In addition, some biclustering proposals allow overlapping biclusters while some identify biclusters as contiguous block matrices. A detailed review of past proposals is outside of the scope of this paper, but can be found in Madeira & Oliveira (2004) and Prelic et al. (2006). Here, we briefly review three proposals for biclustering that form the basis for comparisons in the later sections of this paper. These three methods are included in comparisons because, like the proposal in this paper, they assume that most elements of the data matrix take on a common mean value. If the data matrix is centered appropriately, then this leads to a sparse estimate of the mean matrix.

Lazzeroni & Owen (2002) introduced the plaid model for transposable data, in which Xij=k=1Kθijkρikκjk, where ρik and κjk are binary values that equal one if the ith observation and jth variable belong to the kth bicluster. The plaid model identifies constant biclusters when θijk = μk, and additive coherent biclusters result when θijk = μk + αik + βjk. The parameters are estimated by minimizing the quantity i=1nj=1p(Xij-k=1Kθijkρikκjk)2. Turner et al. (2005) developed the improved plaid (IP) approach, an improved algorithm for this task, which is challenging due to the constraint that ρik and κjk are binary.

More recently, Shabalin et al. (2009) proposed an algorithm for finding constant biclusters, termed large average submatrices (LAS), using the model Xij=k=1KμkI(i,j)Bk+εij, where I(i,j)∈Bk is an indicator function for whether the ith row and jth column belong to the kth bicluster, μk is a mean term, and εij is a noise term. The algorithm seeks to find a bicluster that maximizes a significance score on the residual matrix obtained by subtracting out the biclusters identified in previous iterations.

An entirely different approach based on the singular value decomposition (SVD) is taken by Lee et al. (2010) and Hochreiter et al. (2010). They proposed to identify multiplicative biclusters using a low-rank approximation: Xk=1KskukvkT, where sk is a scalar and uk and vk are vectors of lengths n and p. Lee et al. (2010) estimated the parameters subject to sparsity-inducing penalties on uk and vk; we will refer to this as the sparse SVD (SSVD) approach. Hochreiter et al. (2010) imposed sparsity on the vectors uk and vk using a Bayesian approach. Both sets of authors declared the matrix elements corresponding to non-zero elements of uk and vk to make up the kth bicluster.

In this paper, we propose sparse biclustering under the assumptions that (1) each matrix element is normally distributed with a bicluster-specific mean, and (2) the biclusters partition the rows and columns of the matrix. Our proposal can be thought of as a generalization of k-means clustering to biclustering, and also a sparse and constrained version of the SVD.

3 Sparse biclustering

In what follows, X is a n × p matrix with n observations and p features. We assume that the n observations belong to K unknown and non-overlapping classes, C1, …, CK, and the p features belong to R unknown and non-overlapping classes, D1, …, DR.

3.1 An approach for biclustering

Assume that all matrix elements are independent, and that Xij ~ N (μkr, σ2) for iCk, jDr. We wish to estimate Ck, Dr, and μkr for k = 1, …, K and r = 1, …, R. Maximizing the log likelihood of the data under this model is equivalent to

minimizeC1,,CK,D1,,DR,μK×R{k=1Kr=1RiCkjDr(Xij-μkr)2}, (1)

which is easily seen to reduce to k-means clustering of the observations into K clusters if R = p, and k-means clustering of the features into R clusters if K = n. Note that solving (1) results in the discovery of KR biclusters, each of which consists of |Ck||Dr| elements – namely, the observations in Ck and the features in Dr.

3.2 Sparse biclustering

A shortcoming of (1) is that every row cluster Ck and column cluster Dr is assigned its own mean term μkr, where μkr ≠ 0 in general. If the data matrix X is centered so that its overall mean is zero, then we may suspect that some or many biclusters have a mean term that is approximately zero. In this setting, it may be worth incurring a little bit of additional bias by estimating these mean terms to be exactly zero, in the interest of improved interpretability and reduced variance in the resulting biclusters. It is straightforward to induce sparsity on the mean elements by penalizing (1) using an ℓ1 or lasso penalty (Tibshirani 1996). We arrive at

minimizeC1,,CK,D1,,DR,μK×R{12k=1Kr=1RiCkjDr(Xij-μkr)2+λk=1Kr=1Rμkr}, (2)

where λ is a nonnegative tuning parameter. As λ increases, (on average) an increasing number of μkr’s will be estimated to equal zero. If μ̂kr = 0, then this indicates a bicluster (Ck, Dr) for which the overall mean is not substantially different from zero. We note that (2) can be viewed as an extension of some recent sparse one-way clustering proposals (Pan & Shen 2007, Xie et al. 2008, Wang & Zhu 2008) to the biclustering setting, in the sense that if R = p then we are performing sparse k-means clustering of the rows of the data matrix.

Algorithm 1 is a simple iterative approach for finding a local optimum of (2). It is a descent algorithm, and when λ = 0, it amounts to finding a local optimum of (1). We performed Algorithm 1 5,000 times on the same data matrix X, generated as in Section 6.2, using random initializations of the row and column clusters. In 5,000 replications, the values of the objective function (2) were always within ±0.5% of the mean of the values.

Algorithm 1.

Sparse biclustering

  1. Initialize D1, …, DR and C1, …, CK by performing one-way k-means clustering on the columns and on the rows of the mean-centered data matrix X.

  2. Iterate until convergence:

    1. Holding C1, …, CK and D1, …, DR fixed, solve (2) with respect to μ. That is,
      μkr=S(iCkjDrXij,λ)CkDr, (3)

      where S is the soft-thresholding operator S(a, b) = sign(a)(|a| − b)+, |Ck| is the cardinality of Ck, and |Dr| is the cardinality of Dr.

    2. HoldingD1, …, DR and μ fixed, solve (2) with respect to C1, …, CK, by assigning the ith observation to the row cluster for which r=1RjDr(Xij-μkr)2 is smallest.

    3. Repeat Step 2(a).

    4. Holding C1, …, CK and μ fixed, solve (2) with respect to D1, …, DR, by assigning the jth feature to the column cluster for which k=1KjCk(Xij-μkr)2 is smallest.

We note that in the optimization problem (2), there is a complex interplay between the parameters K, R, and λ. For instance, when λ is extremely large, then μkr = 0 for all k = 1, …, K and r = 1, …, R, and so the values of C1, …, CK and D1, …, DR that minimize (2) are not unique. This problem can also manifest itself for more moderate values of λ. For instance, consider Step 2(a) of Algorithm 1, and suppose that μkr = μkr = 0 for some kk′ and for all r = 1, …, R. Then in Step 2(b), r=1RjDr(Xij-μkr)2=r=1RjDr(Xij-μkr)2, and so Ck and Ck cannot be uniquely assigned. In our implementation of Algorithm 1, we address this problem when it occurs by simply merging the kth and k’th clusters, thereby reducing the total number of row clusters from K to K −1. We take this approach in the interest of simplicity, though alternative procedures are possible and could lead to lower values of the objective (2).

4 A spectral interpretation for biclustering

Zha et al. (2001) established that a relaxation of k-means clustering yields principal components analysis (PCA), or equivalently, that k-means can be interpreted as a constrained version of PCA in which the kth principal component must take on values in {0, 1nk}. We will now show that with K = R (i.e. the same number of row and column clusters), the biclustering optimization problem (1) can be relaxed in order to yield the SVD. We first present a lemma that provides an alternative characterization for the SVD.

Lemma 1

Consider the optimization problem

maximizeATA=IK,BTB=IKATXBF2, (4)

where A and B are n × K and p × K orthogonal matrices and K ≤ min(n, p). The solution is given by A = U1:KQ1 and B = V1:KQ2, where U1:K and V1:K are n × K and p × K matrices whose columns are the first K left and right singular vectors of X respectively, and Q1 and Q2 are any K × K orthogonal matrices.

Finally, we present our theorem.

Theorem 4.1

Consider the problem (4) with two additional constraints:

  1. The elements of the kth column of A are 0 or 1nk with nk ∈ ℤ+, k=1Knk=n.

  2. The elements of the kth column of B are 0 or 1pr with pr ∈ ℤ+, r=1Kpr=p.

This constrained version of (4) is equivalent to the biclustering optimization problem (1) with K = R. Equivalently, a relaxed version of (1) yields the SVD.

Theorem 4.1 elucidates the difference between performing independent k-means clustering on the rows and columns of a data matrix, and performing biclustering. For the relaxed problem, the two approaches are identical - that is, we know that performing PCA on the rows of a data matrix and PCA on the columns of a data matrix is equivalent to simply computing the SVD of the data matrix. However, for the constrained problem, the two approaches are different, in the sense that k-means clustering and biclustering yield different solutions. Biclustering constitutes a more symmetric and systematic approach. A result closely-related to Theorem 4.1 can be found in Cho et al. (2004).

5 Tuning parameter selection

The sparse biclustering proposal (2) involves three tuning parameters: the number of row clusters K, the number of column clusters R, and the sparsity parameter λ. Here we consider the problem of selecting these tuning parameters in an automated fashion.

5.1 Selection of K and R

In order to select K and R, we recast biclustering as a supervised learning problem, as follows. We leave out a random subset of elements from the data matrix X, impute those left-out elements using the overall mean for the data matrix, and bicluster the resulting data matrix. We then assess the extent to which the estimated bicluster mean for the left-out elements differs from the true value of the left-out elements, using squared error loss. A related proposal appears in Witten et al. (2009). This approach, which assumes that λ is fixed, is described in greater detail in Algorithm 2.

Algorithm 2.

Selecting number of row clusters K and column clusters R

  1. Repeat the following procedure T times:

    1. Let Inline graphic denote a set containing np/T elements of the form (i, j), where (i, j) is drawn uniformly at random from {(1, 1), (1, 2), …, (n, p)}.

    2. Construct a new n × p matrix, X*, for which the elements in Inline graphic are “missing” and are imputed using the mean of the non-missing values:
      Xij={Xijif(i,j)Mc(i,j)McXij/Mcif(i,j)M. (5)
    3. For each pair of values (K, R) of interest:

      1. Perform sparse biclustering of X* with K row and R column clusters.

      2. Construct a n × p matrix A whose (i, j)th element equals the estimated value of μkr, where iCk and jDr.

      3. Calculate the mean squared error that results from estimating the “missing” elements using the corresponding bicluster means,
        (i,j)M(Xij-Aij)2/M. (6)
  2. For each pair of values (K, R) that was considered in Step 1(c), compute mK,R, the mean of the quantity (6) across all T iterations, as well as sK,R, its standard error.

  3. Identify the pairs (K, R) for which mK,RmK+1,R+1 + sK+1,R+1.

  4. Select the (K, R) from Step 3 for which K + R is smallest.

In order to explore the performance of this approach for selecting K and R, we conducted a small simulation study with various values of n, p, K, and R. First, each row was randomly assigned into one of the row clusters with uniform probability, and each column was randomly assigned to one of the column clusters with uniform probability. Then, the elements of the matrix X were generated independently, Xij ~i:i:d: N (μkr, 22) for iCk, jDr where μkr ~ Unif(−3, 3). We quantified the extent to which Algorithm 2 correctly identified the values of K and R. Occasionally, Algorithm 2 may return multiple results – for instance, two results will be returned if both (K = 3, R = 4) and (K = 4, R = 3) satisfy the criterion in Step 3, and no pair of (K, R) for which K + R < 7 satisfies the criterion. In this case, we gave the algorithm “partial credit” according to the fraction of returned (K, R) pairs that are correct. Results are in Table 2.

Table 2.

Simulation study to evaluate the performance of Algorithm 2 for tuning parameter selection. Results are reported over 50 simulated data sets. We report the overall accuracy, i.e. the proportion of the data sets for which the correct values of both K and R were identified. We also report the mean (and standard errors) of the K and R values obtained.

True value of (K, R) n p Overall Accuracy Selected K Selected R
(K = 2, R = 4) 100 100 56% 2 (0) 3.48 (0.0914)
500 66% 2 (0) 3.60 (0.0857)
500 100 70% 2 (0) 3.68 (0.0725)
500 94% 2 (0) 3.94 (0.0339)
(K = 6, R = 3) 100 100 44% 5.26 (0.1100) 3 (0.0286)
500 74% 5.7 (0.0769) 3 (0)
500 100 68% 5.68 (0.0666) 3 (0)
500 94% 5.92 (0.0481) 3 (0)

5.2 Selection of λ

We now assume that K and R are known, or else were already selected using Algorithm 2 with λ = 0. We select λ using an approach motivated by BIC. For a given value of λ, we perform sparse biclustering, and create a (np) × (q + 1) design matrix, where q is equal to the number of non-zero μ̂kr’s in the sparse biclustering output. The first column is a vector of 1’s corresponding to an intercept, and the remaining columns contain 1’s and 0’s, indicating whether a given element of the matrix is part of the corresponding non-zero-mean bicluster in the sparse biclustering output. We fit a least squares regression model that uses this design matrix to predict the matrix elements, and compute BIC using the formula

BIC=np×log(RSS)+nplog(q)

where RSS is the usual residual sum of squares. We then select the value of λ that leads to the smallest value of BIC.

6 A simulation study

We compared the performance of our biclustering proposal to independent one-way k-means clustering of the rows and columns in a simulation setting with constant and contiguous non-zero biclusters (Simulation 1). In addition, we compared our biclustering proposal to a number of competitors under three simulation settings: in Simulation 2 there are constant and contiguous biclusters with some of the bicluster means exactly equal to zero, in Simulation 3 there are multiplicative biclusters, and in Simulation 4 there are overlapping biclusters.

6.1 Biclustering methods used in our comparisons

We compared the following biclustering methods, which were discussed in Sections 2 and 3.

  1. Independent one-way k-means clustering of the rows and of the columns.

  2. Sparse biclustering using Algorithm 1, with several values of λ.

  3. IP (Turner et al. 2005), which is a variant of the plaid model (Lazzeroni & Owen 2002), using the R package biclust available on CRAN (Kaiser et al. 2011).

  4. SSVD (Lee et al. 2010), using the R package s4vd, available on CRAN (Sill & Kaiser 2011).

  5. LAS (Shabalin et al. 2009), using Matlab code available at https://genome.unc.edu/las/.

6.2 Simulation 1: No bicluster means exactly equal zero

We created K = 4 row clusters and R = 5 column clusters by randomly assigning each row to a row cluster and each column to a column cluster with uniform probability. We generated a n × p data matrix X, according to Xij ~i:i:d: N (μkr, 42) for iCk, jDr, where μkr ~ Unif(−2, 2). Then, we mean-centered the matrix X. We performed independent one-way k-means clustering on the rows and on the columns of the matrix, as well as sparse biclustering with various values of λ, as well as with λ selected automatically as described in Section 5.2.

The clustering error rate (CER; see e.g. Chipman & Tibshirani 2005, Witten & Tibshirani 2010) measures the disagreement between the true and estimated cluster labels. It is one minus the Rand index (Rand 1971). A high value of CER indicates disagreement between the true and estimated clusters, and a value of zero indicates perfect agreement. We used the CER to compare the estimated row and column clusters to the true row and column clusters. We defined the sparsity rate to be the fraction of the μ̂kr’s that exactly equal zero, and we defined the sparsity error rate to be the proportion of μ̂kr’s that were incorrectly set to zero or incorrectly set to be non-zero.

Results are reported in Table 3. We see that biclustering with λ = 0 leads to consistently better results than independent clustering of the rows and columns.

Table 3.

Results from one-way k-means clustering and sparse biclustering for Simulation 1 with n = 200, over 50 simulated data sets. We report the mean (and standard error) of the CER of the rows and columns, and the mean (and standard error) of the sparsity rate. Note that λ̄ is the mean of λ selected across 50 simulations using the approach of Section 5.2. The correct values of K and R were used, since CER is not comparable across different numbers of clusters.

p Method Row CER Column CER Sparsity Rate
200 k-means 0.0873 (0.0079) 0.1055 (0.0078) -
Bicluster λ=0 0.0547 (0.0066) 0.0559 (0.0056) -
Bicluster λ=200 0.0520 (0.0053) 0.0575 (0.0057) 0.0779 (0.0071)
Bicluster λ=400 0.0589 (0.0063) 0.0699 (0.0065) 0.1665 (0.0111)
Bicluster λ=800 0.0865 (0.0091) 0.0971 (0.0078) 0.2588 (0.0127)
Bicluster λ̄= 320 0.0534 (0.0057) 0.0644 (0.0063) 0.1338 (0.0110)

500 k-means 0.0254 (0.0048) 0.0755 (0.0061) -
Bicluster λ=0 0.0108 (0.0034) 0.0474 (0.0043) -
Bicluster λ=200 0.0109 (0.0032) 0.0475 (0.0044) 0.0237 (0.0052)
Bicluster λ=400 0.0095 (0.0031) 0.0478 (0.0042) 0.0560 (0.0061)
Bicluster λ=800 0.0122 (0.0034) 0.0557 (0.0051) 0.1158 (0.0089)
Bicluster λ̄ = 442 0.0100 (0.0032) 0.0480 (0.0043) 0.0891 (0.009)

6.3 Simulation 2: Some bicluster means exactly equal zero

We modified Simulation 1 so that μkr ~ Unif[(−2.5, −1.5) ∪ (1.5, 2.5)] or μkr = 0 with equal probability. We compared sparse biclustering with several competitors as described in Section 6.1:

  • For IP, we used the R package biclust to identify constant biclusters, with a background layer, and with row and column release parameters set to 0.5 as in Turner et al. (2005).

  • For LAS, we used the default settings in the Matlab code. We discarded biclusters with a significance-based score below one, as those tend to contain the entire matrix.

  • For SSVD, we obtained a rank-1 through rank-4 approximation using the R package sv4d; note that in our simulation set-up, the rank of the true underlying mean matrix is four. Sparsity parameters were selected using BIC. The adaptive weight parameters were set to two as in Lee et al. (2010). Only the best results obtained are reported.

We quantify the success of the approaches via the proportion of zero elements in the underlying mean matrix that are correctly identified (correct zeros), and the proportion of non-zero elements in the underlying mean matrix that are correctly identified (correct non-zeros). We also report sparsity rate and sparsity error rate as defined in Section 6.2. Finally, for one-way k-means clustering and for our sparse biclustering proposal, we report row and column CER; we do not report this for the other competitors, since they do not provide a partition of the rows and columns, and instead simply identify (possibly overlapping) hotspots in the matrix.

The results are presented in Table 4. We see that a substantial benefit is obtained by performing sparse biclustering rather than one-way k-means clustering, in terms of CER. Now, we discuss the performance of various biclustering methods in terms of proportion of correctly identified zeros and non-zeros, and also the sparsity error rate. We see from Table 4 that IP fails to identify any biclusters in this simulation set-up. This is due to the fact that the signal-to-noise ratio in this setting is too low; in related simulation set-ups with a higher signal-to-noise ratio, IP’s performance is improved. SSVD and LAS perform comparably in this setting, and by far the best overall performance is achieved by our sparse biclustering proposal with a large value of λ. For instance, when λ = 1000 and p = 200, the sparsity error rate is only 14.2%.

Table 4.

Results of various competitors in Simulation 2 with n = 200. We report the mean (and standard error) over 50 simulated data sets of the CER of the rows and columns, proportion of correctly identified zeros and non-zeros, sparsity rate, and sparsity error rate. Note that λ̄ is the mean of λ selected across 50 simulations using the approach of Section 5.2.

p Method Row CER Column CER C. Zeros C. Non-zeros Sparsity Rate Sparsity Error Rate
200 k-means 0.0460 (0.009) 0.0725 (0.008) - - - -
Bicluster λ=0 0.0306 (0.008) 0.0434 (0.007) - - - -
Bicluster λ=200 0.0289 (0.007) 0.0425 (0.007) 0.264 (0.035) 0.994 (0.002) 0.135 (0.018) 0.372 (0.021)
Bicluster λ=500 0.0313 (0.008) 0.0482 (0.007) 0.574 (0.053) 0.985 (0.004) 0.295 (0.028) 0.217 (0.025)
Bicluster λ=1000 0.0552 (0.010) 0.0723 (0.009) 0.749 (0.042) 0.962 (0.007) 0.392 (0.238) 0.142 (0.022)
Bicluster λ̄ =475 0.0292 (0.007) 0.0456 (0.007) 0.684 (0.053) 0.987 (0.002) 0.345 (0.028) 0.166 (0.026)
IP - - 1.000 (0.000) 0.000 (0.000) 1.000 (0.000) 0.498 (0.020)
SSVD rank-2 - - 0.683 (0.047) 0.489 (0.052) 0.609 (0.048) 0.388 (0.017)
LAS - - 0.366 (0.008) 0.932 (0.004) 0.217 (0.007) 0.353 (0.012)

500 k-means 0.0168 (0.005) 0.0494 (0.007) - - - -
Bicluster λ=0 0.0100 (0.004) 0.0375 (0.006) - - - -
Bicluster λ=200 0.0097 (0.004) 0.0374 (0.006) 0.127 (0.028) 0.998 (0.001) 0.063 (0.013) 0.440 (0.021)
Bicluster λ=500 0.0103 (0.004) 0.0379 (0.006) 0.287 (0.045) 0.995 (0.001) 0.151 (0.025) 0.354 (0.024)
Bicluster λ=1000 0.0112 (0.004) 0.0401 (0.007) 0.511 (0.058) 0.994 (0.001) 0.261 (0.032) 0.244 (0.028)
Bicluster λ̄ =663 0.0098 (0.004) 0.0383 (0.006) 0.530 (0.059) 0.994 (0.0013) 0.264 (0.029) 0.242 (0.029)
IP - - 1.000 (0.000) 0.000 (0.000) 1.000 (0.000) 0.498 (0.020)
SSVD rank-2 - - 0.594 (0.045) 0.623 (0.043) 0.503 (0.044) 0.373 (0.016)
LAS - - 0.443 (0.011) 0.953 (0.004) 0.244 (0.008) 0.305 (0.013)

6.4 Simulation 3: Multiplicative biclusters

This simulation study is adapted from Lee et al. (2010). Let M=du1v1T be a 100 × 50 matrix with d = 50, ũ1 = [10, 9, 8, 7, 6, 5, 4, 3, r(2, 17), r(0, 75)]T, 1 = [10, −10, 8, −8, 5, −5, r(3, 5), r(−3, 5), r(0, 34)]T, u1 = ũ1/||ũ1||2, and v1 = 1/||1||2, where r(a, b) denotes a vector of length b with all entries equal a. Then, let X = M + ε where εij ~i:i:d: N (0, 1). Figures 2(a)–(b) display the data matrix X and the underlying mean matrix M. As mentioned in Lee et al. (2010), this is a challenging biclustering problem since some non-zero entries in M are small relative to the noise. In particular, this setting is challenging for our sparse biclustering proposal, due to the presence of multiplicative biclusters, as opposed to the contiguous constant bicluster setting for which our proposal is intended.

Figure 2.

Figure 2

Heatmaps of (a): data matrix, generated according to Simulation 3. (b) Underlying means used to generate data. (c) Mean matrix estimated by sparse biclustering, with K and R automatically chosen (K = 3, R = 5) and λ = 10; 84% of the elements are estimated to equal zero.

We performed sparse biclustering with K, R automatically selected using Algorithm 2, and with various values of λ. For IP, LAS, and SSVD, the tuning parameters used are as in Section 6.3 unless specified otherwise. For IP, we set the R package biclust to identify the most flexible model discussed in Lazzeroni & Owen (2002), and ran the algorithm without the background layer. For SSVD, we set the parameters in the R package s4vd such that one bicluster is identified.

The results (averaged over 100 simulations) are summarized in Table 5. It is not surprising that SSVD has the best results in this simulation set-up, as in this set-up there are multiplicative biclusters. Though they have low sparsity error rates, both IP and LAS fail to correctly identify most of the non-zero elements in the underlying mean matrix. It is not surprising that LAS performs poorly in this simulation set-up, as LAS was developed to identify constant biclusters.

Table 5.

Results for Simulation 3, averaged over 100 simulated data sets. For sparse biclustering, K and R were automatically chosen using Algorithm 2. Note that λ̄ is the mean of λ selected across 100 simulations using the approach of Section 5.2. Standard errors are in parentheses.

Method Sparsity Rate C. Zeros C. Non-zeros Sparsity Error Rate
Bicluster λ=0 0.000 (0.000) 0.000 (0.000) 1.000 (0.000) 0.920 (0.000)
Bicluster λ=80 0.829 (0.012) 0.895 (0.013) 0.940 (0.005) 0.101 (0.012)
Bicluster λ=90 0.872 (0.009) 0.944 (0.010) 0.951 (0.005) 0.056 (0.009)
Bicluster λ=100 0.878 (0.014) 0.950 (0.015) 0.955 (0.005) 0.050 (0.013)
Bicluster λ=110 0.804 (0.024) 0.871 (0.025) 0.963 (0.004) 0.122 (0.023)
Bicluster λ̄ = 11:6 0.310 (0.029) 0.336 (0.032) 0.986 (0.004) 0.612 (0.029)
SSVD 0.886 (0.002) 0.963 (0.002) 0.997 (0.001) 0.034 (0.002)
IP 0.972 (0.001) 0.997 (0.001) 0.307 (0.008) 0.059 (0.001)
LAS 0.920 (0.002) 0.963 (0.002) 0.575 (0.009) 0.068 (0.002)

How does sparse biclustering perform in this setting, which clearly violates the constant and contiguous bicluster model? Sparse biclustering with λ = 0 has a sparsity error rate of 0.92, due to the fact that when λ = 0, all elements in the estimated mean matrix are non-zero. However, for a moderate value of λ, sparse biclustering performs well, even though it is designed to identify contiguous constant biclusters. This is because the multiplicative bicluster in Figure 2(b) can be approximated as the union of a number of constant biclusters. Therefore, sparse biclustering leads to Figure 2(c), which is a very accurate approximation of Figure 2(b). In particular, Figure 2(c) resulted from our sparse biclustering proposal with K = 3 and R = 5; note that these values were selected automatically by Algorithm 2.

6.5 Simulation 4: Overlapping multiplicative biclusters

In this section, we investigate an example with overlapping multiplicative biclusters. Let M=du1v1T+du2v2T be a 100 × 50 matrix with d = 50, u1, v1 as defined in Section 6.4, ũ2 = [r(0, 13), 10, 9, 8, 7, 6, 5, 4, 3, r(2, 17), r(0, 62)]T, 2 = [r(0, 9), 10, −9, 8, −7, 6, −5, r(4, 5), r(−3, 5), r(0, 25)]T, u2 = ũ2/||ũ2||, and v2 = 2/||2||. Then, let X = M + ε where εij ~i:i:d: N (0, 1). Heatmaps of X and M are shown in Figures 3(a)–(b).

Figure 3.

Figure 3

Heatmaps of (a): data matrix, generated according to Simulation 4. (b) Underlying means used to generate data. (c) Mean matrix estimated by sparse biclustering, with K and R automatically chosen (K = 3, R = 6) and λ = 70; 88% of the elements are exactly equal to zero.

We performed the biclustering methods described in the previous section, with the SSVD parameters set to identify two biclusters. We expect SSVD to perform well in this set-up, since there are multiplicative overlapping biclusters. In contrast, sparse biclustering’s assumption of constant and non-overlapping biclusters is clearly violated. Nonetheless, sparse biclustering performs competitively (Table 6), since the multiplicative and overlapping biclusters can be very accurately approximated using sparse biclustering using a sufficiently large value of K and R (Figure 3(c)). A similar fact was noted in Gu & Liu (2008).

Table 6.

Results for Simulation 4. Details are as in Table 5.

Method Sparsity Rate C. Zeros C. Non-zeros Sparsity Error Rate
Bicluster λ=40 0.648 (0.020) 0.718 (0.023) 0.775 (0.007) 0.274 (0.019)
Bicluster λ=60 0.770 (0.018) 0.849 (0.021) 0.706 (0.007) 0.171 (0.017)
Bicluster λ=80 0.813 (0.016) 0.895 (0.017) 0.679 (0.007) 0.136 (0.015)
Bicluster λ=100 0.859 (0.012) 0.950 (0.014) 0.687 (0.004) 0.088 (0.011)
Bicluster λ=120 0.823 (0.009) 0.915 (0.010) 0.727 (0.006) 0.112 (0.009)
Bicluster λ̄ = 12:2 0.262 (0.021) 0.294 (0.024) 0.928 (0.006) 0.616 (0.020)
SSVD 0.792 (0.008) 0.897 (0.006) 0.834 (0.028) 0.112 (0.004)
IP 0.944 (0.012) 0.995 (0.001) 0.358 (0.007) 0.097 (0.001)
LAS 0.877 (0.002) 0.963 (0.002) 0.634 (0.005) 0.084 (0.002)

7 Application to a gene expression data set

In this section, we consider a lung cancer gene expression data set previously analyzed by Lee et al. (2010) and Liu et al. (2008), consisting of measurements for 56 samples and 12,625 genes. 17 samples correspond to normal subjects, 20 correspond to subjects with pulmonary carnicoid tumors, 13 correspond to colon metastases, and six correspond to small cell carnicomas. We selected 5,000 genes with largest variance, and we mean-centered the 56 × 5000 data matrix. The goal is to discover sets of genes whose expression differs from the baseline in a subset of the patients.

We performed sparse biclustering using K = 4 (which we know to be the true number of row clusters), R = 10, and λ = 1500. A heatmap of the resulting estimated mean matrix is shown in Figure 4. For visualization purposes, we reordered the genes based on the estimated clusters to which they belong. From Figure 4, we see that one subject with small cell carnicoma is assigned to a cluster of pulmonary carcinoid tumors via sparse biclustering. Imposing sparsity in estimating the bicluster means provides substantial benefits in interpretation of the image plot, as μ̂kr = 0 for many values of k and r. Furthermore, we see from Figure 4 that there is substantial variation among the estimated bicluster means. For instance, the genes in the second column cluster have a very large mean value in normal patients and a very small mean value in carcinoid patients.

Figure 4.

Figure 4

Heatmap of the estimated mean matrix from sparse biclustering using K = 4, R = 10, and λ = 1500 on a subset of the lung cancer data set consisting of the 5,000 genes with highest variance. The rows are ordered by true cancer subtype. The genes are reordered based on the estimated clusters for visualization purposes. The column labels are the gene clusters. Note that all elements in column clusters 6–10 are estimated to equal zero.

The estimated mean matrix shown in Figure 4 is similar to the three image plots obtained using SSVD in Lee et al. (2010). This is not surprising, since our biclustering proposal can be interpreted as a constrained version of the SVD (see Section 4). However, SSVD has a major interpretational disadvantage relative to our proposal: whereas sparse biclustering explicitly returns cluster labels for both the rows and columns of the data matrix, the SSVD instead returns a series of sparse singular vectors. The analyst must then take a post hoc approach to interpret these singular vectors in order to determine the row and column clusters. In other words, SSVD does not directly output a single interpretable figure as in Figure 4.

We note that Algorithm 2 led to selection of K = 5 and R = 25 on this example. One of these row clusters contains just a single subject, and the others correspond perfectly to the subjects’ cancer types. Here we reported results using R = 10 instead of R = 25 for simplicity; however, using R = 25, a figure that is qualitatively very similar to Figure 4 emerges.

8 Matrix-variate normal biclustering

Recently, proposals have emerged to use the matrix-variate normal distribution to model high-dimensional transposable data (Gupta & Nagar 1999, Allen & Tibshirani 2010). To indicate that a n × p data matrix X has a matrix-variate normal distribution, we write

X~MVN(A,,Δ), (7)

where A is a n×p matrix containing the mean for each element of X, Σ is a n×n covariance matrix for the rows of X, and Δ is a p × p covariance matrix for the columns of X. A consequence of the matrix-variate normal model (7) is that the rows and columns of X are marginally multivariate normal. For instance, letting Xi and Ai be the ith rows of X and A, respectively, then

Xi~N(Ai,iiΔ). (8)

We note that in the case Σ = Δ = I, this model reduces to Xij ~i:i:d: N (0, 1).

8.1 General formulation of matrix-variate normal biclustering

Now assume that the n × p data matrix X is drawn from a matrix-variate normal distribution of the form (7) and that A has constant biclusters: that is, for all iCk and jDr, Aij = μkr. Without loss of generality, suppose that the rows and columns are ordered such that k < k′, iCk, and i′ ∈ Ck implies that i < i′, and similarly r < r′, jDr, and j′ ∈ Dr implies that j < j′. In other words, we use the model

X~MVN(((μ11)(μ1R)(μK1)(μKR)),,Δ), (9)

where (μkr) is a |Ck|×|Dr| matrix, all of whose elements equal μkr. This is a natural formulation for biclustering since it easily accommodates constant biclusters as well as arbitrary row and column covariances. Fitting the model (9) requires estimating the n × n matrix Σ and the p × p matrix Δ using the n × p matrix X; a proposal to do this using ℓ1 or ℓ2 penalties is presented in Allen & Tibshirani (2010).

A further simplification to the model (9) is natural. Though we might expect correlation between observations within a row cluster, or between features within a column cluster, correlations between observations in two different row clusters or between features in two different column clusters are less easily interpreted. This leads to the model

X~MVN(((μ11)(μ1R)(μK1)(μKR)),(1K),(Δ1ΔR)), (10)

where Σ and Δ are now block diagonal with blocks of dimension |C1| × |C1|,…, |CK| × |CK| and |D1| × |D1|, …, |DR| × |DR|, respectively. The formulation (10) is attractive not only because it provides a natural model for biclustering, but also because it has as special cases some well-known formulations for one-way clustering. In particular, consider (10) with R = p and Σk = I for k = 1,…, K. Then (10) amounts to a simple and well-studied model in which all observations come from a multivariate normal distribution with a common diagonal covariance matrix and a cluster-specific mean vector (Fraley & Raftery 2002). If furthermore Δ = σ2I, then this amounts to the usual formulation for one-way k-means clustering. By symmetry of the matrix normal distribution, (10) also reduces to model-based clustering or k-means clustering of the columns. Note that if we assume that Σ = σ2I and Δ = I, then this corresponds to our proposal in Section 3.

8.2 Sparse matrix-variate normal biclustering

The log likelihood corresponding to (10) takes the form

l(μ,,Δ)=p2k=1Klogk-1+n2r=1RlogΔr-1-12k=1Kr=1Rtr(k-1(Xk,r-μkr)Δr-1(Xk,r-μkr)T), (11)

where Xk;r is a |Ck|×|Dr| submatrix of X that consists of the elements Xij for iCk and jDr. We would like to fit the model (10) by maximizing (11). However, two problems arise. First, the maximum likelihood estimates of Σk and Δr may be singular. Second, we may want to encourage sparsity in μkr. To address these two points, we propose to maximize the penalized log likelihood

lp(μ,,Δ)=p2k=1Klogk-1+n2r=1RlogΔr-1-12k=1Kr=1Rtr(k-1(Xk,r-μkr)Δr-1(Xk,r-μkr)T)-λk=1Kr=1Rμkr-αk=1Kk-1d-βr=1RΔr-1d. (12)

Here, α, β, and λ are nonnegative parameters that determine the extent of penalization. We take d = 1 or d = 2. The last two terms in (12) can be understood as ||W||d = Σi;j |Wij|d.

To maximize (12), we take an iterative approach in which we update the parameters μ, Σ, Δ, C1,…, CK, D1,…, DR sequentially, holding all other parameters fixed as we update the current set of parameters. We begin with two simple lemmas.

Lemma 2

With 1-1,,K-1,Δ1-1,,ΔR-1, C1,…, CK, and D1,…, DR held fixed, then maximizing (12) with respect to μ results in the update

μkr=S(tr(k-11Δr-1Xk,rT)tr(k-11Δr-11T),λtr(k-11Δr-11T)), (13)

where 1 is a |Ck| × |Dr| matrix comprised solely of 1’s, and S is the soft-thresholding operator.

Lemma 3

With μ, Δ1-1,,ΔR-1, C1,…, CK, and D1,…, DR held fixed, maximizing (12) with respect to k-1 reduces to

maximizek-1{logk-1-tr(k-1Sk)-(2α/p)k-1d} (14)

where Sk=1pr=1R(Xk,r-μkr)Δr-1(Xk,r-μkr)T.

Note that if d = 1, the graphical lasso algorithm (Friedman et al. 2007) can be used to solve (14), and the estimate for k-1 will be sparse if the tuning parameter α is sufficiently large. When d = 2, then a simple analytical solution in terms of the eigenvectors and eigenvalues of Sk is available (Witten & Tibshirani 2009). A similar approach can be used to solve (12) with respect to Δr-1, with μ and 1-1,,K-1 held fixed.

In order to update C1,…, CK with D1,…, DR, Δ−1, σ−1, and μ held fixed, we note that by (8), the ith row of X has a multivariate normal distribution given by

Xi~N(μk,iiΔ) (15)

if that observation belongs to the kth cluster. In (15), μk is a p-vector whose jth element equals μkr if jDr. So we update the row cluster of the ith observation by assigning that observation to the class for which the log likelihood resulting from (15) is largest. We note that this approach for updating the row clusters is not completely rigorous, since we are assigning each observation to a new row cluster without regard to the covariance structure among the rows. In particular, this approach is not guaranteed to increase the log likelihood, but performs well empirically. A similar approach is taken to update the column clusters.

The steps just described for maximizing (12) are summarized in Algorithm 3. Although Steps 2(b) and 2(d) in Algorithm 3 could potentially lead to a decrease in (12), in our experience, the algorithm tends to converge within 35 iterations in the simulation set-up of Section 8.3.

Algorithm 3.

Matrix-variate normal biclustering

  1. Initialize C1,…, CK, D1,…, DR, 1-1,,K-1,Δ1-1,,ΔR-1, μ.

  2. Iterate until convergence or until a fixed number of iterations is reached:

    1. Holding C1, …, CK and D1, …, DR fixed, perform the following updates:

      1. Holding Σ−1 and Δ−1 fixed, update μ using (13).

      2. Holding μ and Σ−1 fixed, update k-1 as in Lemma 3 for k = 1,…, K.

      3. Holding μ and Δ−1 fixed, update Δr-1 as in Lemma 3 for r = 1,…, R.

    2. Holding 1-1,,K-1,Δ1-1,,ΔR-1, μ, and D1,…, DR fixed, update the row clustering. To do this, iterate through the rows and assign each row to the row cluster for which the log likelihood resulting from (15) is largest.

    3. Repeat Step 2(a).

    4. Holding 1-1,,K-1,Δ1-1,,ΔR-1, μ, and C1, …, CK fixed, update the column clustering, as in Step 2(b), with the roles of the rows and columns reversed.

8.3 A simulation study

We created K = 4 row and R = 5 column clusters by randomly assigning each row to a row cluster with uniform probability, and each column to a column cluster with uniform probability. We generated a n × p mean matrix A as follows: for each iCk and jDr, Aij = μkr, where μkr ~ Unif[(−2.5, −1.5) ∪ (1.5, 2.5)] or μkr = 0 with equal probability. Then, the n × p matrix X is generated according to X ~ MV N(A, Σ, Δ), where Σ and Δ are block diagonal covariance matrices with blocks corresponding to the row and column cluster memberships, respectively.

We performed one-way k-means clustering on the rows and on the columns, sparse biclustering, and matrix-variate normal biclustering with d = 1. We considered the cases when Σ−1 and Δ−1 are known and unknown. We set the tuning parameters α and β in (12) to equal 0.05. In addition, we considered IP, LAS, and SSVD, where the tuning parameters were chosen as described in Section 6.3. The same evaluation criteria as in Section 6.3 were used to evaluate the performance of various biclustering methods. Results are reported in Table 7.

Table 7.

Results for simulation study with n = p = 200 as described in Section 8.3. Sparse biclustering and MVN biclustering were performed, with various values of λ, and with λ chosen automatically (λ̄). MVN biclustering was performed with Σ−1 and Δ−1 known (MVN bicluster known) and unknown (MVN bicluster).

Method Row CER Column CER C. Zeros C. Non-zeros Sparsity Rate Sparsity Error Rate
k-means 0.124 (0.013) 0.145 (0.008) - - - -
Bicluster λ = 0 0.075 (0.013) 0.081 (0.010) - - - -
Bicluster λ = 200 0.068 (0.012) 0.078 (0.009) 0.556 (0.031) 0.978 (0.003) 0.272 (0.014) 0.248 (0.023)
Bicluster λ = 400 0.065 (0.012) 0.079 (0.009) 0.782 (0.029) 0.960 (0.006) 0.394 (0.015) 0.139 (0.020)
Bicluster λ̄ = 430 0.066 (0.012) 0.078 (0.009) 0.791 (0.033) 0.962 (0.007) 0.398 (0.019) 0.137 (0.023)
MVN bicluster λ = 0 0.071 (0.013) 0.081 (0.010) - - - -
MVN bicluster λ = 15 0.060 (0.012) 0.073 (0.009) 0.649 (0.028) 0.975 (0.005) 0.323 (0.013) 0.199 (0.020)
MVN bicluster λ = 30 0.087 (0.014) 0.095 (0.011) 0.809 (0.025) 0.922 (0.013) 0.432 (0.015) 0.141 (0.018)
MVN bicluster λ̄ = 18:8 0.060 (0.012) 0.073 (0.010) 0.716 (0.039) 0.969 (0.009) 0.354 (0.019) 0.169 (0.025)
MVN bicluster known, λ = 0 0.027 (0.008) 0.044 (0.007) - - - -
MVN bicluster known, λ = 100 0.025 (0.008) 0.041 (0.007) 0.475 (0.027) 0.997 (0.001) 0.245 (0.018) 0.258 (0.016)
MVN bicluster known, λ = 250 0.034 (0.008) 0.053 (0.009) 0.693 (0.027) 0.987 (0.006) 0.358 (0.020) 0.155 (0.014)
MVN bicluster known, λ̄ = 257:5 0.057 (0.017) 0.048 (0.009) 0.712 (0.039) 0.993 (0.002) 0.344 (0.020) 0.163 (0.026)
IP - - 1.000 (0.000) 0.000 (0.000) 1.000 (0.000) 0.500 (0.020)
SSVD rank-2 - - 0.716 (0.040) 0.449 (0.051) 0.640 (0.044) 0.387 (0.014)
LAS - - 0.334 (0.006) 0.917 (0.004) 0.208 (0.005) 0.376 (0.012)

We see that matrix-variate normal biclustering leads to consistently better results than sparse biclustering and one-way clustering of the rows and columns via k-means. When both Σ−1 and Δ−1 are known, matrix-variate normal biclustering results in the lowest CER.

8.4 Application to real data

We again consider the lung cancer data set described in Section 7. Once again, we selected 5,000 genes with largest variance, and mean-centered the data matrix. We performed MVN biclustering with K = 4, R = 10, λ = 1500, α = 0.35, β = 0.35, and d = 1, where α, β, and d are given in (12). A heatmap of the estimated mean matrix resulting from MVN biclustering is shown in Figure 5.

Figure 5.

Figure 5

Heatmap of the estimated mean matrix from MVN biclustering using K = 4, R = 10, λ = 1500, α = 0.35, and β = 0.35 on a subset of the lung cancer data set consisting of the 5,000 genes with highest variance. Details are as in Figure 4.

We see from Figure 5 that MVN biclustering perfectly identifies the four types of subjects. On this data set, since α is large and n is small, the estimate for Σ−1 obtained is diagonal – in other words, here our MVN biclustering does not model conditional dependencies among the samples. In contrast, the estimate obtained for Δ−1 has many non-zero elements within each of the blocks. In particular, 13.45% of the partial correlations in cluster 1, 73% of the partial correlations in cluster 2, 58.23% of the partial correlations in cluster 3, 40.96% of the partial correlations in cluster 4, 73.22% of the partial correlations in cluster 5, and 0.057% of the partial correlations in clusters 6–10 are non-zero. By inspection of Figure 5, we see that the gene clusters with expression levels that differ substantially among cancer subtypes tend to contain genes that are conditionally dependent. This is scientifically plausible, since we believe that genes that participate in the same pathways tend to be conditionally dependent, and may have similar expression levels in each biological condition.

9 Discussion

In this paper, we have proposed a novel approach for biclustering. Sparsity in the bicluster means is achieved using an ℓ1 penalty, and our biclustering proposal is extended to a more general setting using the matrix-variate normal distribution. We have shown that k-means clustering can be seen as a special case of our biclustering proposal. Just as a relaxation of k-means clustering yields PCA, a relaxation of our biclustering approach yields the SVD.

A possible drawback of our sparse biclustering proposal is that it does not allow for overlapping biclusters — that is, it assigns each element of the data matrix to exactly one bicluster. While allowing for overlapping biclusters can be beneficial in certain contexts (Madeira & Oliveira 2004), we argue that it results in too much complexity as well as challenges in interpretation. Furthermore, we demonstrate in Sections 6.4 and 6.5 that even though our sparse biclustering proposal assumes constant and contiguous biclusters, it performs competitively when there are multiplicative biclusters and overlapping biclusters.

The R package sparseBC, available on CRAN, implements the methods proposed.

Supplementary Material

Supplementary Material

Acknowledgments

We thank the editor, an associate editor, and two reviewers for helpful comments that improved the quality of this manuscript. The authors were supported by NIH Grant DP5OD009145 and NSF CAREER Award DMS-1252624.

Appendix: Proofs

Proof of Lemma 1

Proof

Let X = UDVT denote the SVD of X, where U and V are orthogonal n × n and p × p matrices and D is a n × p matrix with decreasing nonnegative diagonal elements. Note that any n × K orthogonal matrix A can be written as A = Uα for some orthogonal n × K matrix α, and any orthogonal p × K matrix B can be written as B = Vβ for some orthogonal p × K matrix β. Thus, instead of solving (4), we can solve

maximizeαTα=IK,βTβ=IKαTDβF2. (16)

By inspection, (16) is solved by α = In×KQ1 and β = Ip×KQ2, where Q1 and Q2 are any K × K orthogonal matrix, and where In×K and Ip×K and n × K are p × K identity matrices. Therefore, the solution to (4) takes the form A = UIn×KQ1 = U1:KQ1, and B = VIp×KQ2 = V1:KQ2.

Proof of Lemma 2

Proof

We must minimize the quantity

tr(k-1(Xk,r-μkr)Δr-1(Xk,r-μkr)T)+2λμkr

with respect to μkr. This amounts to minimizing

μkr2tr(k-11Δr-11T)-2μkrtr(k-11Δr-1Xk,rT)+2λμkr,

where 1 is a |Ck| × |Dr| matrix with all entries equal to 1. Completing the square, we see that this is equivalent to minimizing

(μkrtr(k-11Δr-11T)-tr(k-11Δr-1Xk,rT)tr(k-11Δr-11T))2+2λμkr

with respect to μkr. The result follows directly.

Proof of Theorem 4.1

Before we prove Theorem 4.1, we present a simple lemma.

Lemma 4

Let denote the mean of the elements in X. Then,

i=1nj=1p(Xij-X¯)2=i=1nj=1pXij2-np(X¯)2=12npi=1nj=1pi=1nj=1p(Xij-Xij)2. (17)

Now we proceed with a proof of Theorem 4.1.

Proof

Problem (4) is equivalent to the problem

minimizeATA=IK,BTB=IK{XF2-ATXBF2}, (18)

which is equivalent to

minimizeATA=IK,BTB=IK{i=1nj=1pXij2-k=1Kr=1K(i=1nj=1pAikXijBjr)2}. (19)

Since (4) constrains A to be orthogonal, the two additional constraints in the theorem statement imply that the kth column of A contains exactly nk elements that are equal to 1nk, and nnk elements that equal zero. Moreover, the non-zero elements of each column of A are non-overlapping. A similar claim holds for B. Let Ck denote the indices of the non-zero elements in the kth column of A, and similarly let Dr denote the indices of the non-zero elements in the rth column of B. Then (19) leads to

minimizeC1,,CK,D1,,DK{k=1Kr=1K(iCkjDrXij2-nkpr(1nkpriCkjDrXij)2)}. (20)

Finally, applying Lemma 4 reveals that this is equivalent to

minimizeC1,,CK,D1,,DK{k=1Kr=1KiCkjDr(Xij-X¯kr)2}. (21)

Now one can easily show that this is equivalent to the biclustering optimization problem in equation 1 in the case that K = R.

Footnotes

Supplementary materials

R scripts for Figures 15: R scripts to reproduce Figures 15. (Figures-code.zip)

R scripts for Tables 27: R scripts to reproduce Tables 27. (Tables-code.zip)

Contributor Information

Kean Ming Tan, Email: keanming@uw.edu, Department of Biostatistics, University of Washington, Seattle, WA 98115.

Dr. Daniela M. Witten, Email: dwitten@u.washington.edu, Department of Biostatistics, University of Washington, 1705 NE Pacific Street, Box 357232, F-649 Health Sciences Building, Seattle, WA 98195-7232.

References

  1. Allen G, Tibshirani R. Transposable regularized covariance models with an application to missing data imputation. Annals of Applied Statistics. 2010;4(2):764–790. doi: 10.1214/09-AOAS314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cheng Y, Church G. Biclustering of gene expression data. Proc Int Conf Intell Syst Mol Biol. 2000;8:93–103. [PubMed] [Google Scholar]
  3. Chipman H, Tibshirani R. Hybrid hierarchical clustering with applications to microarray data. Biostatistics. 2005;7:286–301. doi: 10.1093/biostatistics/kxj007. [DOI] [PubMed] [Google Scholar]
  4. Cho H, Dhillon IS. Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2008;5(3):385–400. doi: 10.1109/TCBB.2007.70268. [DOI] [PubMed] [Google Scholar]
  5. Cho H, Dhillon IS, Guan Y, Sra S. Minimum sum-squared residue co-clustering of gene expression data. Proceedings of the Fourth SIAM International Conference on Data Mining; 2004. pp. 114–125. [Google Scholar]
  6. Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci, USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fraley C, Raftery A. Model-based clustering, discriminant analysis, and density estimation. J Amer Statist Assoc. 2002;97:611–631. [Google Scholar]
  8. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Getz G, Levine E, Domany E. Coupled two-way clustering of gene microarray data. PNAS. 2000;97:12079–12084. doi: 10.1073/pnas.210134797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Gu J, Liu J. Bayesian biclustering of gene expression data. BMC Genomics. 2008;9:S4. doi: 10.1186/1471-2164-9-S1-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gupta A, Nagar D. Matrix variate distributions. CRC Press; Boca Raton, FL: 1999. [Google Scholar]
  12. Hartigan JA. Direct clustering of a data matrix. J Amer Statis Assoc. 1972;6:123–129. [Google Scholar]
  13. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning; Data Mining, Inference and Prediction. Springer Verlag; New York: 2009. [Google Scholar]
  14. Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann H, Shkedy Z, Clevert D. Fabia: factor analysis for bicluster acquisition. Bioinformatics. 2010;26(12):1520–1527. doi: 10.1093/bioinformatics/btq227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kaiser S, Santamaria R, Khamiakova T, Sill M, Theron R, Quintales L, Leisch F. biclust: BiCluster algorithms. R package version 1.0.1. 2011 URL: cran.r-project.org/package=biclust.
  16. Lazzeroni L, Owen A. Plaid models for gene expression data. Statistica Sinica. 2002;12:61–86. [Google Scholar]
  17. Lee M, Shen H, Huang J, Marron J. Biclustering via sparse singular value decomposition. Biometrics. 2010;66(4):1087–1095. doi: 10.1111/j.1541-0420.2010.01392.x. [DOI] [PubMed] [Google Scholar]
  18. Liu Y, Hayes D, Nobel A, Marron J. Statistical significance of clustering for high-dimension, low-sample size data. Journal of the American Statistical Association. 2008;103(483):1281–1293. [Google Scholar]
  19. Madeira S, Oliveira A. Biclustering algorithms for biological data analysis: A survey. IEEE Transactions on Computational Biology and Bioinformatics. 2004;1(1):24–45. doi: 10.1109/TCBB.2004.2. [DOI] [PubMed] [Google Scholar]
  20. Pan W, Shen X. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research. 2007;8:1145–1164. [Google Scholar]
  21. Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006;22(9):1122–1129. doi: 10.1093/bioinformatics/btl060. [DOI] [PubMed] [Google Scholar]
  22. Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66:846–850. [Google Scholar]
  23. Shabalin A, Weigman V, Perou C, Nobel A. Finding large average submatrices in high dimensional data. Annals of Applied Statistics. 2009;3(3):985–1012. [Google Scholar]
  24. Sill M, Kaiser S. s4vd: Biclustering via sparse singular value decomposition incorporating stability selection. R package version 1.0. 2011 doi: 10.1093/bioinformatics/btr322. URL: cran.r-project.org/web/packages/s4vd. [DOI] [PubMed]
  25. Tang C, Zhang L, Zhang A, Ramanathan M. Interrelated two-way clustering: An unsupervised approach for gene expression data analysis. Proc. of 2nd IEEE International Symposium on Bioinformatics and Bioengineering; Bethesda. 2001. [Google Scholar]
  26. Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996;58:267– 288. [Google Scholar]
  27. Turner H, Bailey T, Krzanowski W. Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics and Data Analysis. 2005;48:235–254. [Google Scholar]
  28. Wang S, Zhu J. Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics. 2008;64:440–448. doi: 10.1111/j.1541-0420.2007.00922.x. [DOI] [PubMed] [Google Scholar]
  29. Witten D, Tibshirani R. Covariance-regularized regression and classification for high-dimensional problems. J Royal Stat Soc B. 2009;71(3):615–636. doi: 10.1111/j.1467-9868.2009.00699.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Witten D, Tibshirani R. A framework for feature selection in clustering. Journal of the American Statistical Association. 2010;105(490):713–726. doi: 10.1198/jasa.2010.tm09415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Witten D, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Xie B, Pan W, Shen X. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electronic Journal of Statistics. 2008;2:168–212. doi: 10.1214/08-EJS194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zha H, He X, Ding G, Simon H, Gu M. Spectral relaxation for k-means clustering. Neural Information Processing Systems. 2001;14:1057–1064. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES