Accelerating high-dimensional clustering with lossless data reduction

Bahjat F Qaqish; Jonathon J O’Brien; Jonathan C Hibbard; Katie J Clowers

doi:10.1093/bioinformatics/btx328

. 2017 May 18;33(18):2867–2872. doi: 10.1093/bioinformatics/btx328

Accelerating high-dimensional clustering with lossless data reduction

Bahjat F Qaqish ¹, Jonathon J O’Brien ^2,^✉, Jonathan C Hibbard ¹, Katie J Clowers ²

Editor: Ziv Bar-Joseph

PMCID: PMC5870568 PMID: 28520900

Abstract

Motivation

For cluster analysis, high-dimensional data are associated with instability, decreased classification accuracy and high-computational burden. The latter challenge can be eliminated as a serious concern. For applications where dimension reduction techniques are not implemented, we propose a temporary transformation which accelerates computations with no loss of information. The algorithm can be applied for any statistical procedure depending only on Euclidean distances and can be implemented sequentially to enable analyses of data that would otherwise exceed memory limitations.

Results

The method is easily implemented in common statistical software as a standard pre-processing step. The benefit of our algorithm grows with the dimensionality of the problem and the complexity of the analysis. Consequently, our simple algorithm not only decreases the computation time for routine analyses, it opens the door to performing calculations that may have otherwise been too burdensome to attempt.

Availability and implementation

R, Matlab and SAS/IML code for implementing lossless data reduction is freely available in the Appendix.

1 Introduction

Quantitative data from the omics fields often arrive in the form of a p × n matrix X, in which the columns represent n independent observations x₁,…,x_n. Each observation is a vector in p dimensions representing the expression levels of p genes. High-dimensional data refers to situations where p > n, with p in the thousands to tens of thousands, and n in the tens to hundreds. Problems with data of this nature are extensive and well documented. Beyer et al. (1999) argue that under certain conditions, as dimensionality increases, clustering results become unstable. For the specific problem of classifying cancers with gene expression data, Lu and Han (2003) argue that most genes will not be relevant to the disease, and consequently, using all genes for classification will result in both less accurate classification and unnecessarily large computational burden. For these reasons and others, a substantial literature on methods for dimension reduction has been established.

Jain et al. (2000) review methods for both “feature selection” which seek to find a subset of features minimizing classification error, and “feature extraction” which aim to transform the data into an appropriate subspace. In gene expression studies it is common for feature selection to proceed by ranking the genes according to variability and discarding a predetermined number of the least variable genes (Koboldt et al. 2012). Commonly used feature extraction methods in gene expression research include multidimensional scaling (MDS) (Tzeng et al. 2008) and principle components analysis (PCA) (Ringnér 2008). These methods are not exclusive to cluster analysis, but a related literature specifically for subspace clustering methods has also been developed (Vidal 2011).

Though methods of dimension reduction will undoubtedly be helpful in a variety of settings for classification and visualization, many researchers still find themselves in situations that require high-dimensional clustering. In some cases, clustering is used to visualize variation between technical replicates and treated samples (Weekes et al. 2014). In other circumstances, researchers would like to do feature selection but cannot justify removing enough features to no longer be in a high-dimensional setting (Koboldt et al. 2012). Others might be turned away from dimension reduction by the complexity and risks inherent in reducing information contained in the data. An empirical study by Yeung and Ruzzo (2001) found that doing PCA prior to clustering often degraded cluster quality. Jain et al. (2000) argues that feature selection based on univariate statistics is unlikely to lead to an optimal subset. Questions of which dimension reduction algorithm to use, and more importantly, how far to reduce the dimensionality, do not have simple answers. Exhaustive searches are often impossible in large datasets, and it has been shown that non-exhaustive feature selection procedures cannot guarantee the selection of an optimal subset (Cover and Campenhout 1977). Presumably, the reasons for high-dimensional clustering go beyond these possibilities as finding examples in the literature is trivial.

Thalamuthu et al. (2006) clustered four real datasets of dimensions 1663 × 77, 1744 × 45, 2570 × 114 and 1920 × 203; Dudoit and Fridlyand (2002) used datasets of dimensions 4682 × 81, 3571 × 72, 5244 × 61 and 3613 × 31. Many of the methods commonly used with such data are invariant or equivariant with respect to translation, rigid rotation and reflection. That is, applying such transformations does not change the results. The prototype example is clustering algorithms based on Euclidean distances between points and between points and centroids of groups of points (Jain 2010). Other invariant methods include various types of linear discriminant analysis (Dudoit et al. 2002), principal components, linear regression, logistic regression, density estimation and mixture modeling (McLachlan et al. 2003).

When these methods are employed in the high-dimensional setting, we can re-purpose the concept of dimension reduction to effectively eliminate the computational burden imposed by high dimensionality. The data reduction described here is an effective pre-processing step for the invariant or equivariant procedures described above. This is especially true when the procedures are combined with some form of resampling as found in papers by Dudoit and Fridlyand (2002), Monti et al. (2003), Tseng and Wong (2005), McLachlan and Khan (2004) and Volkovich et al. (2011). Even without resampling, it is usually recommended that the k-means algorithm be rerun with 25–50 random starts to avoid local minima. The pre-processing step can greatly reduce the processing time for an analysis and in some cases will allow for analyses that could not otherwise be performed without upgrading computational power.

The remainder of this article is structured as follows. We will present our lossless data reduction (LDR) algorithm, and the sequential out of memory version (LDRseq). We then demonstrate the speedups achieved in various settings while using k-means clustering on simulated data. In the following section, we implement consensus clustering on four real datasets of increasing size and demonstrate the gains achieved with LDR and LDRseq. We conclude with a discussion of the possible applications of LDR.

2 Materials and methods

In this section, we describe the LDR and LDRseq algorithms which can be used to reduce the size of a dataset while entirely preserving the original distance relationships.

It helps to view the n observations as a point-cloud in p dimensions. To avoid trivialities, we assume that there is no linear dependence among the n columns of X. For p ≥ n, the n points in R^p will fall in a hyperplane of dimension n – 1. Rigid motion of that plane does not change the relationship of the points to each other. The familiar example is that three points in three dimensions, not all on a line, define an ordinary plane. The proposed method consists of two main steps. First, we apply a translation to place one of the observations at the origin; we use x_n for convenience. That is, we replace x_i by $x_{i} - x_{n}, i = 1, \dots, n$ . Second, we apply a rigid rotation to transform p– n + 1 of the coordinates to zero. The remaining n – 1 non-zero coordinates, along with the point that was placed at the origin, contain all the information about the original point cloud, as far as invariant procedures are concerned. After rotation, the hyperplane will be parallel to n – 1 axes and orthogonal to the remaining p – n + 1 axes.

We use the QR decomposition to compute the rotated coordinates. For a p × m matrix X of rank m, the QR decomposition is an orthogonal transformation

Q^{⊤} X = (\begin{matrix} R \\ 0 \end{matrix}),

where Q_p_×_p is orthogonal and R_m_×_m is upper-triangular (Stewart 1998, ch. 4). The matrix Q embodies a rigid rotation applied to the m points. In our application, m = n – 1, and Q need not be computed explicitly. Most software for QR decomposition will take X_p_×_n with zeros in the last column and return zeros in the last column of R_n_×_n, in which case the last row in R can be deleted. The resulting (n – 1) × n matrix contains the rotated coordinates of the n points and is to be used in further processing and analyses instead of the original X. For p > n, this produces significant reduction in computational cost.

The LDR algorithm is summarized as follows:

Input: X, a p × n matrix, p ≥ n.

Step 1: Replace each x_i by $x_{i} - x_{n}, i = 1, \dots, n$ .
Step 2: Compute R in the QR decomposition of X.
Step 3: Delete the last row, a row of zeros, from R.

Output: R, an (n– 1) × n matrix.

We provide computer code in the Appendix.

Table 1 presents a short example that illustrates what the algorithm achieves. The data matrix representing p = 5 variables or features on n = 4 samples or observations is followed by the 3 × 4 reduced matrix computed by LDR. It can be verified that all six pairwise Euclidean distances are preserved. For example, the squared Euclidean distance between samples 1 and 2, based on the original data, is ${(5 - 1)}^{2} + {(1 - 3)}^{2} + {(5 - 3)}^{2} + {(4 - 3)}^{2} + {(5 - 0)}^{2} = 50$ . The same distance is obtained from the reduced matrtix computed by LDR; ${(- 6.164414 + 1.297771)}^{2} + {(0 - 5.129892)}^{2} + {(0 - 0)}^{2} = 50$ . Distances between cluster centroids are also preserved. As an example, suppose that we group samples 1 and 2 as one cluster and samples 3 and 4 as another. The cluster centroids in terms of the original data are (3, 2, 4, 3.5, 2.5) and (4.5, 3, 3.5, 1, 3), and the squared distance between them is 10. After applying LDR, the centroids are (–3.731093, 2.564946, 0) and (–1.9466571, 0.2872739, 1.275931), and the same squared distance is obtained.

Table 1.

An example illustrating LDR

Variable	Sample
Variable	1	2	3	4
Original data matrix
1	5	1	4	5
2	1	3	1	5
3	5	3	4	3
4	4	3	1	1
5	5	0	4	2
Reduced data matrix
1	−6.164414	−1.297771	−3.8933141	0
2	0	5.129892	0.5745479	0
3	0	0	2.5518621	0

Open in a new tab

Note: The data matrix represents p = 5 variables measured on n = 4 samples. The reduced matrix produced by LDR contains three variables.

The LDR transformation should be applied as the very last step just prior to calling the analysis algorithm proper. Any preprocessing such as denoizing, filtering, imputation and normalization should be performed before applying LDR.

It should be noted that LDR is not unique in its ability to preserve distances. The same objective can be achieved with classical MDS, as presented by Torgerson (1952). The aim of MDS is to reduce dimensionality while minimizing alterations in the distance structure of the original data. It has long been known that when reducing from a p × n matrix, with p > n to a dimension greater than or equal to n – 1 the distance can be preserved exactly. Compared with LDR, the resulting reduced matrix will not be identical, but the distances will be the same. So in theory both could be used to accelerate other analyses. However, this special case has been largely ignored, since reducing to n – 1 dimensions does not achieve the usual objectives of MDS.

According to Borg and Groenen (2005, Ch. 1), the four purposes of MDS are the visual inspection of dissimilarity in a low dimensional space, testing structural hypotheses, discovery of the dimensions that underlie similarity, and the modeling of similarity judgments. Noticeably absent from these objectives is the idea of using MDS to accelerate other analyses, which is exactly our purpose. Borg further argues that the term MDS should only be used when reducing to a dimension low enough for visual representation (Borg et al. 2012, Ch.7.1). Thus, it should be no surprise that the MDS algorithm is not optimal for the purpose described in this paper. The algorithm we present requires fewer steps and demonstrates greater benefits relative to MDS as the dimension of the data increases. Furthermore, our algorithm can be adapted for sequential execution which creates an ability to quickly analyze data even when it would otherwise exceed limitations imposed by a computer’s memory.

3 LDR out of memory

LDR works because the R matrix from a QR decomposition preserves the distances between the columns of a matrix. In situations where the matrix is too large to be read into system memory, the Q matrix will also be too large. However, the R matrix, the only part needed for LDR, can always be written as an upper triangular matrix with dimensions determined by the number of columns. Furthermore, it is possible to add a row to the X matrix and update the R matrix without needing X or Q. This enables us to create a sequential version of LDR, (LDRseq), which can be used even when analyzing the data would otherwise result in memory errors. The updating strategy in LDRseq is a modified version of the algorithm presented by Miller (1992), originally used as a step in the updating of linear regression coefficients.

The LDRseq algorithm is summarized as follows:

Input: filename, a path to a text file containing X a p × n matrix, p ≥ n.

Step 1: Read in matrix X_q from the first q rows of X, such that q is large enough to guarantee that X_q is full rank.
Step 2: Compute R from the QR decomposition of X_q.
Step 3: Read in x_q₊₁, the q + 1th row of X and append it to bottom of R.
Step 4: Perform p sequential Givens rotations on R to make it upper triangular.
Step 5: Delete the last row, a row of zeros, from the updated R matrix.
Step 6: Repeat steps 3–5 until the end of the data.

Output: R, an n × n matrix.

The final output is the R matrix from a QR decomposition on the full dataset. Note that the corresponding Q matrix is never computed and having the full data in memory is unnecessary. We present an R version of the algorithm utilizing C ++ code through Rcpp for fast computation on very large datasets. The algorithm, if needed, is easily parallelizable and requires only a single pass through X. The complexity of this algorithm is O(n²p) making it ideal for high dimensional analysis.

Rcpp code for LDRseq can be found in Appendix section.

4 Performance

Here, we provide data on the speedup obtained with K-means in scenarios that mimic those encountered in practice. There are two ways to define run time with LDR. The first is to include LDR in each run of K-means. The second definition of run time is to run LDR only once, and then repeatedly call K-means. We timed K-means, clustering the data into five clusters, with ‘nstart’ number of starting points using R 3.2 on the Windows operating system. All data were generated as iid from the standard normal distribution. The speedup factor was computed as (time without LDR)/(time with LDR).

The timing data in Table 2 demonstrate that even if K-means is run once with nstart = 1, LDR provides some speedup. But with larger values of nstart, as is often recommended, speedups range from about 5 for p = 500–37 for p = 10 000. The second definition of run time is the one that is relevant to iterative and resampling procedures such as consensus clustering (Monti et al. 2003) and ensemble clustering (Strehl and Ghosh 2002). The corresponding speedups are in the range of 6 for p = 500–197 for p = 10 000. With real datasets containing millions of variables much larger speedup factors will be obtained, as we will see in the following section.

Table 2.

Speed up factors for K-means clustering with ‘nstart’ starting positions, p variables and n = 100 samples

P	nstart	LDR each time	LDR once
500	1	1.5	6.1
500	5	5.1	6.7
500	10	5.2	6.7
500	25	5.5	6.7
1000	1	1.9	13
1000	5	8.8	15
1000	10	9.2	14
1000	25	11	15
10 000	1	2.4	168
10 000	5	22	179
10 000	10	26	177
10 000	25	37	197

Open in a new tab

Note: Speedup factors are defined as (time without LDR)/(time with LDR). Factors were computed for two scenarios; with LDR used in each iteration and with only a single LDR pre-processing step.

5 Real data

To demonstrate the advantages of LDR in realistic scenarios we will analyze four datasets of increasing size. These datasets will be used to compare the relative computation times for distance preserving transformation with LDR and MDS. The data will also be used to demonstrate the absolute gains in computing time realized by attempting a computationally intensive clustering algorithm on high dimensional data.

Dataset 1 comes from a yeast proteomics experiment containing relative protein estimates from 5068 proteins measured in 9 different samples (Paulo et al. 2016). The estimates were derived from a much larger dataset containing peptide level intensities from 167 253 peptides measured across 10 samples. The peptide level data matrix will be used as dataset 2. Both the 5068 × 9 dataset 1 and the 167 253 × 10 dataset 2 can be obtained through the online version of the publication (Paulo et al. 2016).

DNA methylation studies provide us with examples that contain substantially larger numbers of features. Bisulfite sequence data from 25 human tissue samples were downloaded from the Human Epigenome Atlas https://www.genboree.org/epigenomeatlas/multiGridViewerPublic.rhtml. From these files we created two datasets. The first, dataset 3, contains only genome markers common to all samples, from chromosome 1. The resulting object is a 1 375 944 × 25 matrix. Dataset 4 was generated from the same files but contains all common positions, across tissues, from all 23 chromosomes. Dataset 4 is a 15 056 536 × 25 matrix, which was too large to be processed in R on our desktop PCs. Accordingly, we only analyze dataset 4 by using LDRseq.

5.1 Comparison of LDR and MDS

Both LDR and MDS can be used to reduce the size of a matrix for the objective of accelerating future analyses. We will now compare the relative performance of the two algorithms. This comparison is admittedly somewhat forced, since, to the best of our knowledge, nobody has ever used MDS to accelerate clustering algorithms and it contains steps that are unnecessary for our objectives. Nonetheless, the comparison is necessary for us to justify using LDR in place of MDS. We do not make any comparisons of MDS to LDRseq, since we only recommend using LDRseq in applications where MDS would fail entirely.

LDR and MDS were compared on datasets 1–3 with the microbenchmark package in R 3.2.3. LDR was computed with the R code presented in the online supplement and MDS was performed with the cmdscale function from the stats package. The microbenchmark function was used to time 1000 calls to both LDR and MDS on each of the datasets. The median processing times for datasets 1–3 were, respectively, 7.89, 118.21 and 2841.31 ms using LDR and 5.09, 297.60 and 43 231.15 ms using MDS. The relative speedups obtained with LDR, shown in Figure 1, suggests that the computational advantage of LDR over MDS increases in an approximately linear relationship with the number of variables in the dataset. The observed advantage of LDR is unsurprising since fewer operations are required. However, it should be made clear that the speedup is only for the distance preserving transformation. As we will show in the next section, the transformation time is often negligible relative to the time saved performing a subsequent analysis with a substantially smaller matrix.

Fig. 1 — Speedups are calculated from the median computations times from 1000 calls to LDR and MDS on matrices of size $5068 \times 9, 167253 \times 10 a n d 1375944 \times 25$ . The speedup factor is the median time for MDS divided by the median time for LDR

5.2 Analysis of absolute time saved

We expect that the greatest gains to be achieved with LDR will occur when performing computationally intensive, resampling based, clustering techniques on high dimensional data. To demonstrate the magnitude of these gains we performed Consensus Clustering on datasets 1–3. All code was executed in R 3.2.3 using the ConsensusClusterPlus package (Wilkerson and Hayes 2010) with 1000 reps and k-means clustering as the base method.

A direct application of ConsensusClusterPlus to dataset 1 required 103.81 s. Implementing both the LDR transformation and the ConsensusCluster algorithm reduced the computational time to 1.56 s. Although this is a nice improvement to processing time, it is minimal compared with the advantage when clustering the larger datasets. For dataset 2, executing both LDR and Consensus Clustering took 1.75 s. Without LDR, the processing time increased to 3582.11 s (59.70 min). For dataset 3, times with and without LDR were 4.79 and 30 336.68 s (8.43 h), respectively.

Dataset 4 was sufficiently large to cause difficulties reading the file into R. We managed to do this in a reasonable amount of time using the fread() function from the data.table package. However, attempting to analyze this data, quickly resulted in a memory error. This size limitation motivated the creation of the LDRseq algorithm which we coded in Rcpp to provide a fast transformation which can be used in the platform most likely to struggle with memory limitations. Passing the 15 056 536 × 25 matrix to LDRseq returns a 25 × 25 matrix object. The combination of LDRseq and Consensus Clustering took 321.18 s. This transformation should be noticeably slower than LDR, but the basic LDR algorithm could not be applied to this data without upgrading our computational power.

Complete performance figures, summarized in Table 3, show that while computation times can be drastically impacted by the dimensionality of a dataset, the increase realized when using the LDR transformation is negligible. Consequently, as the number of variables grows so does the amount of time saved by LDR. With a small number of variables the benefits are minor, while in the more extreme cases LDR, especially LDRseq, allows analyses that could not have otherwise been performed.

Table 3.

CPU time, in seconds, required for analyses with and without LDR

	Dataset 1	Dataset 2	Dataset 3	Dataset 4
Without LDR	103.81	3582.11	30 336.68	NA
With LDR/LDRseq	1.56	1.75	4.79	321.18

Open in a new tab

Note: For datasets 1–3 the timing in the with LDR/LDRseq row, represents the combined time required to run both LDR and Consensus Clustering. For dataset 4 this row shows the time required to run LDRseq and Consensus Clustering.

6 Discussion

With the emergence of high-throughput omics technologies, biologists have been confronted with many issues regarding handling, processing and analyzing very large datasets. Clustering is often an early step in the analysis of these data, as it can transform a large matrix of numerical values into a visual representation of relationships and trends.

For researchers working with high-dimensional data LDR can save large amounts of processing time. These benefits grow with the dimensionality of the data. In the extreme case, with very big data, the LDRseq algorithm enables analyses that otherwise exceed the capabilities of a computational system. Increasing computational power is not always an available option. Furthermore, gaining access to and queuing jobs on large computer clusters can be difficult and time consuming in itself. We hope that our approach will make life easier for researchers working in these conditions. However, an argument can be made that a better use for LDR is to provide modest gains in time, to a very large number of researchers.

To this end, the creators of clustering software are our target audience, and our message is the simplicity of the LDR code. By adding an “if” statement and five lines of code, almost every software package implementing clustering algorithms can be made more efficient without altering the results in any way.

As an example, McLachlan et al. (2003) analyzed a dataset consisting of n = 62 observations and p = 2000 variables, originally selected from a pool of 6500 variables. They employed an initial screening procedure to select 446 variables that were then used to cluster the 62 observations. The method we present would be beneficial after any of the initial reductions. LDR would speed up the analysis of any of the sets of 6500, 2000 or 446 variables to 61 variables with no loss of information. The 61 × 62 reduced data matrix can be entered into further analyses, including lossy dimension reduction. We note that after clustering based on R_61 × 62, we can go back to the original p variables, the rows of X, to study how they differ among the identified clusters. The reduction does not force the analyst to forget about the original variables.

We have been careful not to describe our algorithm in terms of dimension reduction. Our view is that for p ≥ n, the actual dimension of the problem of clustering of the n points or samples is of dimension n – 1. LDR forces the length of the non-zero variables to equal the dimension of the problem. Feature selection, selecting a subset of p* variables ( $p > p^{*} \geq n$ ) out of the original p, does not change the dimension of the problem. As Jain points out, feature selection often occurs not because it is desirable but because it is computationally necessary (Jain et al. 2000). This is precisely the problem solved by LDR; computational burden need not necessitate feature selection.

It should also be emphasized that though our focus has been on the acceleration of clustering algorithms, others may find the concept useful for a wide range of analyses. Our LDR procedure gives considerable speedup factors by computing a concise representation of the data that preserves all the relevant distance information. Classification methods such as linear discriminant analysis and logistic regression, principal components, and essentially all linear procedures fall in the realm of methods for which the data reduction described above is lossless. Kernel density estimation is an example of an equivariant procedure. Translation, rigid rotation and reflection of the n points only transform the density contours in an equivalent manner, but otherwise the density estimate is the same. Clearly, not all multivariate methods are invariant to rigid rotation and reflection; examples include regression trees (Breiman et al. 1984, Ch. 2 & 8) and clustering based on the L₁ norm (Jajuga 1987; Sabo 2014).

The greatest benefits of LDR are achieved when implementing resampling based techniques on high-dimensional datasets. For the analyses considered in this article, LDR saved substantial amounts of processing time. The benefits of LDRseq went further still by enabling data analysis on computing systems that would otherwise not be possible. The trick is simple, the results are unaltered, and the gains are potentially substantial. For these reasons we believe LDR should be incorporated as a standard step in all software implementing clustering algorithms.

Appendix A

A1. CODE

We first present the simple code for LDR in R, Matlab and SAS. Then we provide the somewhat more complicated LDRseq code, which is written in R and makes use Rcpp.

The LDR functions below take a p × n matrix X, p ≥ n, representing n observations in p dimensions, and return an (n– 1) × n matrix containing the coordinates of the n points in a hyperplane in n – 1 dimensions.

R code:

ldr = function(x) {

n = ncol(x)

s = qr(x - x[,n])

r = qr.R(s)[,s$pivot]

return(r[-n,])

}

MATLAB code:

function r = ldr(x)

n = size(x, 2);

x = x - repmat(x(:,n), 1, n);

[q, r] = qr(x, 0);

r = r(1:(n-1),:);

SAS/IML code:

start ldr (x);

n = ncol(x);

call qr(q, r, pivot, lindep, x - x[,n]);

r = r[,pivot];

return(r[1:(n-1),]);

finish;

R code for LDRseq: This function is designed to perform LDR when X is very large relative to system memory. The function operates on a data.frame object in R in order to be compatible with the fread() function from the data.table package. fread is a very fast function for reading tables into R. The C ++ code performs the sequential algorithm described in the article, while the R function directly below provides a wrapper which processes blocks of the matrix one at a time. Users can avoid memory problems by specifying smaller blocks of code to apply to the sequential algorithm.

#requires Rtools, Rcpp.

#input: data(z x a array, z ≫ a), R from a QR of the

#first xx obs of data, a < =xx ≪ z (a x a array), a, z.

#output: R wrt QR of data (a x a array).

library(Rcpp)

cppFunction(’NumericMatrix xp(const DataFrame

&MMG, NumericMatrix &IC, int &a, int &z, int &xx){

NumericMatrix GIJ = IC;

double u, v, x, ghi, ghj;

NumericVector JC(a), JCT(z);

for(int i = xx; i < z; ++i){

for(int k = 0; k < a; ++k){

JCT = MMG[k];

JC[k]=JCT[i];

}

for(int j = 0; j < a; ++j){

u = GIJ(j,j);

v = JC[j];

x = sqrt(u*u + v*v);

u/=x;

v/=x;

for(int k = j; k < a; ++k){

ghi = JC[k];

ghj = GIJ(j,k);

JC[k]=-v*ghj + u*ghi;

GIJ(j,k)=u*ghj + v*ghi;

}

return(GIJ);

}’)

LDRseq <- function(dat, blockSize = 100000){

n_c <- ncol(dat)

n_r <- nrow(dat)

n_blocks <- ceiling(n_r/blockSize)

n_last <- n_r

if(n_last ==0){n_last <- blockSize}

#create indices for each block

ep <- (1:(n_blocks-1))*blockSize

ep <- c(ep, ep[n_blocks-1]+n_last)

sp <- (0:(n_blocks - 1)) * blockSize + 1

#initialize R

R_ <- qr.R(qr(as.matrix(dat[sp[1]:ep[1],])))

for (i in 2:n_blocks){

block <- dat[sp[i]:ep[i]]

nRB <- nrow(block)

R_ <- xp(block, R_, n_c, nRB, 0)

}

R_

}

Funding

This study was supported by the National Cancer Institute through the grant “Biostatistics for Research in Genomics and Cancer”, NCI Grant 5T32CA106209-07 (T32).

Conflict of Interest: none declared.

References

Beyer K. et al. (1999) When is nearest neighbor meaningful? Database Theory ICDT 99, pp. 217–235.
Borg I., Groenen P. (2005) Modern Multidimensional Scaling—Theory and Applications Springer Series in Statistics. Springer, New York. [Google Scholar]
Borg I. et al. (2012) Typical mistakes in MDS In Applied Multidimensional Scaling, Springer Briefs in Statistics, pp. 59–80. Springer, Berlin. [Google Scholar]
Breiman L. et al. (1984) Classification and Regression Trees, Vol 19 CRC Press, Boca Raton. [Google Scholar]
Cover T.M., Campenhout J.M.V. (1977) On the possible orderings in the measurement selection problem. IEEE Trans. Syst. Man Cybern., 7, 657–661. [Google Scholar]
Dudoit R. et al. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, 77–87. [Google Scholar]
Dudoit S., Fridlyand J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol., 3, research0036.1.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jain A. et al. (2000) Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell., 22, 4–37. [Google Scholar]
Jain A.K. (2010) Data clustering: 50 years beyond K-means. Pattern Recogn. Lett., 31, 651–666. [Google Scholar]
Jajuga K. (1987) A clustering method based on the L1-norm. Comput.tional Stat. Data Anal., 5, 357–371. [Google Scholar]
Koboldt D.C. et al. (2012) Comprehensive molecular portraits of human breast tumours. Nature, 490, 61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu Y., Han J. (2003) Cancer classification using gene expression data. Inf. Syst., 28, 243–268. [Google Scholar]
McLachlan G., Khan N. (2004) On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. J. Multivariate Anal., 90, 90–105. [Google Scholar]
McLachlan G.J. et al. (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal., 41, 379–388. [Google Scholar]
Miller A.J. (1992) Algorithm AS 274: least squares routines to supplement those of gentleman. Appl. Stat., 41, 458. [Google Scholar]
Monti S. et al. (2003) Consensus clustering: a resampling based method for class discovery and visualization of gene expression microarray data. Mach. Learn., 52, 91–118. [Google Scholar]
Paulo J.A. et al. (2016) Quantitative mass spectrometry-based multiplexing compares the abundance of 5000S. cerevisiae proteins across 10 carbon sources. J. Proteom., 148, 85–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ringnér M. (2008) What is principal component analysis? Nat. Biotechnol., 26, 303–304. [DOI] [PubMed] [Google Scholar]
Sabo K. (2014) Center based l1 clustering method. Int. J. Appl. Math. Comput. Sci., 24, 151–163. [Google Scholar]
Stewart G.W. (1998) Matrix Algorithms: Volume 1: Basic Decompositions. SIAM. [Google Scholar]
Strehl A., Ghosh J. (2002) Cluster ensembles a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3, 583–617. [Google Scholar]
Thalamuthu A. et al. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22, 2405–2412. [DOI] [PubMed] [Google Scholar]
Torgerson W.S. (1952) Multidimensional scaling: I. Theory and method. Psychometrika, 17, 401–419. [DOI] [PubMed] [Google Scholar]
Tseng G.C., Wong W.H. (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16. [DOI] [PubMed] [Google Scholar]
Tzeng J. et al. (2008) Multidimensional scaling for large genomic data sets. BMC Bioinformatics, 9, 179. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vidal R. (2011) Subspace clustering. IEEE Signal Process. Mag. 28, 52–68. [Google Scholar]
Volkovich Z. et al. (2011) Resampling approach for cluster model selection. Mach. Learn., 85, 209–248. [Google Scholar]
Weekes M.P. et al. (2014) Quantitative temporal viromics: an approach to investigate host-pathogen interaction. Cell, 157, 1460–1472. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wilkerson M.D., Hayes D.N. (2010) ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics, 26, 1572–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yeung K.Y., Ruzzo W.L. (2001) Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763–774. [DOI] [PubMed] [Google Scholar]

[btx328-B1] Beyer K. et al. (1999) When is nearest neighbor meaningful? Database Theory ICDT 99, pp. 217–235.

[btx328-B2] Borg I., Groenen P. (2005) Modern Multidimensional Scaling—Theory and Applications Springer Series in Statistics. Springer, New York. [Google Scholar]

[btx328-B3] Borg I. et al. (2012) Typical mistakes in MDS In Applied Multidimensional Scaling, Springer Briefs in Statistics, pp. 59–80. Springer, Berlin. [Google Scholar]

[btx328-B4] Breiman L. et al. (1984) Classification and Regression Trees, Vol 19 CRC Press, Boca Raton. [Google Scholar]

[btx328-B5] Cover T.M., Campenhout J.M.V. (1977) On the possible orderings in the measurement selection problem. IEEE Trans. Syst. Man Cybern., 7, 657–661. [Google Scholar]

[btx328-B6] Dudoit R. et al. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, 77–87. [Google Scholar]

[btx328-B7] Dudoit S., Fridlyand J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol., 3, research0036.1.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx328-B8] Jain A. et al. (2000) Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell., 22, 4–37. [Google Scholar]

[btx328-B9] Jain A.K. (2010) Data clustering: 50 years beyond K-means. Pattern Recogn. Lett., 31, 651–666. [Google Scholar]

[btx328-B10] Jajuga K. (1987) A clustering method based on the L1-norm. Comput.tional Stat. Data Anal., 5, 357–371. [Google Scholar]

[btx328-B11] Koboldt D.C. et al. (2012) Comprehensive molecular portraits of human breast tumours. Nature, 490, 61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx328-B12] Lu Y., Han J. (2003) Cancer classification using gene expression data. Inf. Syst., 28, 243–268. [Google Scholar]

[btx328-B13] McLachlan G., Khan N. (2004) On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. J. Multivariate Anal., 90, 90–105. [Google Scholar]

[btx328-B14] McLachlan G.J. et al. (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal., 41, 379–388. [Google Scholar]

[btx328-B15] Miller A.J. (1992) Algorithm AS 274: least squares routines to supplement those of gentleman. Appl. Stat., 41, 458. [Google Scholar]

[btx328-B16] Monti S. et al. (2003) Consensus clustering: a resampling based method for class discovery and visualization of gene expression microarray data. Mach. Learn., 52, 91–118. [Google Scholar]

[btx328-B17] Paulo J.A. et al. (2016) Quantitative mass spectrometry-based multiplexing compares the abundance of 5000S. cerevisiae proteins across 10 carbon sources. J. Proteom., 148, 85–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx328-B18] Ringnér M. (2008) What is principal component analysis? Nat. Biotechnol., 26, 303–304. [DOI] [PubMed] [Google Scholar]

[btx328-B19] Sabo K. (2014) Center based l1 clustering method. Int. J. Appl. Math. Comput. Sci., 24, 151–163. [Google Scholar]

[btx328-B20] Stewart G.W. (1998) Matrix Algorithms: Volume 1: Basic Decompositions. SIAM. [Google Scholar]

[btx328-B21] Strehl A., Ghosh J. (2002) Cluster ensembles a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3, 583–617. [Google Scholar]

[btx328-B22] Thalamuthu A. et al. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22, 2405–2412. [DOI] [PubMed] [Google Scholar]

[btx328-B23] Torgerson W.S. (1952) Multidimensional scaling: I. Theory and method. Psychometrika, 17, 401–419. [DOI] [PubMed] [Google Scholar]

[btx328-B24] Tseng G.C., Wong W.H. (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16. [DOI] [PubMed] [Google Scholar]

[btx328-B25] Tzeng J. et al. (2008) Multidimensional scaling for large genomic data sets. BMC Bioinformatics, 9, 179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx328-B26] Vidal R. (2011) Subspace clustering. IEEE Signal Process. Mag. 28, 52–68. [Google Scholar]

[btx328-B27] Volkovich Z. et al. (2011) Resampling approach for cluster model selection. Mach. Learn., 85, 209–248. [Google Scholar]

[btx328-B28] Weekes M.P. et al. (2014) Quantitative temporal viromics: an approach to investigate host-pathogen interaction. Cell, 157, 1460–1472. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx328-B29] Wilkerson M.D., Hayes D.N. (2010) ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics, 26, 1572–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx328-B30] Yeung K.Y., Ruzzo W.L. (2001) Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763–774. [DOI] [PubMed] [Google Scholar]

PERMALINK

Accelerating high-dimensional clustering with lossless data reduction

Bahjat F Qaqish

Jonathon J O’Brien

Jonathan C Hibbard

Katie J Clowers

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Materials and methods

Table 1.

3 LDR out of memory

4 Performance

Table 2.

5 Real data

5.1 Comparison of LDR and MDS

Fig. 1.

5.2 Analysis of absolute time saved

Table 3.

6 Discussion

Appendix A

A1. CODE

Funding

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Accelerating high-dimensional clustering with lossless data reduction

Bahjat F Qaqish

Jonathon J O’Brien

Jonathan C Hibbard

Katie J Clowers

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Materials and methods

Table 1.

3 LDR out of memory

4 Performance

Table 2.

5 Real data

5.1 Comparison of LDR and MDS

Fig. 1.

5.2 Analysis of absolute time saved

Table 3.

6 Discussion

Appendix A

A1. CODE

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases