Object weighting: a new clustering approach to deal with outliers and cluster overlap in computational biology

Alexandre Gondeau; Mohamed Hijri; Pedro Peres-Neto; Vladimir Makarenkov

doi:10.1109/TCBB.2019.2921577

. Author manuscript; available in PMC: 2022 Apr 8.

Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2021 Apr 8;18(2):633–643. doi: 10.1109/TCBB.2019.2921577

Object weighting: a new clustering approach to deal with outliers and cluster overlap in computational biology

Alexandre Gondeau ¹, Mohamed Hijri ², Pedro Peres-Neto ³, Vladimir Makarenkov ⁴

PMCID: PMC8158064 NIHMSID: NIHMS1692932 PMID: 31180868

Abstract

Considerable efforts have been made over the last decades to improve the robustness of clustering algorithms against noise features and outliers, known to be important sources of error in clustering. Outliers dominate the sum-of-the-squares calculations and generate cluster overlap, thus leading to unreliable clustering results. They can be particularly detrimental in computational biology, e.g., when determining the number of clusters in gene expression data related to cancer or when inferring phylogenetic trees and networks. While the issue of feature weighting has been studied in detail, no clustering methods using object weighting have been proposed yet. Here we describe a new data partitioning method that includes an object-weighting step to assign higher weights to outliers and objects that cause cluster overlap. Different object weighting schemes, based on the Silhouette cluster validity index, the median and two intercluster distances, are defined. We compare our novel technique to a number of popular and efficient clustering algorithms, such as K-means, X-means, DAPC and Prediction Strength. In the presence of outliers and cluster overlap, our method largely outperforms X-means, DAPC and Prediction Strength as well as the K-means algorithm based on feature weighting.

Keywords: Clustering, K-means, object weighting, feature weighting, cancer gene expression data, phylogenetic data

1. Introduction

CLUSTERING algorithms aiming at determining the class structure of datasets find numerous applications in computational biology [1], [2], [3]. Two main clustering approaches include hierarchical clustering and data partitioning. Hierarchical, or connectivity-based, clustering connects objects in a tree-like manner based on a given distance. The data partitioning approach includes centroid-based, distribution-based and density-based clustering methods. The main principles of clustering are homogeneity, meaning that objects of the same cluster are maximally close to each other, and separation, meaning that objects in separate clusters are maximally far apart from each other [4], [5]. In biology and bioinformatics, the true number of clusters present in the data is often unknown. It is a function of both the samples/objects and features/genes being used.

Several traditional clustering methods have been adapted or directly applied to biomedical and bioinformatics data. Moreover, recently new clustering techniques specifically designed for analysis of gene expression (or microarray) data have been proposed [6], [7]. For example, clustering of gene expression data involves either a sample clustering that groups together patients or tissues that are similarly affected by a disease, or gene clustering that builds groups of genes with related expression patterns, i.e., those that react similarly to a particular experimental condition, those that are similarly affected by a disease or those that are functionally related [8]. Furthermore, in phylogenetics, hierarchical clustering is used to represent evolutionary histories of species and to predict the number of operational taxonomic units, while in ecology it allows making spatial and temporal comparisons of communities composed of multiple species [9]. In sequence analysis, clustering is used to group homologous sequences into gene families [10]. In human genetics, clustering helps infer population structures from similarity data. Clustering can also be useful for the analysis of complex biological networks consisting of thousands of nodes. It allows one to identify subnets, bottle-necks and network flows, and thus uncover functional modules and obtain hints about cellular organization [11], [12].

The main drawbacks of the existing clustering algorithms include their limited ability to deal with outliers, noise dimensions (i.e., noise features) and cluster overlap [13], [14], [15]. In many cases, high-throughput genomic and proteomic data encompass all of these potential sources of error, making it difficult to achieve reliable discovery of plausible clustering patterns [16].

Several researchers have studied the problem of outlier detection in computational biology. MacDonald and Ghosh [17] proposed the Cancer Outlier Profile Analysis (COPA) method based on robust centering and scaling of the data to help detect outlier samples. This method can be used to identify genes involved in recurrent chromosomal translocations, which are common in cancer and may cause the progression of the disease. Gunawardana et al. [18] presented two approaches to detect post-translationally regulated proteins as outliers in a regression problem. The authors showed that in many settings outliers are reliable candidates for post-translational regulation.

While several clustering algorithms that automatically assign weights to features to minimize the effect of noise dimensions have been proposed in the literature ([15], [19], [20]), less attention has been paid to object weighting. In this context, we have to mention the work of Tseng [21], who described a new penalized and weighted version of K-means [4] to partition “scattered objects” (outliers) into a new group in which the number of objects considered as extreme is controlled by the penalty parameter λ. The author showed how this method could be applied to analyze high-throughput genomic data. The weights are used to take into account prior knowledge of prescribed or proscribed patterns of cluster selection. This prior knowledge concerns multiple groups of objects that are likely to be clustered together. For example, in clustering of gene expression profiles, some sets of genes can be known from previous experiments to be co-regulated in some biological pathways. In the objective function to be minimized, a weight function can be associated with any particular gene clustering by giving low weights to the clusters of genes with similar expression patterns (e.g. genes with co-regulated biological functions) and high weights to the clusters of genes not showing these patterns. However, the method proposed by Tseng [21] does not calculate gene weights automatically; they need to be predefined and are the input parameters of the method. Moreover, all weights are fixed throughout the clustering process. The weight assigned to a given gene remains the same regardless of the cluster to which this gene is assigned. Furthermore, Kerdprasop et al. [22] proposed a variation of K-means [4] using a density biased sampling technique based on the reservoir method. The method described in [22] applies weights to individual objects depending on their density within clusters. Gebru et al. [23] introduced a new mixture model that assigns a weight to each observed object. Two expectation-maximization (EM) algorithms were described in [23]: the first one associates a fixed weight to every given object, while the second considers each weight as a random variable following a gamma distribution. Shen et al. [24] presented a weighted clustering method that allows for the presence of scattered noise genes and makes use of functional annotation data. Similarly to the method of Tseng [21], the objective function used in [24] is a summation of the weighted dispersions of clustered genes. However, unlike the method of Tseng that uses gene-specific weights, the method proposed by Shen et al. applies cluster-specific weights such that all the genes in the same cluster have the same weight.

In this paper, we present a new general clustering algorithm allowing one to assign weights to objects (as opposed to features) and to use them explicitly in a weighted version of the K-means objective function. The performance of the new algorithm, which works particularly well on datasets containing outliers and cluster overlap, is evaluated by comparing it to some efficient clustering techniques, such as X-means [25], Prediction strength [26], Discriminant analysis of principal components [27] and Feature Weighting (FW) K-means [15], as well as by analyzing well-known cancer gene expression [28] and phylogenetic [29] data.

2. Methods

The traditional K-means algorithm is arguably the most popular clustering method nowadays. Its most important variations have been described by MacQueen [4], Hartigan and Wong [30] and Lloyd [31], K-means allows one to partition a given set X of n objects, x₁,…, x_n, into K disjoint clusters so that each cluster includes similar objects. The algorithm starts with a random initial partition of objects. At each iteration, K-means assigns a given object x to cluster S_k, whose center (i.e., centroid) c_k, is the nearest to it. The cluster centers are updated at the end of each step of the algorithm. As such, the following least-squares objective function is alternatingly minimized:

L (S, C) = \sum_{k = 1}^{K} \sum_{x_{i} \in S_{k}} \sum_{v = 1}^{V} {(x_{i v} - c_{k v})}^{2},

(1)

where v = 1, …, V are the features (variables or dimensions) characterizing objects in X and c_k ∈ C is the center (centroid) of the cluster S_k ∈ S with k = 1, …, K. The sets S and C are respectively the sets of clusters and cluster centers. The algorithm stops when the object assignments no longer change.

In spite of its popularity, the traditional K-means algorithm has a few limitations, the most important being the following: (i) the number of clusters K should be known in advance; (ii) the algorithm usually returns a local minimum of (1); (iii) the obtained clustering largely depends on the initial random partition and the algorithm should be ran hundreds (and sometimes thousands) of times with different initial random partitions in order to provide a plausible solution; (iv) all features and all objects equally contribute to the K-means objective function regardless of their degree of relevance. In this work, we address the last of these limitations.

Precisely, we propose to consider a non-negative weight w_i, belonging to the set of object weights w, which will be assigned to the object x_i (i = 1, …, n) in the K-means objective function described by Equation (1). Thus, our weight-dependent objective function will be as follows:

L (S, C, w) = \sum_{k = 1}^{K} \sum_{x_{i} \in S_{k}} w_{i} (\sum_{v = 1}^{V} {(x_{i v} - c_{k v})}^{2}) .

(2)

The object weights in Equation (2) can be used to account for the clustering quality of the objects. Specifically, the weights will be used to penalize the objects with a low contribution to clustering (e.g., outliers or objects generating cluster overlap) and give an advantage to the objects with a high contribution to clustering. Thus, well-clustered objects will get lower weights, whereas outliers will get higher weights.

In addition, we will also consider an objective function including both object and feature weights. This function is as follows:

L (S, C, w, f) = \sum_{k = 1}^{K} \sum_{x_{i} \in S_{k}} w_{i} (\sum_{v = 1}^{V} f_{k v}^{β} {(x_{i v} - c_{k v})}^{β})^{1 / β},

(3)

where f_kv is the weight of the feature v in the cluster S_k and β is the user-defined exponent (β is set to 2 in our study as we work with the Euclidean distance only). These feature weights are cluster-dependent. The feature weights can be calculated using the method proposed by de Amorim and Mirkin [15] that is based on the following formula:

f_{k v} = \frac{1}{\sum_{u = 1, ..., V} {(D_{k v} / D_{k u})}^{1 / (β - 1)}},

(4)

where $D_{k v} = \sum_{i \in S_{k}} {(x_{i v} - c_{k v})}^{β}$ is the dispersion of the feature v in the cluster S_k. The dispersion variance D_ku is defined in a similar way. According to Formula (4), features with low variance will receive higher weights than features with high variance.

2.1. Weighting schemes

In this section, we describe and test seven object weighting schemes which can be used in the objective function of our data partitioning method, named here as OW (Object Weighting) K-means. We will consider both within and between-cluster distances when assigning weights to objects. The first intuitive constraint (Constraint A) applied in all of our object weighting schemes is as follows: (a) the object weights must be non-negative, i.e., w_i ≥ 0, for all i = 1,…, n.

Weighting Schemes (WS) Silhouette, Median, Nearest centroid and Sum of distances are respectively based on: (1) the Silhouette width of a given object [5], (2) the Euclidean distance between the object and the median of its cluster, (3) the squared Euclidean distance between the object and the nearest centroid of a cluster to which this object does not belong, and (4) the sum of the squared Euclidean distances between the centroid of the object’s cluster and all other cluster centroids.

Weighting Schemes Silhouette(n_k), Median(n_k) and Nearest centroid(n_k) are also based on: (5) the Silhouette width, (6) the distance to the cluster median, and (7) the distance between the object and the nearest centroid of a different cluster, but the weight calculation is carried out subject to an additional constraint (Constraint B) that requires that the sum of the object weights in each cluster be equal to the number of objects in this cluster. Constraint B is inspired by the objective function of traditional K-means in which the weight of each object equals 1 and the sum of the object weights in a cluster equals the number of objects belonging to this cluster. Thus, in the weighting schemes based on Constraint B the weight of the object x_i, belonging to the cluster S_k, indicated here as w_i(n_k), is calculated as follow:

w_{i} (n_{k}) = \frac{w_{i}}{\sum_{j \in S_{k}} w_{j}} \times n_{k}, for all i = 1, …, n,

(5)

where n_k is the number of objects in S_k, w_i is the raw weight of x_i (i.e., the weight computed using the corresponding WS (1), (2) or (3)) and $\sum_{j \in S_{k}} w_{j}$ is the sum of the raw weights of all objects belonging to the cluster S_k.

2.1.1. Weights based on the Silhouette index

The Silhouette cluster validity index aims at evaluating a clustering based on the comparison of distances between the groups and distances within the groups [5]. The Silhouette index of a given clustering is defined by using the Silhouette width of individual objects. The Silhouette width s_i of the object x_i is defined as follows:

s_{i} = \frac{b_{i} - a_{i}}{\max (a_{i}, b_{i})}, for all i = 1, …, n,

(6)

where a_i is the average distance between x_i and all other objects of its cluster and b_i is the smallest average distance between x_i and all the objects of the clusters to which x_i does not belong. The Silhouette cluster validity index, SI, characterizing the quality of a given clustering solution is defined as follows:

S I = \frac{1}{K} \sum_{k = 1}^{K} \frac{1}{n_{k}} \sum_{x_{i} \in S_{k}} s_{i}, for all i = 1, …, n,

(7)

where K is the number of clusters and n_k is the number of objects in the cluster S_k. The Silhouette width of an object can be used as its weight because it reflects how well this object is clustered.

Given that the Silhouette width of an object ranges between −1 and 1, we have rescaled it to the interval [0, 1] in order to satisfy our non-negativity constraint applied to the object weights (Constraint A). After this rescaling (Equation 8, WS Silhouette), higher weights will be given to the objects with lower Silhouette indices:

w_{i} = \frac{1 - s_{i}}{2}, for all i = 1, …, n,

(8)

where s_i is the Silhouette width of the object x_i.

2.1.2. Weights based on the median

Our second weighting scheme is based on the component-wise median. This intragroup weighting scheme is referred as Median. It is well known that the median is more robust (less sensitive) to outliers than the mean. Moreover, it is usually easier to obtain reliable estimates of the median compared to the mean due to its robust nature. Obviously, the median is not always more reliable than the mean. For instance, when the data come from distributions with thick tails, the sample median is more accurate than the sample mean. However, when the data come from distributions with a thin tail, like a normal distribution, the sample mean is a more accurate measure than the sample median [32].

Thus, the weight (i.e., clustering quality) of the object x_i ∈ S_k can be also defined using the Euclidean distance between x_i and the median of the elements of its cluster (WS Median):

w_{i} = \sqrt{\sum_{v = 1}^{V} {(x_{i v} - M d n_{k v})}^{2}}, for all i = 1, …, n,

(9)

where Mdn_kv is the median value of the feature v in the cluster S_k. We also carried out some experiments with the geometric median and the cluster medoids, but these statistics, which take a longer time to calculate, did not bring any detectable advantage in terms of the clustering quality compared to the component-wise median.

2.1.3. Weights based on the distance with the nearest centroid of a different cluster

This weighting method is based on the calculation of the squared Euclidean distance between a given object x_i ∈ S_k and the nearest centroid of a cluster to which x_i does not belong. This inter-group weighting scheme is referred here as Nearest centroid. Here the weight of the object x_i is calculated as follows:

w_{i} = \frac{1}{\min_{S_{k} \neq S_{k}} \sum_{v = 1}^{V} {(x_{i v} - c_{k' v})}^{2}}, for all i = 1, …, n,

(10)

where c_k′ is the centroid of the cluster S_k′, different from S_k. The main principle of the Nearest centroid weighting scheme is the following: The farther from the nearest centroid of a different cluster an object is located, the lower its weight will be (i.e., the less it will be penalized).

2.1.4. Weights based on the sum of distances with the centroids of the other clusters

This weighting scheme, referred here as Sum of distances, is based on the sum of the squared Euclidean distances between the centroid of the object’s cluster and the centroids of all other clusters. In this inter-group weighting scheme the weight of the object x_i ∈ S_k is computed as follows:

w_{i} = \frac{1}{\sum_{k' \neq k}^{K} \sum_{v = 1}^{V} {(c_{k v} - c_{k' v})}^{2}}, for all i = 1, …, n,

(11)

where c_k and c_k′ are the centroids of the clusters S_k and S_k′, respectively. The main advantage of this weighting scheme is the speed of the weight calculation. The weights of all objects of a given cluster are equal in this scheme, so the application of Constraint B does not provide any difference in computation here.

2.1.5. An overview figure

Figure 1 presents an overview of the four main weighting schemes described in this paper. Here, we consider Objects 1 to 11 belonging to Cluster 1 and five other clusters (Clusters 2 to 6) whose objects and centroids are located at different distances of Cluster 1. According to the Median WS Objects 1 and 8, located far away from the component-wise median, and the centroid, of Cluster 1, will get high weights (i.e. be penalized) in the objective function of our algorithm. Objects 2, 3 and 4 will have high weights in the Nearest centroid WS as they are located close to the centroid of Cluster 2. Objects 5, 6 and 7 will have high weights in the Silhouette WS as they are located very close to some objects of Cluster 3. Finally, Objects 9, 10 and 11 will have high weights in the Sum of distances to the other centroids WS as they are located closer to the centroids of Clusters 4 to 6 (i.e. the majority of distinct clusters) than the other objects of Cluster 1. It may be difficult to determine in advance which weighting scheme should be applied in each practical situation.

Fig. 1. — Overview of the four main object weighting schemes introduced in this paper: WS Silhouette, Median, Nearest centroid and Sum of distances to the other centroids. Objects 1 to 11 belonging to Cluster 1 are considered. Objects 1 and 8 will have high weights (i.e. be penalized) in the Median WS; Objects 2, 3 and 4 will have high weights in the Nearest centroid WS; Objects 5, 6 and 7 will have high weights in the Silhouette WS; Objects 9, 10 and 11 will have high weights in the Sum of distances to the other centroids WS.

2.2. Object weighting algorithm OW K-means

Here we present our object weighting K-means algorithm that can be used with the objective functions defined by Equations (2) and (3). Our algorithm follows the version of the traditional K-means algorithm described by Hartigan and Wong [30]. These authors proposed a clustering method in which the objects are moved, or not, from one cluster to another depending on the increase or decrease of the value of a specific objective function. The object weighting schemes presented in Section 2.1 cannot be directly incorporated into the conventional MacQueen’s [4] and Lloyd’s [31] versions of K-means because of the explicit use of the objective function (Equation 2 or 3) and repeated updates of the object weights and cluster centroids (Step 3), as described in Algorithm 1 below.

Algorithm 1.

Object weighting K-means

Input:
	- the dataset X with n objects
	- the object weighting scheme (WS 1 to 7 defined in Section 2.1)
Output:
	- the best clustering according to the selected WS
	- the object weights and the cluster centroids
1. Parameter setting. Choose the number of clusters K and the weighting scheme (see Equations 5–11). Set w_i = 1, for i = 1,…,n.
2. Setting initial centers and clusters. Assign K objects from X, selected at random, to be initial cluster centers c₁, c₂,…, C_K. Assign each object x_i ∈ X to the cluster S_k represented by the nearest c_k as per (2) to form the initial clustering S.
3. Cluster update. Let $L (S_{k}, c_{k}, w) = \sum_{x_{i} \in S_{k}} w_{i} (\sum_{v = 1}^{V} {(x_{i v} - c_{k v})}^{2})$ be the weighted within sum of squares of the cluster S_k. For all objects x_i ∈ S_k and all clusters S_k (k = 1,…,K) do:
If there exists a cluster S_k′ such that after moving x_i from S_k to S_k′, we have:
DIFF = L(S_k, c_k, w) + L(S_k′, c_k′, w) − L(S_k \ x_i, c_k, w) − L(S_k′ ∪ x_i, c_k′, w) > 0
then assign x_i to the cluster S_k′ that provides the maximum of DIFF. The cluster centers and object weights in both S_k and S_k′ are updated when calculating DIFF. When all objects of X are examined, a new clustering S′ = {S′₁, S′₂,…, S′_K} is generated.
4. Decision step. If S′ = S or the maximum number of K-means iterations I is reached, then output the generated clustering S′, cluster centers C and object weights w, and end the computation; otherwise go to Step 3.

Open in a new tab

It is worth noting that Step 3 above can also include the update of feature weights. This can be done using Equations 3 and 4. The running time of Algorithm 1 is O(n²VI), where I is the maximum number of K-means iterations. Because the Cluster update step (Step 3) for a given object x_i can be completed in O(nV), the running time of Algorithm 1 does not depend on the number of clusters K, and is slightly higher than that of Lloyd’s K-means (which is O(nVKI)), but is equivalent to that of Hartigan and Wong’s version of K-means. Similarly to the traditional K-means algorithm, Algorithm 1 should be carried out with several initial random partitions (100 to 1000 random starts are usually recommended) and different numbers of clusters K. The clustering providing the optimum of the selected cluster validity index (e.g., Silhouette or Calinski-Harabasz) is then chosen as the final solution.

2.3. Impact of object weighting on the Iris data

In order to show the impact of object weighting, we applied our OW K-means to the famous Iris data [33] using two weighting schemes: Silhouette and Median(n_k). The Calinski-Harabasz (CH, [34]) cluster validity index was used to select the best clustering from clustering solutions generated after 1000 random starts of Algorithm 1. We first compared the expected cluster membership of the Iris objects (Fig. 2a) with the cluster membership found by our algorithm using the Silhouette weighting scheme (Fig. 2b). The value of the Adjusted Rand Index (ARI, [35]) for our solution here is 0.78 (the solution obtained using Median(n_k), which is not shown here, consisted in a very similar clustering; the ARI value for this solution is 0.76). The large majority of misclassified objects here is concentrated within the cluster overlap involving the species of Virginica and Versicolor. The values of the object weights are shown for two weighting schemes: Silhouette (Fig. 2c) and Median(n_k) (Fig. 2d). The object weights computed using the Silhouette-based scheme have high values within the cluster overlap zone discussed above. They penalize the objects with high instability (i.e., the objects with erratic class membership through different random starts of the clustering algorithm) located on the border of two or several clusters and causing cluster overlap. On the contrary, the object weights generated using the Median(n_k-based weighting have high values on the extremities of clusters located far away from the cluster centers. They penalize the objects which can be considered as cluster outliers. Thus, the Silhouette and Median(n_k) weighting schemes can be viewed as complementary, penalizing respectively the objects creating cluster overlap and outliers.

Fig. 2. — Impact of object weighting on the clustering of Iris data shown for the first two components of the principal component analysis. (a) The correct cluster membership for the Iris dataset, featuring three Iris species: *Setosa* (red), *Virginica* (black) and *Versicolor* (green), (b) The cluster membership found by our OW K-means algorithm carried out with the *Silhouette* weighting scheme, (c) the values of the object weights obtained with the *Silhouette* weighting scheme (the overlapping region between the clusters of *Virginica* and *Versicolor* is highlighted by multiple red points), and (d) the values of the object weights obtained with the *Median*(*n_k*) weighting scheme (the outliers of the three Iris clusters are highlighted by multiple red points).

3. Results

In order to assess the performance of our OW K-means algorithm, we first conducted a comprehensive simulation study to compare its accuracy, in terms of determining the expected number of clusters, to the accuracy of some popular clustering methods, including X-means [25], Prediction strength [26], Discriminant Analysis of Principal components (DAPC) [27] and Feature Weighting (FW) K-means [15], We then applied our new algorithm to real bioinformatics data, including some well-known cancer gene expression [28] and DNA polymerase phylogenetic [29] datasets.

3.1. Simulation study

In order to compare our different proposed weighting schemes we used the MixSim R package [36] that simulates mixtures of Gaussian distributions with different levels of overlap between clusters. First, we generated error-free data, comprising a very small cluster overlap (average pairwise cluster overlap BarOmega = 0.01 in MixSim). Second, data with homogeneous noise, generated by the R function jitter from the base R package using the default parameters, were considered. Third, 20 and 40% of noise dimensions (i.e., noise features) were added to error-free data. Finally, to further increase clustering complexity and check the performance of clustering algorithms in the presence of scatter and cluster overlap, we generated and analyzed noise data including 10 and 25% of outliers and the average pairwise cluster overlap of 0.05 (BarOmega = 0.05 in MixSim). In our simulations, the number of objects n was equal to 100, 200, 300, 400 and 500 (before adding the outliers), the number of features V was equal to 5, 10 and 15 (before adding noise dimensions), and the number of clusters K was equal to 2, 3, 4, 5 and 6. For each parameter combination (n, V, K, Type_of_noise), 100 random datasets consisting of Gaussian clusters were generated. All datasets were then normalized using Z-scores. The mean absolute difference between the generated number of clusters and the number of clusters provided by the competing methods was measured (i.e., the average error rate, see also Arbelaitz et al. [37]). Two popular cluster validity indices, Silhouette and Calinski-Harabasz (CH), were used to determine the number of clusters.

In our first simulation, we compared our median-based OW K-means to the traditional K-medians algorithm [38] and the seven weighting schemes which can be used within our OW K-means (see Section 2.1) among them. Fig. SF1 in Supplementary Materials present the results in terms of the average error rate shown for our Median and Median(n_k) weighting schemes and the K-medians clustering using either the Silhouette or CH cluster validity index to select the best number of clusters. The number of clusters allowed by the methods varied from 2 to 10. The number of random starts was equal to 1000. As the results provided by OW K-means using the CH index were generally better than those found using Silhouette, only the average error rates obtained using CH were presented for our method (see Figs. SF1 and SF2, and Fig. 3). The biggest difference in the results provided by the seven different weighting schemes used in the framework of OW K-means occurred when analyzing the data with 25% of outliers and cluster overlap of 0.05, as well as the data with 40% of noise dimensions. Thus, these results are presented in Figs. SF1 and SF2. Our Median(n_k) WS outperformed our Median WS and the two versions of K-medians (based on the Silhouette and CH indices, respectively) for the data affected by outliers and cluster overlap (Fig. SF1a), while the Silhouette-based K-medians was generally the best method to deal with the data affected by noise dimensions (Fig. SF1b). When our seven WS were compared among them, the Median(n_k) weighting outperformed the other weighting strategies for the data with outliers and cluster overlap (Fig. SF2a), whereas the Silhouette weighting was the most effective to deal with noise dimensions (Fig. SF2b). Therefore, these two weighting strategies were selected for further analysis.

Fig. 3. — Average error rate, i.e., the mean absolute difference between the correct and predicted number of clusters, depicted for: Traditional K-means (based on MacQueen’s algorithm), Prediction strength, X-means, DAPC using Bayesian Information Criterion (BIC), Feature Weighting (FW) K-means, Object Weighting (OW) K-means using *Silhouette* and *Median*(*n_k*)-based weightings, and feature and object weighting (FW-OW) K-means using *Silhouette* and *Median*(n_k)-based weightings. The Calinski-Harabasz (CH) cluster validity index was used to determine the number of clusters in the K-means-based methods. The types of noise used in our simulation were as follows: (a) Error-free data, (b) Jitter model of noise, (c) 20% of noise dimensions, (d) 40% of noise dimensions, (e) 10% of outliers and the average pairwise cluster overlap of 0.1, and (f) 25% of outliers and the average pairwise cluster overlap of 0.1.

In our second simulation, nine different clustering methods were contrasted. They include: Traditional K-means, Prediction strength, X-means, Discriminant analysis (DAPC) of principal components using Bayesian Information Criterion, Feature Weighting K-means, our OW K-means based on the Silhouette and Median(n_k) weighting schemes, and the two versions of our algorithm allowing both object and feature weighting (Fig. 3). All versions of our algorithm and traditional K-means used the CH index to determine the number of clusters. As expected, the average error rate was almost null for error-free data for all of the methods (Fig. 3a). Figure 3b depicts the clustering performances of the nine algorithms for the data affected by homogeneous noise. It highlights the robustness of the DAPC algorithm as well as the weakness of X-means for this kind of noise, while the rest of the methods yield very similar clustering results. Regarding the data with noise dimensions, the OW K-means algorithm using the Silhouette and Median(n_k) weightings remains among the best performers with 20% of noise dimensions (Fig. 3c). When 40% of noise dimensions are added to the error-free data, the Silhonette-based strategy works better than the Median(n_k)-based strategy, but gets outperformed by the DAPC and Prediction strength algorithms (Fig. 3d). However, for the data that include both outliers and cluster overlap the two OW K-means strategies clearly outperform the seven remaining methods (Fig. 3e–f). The Median(n_k)-based weighting has a slight advantage over the Silhonette-based weighting for the data with 40% of outliers (Fig. 3f). Another important factor allowing one to estimate the quality of the obtained clustering solutions, apart from the correct number of clusters recovered, is the quality of the data partitions returned by clustering algorithms. To estimate the quality of the obtained data partitions we used the ARI index, which is the corrected-for-chance version of the Rand index [35], Supplementary Figures SF3 and SF4 shows the distribution of the absolute differences between the correct and predicted number of clusters and that of the ARI values provided by the nine algorithms compared in our simulations. In terms of the recovery of correct data partitions, our OW K-means based on the Silhouette and Median(n_k) weighting schemes provided very competitive results for the datasets affected by outliers (10% and 25% of outliers) and cluster overlap. Moreover, in many cases the results yielded by the OW K-means-based procedures had the smallest standard deviation among the competing methods. On the other hand, the Prediction strength and DAPC algorithms were the best performers when 20% and 40% of noise dimensions were added to the error-free data.

3.2. Clustering cancer gene expression data

We further analysed the performance of our OW K-means algorithm considering its application to gene expression data related to different types of cancer. We used the data examined in the paper of de Souto et al. [28], consisting of a pool of 35 microarray datasets from different clinical and bioinformatics studies. The single channel Affymetrix chips and double-channel cDNA microarrays technologies were used to obtain these data. The number of classes (i.e., cancer types) was supposed to be known for each dataset. This number was reported by de Souto et al. following the corresponding (original) cancer gene expression study. The number of cancer types varied from 2 to 10 for these data. Prior to their clustering analysis, de Souto et al. performed unsupervised feature selection as the number of features in the original datasets varied between 4022 and 42640, resulting in a well-known over-fitting effect, often referred as the “curse of dimensionality”. After this data pre-processing step, the number of data dimensions was reduced to the range of 85 to 4553 features. To further reduce the number of features and improve clustering results, we carried out supervised feature selection (as we new the number of classes and the sample distribution per class for each dataset) using the random forest classifier implemented in the function random.forest.importance of the R package Fselector [39] which ranks features according to their importance. This function uses OneR classifier to determine the feature weights. For each feature it creates a simple rule based only on that feature and then calculates its error rate. The cutoff.biggest.diff function of Fselector was then carried out with default parameters to select a subset of the most significant features. This allowed us to reduce the number of data dimensions to the range of 2 to 110 features (42 features on average).

Our OW K-means algorithm along with the other clustering methods compared in the previous section was applied on these reduced cancer gene expression datasets. The good results originally provided by the traditional K-means algorithm can be explained by the fact that, regardless of the expected number of classes, it constantly found that the optimal number of classes was 2 (for 34 of the 35 datasets), and 22 of the 35 datasets consisted of 2 or 3 classes. A class here represented a known subtype of cancer. This undesired constancy of the K-means algorithm highly boosted its success rate in our original analysis. To evaluate all the methods in an equitable way, we reduced our pool of data to 20 datasets, selecting 5 datasets with 2 classes, 5 datasets with 3 classes, 5 datasets with 4 classes, and 5 datasets with 5 or more classes (see Table 1 and our data archive at: http://www.info2.uqam.ca/~makarenkov_v/gene_expression_data.htm). When it was possible, different types of cancer were chosen for each number of classes.

Table 1.

Summary of the main characteristics of the 20 cancer gene expression benchmark datasets analyzed in our study. The indicated number of classes is that found in the original study.

Cancer type	Dataset	Objects	Features	Classes
Skin	Bittner	38	110	2
Liver	Chen	179	4	2
Lung	Golub-V1	72	93	2
Colon	Laiho	37	110	2
Breast	West	49	60	2
Leukemia	Amstrong-	72	110	3
Bladder	Dyrskjot	40	60	3
Leukemia	Golub-V2	72	93	3
Blood	Alizadeh-V2	62	3	3
Brain	Liang	37	20	3
Brain	Nutt-V1	50	20	4
Lung	Garber	66	2	4
Prostate	Lapointe-V2	110	3	4
Endometrium	Risinger	42	20	4
Prostate	Tomlins-V2	92	20	4
Lung	Bhattacharjee	203	7	5
Brain	Pomeroy-V2	42	69	5
Prostate	Tomlins-V1	104	4	5
Bone marrow	Yeoh-V2	248	20	6
Multi-tissue	Su	174	2	10

Open in a new tab

The results provided by the nine competing algorithms for the 20 selected datasets are presented in Figure 4 (here we adopted the graphical representation of Arbelaitz et al. [37]). Figure 4a shows that according to the mean absolute difference between the expected and predicted number of classes the two best results were found by the two tested versions of OW K-means, i.e., those using the Silhouette and Median(n_k) weighting schemes. According to the success rate criterion (Fig. 4b), our OW K-means method based on Median(n_k) was the best performer here, followed by the Silhouette-based version of the method and the X-means and DAPC algorithms. In terms of the mean ARI rate the OW K-means algorithm based on Median(n_k) also outperformed the other methods, while very close results were provided by the DAPC, traditional K-means and OW K-means based on the Silhouette index (Fig. 4c). The mean ARI values presented in this figure vary from 0.499 to 0.698. In all the three cases, the best results yielded by OW K-means were obtained without considering feature weighting. Figure SF5 in Supplementary Materials shows the distribution of the absolute differences between the correct and predicted number of classes (Fig. SF5a) and that of the ARI values (Fig. SF5b) obtained for the nine algorithms compared in our work. Supplementary Tables 1 and 2 report the detailed absolute differences and ARI values provided by the competing algorithms for each of the 20 benchmark datasets examined here. The following general pattern can be observed from these results: the larger is the number of classes and respectively the possibility of a cluster overlap, the more effective our OW K-means approach is.

Fig. 4. — (a) The mean absolute difference between the true and predicted number of classes, (b) The success rate (given in % and accounting for the number of times when the true number of classes was found by the algorithm) and (c) The mean ARI values provided by: Traditional K-means (MacQueen’s version), Feature Weighting (FW) K-means, our OW K-means using the CH cluster validity index and the *Silhouette* and *Median*(*n_k*)-based object weighting schemes with and without FW, X-means, DAPC using BIC and Prediction strength, for the 20 cancer gene expression datasets reported in Table 1.

3.3. Clustering DPO gene phylogenetic data

Next, we considered the problem of clustering DNA polymerase (DPO) gene data, originally examined by Beaudet et al. [29]. We analyzed the amino acid sequences that have been found using translated Glomeromycota sequences (fungal phylum comprising arbuscular mycorrhizal fungi, AMF) as queries for a BLASTp search in the GenBank nr database. Beaudet et al. assessed a possible origin and transmission mode of AMF mitochondrial DPO sequences by constructing a protein similarity network. We examined the 26 amino acid sequences (Supplementary Table 3) of this large network, i.e., those whose evolution was depicted by a phylogeny presented in Fig. 4b in Beaudet et al. [29], Namely, the DPO proteins belonging to five Glomeromycota species, eight “other fungi”, four plants, three eukaryotes, three bacteriophages and three bacteria, were considered.

We first inferred a sequence similarity network of the selected DPO proteins (see Fig. 5a) using the Cytoscope program [40]. Similarly to Beaudet et al. [29], we considered the edge-weighted force-directed network model in which the similarity between sequences is proportional to the distance between them. The obtained network can be clearly separated into five distinct classes. They are represented by dashed ellipses in Figure 5a. The first two classes include the five Glomeromycota taxa (black nodes in Fig. 5a) and the four plants (green nodes), respectively. These classes share similarities, represented by two edges in the network, due to the well-known mycorrhizal symbiosis, which is an association between Glomeromycota and plant roots [41]. The third class includes the three bacteriophages (red nodes in Fig. 5a), i.e., viruses that infect and replicate within bacteria, and the three bacteria (blue nodes), which have received the DPO protein from these bacteriophages by horizontal gene transfer. A very close similarity between these bacteria and bacteriophages indicates that they should be clustered together. The fourth class includes the eight organisms of other fungi (orange nodes in Fig. 5a) along with two eukaryotes (purple nodes related to orange nodes). These eukaryotes have been probably contaminated by other fungi through horizontal gene transfer. Other fungi also share some similarity with the Glomeromycota fungi. The fifth class is composed of a unique eukaryote (unlinked orange node in Fig. 5a) that does not share close similarity with the rest of the species.

In order to apply our OW K-means to the DPO data, we first converted the amino acid sequences into numerical vectors, which can be processed by clustering methods. In our case, the numeric vectors represented the amino acids proportions in the DPO sequences. These proportions were calculated using the function featurePseudoAAComp of the R package BioSeqClass [42]. Along with our OW K-means with weights based on Silhouette, we also carried out traditional K-means based on CH, X-means, DAPC and Prediction strength. The clusterings provided by these algorithms are reported in Figure 5b. Our OW K-means was the only method that correctly classified the 26 considered species into five groups, corresponding to the similarity network patterns described above (i.e., the bacteria and the related bacteriophages were grouped together, two of the three eukaryotes were placed in the same group with other fungi, the third eukaryote was a singleton element in its class, and the Glomeromycota and plant species were clustered separately). The clustering obtained by the traditional K-means algorithm is relatively similar to that found by our method with the exception that the group of other fungi is divided into two subgroups. The other methods misclassified the DPO data by clustering all the eukaryotes with other fungi, and either by merging plants with the bacteria/phages group (DAPC and X-means) or by clustering plants with Glomeromycota fungi (Prediction strength).

4. Conclusion

In this paper, we presented a new clustering approach, OW K-means, based on the use of object weights in the objective function of K-means. Seven object weighting schemes, using the Silhouette width, the median and two intercluster distances, were considered in our study. The object weights defined in this work can be used to identify two types of objects, which can easily damage cluster organization, making clustering less reliable. Such objects, which are typical for bioinformatics applications, are located either on the frontier with other clusters (i.e., objects causing cluster overlap) or far away from the cluster centers (i.e., outliers). Both of them bring instability to the clustering process. Clustering and outlier detection are strongly related tasks in data mining [13], [14], [15]. One of the goals of our work was to detect the outliers and unstable clustering elements, not removing them from the clustering process as it is often done in data mining [14], but rather assigning them higher weights in the method’s objective function. Thus, our approach could be seen as a unified tool for clustering and discovering outliers and elements causing cluster overlap in data simultaneously. However, if necessary, the presented method could be also used as a data pre-processing step, allowing one to eliminate outliers and elements causing cluster overlap from the data prior to clustering analysis. Our Algorithm 1 can be carried out with each of the weighting schemes described in Section 2 and the clustering solutions provided by them can then be compared using the selected cluster validity index, such as for example Silhouette or Calinski-Harabasz indices. The optimum value of the selected index will indicate the most appropriate weighting scheme for the data at hand.

Experiments on synthetic and real bioinformatics data, including gene expression and phylogenetic datasets, demonstrated the effectiveness of our approach. The best of our weighting schemes, based on Silhouette and the median, were compared to a number of popular clustering methods, including traditional K-means [4], X-means [25], DAPC [27] and Prediction Strength [26]. Our method clearly outperformed the existing clustering approaches in the presence of outliers and cluster overlap. However, in the presence of noisy features and jitter noise DAPC and Prediction Strength slightly outperformed our OW K-means. Given that in real situations it is unlikely that we know the types of noise in data, when contrasting DAPC and Prediction Strength with our OW K-means algorithm, a potential solution is to identify the outliers that are not common to these procedures and classify them afterwards. In the future, it would be interesting to extend the presented approach to the case where a weight could be given not only to individual objects, but also to clusters of objects. Such cluster weights could be defined to maximize the total distance between clusters. It would be also important to test if the use of weights helps increase the stability of clustering solutions provided by K-means [43].

The presented OW K-means algorithm was implemented in R and included in the RWeightedKmeans package. It is freely available at CRAN. A faster C version of the program is available at: https://github.com/agondeau/weightedKmeans.

Supplementary Material

TCBB2921577

NIHMS1692932-supplement-TCBB2921577.pdf^{(565.7KB, pdf)}

Acknowledgments

The authors acknowledge the support of the Fonds Québécois de la Recherche sur la Nature et les Technologies (grant no. 173878) and of the Natural Sciences and Engineering Research Council of Canada (grant no. 249644).

Biographies

graphic file with name nihms-1692932-b0006.gif

Alexandre Gondeau received his Master’s degree in Computer Science at the University Bordeaux I (France) in 2013 and a post-graduate diploma in Bioinformatics at the Université du Québec à Montréal (Canada) in 2016. He joined the Bordeaux Bioinformatics Center (CBIB) in 2019. His research interests reside in computational biology.

graphic file with name nihms-1692932-b0007.gif

Mohamed Hijri is a Full professor at the Département de Sciences Biologiques of the Université de Montréal (Canada). His research program centers on the molecular genetics and genomics of arbuscular mycorrhizal fungi (AMF), metagenomics, environmental microbiology, microbial ecology and plant-microbe interactions.

graphic file with name nihms-1692932-b0008.gif

Pedro Peres-Neto is a Full professor at the Department of Biology of the University of Concordia (Montreal, Canada). He is a Canada Research Chair (tier I) in Spatial Ecology and Biodiversity. His research lies at the interface of quantitative ecology and biostatistics.

graphic file with name nihms-1692932-b0009.gif

Vladimir Makarenkov is a Full professor and Director of the graduate Bioinformatics program at the Department of Computer Science of the Université du Québec à Montréal (Canada). His research interests are in the fields of bioinformatics, software engineering and data mining.

Contributor Information

Alexandre Gondeau, Département d’Informatique, Université du Québec à Montréal, C.P.8888, s. Centre-Ville, Montreal, QC, Canada, H3C 3P8.

Mohamed Hijri, Department of Biology, Université de Montréal, 4101 Rue Sherbrooke Est, Montreal, QC, Canada, H1X 2B2.

Pedro Peres-Neto, Department of Biology, Concordia University, 7141 Rue Sherbrooke Ouest, Montreal, QC, Canada, H4B 1R6.

Vladimir Makarenkov, Département d’Informatique, Université du Québec à Montréal, C.P.8888, s. Centre-Ville, Montreal, QC, Canada, H3C 3P8.

References

[1].Baldi P and Hatfield GW, DNA microarrays and gene expression: from experiments to data analysis and modeling, Cambridge University Press, 2002. [Google Scholar]
[2].Li BN et al. , “Integrating spatial fuzzy clustering with level set methods for automated medical image segmentation,” Comput. Biol. Med, vol. 41, pp. 1–10, 2011. [DOI] [PubMed] [Google Scholar]
[3].Young JH et al. , “Computational discovery of pathway-level genetic vulnerabilities in non-small-cell lung cancer,” Bioinformatics, vol. 32, pp. 1373–1379, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].MacQueen J, “Some methods for classification and analysis of multivariate observations,” Proc. of the fifth Berkeley symposium on mathematical statistics and probability, pp. 281–297, 1967. [Google Scholar]
[5].Rousseeuw PJ, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math, vol. 20, pp. 53–65, 1987. [Google Scholar]
[6].Tamborero D et al. , “OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes,” Bioinformatics, vol. 29, pp. 2238–2244, 2013. [DOI] [PubMed] [Google Scholar]
[7].Yang Y et al. , “SAFE-clustering: single-cell aggregated (from ensemble) clustering for single-cell RNA-seq data,” Bioinformatics, bty793, in press, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Hendrickx DM et al. , “Pattern recognition methods to relate time profiles of gene expression with phenotypic data: a comparative study,” Bioinformatics, vol. 31, pp. 2115–2122, 2015. [DOI] [PubMed] [Google Scholar]
[9].Legendre P and Legendre L, Numerical ecology. 3rd English edn. Developments in environmental modelling, 24, Amsterdam: Elsevier Science BV, 2012. [Google Scholar]
[10].Hao X et al. , “Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering,” Bioinformatics, vol. 27, pp. 611–618, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Jiang P and Singh M, “SPICi: a fast clustering algorithm for large biological networks,” Bioinformatics, vol. 26, pp. 1105–1111, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Choi J et al. , “Improved prediction of breast cancer outcome by identifying heterogeneous biomarkers,” Bioinformatics, vol. 33, pp. 3619–3626, 2017. [DOI] [PubMed] [Google Scholar]
[13].Jiang MF et al. , “Two-phase clustering process for outliers detection,” Pattern Recogn. Lett, vol. 22, pp. 691–700, 2001. [Google Scholar]
[14].Hautamäki V et al. , “Improving k-means by outlier removal,” Proc. Scandinavian Conference on Image Analysis, pp. 978–987, 2005. [Google Scholar]
[15].de Amorim RC and Mirkin B, “Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering,” Pattern Recognit, vol. 45, pp. 1061–1075, 2012. [Google Scholar]
[16].Oh JH and Gao J, “A kernel-based approach for detecting outliers of high-dimensional biological data,” BMC Bioinformatics, vol. 10 (Suppl 4), S7, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].MacDonald JW and Ghosh D, “COPA - cancer outlier profile analysis,” Bioinformatics, vol. 22, pp. 2950–2951, 2006. [DOI] [PubMed] [Google Scholar]
[18].Gunawardana Y et al. , “Outlier detection at the transcriptome-proteome interface,” Bioinformatics, vol. 31, pp. 2530–2536, 2015. [DOI] [PubMed] [Google Scholar]
[19].Makarenkov V and Legendre P, “Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software,” J. Classif, vol. 18, pp. 245–271, 2001. [Google Scholar]
[20].de Amorim RC, “A survey on feature weighting based k-means algorithms,” J. Classif, vol. 33, pp. 210–242, 2016. [Google Scholar]
[21].Tseng GC, “Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data,” Bioinformatics, vol. 23, pp. 2247–2255, 2007. [DOI] [PubMed] [Google Scholar]
[22].Kerdprasop K, Kerdprasop N and Sattayatham P, “Weighted K-Means for Density-Biased Clustering,” In Data Warehousing and Knowledge Discovery, pp. 488–497, Ed. Springer; Berlin Heidelberg, 2005. [Google Scholar]
[23].Gebru ID, Alameda-Pineda X, Forbes F and Horaud R, “EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, pp. 2402–2415, 2016. [DOI] [PubMed] [Google Scholar]
[24].Shen YJ, Sun W, W and Li KC “Dynamically weighted clustering with noise set.” Bioinformatics, vol. 26, pp. 341–347, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Pelleg D and Moore AW, “X-means: extending k-means with efficient estimation of the number of clusters,” Proc. of ICML, pp. 727–734, June 2000. [Google Scholar]
[26].Tibshirani R and Walther G, “Cluster validation by prediction strength,” J. Comput. Graph. Stat, vol. 14, pp. 511–528, 2005. [Google Scholar]
[27].Jombart T et al. , “DAPC: a new method for the analysis of genetically structured populations,” BMC Genetics, vol. 11, 94, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].de Souto MC et al. , “Clustering cancer gene expression data: a comparative study,” BMC Bioinformatics, vol. 9, 497, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Beaudet D. et al. , “Mitochondrial genome rearrangements in Glomus species triggered by homologous recombination between distinct mtDNA haplotypes,” Genome Biol. Evol, vol. 5, pp. 1628–1643, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Hartigan JA and Wong MA, “Algorithm AS 136: A k-means clustering algorithm,” J. R. Stat. Soc, vol. 28, pp. 100–108, 1979. [Google Scholar]
[31].Lloyd S, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, pp. 129–137, 1982. [Google Scholar]
[32].Manikandan S, “Measures of central tendency: Median and mode,” J. Pharmacol. Pharmacother, vol. 2, pp. 214–215, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Fisher RA “The use of multiple measurements in taxonomic problems,” Ann. Eugen, vol. 7, pp. 179–188, 1936. [Google Scholar]
[34].Calinski T and Harabasz J, “A dendrite method for cluster analysis,” Commun. Stat. Theory Methods, vol. 3, pp. 1–27, 1974. [Google Scholar]
[35].Hubert L and Arabie P, “Comparing Partitions,” J. Classification, vol. 2, no. 4, pp. 193–218, April. 1985. [Google Scholar]
[36].Melnykov V et al. , “MixSim: An R package for simulating data to study performance of clustering algorithms,” J. Stat. Softw, vol. 51, pp. 1–25, 2012.23504300 [Google Scholar]
[37].Arbelaitz O et al. , “An extensive comparative study of cluster validity indices,” Pattern Recognit, vol. 46, pp. 243–256, 2013. [Google Scholar]
[38].Jain AK and Dubes RC, Algorithms for clustering data. Prentice-Hall, 1988. [Google Scholar]
[39].Romanski P, FSelector: Selecting Attributes, R Foundation for Statistical Computing, Vienna, 2009. [Google Scholar]
[40].Shannon P et al. , “Cytoscape: a software environment for integrated models of biomolecular interaction networks,” Genome Res, vol. 13, pp. 2498–2504, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Smith SE and Read DJ, Mycorrhizal symbiosis. Academic press, 2008. [Google Scholar]
[42].Hong L, BioSeqClass: classification for biological sequences, R package, 2017. [Google Scholar]
[43].Bertrand P and Mufti G, “Loevinger’s measures of rule quality for assessing cluster stability,” Comput. Stat. Data Anal, vol. 50, pp. 992–1015, 2006. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

TCBB2921577

NIHMS1692932-supplement-TCBB2921577.pdf^{(565.7KB, pdf)}

[R1] [1].Baldi P and Hatfield GW, DNA microarrays and gene expression: from experiments to data analysis and modeling, Cambridge University Press, 2002. [Google Scholar]

[R2] [2].Li BN et al. , “Integrating spatial fuzzy clustering with level set methods for automated medical image segmentation,” Comput. Biol. Med, vol. 41, pp. 1–10, 2011. [DOI] [PubMed] [Google Scholar]

[R3] [3].Young JH et al. , “Computational discovery of pathway-level genetic vulnerabilities in non-small-cell lung cancer,” Bioinformatics, vol. 32, pp. 1373–1379, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].MacQueen J, “Some methods for classification and analysis of multivariate observations,” Proc. of the fifth Berkeley symposium on mathematical statistics and probability, pp. 281–297, 1967. [Google Scholar]

[R5] [5].Rousseeuw PJ, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math, vol. 20, pp. 53–65, 1987. [Google Scholar]

[R6] [6].Tamborero D et al. , “OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes,” Bioinformatics, vol. 29, pp. 2238–2244, 2013. [DOI] [PubMed] [Google Scholar]

[R7] [7].Yang Y et al. , “SAFE-clustering: single-cell aggregated (from ensemble) clustering for single-cell RNA-seq data,” Bioinformatics, bty793, in press, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Hendrickx DM et al. , “Pattern recognition methods to relate time profiles of gene expression with phenotypic data: a comparative study,” Bioinformatics, vol. 31, pp. 2115–2122, 2015. [DOI] [PubMed] [Google Scholar]

[R9] [9].Legendre P and Legendre L, Numerical ecology. 3rd English edn. Developments in environmental modelling, 24, Amsterdam: Elsevier Science BV, 2012. [Google Scholar]

[R10] [10].Hao X et al. , “Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering,” Bioinformatics, vol. 27, pp. 611–618, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Jiang P and Singh M, “SPICi: a fast clustering algorithm for large biological networks,” Bioinformatics, vol. 26, pp. 1105–1111, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Choi J et al. , “Improved prediction of breast cancer outcome by identifying heterogeneous biomarkers,” Bioinformatics, vol. 33, pp. 3619–3626, 2017. [DOI] [PubMed] [Google Scholar]

[R13] [13].Jiang MF et al. , “Two-phase clustering process for outliers detection,” Pattern Recogn. Lett, vol. 22, pp. 691–700, 2001. [Google Scholar]

[R14] [14].Hautamäki V et al. , “Improving k-means by outlier removal,” Proc. Scandinavian Conference on Image Analysis, pp. 978–987, 2005. [Google Scholar]

[R15] [15].de Amorim RC and Mirkin B, “Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering,” Pattern Recognit, vol. 45, pp. 1061–1075, 2012. [Google Scholar]

[R16] [16].Oh JH and Gao J, “A kernel-based approach for detecting outliers of high-dimensional biological data,” BMC Bioinformatics, vol. 10 (Suppl 4), S7, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].MacDonald JW and Ghosh D, “COPA - cancer outlier profile analysis,” Bioinformatics, vol. 22, pp. 2950–2951, 2006. [DOI] [PubMed] [Google Scholar]

[R18] [18].Gunawardana Y et al. , “Outlier detection at the transcriptome-proteome interface,” Bioinformatics, vol. 31, pp. 2530–2536, 2015. [DOI] [PubMed] [Google Scholar]

[R19] [19].Makarenkov V and Legendre P, “Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software,” J. Classif, vol. 18, pp. 245–271, 2001. [Google Scholar]

[R20] [20].de Amorim RC, “A survey on feature weighting based k-means algorithms,” J. Classif, vol. 33, pp. 210–242, 2016. [Google Scholar]

[R21] [21].Tseng GC, “Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data,” Bioinformatics, vol. 23, pp. 2247–2255, 2007. [DOI] [PubMed] [Google Scholar]

[R22] [22].Kerdprasop K, Kerdprasop N and Sattayatham P, “Weighted K-Means for Density-Biased Clustering,” In Data Warehousing and Knowledge Discovery, pp. 488–497, Ed. Springer; Berlin Heidelberg, 2005. [Google Scholar]

[R23] [23].Gebru ID, Alameda-Pineda X, Forbes F and Horaud R, “EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, pp. 2402–2415, 2016. [DOI] [PubMed] [Google Scholar]

[R24] [24].Shen YJ, Sun W, W and Li KC “Dynamically weighted clustering with noise set.” Bioinformatics, vol. 26, pp. 341–347, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Pelleg D and Moore AW, “X-means: extending k-means with efficient estimation of the number of clusters,” Proc. of ICML, pp. 727–734, June 2000. [Google Scholar]

[R26] [26].Tibshirani R and Walther G, “Cluster validation by prediction strength,” J. Comput. Graph. Stat, vol. 14, pp. 511–528, 2005. [Google Scholar]

[R27] [27].Jombart T et al. , “DAPC: a new method for the analysis of genetically structured populations,” BMC Genetics, vol. 11, 94, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].de Souto MC et al. , “Clustering cancer gene expression data: a comparative study,” BMC Bioinformatics, vol. 9, 497, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Beaudet D. et al. , “Mitochondrial genome rearrangements in Glomus species triggered by homologous recombination between distinct mtDNA haplotypes,” Genome Biol. Evol, vol. 5, pp. 1628–1643, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Hartigan JA and Wong MA, “Algorithm AS 136: A k-means clustering algorithm,” J. R. Stat. Soc, vol. 28, pp. 100–108, 1979. [Google Scholar]

[R31] [31].Lloyd S, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, pp. 129–137, 1982. [Google Scholar]

[R32] [32].Manikandan S, “Measures of central tendency: Median and mode,” J. Pharmacol. Pharmacother, vol. 2, pp. 214–215, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Fisher RA “The use of multiple measurements in taxonomic problems,” Ann. Eugen, vol. 7, pp. 179–188, 1936. [Google Scholar]

[R34] [34].Calinski T and Harabasz J, “A dendrite method for cluster analysis,” Commun. Stat. Theory Methods, vol. 3, pp. 1–27, 1974. [Google Scholar]

[R35] [35].Hubert L and Arabie P, “Comparing Partitions,” J. Classification, vol. 2, no. 4, pp. 193–218, April. 1985. [Google Scholar]

[R36] [36].Melnykov V et al. , “MixSim: An R package for simulating data to study performance of clustering algorithms,” J. Stat. Softw, vol. 51, pp. 1–25, 2012.23504300 [Google Scholar]

[R37] [37].Arbelaitz O et al. , “An extensive comparative study of cluster validity indices,” Pattern Recognit, vol. 46, pp. 243–256, 2013. [Google Scholar]

[R38] [38].Jain AK and Dubes RC, Algorithms for clustering data. Prentice-Hall, 1988. [Google Scholar]

[R39] [39].Romanski P, FSelector: Selecting Attributes, R Foundation for Statistical Computing, Vienna, 2009. [Google Scholar]

[R40] [40].Shannon P et al. , “Cytoscape: a software environment for integrated models of biomolecular interaction networks,” Genome Res, vol. 13, pp. 2498–2504, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Smith SE and Read DJ, Mycorrhizal symbiosis. Academic press, 2008. [Google Scholar]

[R42] [42].Hong L, BioSeqClass: classification for biological sequences, R package, 2017. [Google Scholar]

[R43] [43].Bertrand P and Mufti G, “Loevinger’s measures of rule quality for assessing cluster stability,” Comput. Stat. Data Anal, vol. 50, pp. 992–1015, 2006. [Google Scholar]

PERMALINK

Object weighting: a new clustering approach to deal with outliers and cluster overlap in computational biology

Alexandre Gondeau

Mohamed Hijri

Pedro Peres-Neto

Vladimir Makarenkov

Abstract

1. Introduction

2. Methods

2.1. Weighting schemes

2.1.1. Weights based on the Silhouette index

2.1.2. Weights based on the median

2.1.3. Weights based on the distance with the nearest centroid of a different cluster

2.1.4. Weights based on the sum of distances with the centroids of the other clusters

2.1.5. An overview figure

Fig. 1.

2.2. Object weighting algorithm OW K-means

Algorithm 1.

2.3. Impact of object weighting on the Iris data

Fig. 2.

3. Results

3.1. Simulation study

Fig. 3.

3.2. Clustering cancer gene expression data

Table 1.

Fig. 4.

3.3. Clustering DPO gene phylogenetic data

Fig. 5.

4. Conclusion

Supplementary Material

Acknowledgments

Biographies

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases