Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 24.
Published in final edited form as: Proc IEEE Int Conf Acoust Speech Signal Process. 2018 Sep 13;2018:2831–2835. doi: 10.1109/icassp.2018.8462602

CLUSTERING OF DATA WITH MISSING ENTRIES

Sunrita Poddar 1, Mathews Jacob 1
PMCID: PMC7902244  NIHMSID: NIHMS1667923  PMID: 33633499

Abstract

The analysis of large datasets is often complicated by the presence of missing entries, mainly because most of the current machine learning algorithms are designed to work with full data. The main focus of this work is to introduce a clustering algorithm, that will provide good clustering even in the presence of missing data. The proposed technique solves an 0 fusion penalty based optimization problem to recover the clusters. We theoretically analyze the conditions needed for the successful recovery of the clusters. We also propose an algorithm to solve a relaxation of this problem using saturating non-convex fusion penalties. The method is demonstrated on simulated and real datasets, and is observed to perform well in the presence of large fractions of missing entries.

Keywords: clustering, missing entries, non-convex penalties

1. INTRODUCTION

Clustering is a popular unsupervised data analysis technique for finding natural groupings in the absence of training data. Specifically, it assigns each data point to a group, such that all points within a group are similar and points in different groups are dissimilar in some sense. Clustering methods are widely used in the analysis of gene expression data, image segmentation, identification of lexemes in handwritten text, search result grouping and recommender systems [1, 2].

Most clustering algorithms cannot be directly applied to datasets with missing entries. For example, gene expression data often contains missing entries due to image corruption, fabrication errors or contaminants [3], rendering gene cluster analysis difficult. Likewise, large databases used by recommender systems (e.g Netflix) usually have a huge amount of missing data, which makes pattern discovery challenging [4]. Similar issues are reported in the context of missing responses in surveys [5] and failing imaging sensors in astronomy [6] are reported to make the analysis in these applications challenging. The most obvious way to apply existing clustering algorithms to data with missing entries is to convert the data to a complete one. This can be done using deletion or imputation [7]. An extension of the weighted sum-of-norms algorithm [8] has been proposed where the weights are estimated from the data points by using some imputation techniques on the missing entries [9]. A majorize minimize algorithm was introduced to solve for the cluster-centres and cluster memberships in [10], which offers proven reduction in cost with iteration. However, these is no theoretical analysis of these algorithms, which makes it difficult to determine what fraction of entries need to be sampled to recover the correct clusters.

In this paper, we introduce an algorithm to cluster data when some of the features are missing in each point. The method is inspired by the recently proposed sum-of-norms clustering technique [8]. This technique assigns a surrogate variable to each data point, which is an estimate of the cluster centre to which that point belongs. When a fusion penalty is used, it is observed that the surrogate variables belonging to the same cluster coalesce to that centre point. These values denote the estimated cluster centres. In prior work, we used a weighted convex fusion penalty to recover under-sampled MRI images lying on a manifold [11, 12], where the weights were estimated using a special navigator acquisition. In this work, we propose an optimization problem with an 0 norm based fusion penalty, since we have observed that non-convex fusion penalties provide better clustering performance than convex ones. The main focus is to theoretically analyze the conditions for the successful recovery of the clusters from data using the proposed optimization technique, when several features are missing. This analysis reveals that the clustering performance is determined by factors such as cluster-separation, cluster variance and feature coherence. When two clusters are distinguishable by only very few features, then it is difficult to distinguish between them if these features are not observed, making feature coherence important. As expected, we also obtain a higher probability of successful clustering in the presence of fewer missing entries. We propose an algorithm to efficiently solve a relaxation of this optimization problem, using saturating non-convex fusion penalties. It is demonstrated on simulated and real datasets that the proposed algorithm performs successful clustering in the presence of large fractions of missing entries.

2. CLUSTERING USING 0 FUSION PENALTY

2.1. Background

We consider the clustering of points drawn from one of K distinct clusters C1, C2, …, CK. We denote the center of the clusters by c1,c2,,cKP. For simplicity, we assume that there are M points in each of the clusters. The individual points in the kth cluster are modelled as:

zk(m)=ck+nk(m);m=1,..,M,k=1,,K (1)

Here, nk(m) is the noise or the variation of zk(m) from the cluster center ck. The set of input points {xi}, i = 1, ‥,KM is obtained as a random permutation of the points {zk(m)}. The objective of a clustering algorithm is to estimate the cluster labels, denoted by C(xi) for i = 1, ‥,KM.

The sum-of-norms (SON) method is a recently proposed convex clustering algorithm [8]. Here, a surrogate variable ui is introduced for each point xi, which is an estimate of the centre of the cluster to which xi belongs. In order to find the optimal {ui*}, the following optimization problem is solved:

{ui*}=argmin{ui}i=1KMxiui22+λi=1KMj=1KMuiujp (2)

The fusion penalty (∥uiujp) can be enforced using different p norms, out of which the 1, 2 and norms have been used in literature [8]. The use of sparsity promoting fusion penalties encourages sparse differences uiuj, which facilitates the clustering of the points {ui}.

2.2. Central Assumptions

We make the following assumptions (illustrated in Fig 1), which are key to the successful clustering of the points:

Fig. 1:

Fig. 1:

Central Assumptions: (a) and (b) show different datasets of points 2 lying in 3 clusters (denoted by red, green and blue). A.1 and A.2 are illustrated in both (a) and (b). The importance of A.3 can be appreciated by comparing (a) and (b). In (a), points in the red and blue clusters cannot be distinguished using only feature 1, while the red and green clusters cannot be distinguished using only feature 2. Due to low coherence in (b), this problem does not arise.

A.1: Cluster separation:

Points from different clusters are separated by δ > 0 in the 2 sense, i.e:

min{m,n}zk(m)zl(n)2δ;kl (3)

A.2: Cluster size:

The maximum separation of points within any cluster in the sense is ϵ ≥ 0, i.e:

max{m,n}zk(m)zk(n)=ϵ;k=1,,K (4)

A.3: Feature concentration:

The coherence of a vector yP is defined as: μ(y)=Py2y22. We bound the coherence of the difference between points from different clusters as:

max{m,n}μ(zk(m)zl(n))μ0;kl (5)

The quantity κ=ϵPδ is a measure of the difficulty of the clustering problem. The recovery of clusters when κ is small is expected to be easier.

2.3. Theoretical Guarantees

We study the problem of clustering {xi} in the presence of entries missing uniformly at random. We arrange the points {xi} as columns of a matrix X. We assume that each entry of X is observed with probability p0. The entries measured in the ith column are denoted by:

yi=Sixi,i=1,,KM (6)

where Si is the sampling matrix, formed by selecting rows of the identity matrix. We consider solving the following optimization problem to obtain the cluster memberships from data with missing entries:

{ui*}=min{ui}i=1KMj=1KMuiuj2,0s.tSi(xiui)ϵ2,i{1KM} (7)

We claim that the above algorithm can successfully recover the clusters with high probability when the clusters are well separated (low κ), the sampling probability p0 is sufficiently high and the coherence μ0 is small. We state our theoretical guarantees after defining the following quantities:

  • Upper bound for probability that two points have < p02P2 commonly observed locations: γ0:=(e2)p02P2

  • Given that two points from different clusters have > p02P2 commonly observed locations, upper bound for probability that they can yield the same u without violating the constraints in (7): δ0:=ep02P(1κ2)2μ02

  • Upper bound for probability that two points from different clusters can yield the same u without violating the constraints in (7): β0 := 1 − (1 − δ0)(1 − γ0)

  • Upper bound for failure probability of (7): η0:={mj}S[β012(M2jmj2)j(Mmj)] where S is the set of all sets of positive integers {mj} such that: 2U({mj})K and jmj=M. Here, the function U counts the number of non-zero elements in a set. For example, if K = 2 then η0=i=1M1[β0i(Mi)(Mi)2].

  • For K = 2 and logβ01M1+2M2log1M1, we have η0M3β0M1:=η0.

Lemma 2.1.

Consider any two points x1 and x2 from the same cluster. A solution u exists for the following equations:

Si(xiu)ϵ2;i=1,2 (8)

with probability 1.

Lemma 2.2.

Consider any two points x1 and x2 from different clusters, and assume that κ < 1. A solution u exists for the following equations:

Si(xiu)ϵ2;i=1,2 (9)

with probability less than β0.

The above lemmas indicate that two points from the same cluster can always be assigned the same centre u. However, for a pair of points from different clusters, this can happen with a probability < β0. We note that β0 decreases with a decrease in κ. Using lemmas 2.1 and 2.2, we get the following result for a large number of points from multiple clusters:

Lemma 2.3.

Assume that {xi:iI,|I|=M} is a set of points chosen randomly from multiple clusters (not all are from the same cluster). If κ < 1, a solution u does not exist for the following equations:

Si(xiu)ϵ2;iI (10)

with probability exceeding 1 − η0.

We note here, that for a low value of β0 and a high value of M, we will arrive at a very low value of η0. Lemma 2.3 can be used to arrive at our main result:

Theorem 2.4.

If κ < 1, the solution to the optimization problem (7) is identical to the ground-truth clustering with probability exceeding 1 − η0.

The reasoning follows from the fact that all solutions with cluster sizes smaller than M are associated with a higher cost than the ground-truth solution. In the special case where there are no missing entries, the constraints of optimization problem (7) reduce to: xiuiϵ2. We have the following theorem guaranteeing successful recovery for the clusters:

Theorem 2.5.

If κ < 1, the solution to the optimization problem (7) is identical to the ground-truth clustering in the absence of missing entries.

3. RELAXATION OF THE 0 PENALTY

We propose to solve the following relaxation of the optimization problem (7), which is more computationally feasible:

{ui*}=argmin{ui}i=1KMSi(uixi)22+λi=1KMj=1KMϕ(uiuj2) (11)

Here ϕ is a function approximating the 0 norm, such as:

  • p norm: ϕ(x) = |x|p, for some 0 < p < 1.

  • H1 penalty: ϕ(x)=1ex22σ2.

Similar to [13, 14], we reformulate the problem by majorizing the penalty ϕ using a quadratic surrogate functional: ϕ(x) ≤ w(x)x2+d, where w(x)=ϕ(x)2x, and d is a constant. We now state the majorize-minimize formulation for problem (11) as:

{ui*,wij*}=argmin{ui,wij}i=1KMSi(uixi)22+λi=1KMj=1KMwijuiuj22 (12)

We solve problem (12) by alternating between minimization with respect to {ui} and {wij} till convergence.

4. RESULTS

4.1. Study of Theoretical Guarantees

We observe the behaviour of γ0, δ0, β0 and η0 as a function of p0, P, κ and M. In Fig 2 (a), the change in γ0 is shown as a function of p0 for different values of P. In subsequent plots, we fix P = 50 and μ0 = 1.5. In Fig 2 (b), the change in δ0 is shown as a function of p0 for different values of κ. In Fig 2 (c), the behaviour of β0 is shown. We consider K = 2 for subsequent plots. (1−η0) is plotted in (d) as a function of p0 for different values of κ and M. As expected, the probability of success of the clustering algorithm increases with decrease in κ and increase in p0 and M.

Fig. 2:

Fig. 2:

Study of Theoretical Guarantees. Quantities γ0, δ0 and β0 defined in Section 2.3 are studied in (a), (b) and (c). In (b), (c) and (d), P = 50 and μ0 = 1.5. As expected, β0 decreases with increase in p0 and decrease in κ. Considering K = 2 clusters, a lower bound for the probability of successful clustering (1 − η0) is shown in (d) for different κ.

4.2. Clustering of Simulated Data

We simulated datasets with K = 2 disjoint clusters in 50 with a varying number of points per cluster. The points in each cluster follow a uniform random distribution. We study the probability of success of the H1 penalty based clustering algorithm as a function of κ, M and p0. For a particular set of parameters the experiment was conducted 20 times. Fig 3 (a) shows the result for datasets with κ = 0.39 and μ0 = 2.3. The theoretical guarantees for successfully clustering the dataset are shown in (b). Our theoretical guarantees hold for κ < 1. However, we demonstrate in (c) that even with κ = 1.15 and μ0 = 13.2, our clustering algorithm is successful.

Fig. 3:

Fig. 3:

Experimental results for probability of success. Guarantees are shown for a simulated dataset with K = 2 clusters. For (a) and (b), κ = 0.39 and μ0 = 2.3. (a) and (b) show the experimental and theoretical values for the probability of success respectively. (c) shows the experimentally obtained probability of success for a more challenging dataset with κ = 1.15 and μ0 = 13.2. We do not have theoretical guarantees for this case, since our analysis assumes κ < 1.

Clustering results with K = 3 simulated clusters are shown in Fig 4. We simulated Dataset-1 with K = 3 disjoint clusters in 50 and M = 200 points in each cluster. For each of these 3 cluster centres, 200 noisy instances were generated by adding zero-mean white Gaussian noise of variance 0.1. The dataset was sub-sampled with varying fractions of missing entries (p0 = 1, 0.9, 0.8, …, 0.3, 0.2). We also generate Dataset-2 by halving the distance between the cluster centres, while keeping the intra-cluster variance fixed. We test the proposed algorithm on these datasets using the H1 penalty. Since the points lie in 50, we take a PCA of the points and their estimated centres and plot the 2 most significant components. The 3 colours distinguish the points according to their ground-truth clusters. Each point xi is joined to its centre estimate ui* by a line. We observe that the clustering algorithm is more stable with fewer missing entries.

Fig. 4:

Fig. 4:

Clustering results in simulated datasets. The H1 penalty is used to cluster two datasets with varying fractions of missing entries. We show here the 2 most significant principal components of the solutions. The original points {xi} are connected to their cluster centre estimates {ui*} by lines.

4.3. Clustering of Wine Dataset

We apply the clustering algorithm to the Wine dataset [15]. Each data point has P = 13 features. We created a dataset without outliers by retaining only M = 40 points per cluster, resulting in 120 points. The results are displayed in Fig 5 using the PCA technique as explained in the previous subsection. It is seen that the clustering is quite stable and degrades gradually with increasing fractions of missing entries.

Fig. 5:

Fig. 5:

Clustering the Wine dataset. The H1 penalty is used for clustering with varying fractions of missing entries.

5. CONCLUSION

We propose a clustering technique that can handle the presence of missing feature values. We derive theoretical guarantees for the successful recovery of the clusters using the proposed optimization problem. We also propose an algorithm to efficiently solve a relaxation of the above problem. This algorithm is demonstrated on simulated and real datasets. It is observed that the proposed scheme can perform clustering even in the presence of a large fraction of missing entries.

Acknowledgments

This work is supported by NIH 1R01EB019961-01A1 and onrn000141310202.

6. REFERENCES

  • [1].Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, and Lin C-T, “A review of clustering techniques and developments,” Neurocomputing, 2017. [Google Scholar]
  • [2].Jain AK, Murty MN, and Flynn PJ, “Data clustering: a review,” ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999. [Google Scholar]
  • [3].De Souto MC, Jaskowiak PA, and Costa IG, “Impact of missing data imputation methods on gene expression clustering and classification,” BMC bioinformatics, vol. 16, no. 1, p. 64, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Bell RM, Koren Y, and Volinsky C, “The bellkor 2008 solution to the netflix prize,” Statistics Research Department at AT&T Research, 2008. [Google Scholar]
  • [5].Brick JM and Kalton G, “Handling missing data in survey research,” Statistical methods in medical research, vol. 5, no. 3, pp. 215–238, 1996. [DOI] [PubMed] [Google Scholar]
  • [6].Wagstaff KL and Laidler VG, “Making the most of missing values: Object clustering with partial data in astronomy,” in Astronomical Data Analysis Software and Systems XIV, vol. 347, 2005, p. 172. [Google Scholar]
  • [7].Dixon JK, “Pattern recognition with partly missing data,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 10, pp. 617–621, 1979. [Google Scholar]
  • [8].Hocking TD, Joulin A, Bach F, and Vert J-P, “Clusterpath an algorithm for clustering using convex fusion penalties,” in 28th international conference on machine learning, 2011, p. 1. [Google Scholar]
  • [9].Chen GK, Chi EC, Ranola JMO, and Lange K, “Convex clustering: An attractive alternative to hierarchical clustering,” PLoS Comput Biol, vol. 11, no. 5, p. e1004228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Chi JT, Chi EC, and Baraniuk RG, “k-pod: A method for k-means clustering of missing data,” The American Statistician, vol. 70, no. 1, pp. 91–99, 2016. [Google Scholar]
  • [11].Poddar S and Jacob M, “Dynamic mri using smoothness regularization on manifolds (storm),” IEEE Tran. Medical Imaging, vol. 35, no. 4, pp. 1106–1115, April 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Poddar S, Lingala SG, and Jacob M, “Joint recovery of under sampled signals on a manifold: Application to free breathing cardiac mri,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on IEEE, 2014, pp. 6904–6908. [Google Scholar]
  • [13].Mohsin YQ, Ongie G, and Jacob M, “Iterative shrinkage algorithm for patch-smoothness regularized medical image recovery,” IEEE transactions on medical imaging, vol. 34, no. 12, pp. 2417–2428, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Yang Z and Jacob M, “Nonlocal regularization of inverse problems: A unified variational framework,” IEEE Transactions on Image Processing, vol. 22, no. 8, pp. 3192–3203, August 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Lichman M, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml

RESOURCES