CLUSTERING OF DATA WITH MISSING ENTRIES

Sunrita Poddar; Mathews Jacob

doi:10.1109/icassp.2018.8462602

. Author manuscript; available in PMC: 2021 Feb 24.

Published in final edited form as: Proc IEEE Int Conf Acoust Speech Signal Process. 2018 Sep 13;2018:2831–2835. doi: 10.1109/icassp.2018.8462602

CLUSTERING OF DATA WITH MISSING ENTRIES

Sunrita Poddar ¹, Mathews Jacob ¹

PMCID: PMC7902244 NIHMSID: NIHMS1667923 PMID: 33633499

Abstract

The analysis of large datasets is often complicated by the presence of missing entries, mainly because most of the current machine learning algorithms are designed to work with full data. The main focus of this work is to introduce a clustering algorithm, that will provide good clustering even in the presence of missing data. The proposed technique solves an ℓ₀ fusion penalty based optimization problem to recover the clusters. We theoretically analyze the conditions needed for the successful recovery of the clusters. We also propose an algorithm to solve a relaxation of this problem using saturating non-convex fusion penalties. The method is demonstrated on simulated and real datasets, and is observed to perform well in the presence of large fractions of missing entries.

Keywords: clustering, missing entries, non-convex penalties

1. INTRODUCTION

Clustering is a popular unsupervised data analysis technique for finding natural groupings in the absence of training data. Specifically, it assigns each data point to a group, such that all points within a group are similar and points in different groups are dissimilar in some sense. Clustering methods are widely used in the analysis of gene expression data, image segmentation, identification of lexemes in handwritten text, search result grouping and recommender systems [1, 2].

Most clustering algorithms cannot be directly applied to datasets with missing entries. For example, gene expression data often contains missing entries due to image corruption, fabrication errors or contaminants [3], rendering gene cluster analysis difficult. Likewise, large databases used by recommender systems (e.g Netflix) usually have a huge amount of missing data, which makes pattern discovery challenging [4]. Similar issues are reported in the context of missing responses in surveys [5] and failing imaging sensors in astronomy [6] are reported to make the analysis in these applications challenging. The most obvious way to apply existing clustering algorithms to data with missing entries is to convert the data to a complete one. This can be done using deletion or imputation [7]. An extension of the weighted sum-of-norms algorithm [8] has been proposed where the weights are estimated from the data points by using some imputation techniques on the missing entries [9]. A majorize minimize algorithm was introduced to solve for the cluster-centres and cluster memberships in [10], which offers proven reduction in cost with iteration. However, these is no theoretical analysis of these algorithms, which makes it difficult to determine what fraction of entries need to be sampled to recover the correct clusters.

In this paper, we introduce an algorithm to cluster data when some of the features are missing in each point. The method is inspired by the recently proposed sum-of-norms clustering technique [8]. This technique assigns a surrogate variable to each data point, which is an estimate of the cluster centre to which that point belongs. When a fusion penalty is used, it is observed that the surrogate variables belonging to the same cluster coalesce to that centre point. These values denote the estimated cluster centres. In prior work, we used a weighted convex fusion penalty to recover under-sampled MRI images lying on a manifold [11, 12], where the weights were estimated using a special navigator acquisition. In this work, we propose an optimization problem with an ℓ₀ norm based fusion penalty, since we have observed that non-convex fusion penalties provide better clustering performance than convex ones. The main focus is to theoretically analyze the conditions for the successful recovery of the clusters from data using the proposed optimization technique, when several features are missing. This analysis reveals that the clustering performance is determined by factors such as cluster-separation, cluster variance and feature coherence. When two clusters are distinguishable by only very few features, then it is difficult to distinguish between them if these features are not observed, making feature coherence important. As expected, we also obtain a higher probability of successful clustering in the presence of fewer missing entries. We propose an algorithm to efficiently solve a relaxation of this optimization problem, using saturating non-convex fusion penalties. It is demonstrated on simulated and real datasets that the proposed algorithm performs successful clustering in the presence of large fractions of missing entries.

2. CLUSTERING USING ℓ₀ FUSION PENALTY

2.1. Background

We consider the clustering of points drawn from one of K distinct clusters C₁, C₂, …, C_K. We denote the center of the clusters by $c_{1}, c_{2}, \dots, c_{K} \in ℝ^{P}$ . For simplicity, we assume that there are M points in each of the clusters. The individual points in the k^th cluster are modelled as:

z_{k} (m) = c_{k} + n_{k} (m); m = 1, .., M, k = 1, \dots, K

(1)

Here, n_k(m) is the noise or the variation of z_k(m) from the cluster center c_k. The set of input points {x_i}, i = 1, ‥,KM is obtained as a random permutation of the points {z_k(m)}. The objective of a clustering algorithm is to estimate the cluster labels, denoted by $C (x_{i})$ for i = 1, ‥,KM.

The sum-of-norms (SON) method is a recently proposed convex clustering algorithm [8]. Here, a surrogate variable u_i is introduced for each point x_i, which is an estimate of the centre of the cluster to which x_i belongs. In order to find the optimal ${u_{i}^{*}}$ , the following optimization problem is solved:

{u_{i}^{*}} = \arg \min_{{u_{i}}} \sum_{i = 1}^{K M} {‖ x_{i} - u_{i} ‖}_{2}^{2} + λ \sum_{i = 1}^{K M} \sum_{j = 1}^{K M} {‖ u_{i} - u_{j} ‖}_{p}

(2)

The fusion penalty (∥u_i − u_j∥_p) can be enforced using different ℓ_p norms, out of which the ℓ₁, ℓ₂ and ℓ_∞ norms have been used in literature [8]. The use of sparsity promoting fusion penalties encourages sparse differences u_i−u_j, which facilitates the clustering of the points {u_i}.

2.2. Central Assumptions

We make the following assumptions (illustrated in Fig 1), which are key to the successful clustering of the points:

A.1: Cluster separation:

Points from different clusters are separated by δ > 0 in the ℓ₂ sense, i.e:

\min_{{m, n}} {‖ z_{k} (m) - z_{l} (n) ‖}_{2} \geq δ; \forall k \neq l

(3)

A.2: Cluster size:

The maximum separation of points within any cluster in the ℓ_∞ sense is ϵ ≥ 0, i.e:

\max_{{m, n}} {‖ z_{k} (m) - z_{k} (n) ‖}_{\infty} = ϵ; \forall k = 1, \dots, K

(4)

A.3: Feature concentration:

The coherence of a vector $y \in ℝ^{P}$ is defined as: $μ (y) = \frac{P ‖ y ‖_{\infty}^{2}}{‖ y ‖_{2}^{2}}$ . We bound the coherence of the difference between points from different clusters as:

\max_{{m, n}} μ (z_{k} (m) - z_{l} (n)) \leq μ_{0}; \forall k \neq l

(5)

The quantity $κ = \frac{ϵ \sqrt{P}}{δ}$ is a measure of the difficulty of the clustering problem. The recovery of clusters when κ is small is expected to be easier.

2.3. Theoretical Guarantees

We study the problem of clustering {x_i} in the presence of entries missing uniformly at random. We arrange the points {x_i} as columns of a matrix X. We assume that each entry of X is observed with probability p₀. The entries measured in the i^th column are denoted by:

y_{i} = S_{i} x_{i}, i = 1, \dots, K M

(6)

where S_i is the sampling matrix, formed by selecting rows of the identity matrix. We consider solving the following optimization problem to obtain the cluster memberships from data with missing entries:

{u_{i}^{*}} = \min_{{u_{i}}} \sum_{i = 1}^{K M} \sum_{j = 1}^{K M} {‖ u_{i} - u_{j} ‖}_{2, 0} s.t {‖ S_{i} (x_{i} - u_{i}) ‖}_{\infty} \leq \frac{ϵ}{2}, i \in {1 \dots K M}

(7)

We claim that the above algorithm can successfully recover the clusters with high probability when the clusters are well separated (low κ), the sampling probability p₀ is sufficiently high and the coherence μ₀ is small. We state our theoretical guarantees after defining the following quantities:

Upper bound for probability that two points have < $\frac{p_{0}^{2} P}{2}$ commonly observed locations: $γ_{0} : = {(\frac{e}{2})}^{- \frac{p_{0}^{2} P}{2}}$
Given that two points from different clusters have > $\frac{p_{0}^{2} P}{2}$ commonly observed locations, upper bound for probability that they can yield the same u without violating the constraints in (7): $δ_{0} : = e^{- \frac{p_{0}^{2} P {(1 - κ^{2})}^{2}}{μ_{0}^{2}}}$
Upper bound for probability that two points from different clusters can yield the same u without violating the constraints in (7): β₀ := 1 − (1 − δ₀)(1 − γ₀)
Upper bound for failure probability of (7): $η_{0} : = \sum_{{m_{j}} \in S} [β_{0}^{\frac{1}{2} (M^{2} - \sum_{j} m_{j}^{2})} \prod_{j} (\begin{matrix} M \\ m_{j} \end{matrix})]$ where $S$ is the set of all sets of positive integers {m_j} such that: $2 \leq U ({m_{j}}) \leq K$ and $\sum_{j} m_{j} = M$ . Here, the function U counts the number of non-zero elements in a set. For example, if K = 2 then $η_{0} = \sum_{i = 1}^{M - 1} [β_{0}^{i (M - i)} {(\begin{matrix} M \\ i \end{matrix})}^{2}]$ .
For K = 2 and $\log β_{0} \leq \frac{1}{M - 1} + \frac{2}{M - 2} \log \frac{1}{M - 1}$ , we have $η_{0} \leq M^{3} β_{0}^{M - 1} : = η_{0}$ .

Lemma 2.1.

Consider any two points x₁ and x₂ from the same cluster. A solution u exists for the following equations:

{‖ S_{i} (x_{i} - u) ‖}_{\infty} \leq \frac{ϵ}{2}; i = 1, 2

(8)

with probability 1.

Lemma 2.2.

Consider any two points x₁ and x₂ from different clusters, and assume that κ < 1. A solution u exists for the following equations:

{‖ S_{i} (x_{i} - u) ‖}_{\infty} \leq \frac{ϵ}{2}; i = 1, 2

(9)

with probability less than β₀.

The above lemmas indicate that two points from the same cluster can always be assigned the same centre u^∗. However, for a pair of points from different clusters, this can happen with a probability < β₀. We note that β₀ decreases with a decrease in κ. Using lemmas 2.1 and 2.2, we get the following result for a large number of points from multiple clusters:

Lemma 2.3.

Assume that ${x_{i} : i \in I, | I | = M}$ is a set of points chosen randomly from multiple clusters (not all are from the same cluster). If κ < 1, a solution u does not exist for the following equations:

{‖ S_{i} (x_{i} - u) ‖}_{\infty} \leq \frac{ϵ}{2}; \forall i \in I

(10)

with probability exceeding 1 − η₀.

We note here, that for a low value of β₀ and a high value of M, we will arrive at a very low value of η₀. Lemma 2.3 can be used to arrive at our main result:

Theorem 2.4.

If κ < 1, the solution to the optimization problem (7) is identical to the ground-truth clustering with probability exceeding 1 − η₀.

The reasoning follows from the fact that all solutions with cluster sizes smaller than M are associated with a higher cost than the ground-truth solution. In the special case where there are no missing entries, the constraints of optimization problem (7) reduce to: ${‖ x_{i} - u_{i} ‖}_{\infty} \leq \frac{ϵ}{2}$ . We have the following theorem guaranteeing successful recovery for the clusters:

Theorem 2.5.

If κ < 1, the solution to the optimization problem (7) is identical to the ground-truth clustering in the absence of missing entries.

3. RELAXATION OF THE ℓ₀ PENALTY

We propose to solve the following relaxation of the optimization problem (7), which is more computationally feasible:

{u_{i}^{*}} = \arg \min_{{u_{i}}} \sum_{i = 1}^{K M} {‖ S_{i} (u_{i} - x_{i}) ‖}_{2}^{2} + λ \sum_{i = 1}^{K M} \sum_{j = 1}^{K M} ϕ ({‖ u_{i} - u_{j} ‖}_{2})

(11)

Here ϕ is a function approximating the ℓ₀ norm, such as:

ℓ_p norm: ϕ(x) = |x|^p, for some 0 < p < 1.
H₁ penalty: $ϕ (x) = 1 - e^{- \frac{x^{2}}{2 σ^{2}}}$ .

Similar to [13, 14], we reformulate the problem by majorizing the penalty ϕ using a quadratic surrogate functional: ϕ(x) ≤ w(x)x²+d, where $w (x) = \frac{ϕ^{'} (x)}{2 x}$ , and d is a constant. We now state the majorize-minimize formulation for problem (11) as:

{u_{i}^{*}, w_{i j}^{*}} = \arg \min_{{u_{i}, w_{i j}}} \sum_{i = 1}^{K M} {‖ S_{i} (u_{i} - x_{i}) ‖}_{2}^{2} + λ \sum_{i = 1}^{K M} \sum_{j = 1}^{K M} w_{i j} {‖ u_{i} - u_{j} ‖}_{2}^{2}

(12)

We solve problem (12) by alternating between minimization with respect to {u_i} and {w_ij} till convergence.

4. RESULTS

4.1. Study of Theoretical Guarantees

We observe the behaviour of γ₀, δ₀, β₀ and η₀ as a function of p₀, P, κ and M. In Fig 2 (a), the change in γ₀ is shown as a function of p₀ for different values of P. In subsequent plots, we fix P = 50 and μ₀ = 1.5. In Fig 2 (b), the change in δ₀ is shown as a function of p₀ for different values of κ. In Fig 2 (c), the behaviour of β₀ is shown. We consider K = 2 for subsequent plots. (1−η₀) is plotted in (d) as a function of p₀ for different values of κ and M. As expected, the probability of success of the clustering algorithm increases with decrease in κ and increase in p₀ and M.

4.2. Clustering of Simulated Data

We simulated datasets with K = 2 disjoint clusters in $ℝ^{50}$ with a varying number of points per cluster. The points in each cluster follow a uniform random distribution. We study the probability of success of the H₁ penalty based clustering algorithm as a function of κ, M and p₀. For a particular set of parameters the experiment was conducted 20 times. Fig 3 (a) shows the result for datasets with κ = 0.39 and μ₀ = 2.3. The theoretical guarantees for successfully clustering the dataset are shown in (b). Our theoretical guarantees hold for κ < 1. However, we demonstrate in (c) that even with κ = 1.15 and μ₀ = 13.2, our clustering algorithm is successful.

Fig. 3: — Experimental results for probability of success. Guarantees are shown for a simulated dataset with K = 2 clusters. For (a) and (b), κ = 0.39 and μ₀ = 2.3. (a) and (b) show the experimental and theoretical values for the probability of success respectively. (c) shows the experimentally obtained probability of success for a more challenging dataset with κ = 1.15 and μ₀ = 13.2. We do not have theoretical guarantees for this case, since our analysis assumes κ < 1.

Clustering results with K = 3 simulated clusters are shown in Fig 4. We simulated Dataset-1 with K = 3 disjoint clusters in $ℝ^{50}$ and M = 200 points in each cluster. For each of these 3 cluster centres, 200 noisy instances were generated by adding zero-mean white Gaussian noise of variance 0.1. The dataset was sub-sampled with varying fractions of missing entries (p₀ = 1, 0.9, 0.8, …, 0.3, 0.2). We also generate Dataset-2 by halving the distance between the cluster centres, while keeping the intra-cluster variance fixed. We test the proposed algorithm on these datasets using the H₁ penalty. Since the points lie in $ℝ^{50}$ , we take a PCA of the points and their estimated centres and plot the 2 most significant components. The 3 colours distinguish the points according to their ground-truth clusters. Each point x_i is joined to its centre estimate $u_{i}^{*}$ by a line. We observe that the clustering algorithm is more stable with fewer missing entries.

4.3. Clustering of Wine Dataset

We apply the clustering algorithm to the Wine dataset [15]. Each data point has P = 13 features. We created a dataset without outliers by retaining only M = 40 points per cluster, resulting in 120 points. The results are displayed in Fig 5 using the PCA technique as explained in the previous subsection. It is seen that the clustering is quite stable and degrades gradually with increasing fractions of missing entries.

Fig. 5: — Clustering the Wine dataset. The H₁ penalty is used for clustering with varying fractions of missing entries.

5. CONCLUSION

We propose a clustering technique that can handle the presence of missing feature values. We derive theoretical guarantees for the successful recovery of the clusters using the proposed optimization problem. We also propose an algorithm to efficiently solve a relaxation of the above problem. This algorithm is demonstrated on simulated and real datasets. It is observed that the proposed scheme can perform clustering even in the presence of a large fraction of missing entries.

Acknowledgments

This work is supported by NIH 1R01EB019961-01A1 and onrn000141310202.

6. REFERENCES

[1].Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, and Lin C-T, “A review of clustering techniques and developments,” Neurocomputing, 2017. [Google Scholar]
[2].Jain AK, Murty MN, and Flynn PJ, “Data clustering: a review,” ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999. [Google Scholar]
[3].De Souto MC, Jaskowiak PA, and Costa IG, “Impact of missing data imputation methods on gene expression clustering and classification,” BMC bioinformatics, vol. 16, no. 1, p. 64, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Bell RM, Koren Y, and Volinsky C, “The bellkor 2008 solution to the netflix prize,” Statistics Research Department at AT&T Research, 2008. [Google Scholar]
[5].Brick JM and Kalton G, “Handling missing data in survey research,” Statistical methods in medical research, vol. 5, no. 3, pp. 215–238, 1996. [DOI] [PubMed] [Google Scholar]
[6].Wagstaff KL and Laidler VG, “Making the most of missing values: Object clustering with partial data in astronomy,” in Astronomical Data Analysis Software and Systems XIV, vol. 347, 2005, p. 172. [Google Scholar]
[7].Dixon JK, “Pattern recognition with partly missing data,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 10, pp. 617–621, 1979. [Google Scholar]
[8].Hocking TD, Joulin A, Bach F, and Vert J-P, “Clusterpath an algorithm for clustering using convex fusion penalties,” in 28th international conference on machine learning, 2011, p. 1. [Google Scholar]
[9].Chen GK, Chi EC, Ranola JMO, and Lange K, “Convex clustering: An attractive alternative to hierarchical clustering,” PLoS Comput Biol, vol. 11, no. 5, p. e1004228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Chi JT, Chi EC, and Baraniuk RG, “k-pod: A method for k-means clustering of missing data,” The American Statistician, vol. 70, no. 1, pp. 91–99, 2016. [Google Scholar]
[11].Poddar S and Jacob M, “Dynamic mri using smoothness regularization on manifolds (storm),” IEEE Tran. Medical Imaging, vol. 35, no. 4, pp. 1106–1115, April 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Poddar S, Lingala SG, and Jacob M, “Joint recovery of under sampled signals on a manifold: Application to free breathing cardiac mri,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on IEEE, 2014, pp. 6904–6908. [Google Scholar]
[13].Mohsin YQ, Ongie G, and Jacob M, “Iterative shrinkage algorithm for patch-smoothness regularized medical image recovery,” IEEE transactions on medical imaging, vol. 34, no. 12, pp. 2417–2428, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Yang Z and Jacob M, “Nonlocal regularization of inverse problems: A unified variational framework,” IEEE Transactions on Image Processing, vol. 22, no. 8, pp. 3192–3203, August 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Lichman M, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml

[R1] [1].Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, and Lin C-T, “A review of clustering techniques and developments,” Neurocomputing, 2017. [Google Scholar]

[R2] [2].Jain AK, Murty MN, and Flynn PJ, “Data clustering: a review,” ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999. [Google Scholar]

[R3] [3].De Souto MC, Jaskowiak PA, and Costa IG, “Impact of missing data imputation methods on gene expression clustering and classification,” BMC bioinformatics, vol. 16, no. 1, p. 64, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Bell RM, Koren Y, and Volinsky C, “The bellkor 2008 solution to the netflix prize,” Statistics Research Department at AT&T Research, 2008. [Google Scholar]

[R5] [5].Brick JM and Kalton G, “Handling missing data in survey research,” Statistical methods in medical research, vol. 5, no. 3, pp. 215–238, 1996. [DOI] [PubMed] [Google Scholar]

[R6] [6].Wagstaff KL and Laidler VG, “Making the most of missing values: Object clustering with partial data in astronomy,” in Astronomical Data Analysis Software and Systems XIV, vol. 347, 2005, p. 172. [Google Scholar]

[R7] [7].Dixon JK, “Pattern recognition with partly missing data,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 10, pp. 617–621, 1979. [Google Scholar]

[R8] [8].Hocking TD, Joulin A, Bach F, and Vert J-P, “Clusterpath an algorithm for clustering using convex fusion penalties,” in 28th international conference on machine learning, 2011, p. 1. [Google Scholar]

[R9] [9].Chen GK, Chi EC, Ranola JMO, and Lange K, “Convex clustering: An attractive alternative to hierarchical clustering,” PLoS Comput Biol, vol. 11, no. 5, p. e1004228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Chi JT, Chi EC, and Baraniuk RG, “k-pod: A method for k-means clustering of missing data,” The American Statistician, vol. 70, no. 1, pp. 91–99, 2016. [Google Scholar]

[R11] [11].Poddar S and Jacob M, “Dynamic mri using smoothness regularization on manifolds (storm),” IEEE Tran. Medical Imaging, vol. 35, no. 4, pp. 1106–1115, April 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Poddar S, Lingala SG, and Jacob M, “Joint recovery of under sampled signals on a manifold: Application to free breathing cardiac mri,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on IEEE, 2014, pp. 6904–6908. [Google Scholar]

[R13] [13].Mohsin YQ, Ongie G, and Jacob M, “Iterative shrinkage algorithm for patch-smoothness regularized medical image recovery,” IEEE transactions on medical imaging, vol. 34, no. 12, pp. 2417–2428, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Yang Z and Jacob M, “Nonlocal regularization of inverse problems: A unified variational framework,” IEEE Transactions on Image Processing, vol. 22, no. 8, pp. 3192–3203, August 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Lichman M, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml

PERMALINK

CLUSTERING OF DATA WITH MISSING ENTRIES

Sunrita Poddar

Mathews Jacob

Abstract

1. INTRODUCTION