Skip to main content
PLOS One logoLink to PLOS One
. 2023 Apr 13;18(4):e0284114. doi: 10.1371/journal.pone.0284114

A new iterative initialization of EM algorithm for Gaussian mixture models

Jie You 1, Zhaoxuan Li 1, Junli Du 1,*
Editor: Praveen Kumar Donta2
PMCID: PMC10101421  PMID: 37053163

Abstract

Background

The expectation maximization (EM) algorithm is a common tool for estimating the parameters of Gaussian mixture models (GMM). However, it is highly sensitive to initial value and easily gets trapped in a local optimum.

Method

To address these problems, a new iterative method of EM initialization (MRIPEM) is proposed in this paper. It incorporates the ideas of multiple restarts, iterations and clustering. In particular, the mean vector and covariance matrix of sample are calculated as the initial values of the iteration. Then, the optimal feature vector is selected from the candidate feature vectors by the maximum Mahalanobis distance as a new partition vector for clustering. The parameter values are renewed continuously according to the clustering results.

Results

To verify the applicability of the MRIPEM, we compared it with other two popular initialization methods on simulated and real datasets, respectively. The comparison results of the three stochastic algorithms indicate that MRIPEM algorithm is comparable in relatively high dimensions and high overlaps and significantly better in low dimensions and low overlaps.

1 Introduction

Gaussian mixture model (GMM) is a very useful tool, which is widely used in complex probability distribution modeling, such as data classification [1], image classification and segmentation [24], speech recognition [5], etc. The Gaussian mixture model is composed of K single Gaussian distributions. For a single Gaussian distribution, the parameters are usually estimated by using the maximum likelihood estimation (MLE) method, but this is not applicable to GMM. In fact, it is not known in advance which component each observation belongs to in GMM due to introduce the hidden variables. The expectation-maximization method (EM), introduced by Dempster et al. [6], can complete the parameter estimation of the GMM by iteratively constructing the lower limit of the likelihood function to continuously improve the value of the likelihood function. However, this procedure is highly sensitive to initialization and easily gets trapped in a local optimum [7, 8]. As shown in Fig 1, when the likelihood function is nonconvex, the EM algorithm may terminate the iteration before reaching the global optimum. Thus, initialization research is very necessary for EM algorithm.

Fig 1. Relationship curve between L(Θ) and Θ.

Fig 1

Θ: the initial parameter of the EM algorithm; L(Θ): the likelihood function about Θ.

At present, many initialization methods of the EM algorithm for GMM have been proposed, which are mainly divided into two situations with a known and an unknown numbers of components K. When the number of components K is known, some initialization methods belong to the deterministic strategy. For example, the initialization method in the document [9] is based on hierarchical agglomerative clustering (HAC), which uses Ward’s criterion measurement to obtain the means of the initial model. The procedure proposed by Maitra relies on detecting the best local modes [10]. The deterministic methods may lead to an incorrect solution or even no solution when the likelihood function is unbounded [11]. In comparison, stochastic initialization strategies can bring improvements with the increase of runs. The standard program of stochastic initialization is the multiple restart method (MREM) [12]. In this approach, the EM algorithm is run many times with different random initial conditions. The parameters corresponding to the highest log likelihood are returned as the final parameter estimation. When the sample size is large, the method is still easy to fall into local optimization. The emEM [13] and RndEM [10] algorithms are the other two typical representatives of stochastic initialization strategies. The emEM algorithm starts with the phase called short EM at which the EM algorithm iterates multiple times from different random points according to a lax convergence criterion. The set of parameters that produced the highest likelihood value is then used for the final long EM run. RndEM is a variant of emEM algorithm. It involves the same stages as emEM, but the short EM phase stops after the very first parameter estimation. In the paper of Blömer and Bujna [14], two new initialization methods are presented based on the well-known K-means++ algorithm and the Gonzalez algorithm. In the method of rnd-maxmin [15], the initialization mean is selected from the random subset of candidate eigenvectors by application of Mahalanobis distance. The covariance matrix of the component is initialized by randomly generating the eigenvalues and eigenvectors.

The above approaches generally need to give the number of components K in advance. In fact, if K is unknown, it can be assumed firstly to belong to a set of values K. For each different KK, the initialization method and EM algorithm are run. The K value of best fitting model is to be chosen based on some test procedures or criteria (usually log-likelihood value). Moreover, many initialization methods integrate the estimation of K in the process of calculating parameters. A clustering algorithm using fast search of density peak points (DPC) [16] can predict the number of components K and centers of the initial class. The method of Σ-EM [17] initializes the mean vectors by choosing points that have a higher concentration of neighbors and the covariance matrix by using a truncated normal distribution. In the paper of Verbeek et al. [18], a greedy algorithm is presented which does not need an initialization but needs to construct a whole sequence of mixture models with m = 1, …, K components instead. Subsequently, the optimized greedy initialization methods are proposed [19, 20]. Another method is to apply Rough-Enhanced-Bayes mixture estimation (REBMIX) to the initialization of the EM algorithm [21]. In fact, this approach combines a new type of clustering algorithm with the EM algorithm to improve the accuracy of the results, and the approach itself is not innovative.

Clearly, there is no “the best initialization algorithm” that can be applied to all instances. The performance of initialization depends on two aspects: one is the data itself, including the degree of overlap, sample size, and dimension. The other is the allowable computation cost. Besides, the choice of hyperparameters for some algorithms also has a crucial impact on the final results.

In order to optimize the initial value of the EM algorithm and the estimated results, we propose a new iterative initialization method of EM algorithm for GMM under the situation with a known K in this paper. This method is to calculate the mean vector and covariance matrix of sample as the initial value of the iteration rather than to start with many different random initial conditions. Then, the optimal feature vector is selected from the candidate feature vectors by the maximum Mahalanobis distance as a new partition vector for clustering. The parameter values are renewed continuously according to the clustering results. To verify its applicability, we compared the proposed method with other two popular initialization methods on simulated and real datasets, respectively.

2 Materials and methods

2.1 EM algorithm for Gaussian mixture models

For d-dimensional random variable X with n samples, the probability distribution of a finite Gaussian mixture model can be expressed by a weighted sum of K components [22]:

P(X|Θ)=m=1Kαmpm(X|θm), (1)

where αm is m-th mixing proportion, which must satisfy αm > 0, m = 1, …, K and m=1Kαm=1. In Eq (1), θm = {μm, Σm} is the set of parameters of the m-th component, where μm and Σm denote the mean vector and covariance matrix of the m-th mixture component, respectively. pm is the probability density function of the m-th component:

pm(X|θm)=1(2π)d/2|Σm|1/2exp{12(X-μm)TΣm-1(X-μm)}, (2)

where d is the dimension of the feature space and |⋅| denotes the determinant of a matrix. Thus, Θ = {α1, …, αm, μ1, Σ1, …, μm, Σm} is an unknown set of the parameters and it has to be estimated in the mixture learning process. The number of components K is either known or must be determined in the learning process. In this paper it is assumed that K is known.

Suppose X = {x1, x2, …, xn} is independent and identically distributed. For K = 1, Θ can be solved by MLE:

ΘMLE=argmaxΘ{logL(Θ)}. (3)

Where L(Θ) refers to the likelihood function and is given by:

L(Θ)=i=1nP(xi|Θ). (4)

For K > 1, the solution of this maximization problem cannot be obtained in a closed form. As a numerical optimization method, EM algorithm is often used for such cases. EM algorithm is an iterative optimization strategy. Each iteration is divided into two steps called the expectation step (E-step) and the maximization step (M-step). For GMM, it can be divided into the following steps after giving the initial parameters Θ(0) [23]:

• E-step: The posterior probability of the s-th observation belonging to the K-th component, hm(xi), is calculated by:

hm(s)(xi)=αm(s-1)pm(xi|θm(s-1))j=1Kαj(s-1)pj(xi|θj(s-1)), (5)

where i = 1, 2, …, n and m = 1, 2, …, K. s = 1, 2, … represents the iteration number.

• M-step: Given the posterior probabilities hm(s)(xi), the updating formulas of parameter Θ(s) are as follows:

αm(s)=1ni=1nhm(s)(xi) (6)
μm(s)=i=1nhm(s)(xi)*xii=1nhm(s)(xi) (7)
Σm(s)=i=1nhm(s)(xi)(xi-μm(s))(xi-μm(s))Ti=1nhm(s)(xi). (8)

The E-steps and M-steps are iterated until a convergence criterion is met. In this paper, A convergence criterion of relative improvement based on log likelihood is adopted:

logP(X|Θ(s))-logP(X|Θ(s-1))logP(X|Θ(s-1))<ε, (9)

where ε ≪ 1 was a pre-specified tolerance level (in this paper, ε = 1e − 5).

2.2 Initialization procedure

In this section, the proposed methodology (denoted as MRIPEM) is introduced under the premise that the number of components K is known. The initialization method of this paper involves the idea of clustering. Starting from the number of clusters is 1, the mean vectors, covariance matrices, and mixing proportions are gradually renewed in each iteration. The initialization procedure can be described in detail by the following steps:

  1. Compute the population mean vector μ1 and covariance matrix Σ1 by MLE and set m = 2.
    μ1=1ni=1nxi (10)
    Σ1=1ni=1n(xi-μ1)(xi-μ1)T. (11)
  2. Choose t (t is a parameter of the method) feature vectors randomly from X to form Xl = {xl1, xl2, …, xlt}. For each xliϵXl, compute the minimal squared Mahalanobis distance [24] to the mean vectors μ1, μ2, …, μm−1 that have already been chosen:
    dmin2(xli)=minj=1,,m-1dm2(μj,Σj,xli), (12)
    where dm2(μj,Σj,xli) is given by:
    dm2(μj,Σj,xli)=(xli-μj)(Σj)-1(xli-μj)T. (13)
  3. Select pm=argmaxxliXldmin2(xli) as the division standard of the m-th component. In this way, pm and μ1, μ2, …, μm−1 which have been calculated constitute m partition vectors.

  4. Divide X into m partitions {C1, C2, …, Cm} by using Euclidean distance, i.e. assign sample points xi to the nearest partition vector.

  5. According to the partition {C1, C2, …, Cm} obtained in step 4, update the mean vectors, covariance matrices and mixing proportions of each component:
    μj=1|Cj|xiCjxi (14)
    Σj=1|Cj|xiCj(xi-μj)(xi-μj)T (15)
    πj=n|Cj|, (16)
    where |⋆| represents the number of sample points in the ⋆.
  6. If Σj is not positive definite, use a spherical covariance matrix instead:
    Σj=1d|Cj|xiCjxi-μjT·ID, (17)
    where ID denotes the d-dimensional identity matrix.
  7. m = m + 1 if m < K then go to step 2. Otherwise, terminate the algorithm.

  8. Set the manual loop parameter r that represents the number of runs of the initialization method and take the results corresponding to the best Adjusted Rand Index (ARI) [25] value as the output of the final initialization.

2.3 Properties of the initialization procedure

The iterative initialization method of this paper is a stochastic strategy, and it depends on samples completely. If K = 1, the initial parameter value is obtained in step 1 by using MLE. The algorithm iterates from step 2, gradually increasing the number of partitions until X is divided into K components.

The main idea of step 2 and step 3 is to use the minimal squared Mahalanobis distance to sift out the next optimal partition vector that is used for dividing the sample. When dividing the samples, our purpose is to make the selected partition vectors as far as possible from each other. This can reduce the probability of different components of samples being divided into the same component. In step 2, the t sample points are selected randomly from X as the candidate vectors. Then, the minimal squared Mahalanobis distance is calculated from the t candidate vectors to the determined m − 1 centers. The m-th partition vector is determined according to the maximum distance calculated above (Step 3). It should be noted that Mahalanobis distance is selected as the standard for determining the partition vector because it is an effective multivariate distance metric taking into account the relationship between various feature vectors [26]. In addition, the sensitivity study on the selection of t for different numbers of K(K = 2, 3, 4, 5, 7, 9) was made according to ARI value. In fact, ARI is an evaluation index proposed by Hubert and Arabie [25] to measure the clustering performance. The larger the ARI value, the more consistent the clustering results of the two partitions. When ARI takes a negative value, it indicates that the labels of the two partitions are independently distributed. When ARI is equal to 1, it indicates that the two partitions are identical.

As can be seen from Fig 2, when K ⩽ 5 and tK, the ARI value rise rapidly until t reaches K. When t > K, the ARI value gradually tends to a plateau. Therefore, we define t = K for K < 5. Similarly, when K > 5, the ARI value gradually grows with the increase of t until t = 5. The ARI value fluctuates little after t > 5. Therefore, we define t = 5 for K ≥ 5. The simulation experiments indicate that we also can get excellent results by bringing t = 5 (if K ≥ 5) sample points into step2 while the sample size is much larger than t. Choosing t instead of all n sample points means that the amount of iteration calculation will be greatly reduced, thus, the running speed of the algorithm is accelerated.

Fig 2. Determination of the best t.

Fig 2

For each K, other parameters settings were fixed: loop parameter r = 1, number of samples per cluster nC = 200, dimension p = 5. Considering the randomness of the algorithm, each group of experiments (K, t) was repeated 30 times. For each (K, t), the ARI value was determined using the ARI mean of 30 repeated experiments.

The sample points are clustered (Step 4) through the new partition vector (Step 3). The clustering results are used to estimate the m-th parameter values, which are mean vectors, covariance matrices, and mixing proportions. Step 5 lists the calculation formula of each parameter. In step 6, the covariance matrix is replaced by the spherical covariance matrix when it is not positive definite. This usually happens when the cluster is too small at the early stage of the iteration. The spherical covariance matrix is not only a positive definite matrix, but also can be constructed quickly without being affected by the iterative process. The parameter values are continuously updated during the process of iteration. The iteration stops and the last results are the outputs of the initialization method (Step 7) When m = K.

Due to the random nature of the algorithm, the results will vary from run to run. In order to get the best initialization results, we have added a loop parameter r inside the initialization procedure representing the number of times to run our initialization method (Step 8). In r runs, ARI is a performance criterion to evaluate the results of each initialization. The initialization results corresponding to the maximum value of ARI is selected as the output of our initialization algorithm. Step 8 can be adjusted manually. Generally speaking, the greater the number of runs r, the better the initialization results. However, in order to improve the operation efficiency, we attempted to conduct a sensitivity analysis between the number of runs and the initialization results. The r(r = 1, 5, 10, 15, 20) experiments were conducted under three kinds of dimensions (p = 2, 5, 10), respectively. Considering the randomness of the algorithm, each group of experiments (r, p) was repeated 30 times and the ARI value was determined using the ARI mean of 30 repeated experiments. As show in Fig 3, the best performances are obtained under three kinds of dimensions when r = 10. Therefore, the manual loop parameter r is fixed to 10 in the subsequent experiments of this paper.

Fig 3. Determination of the best r.

Fig 3

Other parameters settings: the number of components K = 20, sample size n = 4000.

3 Results

In this section, the proposed initialization method is combined with the EM algorithm to estimate the parameters of the GMM through the clustering of the data, which is tested on the artificially generated datasets and real datasets, respectively. Meanwhile, our approach is compared with two other popular stochastic initialization strategies, which are emEM and RndEM.

The ARI value and the log-likelihood (logL(Θ)) were used to measure the performance of the data clustering. logL(Θ) is the objective function of the EM algorithm, and the larger it is, the better is. The combination of the two performance criteria can accurately measure the effectiveness of algorithms and make the experimental results more credible.

3.1 Simulation study

The artificial datasets for simulation study are generated by using the MixSim package in R [27], which can generate a finite mixture model with Gaussian components for prespecified levels of maximum pairwise overlap ωˇ, average pairwise overlap ω¯, the number of clusters K and the dimension of samples p.

Naturally, the dimensionality is considered to have an impact on the clustering performance of methods. The higher the dimension of the samples, the worse the clustering effect may be. Additionally, in Fig 4, both graphs A and B are made from the mixed Gaussian data generated by the generator (number of clusters: K = 10, dimension: p = 2, number of samples per cluster: nC = 200). The difference is that the overlap degree of A (ω¯=0.001 and ω˘=0.04) is lower than the overlap degree of B (ω¯=0.05 and ω˘=0.5). As shown in Fig 4, the degree of data mixing will increase with the increase of ω¯. When ω¯ reaches 0.05, the degree of data mixing is quite serious. In this case, a very satisfactory result can not be obtained by any method. It can be concluded that the degree of overlap (ω¯ and ω˘) and the number of dimensions p are important parameters that affect the clustering results of the initialization methods.

Fig 4. Two-dimensional datasets simulated from 10-component mixtures.

Fig 4

A: ω¯=0.001 and ω˘=0.04. B: ω¯=0.05 and ω˘=0.5. The ellipses are centered around component means and represent 95% confidence regions.

Consequently, considering the different performances of the methods in different dimension p, average overlap ω¯ and maximum overlap ω˘, the number of components K is fixed to 20 and the total sample size n is fixed to 4000 (the number of samples nC of each component is 200) in the simulation study. This experiment simulates three different dimensions (p ∈ {2, 5, 10}). Each dimension sets three values of average overlap (ω¯{0.0001,0.001,0.01}) and each average overlap sets two values of maximum overlap. In this way, a total of 18 experiments were carried out for different combination (p,ω¯,ω˘).

When three algorithms of emEM, RndEM and MRIPEM are compared, due to nature of stochastic strategies, we executed 30 different datasets for each triplet (p,ω¯,ω˘). In addition, to explore the stability of the results of 30 runs, we define var*:

var*=var×1000, (18)

where var represents the variance of ARI for 30 results. The reason for 1000 times magnification operation is that the range of ARI value is [0, 1], and the calculated variance value is too small to distinguish. Besides, there is a large difference in the value of the log likelihood function calculated from 30 different data, so it is meaningless to calculate its variance. Here we only calculate the variant of ARI variance (var*).

3.2 Results of simulation study

The comparison results of p = 2, p = 5 and p = 10 Gaussian mixture model datasets are listed in Tables 13 under three stochastic initialization methods, respectively. I¯ and L¯ represent the averages of ARI and logL(Θ) of 30 different datasets, respectively. Iq and Lq represent their corresponding third quartile. The values of ARI are accurate to 4 decimal places and the results of others are rounded up to 2 decimal places. The optimal value of each index has been highlighted in bold in the table.

Table 1. The comparison results of different ω¯ and ω˘ under three methods when p = 2.

Method ω¯ 0.0001 0.001 0.01
ω˘ 0.01 0.015 0.1 0.15 0.3 0.4
MRIPEM I¯ 0.9730 0.9909 0.9537 0.9614 0.7777 0.8071
L¯ 12889.66 13958.63 10068.74 12329.19 4754.87 3778.89
I q 0.9980 0.9988 0.9792 0.9791 0.8161 0.8256
L q 13616.60 14616.13 10915.82 13326.39 4935.85 3716.05
var * 1.02 0.39 0.76 0.55 0.68 0.46
emEM I¯ 0.8580 0.8355 0.8257 0.8584 0.7736 0.7414
L¯ 12178.29 12882.32 9486.79 11767.31 4738.22 3679.89
I q 0.8469 0.8832 0.8587 0.8838 0.7926 0.7543
L q 13014.69 13976.94 10136.48 12900.92 4934.13 3584.34
var * 1.13 3.14 2.18 1.70 1.70 0.58
RndEM I¯ 0.8933 0.8814 0.8535 0.8507 0.7119 0.7559
L¯ 12326.64 13172.46 9673.00 11758.41 4650.84 3695.08
I q 0.9225 0.9225 0.8802 0.8678 0.7381 0.7995
L q 13173.78 14232.17 10165.81 13048.89 4799.56 3609.16
var * 1.04 2.54 2.40 1.36 0.94 2.07

Other parameter settings: K = 20, n = 4000, t = 5, r = 10.

Table 3. The comparison results of different ω¯ and ω˘ under three methods when p = 10.

Method ω¯ 0.0001 0.001 0.01
ω˘ 0.01 0.015 0.1 0.15 0.3 0.4
MRIPEM I¯ 0.9756 0.9800 0.9405 0.9177 0.6738 0.5907
L¯ 20043.87 23450.43 10902.43 14794.39 -9597.66 -13373.93
I q 0.9974 0.9979 0.9646 0.9256 0.6853 0.6334
L q 21915.60 25603.39 12208.31 15998.47 -8279.14 -12860.28
var * 0.79 0.76 0.74 1.09 0.37 4.25
emEM I¯ 0.9183 0.8946 0.8575 0.8998 0.7449 0.6715
L¯ 19620.80 22901.37 10674.20 14746.22 -8901.69 -12316.50
I q 0.9407 0.8890 0.9148 0.9199 0.7649 0.7265
L q 21444.06 24965.52 12195.23 15907.85 -7413.76 -11826.79
var * 1.51 0.90 14.50 1.94 1.66 72.80
RndEM I¯ 0.8716 0.8856 0.8787 0.8723 0.7180 0.6827
L¯ 19352.43 22812.75 10527.50 14557.12 -8985.49 -12396.52
I q 0.8832 0.8854 0.9076 0.8937 0.7369 0.7241
L q 21335.29 24940.08 12136.25 15641.88 -7581.46 -12073.11
var * 0.99 1.18 1.19 1.78 1.35 2.51

Other parameter settings: K = 20, n = 4000, t = 5, r = 10.

Obviously, the three initialization methods perform differently with varying degrees of overlap and dimensionality. As show in Table 1, the values of five indexes indicate that MRIPEM is significantly better than emEM and RndEM for six different combinations of overlap. When ω¯=0.0001 and ω¯=0.001, the I¯ value of emEM and RndEM are both below 0.9 and differ greatly from MRIPEM. When ω¯=0.01, although the difference of I¯ values among the three algorithms is reduced, the method of MRIPEM is still the best. Particularly, when w = 0.0001, the Iq value of MRIPEM is close to 1. This displays that more than 75% of the results in 30 repeated experiments trials are consistent with the real classification results. Besides, the var* value of MRIPEM is the smallest compared with the other two algorithms, indicating that the obtained results of MRIPEM are relatively stable. The results of Table 2 illustrate that the situation of p = 5 is similar to that p = 2. In terms of the averages of ARI and logL(Θ), MRIPEM is also better than the other two methods for six overlap combinations. For the corresponding third quartile, when ω¯=0.01 and ω˘=0.3, the result of MRIPEM is slightly smaller than that of emEM. However, the difference of Iq is only 0.26% and the difference of Lq is only 1.27%. In Table 3, when the dimension is increased to 10, the results of MRIPEM are still best for ω¯={0.0001,0.001}. It’s a pity that MRIPEM provides the worst results compared to the other two methods when ω¯=0.01. Nevertheless, the MRIPEM method still remains relatively stable to a certain extent from the results of var*.

Table 2. The comparison results of different ω¯ and ω˘ under three methods when p = 5.

Method ω¯ 0.0001 0.001 0.01
ω˘ 0.01 0.015 0.1 0.15 0.3 0.4
MRIPEM I¯ 0.9863 0.9812 0.9612 0.9651 0.7839 0.8137
L¯ 19603.89 22951.83 14441.47 16458.43 2536.95 3680.60
I q 0.9983 0.9986 0.9818 0.9818 0.7948 0.8341
L q 21850.09 25080.13 15168.96 17192.86 2727.93 4372.49
var * 0.57 0.76 0.65 0.68 0.33 0.41
emEM I¯ 0.8693 0.8294 0.8824 0.8857 0.7791 0.7966
L¯ 19133.66 22498.65 14088.53 16131.29 2536.84 3490.30
I q 0.8952 0.8899 0.9212 0.9166 0.7969 0.8211
L q 21394.62 24686.68 14757.57 16837.87 2762.54 4330.93
var * 8.17 45.57 1.62 1.72 0.31 1.56
RndEM I¯ 0.8739 0.8746 0.8467 0.8343 0.7361 0.7641
L¯ 18967.73 22189.57 13926.75 15827.97 2429.54 3417.45
I q 0.8907 0.9230 0.8729 0.8379 0.7724 0.7873
L q 21227.24 24542.77 14834.54 16288.88 2747.39 4255.13
var * 2.08 2.36 0.88 0.71 2.33 1.14

Other parameter settings: K = 20, n = 4000, t = 5, r = 10.

In a word, the comparisons of ARI reveal that the MRIPEM method has a higher accuracy and can reduce the probability of data being wrongly divided. The comparisons of log-likelihood function values show that the parameters calculated by the MRIPEM method have high fitting validity. According to the values of var*, the MRIPEM method is relatively stable and has little volatility compared with the other two initialization methods. In addition, it can be found that the three stochastic initialization methods have different degrees of decline with the increase of the degree of overlap and dimension. This reflects from another aspect that the degree of data overlap and dimension are the important reasons that affect the performance of the algorithm.

In order to better present the specific information for results of the 30 datasets, we extract four representative groups from the 18 triplets, and draw the boxplots of ARI and logL(Θ) in Figs 58, respectively. In Fig 5, we draw with the settings of the least overlap (ω¯=0.0001) and p = 2. Fig 6 represents the medium overlap (ω¯=0.001) and p = 5. From Figs 5 and 6, it can be found that the MRIPEM is better than the other two methods in both ARI and logL(Θ). The middle line of the box is the median of the data, representing the average level of sample. Actually, the values of MRIPEM in ARI are far ahead of the other two methods and its median is close to 1. Fig 7 is drawn with the settings of the high degree of overlap (ω¯=0.01) and p = 5, and Fig 8 represents the medium degree of overlap (ω¯=0.001) and p = 10. Figs 7 and 8 indicate that the MRIPEM method continues to excel in whatever its respect in accordance with medians of ARI and logL(Θ). From the closeness of the upper and lower limits of boxes, the results of MRIPEM fluctuate less.

Fig 5. Boxplots of the ARI and logL(Θ) for the emEM, RndEM and the proposed algorithm MRIPEM.

Fig 5

A: Comparison of the ARI for three methods. B: Comparison of the logL(Θ) for three methods. Each boxplot is constructed based on 30 different datasets with {p,ω¯,ω˘}={2,0.0001,0.01}.

Fig 8. Boxplots of the ARI and logL(Θ) for the emEM, RndEM and the proposed algorithm MRIPEM.

Fig 8

A: Comparison of the ARI for three methods. B: Comparison of the logL(Θ) for three methods. Each boxplot is constructed based on 30 different datasets with {p,ω¯,ω˘}={10,0.001,0.1}.

Fig 6. Boxplots of the ARI and logL(Θ) for the emEM, RndEM and the proposed algorithm MRIPEM.

Fig 6

A: Comparison of the ARI for three methods. B: Comparison of the logL(Θ) for three methods. Each boxplot is constructed based on 30 different datasets with {p,ω¯,ω˘}={5,0.001,0.15}.

Fig 7. Boxplots of the ARI and logL(Θ) for the emEM, RndEM and the proposed algorithm MRIPEM.

Fig 7

A: Comparison of the ARI for three methods. B: Comparison of the logL(Θ) for three methods. Each boxplot is constructed based on 30 different datasets with {p,ω¯,ω˘}={5,0.01,0.4}.

On the premise of fixing the total number of samples n and the number of clusters K, the results of the simulation experiments show that MRIPEM is significantly better than the other two classic algorithms when the degrees of overlap(ω¯={0.0001,0.001}) are not high. The MRIPEM method achieves the highest results on both ARI and logL(Θ). The Adjusted Rand Index is close to 1, almost perfect classification. In fact, the result of MRIPEM even appeared several times with ARI = 1 during the process of simulation.

For the {p,ω¯}={10,0.01}, the MRIPEM method is inferior to the other two methods. Maybe the high degree of overlap and dimensionality increase the probability of selecting partition vectors from the same component by using maximum Mahalanobis distance. Therefore, it seems that our initialization algorithm is suitable for well-separated and low-dimensional samples. In fact, the performance of all three algorithms will decrease in varying degrees with the increase of overlap and dimension. It’s obvious that more overlapping data increases the difficulty of distinguishing them.

3.3 Real datasets study

In this section, we use datasets of four known class labels from UCI machine learning database [28] and KEEL-dataset repository [29] to demonstrate the validity of the proposed method, namely Seeds, Aff, Appendicitis, and SKM. These datasets vary from dimension of feature space, sample size, number of classes, and degree of overlap.

  1. Seeds dataset

    The Seeds dataset is a commonly used classification experimental dataset. The dataset contains 210 instances, which belong to three different varieties of wheat: Kama, Rosa, and Canadian. There are 70 instances in each category, and each instance contains 7 attributes: area A, perimeter P, compactness C, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove. According to the seven attributes, which variety each seed belongs to can be predicted.

  2. Aff dataset

    The full name of Aff dataset is Algerian forest fires. It includes 244 instances that regroup data of two regions of Algeria, namely the Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria, with 122 instances for each region. Through 7 attributes, 244 examples were divided into “fire” (138 classes) and “not fire” (106 classes).

  3. Appendicitis dataset

    The data represents 7 medical measures taken over 106 patients on which the class label represents if the patient has appendicitis (class label 1) or not (class label 0).

  4. SKM dataset

    It is the real dataset about the students’ knowledge status about the subject of Electrical DC Machines. In accordance to five attributes including the degree of study time for the goal object materials (STG), it can be divided the knowledge level of students.

The above datasets all have known their respective dimension sizes, but the degrees of overlap are unknown. To get the level of clustering complexity for an existing classification dataset, the average pairwise overlap and maximum pairwise overlap of four real datasets were calculated by using the overlap function of MixSim package in R [27]. The information of four real datasets is listed in Table 4 as follows:

Table 4. Information on real datasets.

Datasets n p k ω¯ ω˘
Seeds 210 7 3 0.0080 0.0172
Aff 244 7 2 0.0075 0.0075
Appendicitis 106 7 2 0.1977 0.1977
SKM 403 5 4 0.0202 0.0607

Parameter elucidation: n, p, k, ω¯, ω˘ represent the number of samples, the number of features, the number of clusters, the average pairwise overlap and the maximum pairwise overlap, respectively.

3.4 Results of real datasets study

As in the simulation experiments, considering that the three initialization methods are all stochastic strategies, this experiment was repeated 30 times and each time was started with a different seed of the random number generator. The results of three initialization methods under four real datasets are displayed in Table 5.

Table 5. Comparative results of ARI and logL(Θ) of three stochastic initialization methods on four real datasets.

Method Seeds Aff Appendicitis SKM
ARI logL(Θ) ARI logL(Θ) ARI logL(Θ) ARI logL(Θ)
MRIPEM 0.7882 1250.71 0.7104 4364.10 0.3787 1162.58 0.3347 234.15
emEM 0.6299 1276.66 0.6157 4257.49 0.2227 1157.14 0.2051 231.69
RndEM 0.5869 1281.56 0.6554 4267.09 0.2227 1157.14 0.2676 233.02

Fig 9 intuitively shows the difference of ARI and logL(Θ) values of the three methods under the four real datasets. Due to the large difference between the logL(Θ) values of different datasets, the logL(Θ) results is converted into a proportional form (For each real dataset, logL(Θ) of each method / the sum of logL(Θ) of three methods).

Fig 9. Line charts of ARI and logL(Θ) values of three methods under four real datasets.

Fig 9

A: Comparison of the ARI for three methods. B: Comparison of the logL(Θ) for three methods.

It can be seen from Table 5 and Fig 9 that the ARI value of the MRIPEM is the largest among the three methods in four kinds of datasets. In terms of logL(Θ), MRIPEM is the largest except on the Seeds dataset. These results demonstrate that the classification results of MRIPEM algorithm are closer to those of real datasets. In more detail, the MRIPEM method is improved by 25.13% and 8.39% respectively compared with the second-ranked method on Seeds and Aff datasets in terms of ARI. However, the MRIPEM method is only 2.41% worse than the first ranked RndEM method on Seeds datasets in the aspect of logL(Θ). From the overall results, the value of logL(Θ) has little difference under the three algorithms. The lower ARI value of the last two datasets may be caused by the relatively high degree of overlap. Even so, MRIPEM is still ahead of emEM and RndEM whether on ARI or logL(Θ).

4 Discussion and conclusion

In this paper, a novel iterative initialization method of the EM algorithm for the estimation of the parameters of GMM was proposed. This new method, named MRIPEM, incorporates the ideas of multiple restarts, iterations, and clustering. The performance of MRIPEM was assessed by ARI and logL(Θ) indexes, and compared with the other two popular initialization strategies on simulated datasets and real datasets, respectively.

In fact, the initial mean and covariance matrix of the iteration of the MRIPEM algorithm are calculated by taking all the sample data as one component, which is different from other algorithms to randomly generate the initial mean and variance. According to the Mahalanobis distance, the candidate vector farthest from all the determined center vector points is taken as the partition vector. In this way, the greater the distance, the greater the probability of being selected as the partition vector, otherwise the probability is smaller. That is, the probability of proper initialization of all component centers increases. After clustering the sample data according to the partition vector, the mean and variance of each component are continuously updated in each iteration. Obviously, the MRIPEM algorithm completely depends on the sample dataset itself, which reduces the randomness to a certain extent. In addition, the setting of the manually adjustable loop parameter r can also reduce the randomness to a certain extent and improve the accuracy of the results. The comparison results of simulation datasets and real datasets under the three algorithms confirm that the MRIPEM algorithm shows excellent accuracy and robustness in low dimensions and low overlaps. Therefore, the MRIPEM is recommended to apply to the well-separated datasets of the low dimension or after dimensionality reduction.

However, it is a well-known fact that no single method can outperform the others in all cases, and our algorithm is no exception. Similar to other initialization methods, the performance of the MRIPEM method also decreases when the degree of overlap and the dimensionality are both high. Maybe the high dimension and high overlap will increase the probability of misclassification of partition vectors determined by the maximum Mahalanobis distance. Nevertheless, on the whole, the MRIPEM method can still be regarded as a promising algorithm in terms of situations in low-dimension and low-overlap.

4.1 Statements

The real datasets that support the findings of this study are openly available in UCI machine learning database at https://archive.ics.uci.edu/ml and KEEL-dataset repository at https://sci2s.ugr.es/keel/datasets.php (Ref [28, 29]).

Data Availability

All relevant data are within the manuscript.

Funding Statement

This work was financially supported by Chinese Universities Scientific Fund (Grant Nos. 2452022369 and 2452019546). The funders partially supported the data collection and analysis in this study.

References

  • 1. Delgosha P, Hassani H, Pedarsani R. Robust Classification Under ℓ0 Attack for the Gaussian Mixture Model. SIAM J Math Data Sci. 2022;4(1):362–385. doi: 10.1137/21M1426286 [DOI] [Google Scholar]
  • 2. Jiang Jie, Gao M. Agricultural super green image segmentation method based on Gaussian mixture model combined with Camshift. Arabian J Geosci. 2021;14(11):1–12. doi: 10.1007/s12517-021-07144-w [DOI] [Google Scholar]
  • 3. Xu N. Application of remote sensing image classification based on adaptive Gaussian mixture model in analysis of mountain environment features. Arabian J Geosci. 2021;14(15):1–14. doi: 10.1007/s12517-021-07899-2 [DOI] [Google Scholar]
  • 4. Permuter H, Francos J, Jermyn I. A study of Gaussian mixture models of color and texture features for image classification and segmentation. Pattern Recognit. 2006;39(4):695–706. doi: 10.1016/j.patcog.2005.10.028 [DOI] [Google Scholar]
  • 5. Ghai W, Kumar S, Athavale VA. Using Gaussian mixtures on triphone acoustic modelling-based Punjabi continuous speech recognition. In: Advances in Computational Intelligence and Communication Technology. Springer; 2021. p. 395–406. [Google Scholar]
  • 6. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B. 1977;39(1):1–22. 10.1111/j.2517-6161.1977.tb01600.x [DOI] [Google Scholar]
  • 7. Wu CJ. On the convergence properties of the EM algorithm. Ann Stat. 1983; p.95–103. 10.1214/aos/1176346060 [DOI] [Google Scholar]
  • 8. Wu C, Yang C, Zhao H, Zhu J. On the Convergence of the EM Algorithm: From the Statistical Perspective. 2016. 10.48550/arXiv.1611.00519 [DOI] [Google Scholar]
  • 9.Meilă M, Heckerman D. An experimental comparison of model-based clustering methods. Mach Learn. 2001;42(1):9–29. 10.1023/A:1007648401407 [DOI] [Google Scholar]
  • 10. Maitra R. Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf. 2009;6(1):144–157. doi: 10.1109/TCBB.2007.70244 [DOI] [PubMed] [Google Scholar]
  • 11. Rathnayake GJMXLI. Finite Mixture Models. Annual Review of Statistics and Its Application. 2019;6. 10.1146/annurev-statistics-031017-100325 [DOI] [Google Scholar]
  • 12.Mclachlan GJ, Krishnan T. The EM Algorithm and Extensions: Second Edition. The EM Algorithm and Extensions, Second Edition; 2007.
  • 13. Biernacki C, Celeux G, Govaert G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal. 2003;41(3-4):561–575. doi: 10.1016/S0167-9473(02)00163-9 [DOI] [Google Scholar]
  • 14. Blömer J, Bujna K. Adaptive seeding for Gaussian mixture models. In: Pacific-asia conference on knowledge discovery and data mining. Springer; 2016.p. 296–308. [Google Scholar]
  • 15. Kwedlo W. A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering. Pattern Anal Appl. 2015;18(4):757–770. doi: 10.1007/s10044-014-0441-3 [DOI] [Google Scholar]
  • 16. Xie J, Gao H, Xie W. K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset. Sci Sin Inform. 2016;46(2):258–280. doi: 10.1360/N112015-00135 [DOI] [Google Scholar]
  • 17. Melnykov V, Melnykov I. Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal. 2012;56(6):1381–1395. doi: 10.1016/j.csda.2011.11.002 [DOI] [Google Scholar]
  • 18. Verbeek JJ, Vlassis N, Kröse B. Efficient greedy learning of Gaussian mixture models. Neural Comput. 2003;15(2):469–485. doi: 10.1162/089976603762553004 [DOI] [PubMed] [Google Scholar]
  • 19. Vlassis N, Likas A. A greedy EM algorithm for Gaussian mixture learning. Neural Process Lett. 2002;15(1):77–87. doi: 10.1023/A:1013844811137 [DOI] [Google Scholar]
  • 20. Štepánová K, Vavrečka M. Estimating number of components in Gaussian mixture model using combination of greedy and merging algorithm. Pattern Anal Appl. 2018;21(1):181–192. doi: 10.1007/s10044-016-0576-5 [DOI] [Google Scholar]
  • 21. Panić B, Klemenc J, Nagode M. Improved initialization of the EM algorithm for mixture model parameter estimation. Math. 2020;8(3):373. doi: 10.3390/math8030373 [DOI] [Google Scholar]
  • 22. Viroli C, McLachlan GJ. Deep Gaussian mixture models. Stat Comput. 2019;29(1):43–51. doi: 10.1007/s11222-017-9793-z [DOI] [Google Scholar]
  • 23. Redner RA, Walker HF. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 1984;26(2):195–239. doi: 10.1137/1026034 [DOI] [Google Scholar]
  • 24. Maesschalck RD, Jouan-Rimbaud D, Massart DL. The Mahalanobis distance. Chemom Intell Lab Syst. 2000;50(1):1–18. doi: 10.1016/S0169-7439(99)00047-7 [DOI] [Google Scholar]
  • 25. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218. doi: 10.1007/BF01908075 [DOI] [Google Scholar]
  • 26.Mirylenka K, Dallachiesa M, Palpanas T. Correlation-Aware Distance Measures for Data Series. In: EDBT; 2017. p. 502–505.
  • 27. Melnykov V, Chen WC, Maitra R. MixSim: An R package for simulating data to study performance of clustering algorithms. J Stat Softw. 2012;51(12):1. doi: 10.18637/jss.v051.i1223504300 [DOI] [Google Scholar]
  • 28.Asuncion A, Newman D. UCI machine learning repository; 2007. https://archive.ics.uci.edu/ml.
  • 29.Alcalá-fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, et al.. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework.; 2011. https://sci2s.ugr.es/keel/datasets.php.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All relevant data are within the manuscript.


Articles from PLOS ONE are provided here courtesy of PLOS

RESOURCES