Modified Immune Evolutionary Algorithm for Medical Data Clustering and Feature Extraction under Cloud Computing Environment

Jing Yu; Hang Li; Desheng Liu

doi:10.1155/2020/1051394

. 2020 Jan 20;2020:1051394. doi: 10.1155/2020/1051394

Modified Immune Evolutionary Algorithm for Medical Data Clustering and Feature Extraction under Cloud Computing Environment

Jing Yu ¹, Hang Li ^2,^✉, Desheng Liu ^3,^✉

PMCID: PMC7201819 PMID: 32399163

Abstract

Medical data have the characteristics of particularity and complexity. Big data clustering plays a significant role in the area of medicine. The traditional clustering algorithms are easily falling into local extreme value. It will generate clustering deviation, and the clustering effect is poor. Therefore, we propose a new medical big data clustering algorithm based on the modified immune evolutionary method under cloud computing environment to overcome the above disadvantages in this paper. Firstly, we analyze the big data structure model under cloud computing environment. Secondly, we give the detailed modified immune evolutionary method to cluster medical data including encoding, constructing fitness function, and selecting genetic operators. Finally, the experiments show that this new approach can improve the accuracy of data classification, reduce the error rate, and improve the performance of data mining and feature extraction for medical data clustering.

1. Introduction

Through the support of existing technologies, relevant medical research organizations only rely on coupled dictionary technology to classify and store medical images [1]. However, with the continuous increase of the number of slices, some images begin to show serious frame rate overlap phenomenon, which not only causes the sharp decline of the original image gray level but also causes a series of image data redundancy problems. It brings great trouble to the mining and scheduling of the following image information. The so-called image data redundancy refers to the phenomenon of uneven or excessive storage caused by data repetition in the process of data imaging that can lead to the real information loss in the image and cause a certain negative impact on the image sharpness. Frame rate overlap is a common image fault problem, which is often associated with image data redundancy. Under certain circumstances [2], a certain degree of frame rate overlap may lead to a small increase of the image sharpness. But excessive frame rate overlap will lead to serious damage to the modal property of the medical image, which will lead to a large increase of the redundant region in the medical image data. Diagnosis in medicine is related to the patient's medication and treatment. Many diseases are more complex. Data clustering analysis is integrated into the diagnosis of diseases, such as clinical urology and breast cancer, so that doctors can greatly enhance the diagnosis accuracy of patients.

With the fast growth of information science, the research of biological applications has been used for computational science to analyze the intelligent bionic optimization algorithm design and improve the ability of processing big data and analysis [3]. Intelligent bionic algorithms mainly include ant colony algorithm [4], particle swarm optimization (PSO) algorithm [5], and the quantum swarm algorithm [6–8]. Swarm intelligence optimization algorithms have a good application value in artificial intelligence design, data clustering analysis, computer control, and other fields.

Clustering technology is an important part in data mining and machine learning. Domestic researchers mainly focus on the following two aspects: (1) a clustering algorithm dynamically determines the number of clustering centers and (2) a clustering algorithm improves the accuracy of clustering. Zhao et al. [9] presented a new dynamic clustering method based on genetic algorithm; the main idea of the method was that, in order to effectively overcome the sensitivity to the initial state value clustering algorithm, it used the maximum attribute value range partitioning strategy and two stages and dynamic selection method in mutation, which obtained the optimal clustering center.

Clustering analysis is a kind of unsupervised model in pattern recognition. The task of cluster is to divide an unmarked pattern according to the certain criteria into several subsets, which requires that similar samples have the most similar cluster center and dissimilar samples should be divided in different classes. Therefore, it is also called unsupervised classification. Clustering analysis has been extensively used in data mining, image processing, object detection, radar target detection, etc. [10, 11]. Zhang et al. [12] proposed a Geometric-constrained multiview image matching method based on semiglobal optimization. It was obvious that some features had more information than others in a dataset. So it was highly likely that some features should have lower importance degrees during a clustering or a classification algorithm due to their lower information, their higher variances, etc. So, it was always a desire for all artificial intelligence communities to enforce the weighting mechanism in any task that identically used a number of features to make a decision. Parvin and Minaei-Bidgoli [13] proposed a weighted locally adaptive clustering algorithm that was based on the locally adaptive clustering algorithm.

Nowadays, different clustering methods are being used to resolve several machine learning problems. According to the clustering criterion, different clustering algorithms can be divided into clustering algorithm based on fuzzy relations including hierarchical clustering and graph clustering and clustering algorithm based on the objective function [14–16]. For the objective function of optimization clustering algorithms, it generally uses the gradient method to solve the extremum problem. The search direction of gradient method is always along the direction of the energy reduction, which prompts the algorithm easily falling into local minimum value. Methods are sensitive to the initialization of clustering algorithm in the objective function which is a serious defect. To overcome the above shortcomings, all proposed algorithms are used to optimize objective function. Meng et al. [17] presented that the MapReduce programming model was adopted to combine Canopy and K-means clustering algorithms within cloud computing environment, so as to fully utilize the computing and storing capacity of Hadoop clustering. Large quantities of buyers on taobao were taken as application context to do case study through the Hadoop platform's data mining set Mahout. Zhang et al. [18] proposed a high-order possibilistic c-means algorithm (HOPCM) for big data clustering by optimizing the objective function in the tensor space. Li et al. [19] proposed a task scheduling algorithm based on fuzzy clustering algorithms. However, there are still some problems, such as long convergence time.

Moreover, deep learning-based methods are used for feature selection. Minaei-Bidgoli et al. [20] proposed an ensemble based approach for feature selection. The results showed that, although the efficacy of the method was not considerably decreased in most of cases, the method became free from setting of any parameter. Some algorithms could not properly represent data distribution characteristics when datasets were imbalanced. In some cases, the cost of wrong classification could be very high in a sample of a special class, such as wrongly misclassifying cancerous individuals or patients as healthy ones. Hu and Du [21] tried to present a fast and efficient way to learn from imbalanced data. This method was more suitable for learning from the imbalanced data having very little data in class of minority. Gao et al. [22] was devoted to the exploration of brain images for early detection of Parkinson's disease. All brain images were analyzed to extract Gabor 2D features. It was also shown that the models created on Gabor features outperform the ones created without Gabor features. Zhao et al. [23] analyzed the triple-negative breast neoplasm gene regulatory network using gene expression data. We collected triple-negative breast neoplasm gene expression data from the Cancer Genome Atlas to construct a triple-negative breast neoplasm gene regulatory network using least absolute shrinkage and selection operator regression. In addition, it constructed a triple-positive breast neoplasm network for comparison. Nejatian [24] presented that the available additional information at different times and conditions and gold-standard protein complexes was employed to determine fitting thresholds. By doing so, the problem was converted into an optimization problem. Thereafter, the problem was solved using the firefly metaheuristic optimization algorithm.

Hence, we propose a new medical big data clustering algorithm based on modified immune evolutionary method under cloud computing environment to overcome the above disadvantages in this paper. The reminder of this paper is organized as follows: Section 2 presents big data structure analysis in cloud computing environment. Immune evolutionary algorithm is stated in Section 3. Section 4 describes the improved clustering method in detail, Section 5 provides the MapReduce framework, and Section 6 manifests the experiments results. Finally, the conclusion is given in Section 7.

2. Analysis on Storage Mechanism and Structure of Medical Big Data in Cloud Computing Environment

Cloud computing [25–28] is through the Internet to provide dynamic data to extend large storage space and the structure model. In order to evaluate the data clustering and mining in the cloud computing environment, it needs to build a big data storage system architecture in cloud computing environment. Big data storage structure adopts virtualized storage pool and depends on the computer cluster. From top to bottom, these are the I/O (input/output) virtual computer, USB interface layer sequence, and disk layer, respectively. Enterprise data center through all kinds of terminal accesses the application service, which makes the calculation of distribution on a large number of distributed computers. When all the cloud computing virtual machines are assigned to the physical machine, it uses the following formula to calculate the global optimal solution in this clustering process. And, it also can assign big data feature clustering center BF_{M_i} of the cloud computing on the physical machine P_{M_i} according to the optimal solution:

\begin{matrix} N = \frac{1}{n} \sum_{j = 1}^{n} |U_{t_{j}^{CPU}} - U_{t_{avg}^{CPU}}| + \frac{1}{n} \sum_{j = 1}^{n} |U_{t_{j}^{Mem}} - U_{t_{avg}^{Mem}}| + \frac{1}{n} \sum_{j = 1}^{n} |U_{t_{j}^{bw}} - U_{t_{avg}^{bw}}| . \end{matrix}

(1)

The sample is collected and analyzed to determine whether the sample belongs to a typical sample. Assuming that data information stream sample $S = ({\bar{X}}_{1}, {\bar{X}}_{2}, \dots, {\bar{X}}_{k})$ makes sampling in time (T₁, T₂,…, T_k). We divide big data set X in cloud environment into c clusters, 1 < c < n. The data segmentation can be transformed as space segmentation. Storage structure central vector of big data is obtained:

\begin{matrix} V = \{v_{i j}| i = 1,2, \dots, c, j = 1,2, \dots, s\}, \end{matrix}

(2)

where V_i is the i-th vector of object cluster feature.

Fuzzy division matrix can be presented as

\begin{matrix} U = \{μ_{i k}| i = 1,2, \dots, c, k = 1,2, \dots, n\} . \end{matrix}

(3)

Redundant data reduction is processed for a single data source. In the process of multichannel QoS demand virtual machine clustering, some parameters are defined as virtual machine set V_MS={V_M₁, V_M₁,…, V_{M_m}} and physical machine set P_MS={P_M₁, P_M₁,…, P_{M_n}}. Inspiring factor is α, and the expect of inspiring factor is β. Biggest mining number is I_max. As a result, uploaded data blocks provide a fixed size of data blocks, which is beneficial to analyze the cloud clustering. Through the big data storage mechanism analysis in cloud computing environment, it provides the accurate data for big data clustering.

Supposing that the time series of information stream is {x(t₀+iΔt)}, i=1,2,…, N − 1. X and Y are attribute sets. The vector expression of big data clustering space in the cloud computing environment is

\begin{matrix} ℜ = [r (t_{0}), r (t_{0} + Δ t), \dots, r (t_{0} + (K - 1) Δ t)], \end{matrix}

(4)

where r(t) is information stream time series of big data clustering in cloud computing environment and Δt is data sampling interval. The spectral characteristic X_p(u) of discrete samples of big data can be calculated as

\begin{matrix} X_{p} (u) = s_{c} (t) e^{2 π f_{0} t} = \frac{1}{\sqrt{T}} rect \frac{t}{T} e^{(2 π (f_{0} t + K t^{2}) / 2)}, \end{matrix}

(5)

where s_c(t) is the characteristic scalar time series of big data, e^2πf₀t is the discrete sample center of big data clustering, and (F, Q) is sample data high-order Bessel function statistics of data set {X₁, X₂,…, X_N}. So, we can get the confidence and confidence interval:

\begin{matrix} z_{i, d}^{k + 1} = x_{r_{1}}^{k} + F \cdot (x_{r_{2}}^{k} - x_{r_{3}}^{k}), \\ u_{i, d}^{k + 1} = \{\begin{matrix} x_{i d}^{t + 1}, f_{fit}^{t} < f_{fit}^{'}, \\ z_{i, d}^{k + 1}, f_{fit}^{t} \geq f_{fit}^{'} . \end{matrix} \end{matrix}

(6)

Suppose the information flow time series in the cloud computing environment is {x(t₀+iΔt)}, i=0,1,…, N − 1. Let X and Y be the set of properties. The expression of clustering space state vector of big data in cloud computing environment is as follows:

\begin{matrix} X = [x (t_{0}), x (t_{0} + Δ t), \dots, x (t_{0} + (K - 1) Δ t)] \\ = [\begin{matrix} x (t_{0}) & x (t_{0} + Δ t) & \dots & x (t_{0} + (K - 1) Δ t) \\ x (t_{0} + J Δ t) & x (t_{0} + (J + 1) Δ t) & \dots & x (t_{0} + (K - 1) Δ t + J Δ t) \\ ⋮ & ⋮ & ⋮ \\ x (t_{0} + (m - 1) J Δ t) & x (t_{0} + (1 + (m - 1) J) Δ t) & \dots & x (t_{0} + (N - 1) Δ t) \end{matrix}], \end{matrix}

(7)

where x(t) is the information flow time series of big data clustering system in cloud computing environment, J is the time window function of phase space reconstructed by big data in cloud computing environment, M is the target clustering regulator, and Δt is the data sampling interval.

The discrete sample spectral characteristic X_p(u) of big data is calculated, and the main feature component is

\begin{matrix} X_{p} (u) = s_{c} (t) e^{j 2 π f_{0} t} = \frac{1}{\sqrt{T}} rect (\frac{t}{T}) e^{j 2 π (f_{0} t + K t^{2}) / 2}, \end{matrix}

(8)

where s_c(t) is the characteristic scalar time series of big data and e^j2πf₀t is the center of discrete sample of big data clustering.

The data set is {X₁, X₂,…, X_n}. (F, Q) is the high-order Bessel function statistics of the sample data to determine the confidence of node data packets and establish the confidence interval. The obtained confidence and confidence intervals are

\begin{matrix} z_{(i, d)}^{(k + 1)} = x_{r_{1}}^{k} + F \cdot (x_{r_{2}}^{k} - x_{r_{3}}^{k}), \\ u_{(i, d)}^{(k + 1)} = \{\begin{matrix} x_{i d}^{(t + 1)}, & f_{fitness}^{t} < f_{fitness}^{*}, \\ z_{(i, d)}^{(k + 1)}, & f_{fitness}^{t} \geq f_{fitness}^{*} . \end{matrix} \end{matrix}

(9)

3. Immune Evolutionary Algorithm (IEA)

IEA consists of crossover and mutation operator which represent two strategies with group search and information exchange. It provides optimization opportunities for each individual. However, this inevitably produces the degradation phenomenon in some cases, and the degradation phenomenon is quite obvious.

IEA uses some features or knowledge in original problems to suppress the degradation phenomenon appeared in the process of optimization. The key operation of IEA is to construct the structure of immune operator that is finished through vaccination and immune selection. The immune evolutionary algorithm can improve the fitness of the individual and prevent the group degradation, so as to reduce the original wave phenomenon in the late evolutionary algorithm and improve the convergence speed. The main steps for immune evolutionary algorithm are as follows, and the detailed information can be obtained from [29, 30].

Randomly generate the initial parent group A₁.
Extract the vaccine according to prior knowledge.
If the current group contains the best individual, it stops running the process and outputs the result. Otherwise, the procedure continues to work.
Cross operation of the current k-th group A_k is conducted, and it obtains the population B_k.
It makes mutation operation for B_k and obtains the population C_k.
It executes vaccination for C_k and gets group D_k.
It executes immune selection for D_k and obtains new parent group A_k+1. Then back to step 3.

4. Modified Immune Evolutionary Algorithm for Data Clustering

Fuzzy clustering is regarded as one of the commonly used approaches for data analysis. The Fuzzy C-means (FCM) algorithm is the most well-known and widely used method for fuzzy clustering and provides an optimal way to construct fuzzy information granules [31]. Cluster prototypes and membership values of data across all clusters can be developed by optimizing the FCM clustering model. Basically, the FCM is a steepest-descent algorithm with variable step length that is adjusted according to the majorization principle for the step length, showing the simplicity and efficiency of the algorithm. Therefore, we combine immune evolutionary algorithm and FCM to optimize the cluster result [32, 33]. The detailed improved data clustering processes are as follows:

The objective function of FCM is

\begin{matrix} J (X; U, V) = \sum_{k = 1}^{n} \sum_{i = 1}^{c} u_{i k}^{m} D_{i k}^{2}, \end{matrix}

(10)

\begin{matrix} D_{i k}^{2} = {(x_{k} - v_{i})}^{T} (x_{k} - v_{i}), \end{matrix}

(11)

where D_ik is the distance from k-th data point to i-th cluster center, V=(v₁, v₂,…, v_c) denotes the cluster center of each class, and v_i ∈ R and m ∈ (1, ∞) are fuzzy index:

\begin{matrix} X = (x_{1}, x_{2}, \dots, x_{n}) \subset R, \\ U = \{U \in R^{c \times n} |u_{i k} \in [0,1]; \sum_{i = 1}^{c} v_{i k} = 1; 0 < \sum_{k = 1}^{n} v_{i k} < n\} . \end{matrix}

(12)

4.1. Encoding

According to J(X; U, V), the aim of cluster is to obtain fuzzy division matrix U and cluster prototype V of sample X. U and V are associated with each other. So we have two encoding methods. First, we encode U. Suppose that n samples need to be divided into c clusters. Gene cluster a={α₁, α₂,…, α_n} denotes one clustering result; α_i ∈ {1,2,…, c}. When α_i=k(1 ≤ k ≤ c), then x_i belongs to k-th cluster. Its search space is cⁿ. If the data samples are bigger, the search space of this encoding is very big too. Therefore, we adopt the second encoding method for V. The quantized values are encoded into strings according to their respective values. a={α₁, α₂,…, α_l}, l=c × p. The former p quantized values denote the first p dimension cluster center. But it does not change with the data sample n.

4.2. Constructing Fitness Function

According to J(X; U, V), if the clustering effect is better, the object function value is smaller. The formula (10) is used for constructing fitness function f:

\begin{matrix} f = \frac{1}{J (X; U, V) + 1} . \end{matrix}

(13)

4.3. Genetic Operator Selection

Genetic operator has a point crossover, two-point crossover, and multipoint crossover methods. The immune operator inverts the selected individual genes based on certain probability. We can also adopt a reverse genetic mutation operator, namely, it randomly generates a gene in the parent group and the gene is reverted. It basically prevents premature phenomenon. In genetic selection methods, it adopts the roulette wheel selection method and ranking selection. Crossover probability p_c ∈ [0.75, 0.95], p_m ∈ [10⁻³, 10⁻²].

4.4. Immune Vaccine Selection

The immune vaccine selection properly describe two ways. It is not clear. Specifically, the first method, after collecting information, executes the immune vaccine. The other is an adaptive method, namely, in the process of group evolution from the best individual genes. It extracts useful information and then executes the vaccine. The former is restricted due to two reasons. The first one is it is difficult to form a mature approach for a prior knowledge. It cannot get effective immune vaccine. The second is, to extract the vaccine, the work costs too much. Therefore, in the clustering algorithm based on immune evolution, we adopt the adaptive method to extract the vaccine.

Therefore, we get the new cluster algorithm as follows (Figure 1).

Step 1. Fix cluster class number c, 1 ≤ c ≤ n − 1. Set fuzzy index m ∈ (1, +∞), stop condition τ, total population number p_n, crossover probability p_c, mutation probability p_m, vaccination probability p_v, and vaccine update probability p_u.
Step 2. Randomly generate group P(k) with p_n individuals.
Step 3. Compute fitness of every individual.
1. Each individual is decoded to calculate each prototype parameter v_i, 1 ≤ i ≤ c.
2. Use v_i and (8) to calculate D_ik².
3. Calculate U=[u_ik]_c×n.
4. If I_k=φ,

\begin{matrix} u_{i k} = \frac{1}{\sum_{j = 1}^{c} {[d_{i k}^{2} / d_{j k}^{2}]}^{(1 / (m - 1))}} . \end{matrix}

(14)

If I_k ≠ φ,

\begin{matrix} u_{i k} = 0, \forall i \in^{-} I_{k}, \\ \sum_{i \in I_{k}} u_{i k} = 1, \end{matrix}

(15)

where I_k={i|1 ≤ i ≤ c, d_ik=0} and ⁻I_k={1,2,…, c} − I_k.

(4) Use U, D_ik, and (7) to calculate object function J(X; U, V), and then it can get f for each individual.

Step 4. Make statistics for parent group, determine the best individual, then decompose the best individual, and extract immune vaccine H={h_i|i=1 − m}.

Step 5. Use p_c and p_m to make crossover, mutation operation for P(k), and get group P′(k).

Step 6. Execute vaccination and immunization selection for P(k) and get group P(k+1).

Step 7. If it satisfies τ, return to Step 8. Otherwise, return to Step 3.

Step 8. Then, it decodes the best individual, the clustering prototype v_i is calculated, the classification results of each sample are calculated, and this classification result is the clustering result of data set X.

Proposed clustering algorithm flow diagram.

5. MapReduce Framework

In order to improve the efficiency of modified immune evolutionary algorithm (MIEA) in processing large datasets, this paper designs the implementation scheme of MIEA in the MapReduce model. There are two main operations in the mechanism processing big data clustering tasks: updating the center of the class and fitness evaluation. Class center is updated based on MIEA. Fitness evaluation is to calculate the sum of Euclidean distance between each object and the center of mass and then find the global optimal value. The clustering program divides data objects into clusters, minimizes the sum of Euclidean distances between all objects and the center of mass, and takes it as the fitness function of MIEA. The data clustering process based on MIEA is shown in Figure 2.

6. Experiments and Analysis

In order to verify the performance of clustering and data mining in cloud computing environment, we conduct abundant experiments. Medical data are taken from http://archive.ics.uci.edu/ml/. The database is constantly updated. Donations of data are also accepted. The database type involves life, engineering, science, etc.; the record number is from several to hundred thousand pieces. The data selected in this paper are Breast Cancer Wisconsin (Original) Data Set. These data sets are from the clinical case reports of the university of Wisconsin hospital in the United States, and each data has 11 attributes.

Due to limited space, we display only few results in here. The computing platform is configured with Intel Core I7 4.0 GHz CPU, 16G Memory, and NVIDIA GTX 780 GPU. The algorithm is compiled by Apache Hadoop platform. The sampling frequency of big data is f_s=20 kHz. The time center of big data clustering is t₀=20 s. Size of the data is from 50 MB to 2 GB. Cross probability p_c=0.95, variation probability p_v=0.3, and fuzzy index m=2. We also select three state-of-the-art clustering methods to make comparisons including HGM [34], WPC [35], and ACCH [36].

6.1. Result 1

Table 1 is the description of 11 attributes of this dataset.

Table 1.

Attribute description.

Name of attributes	Description	The serial number of characteristics
Lumps thickness	1–10	1
Cell size uniformity	1–10	2
Cell morphology uniformity	1–10	3
Marginal adhesion	1–10	4
Single epithelial cell size	1–10	5
Bare nucleus	1–10	6
Bland chromatin	1–10	7
Normal nucleoli	1–10	8
Mitosis	1–10	9

Open in a new tab

In this paper, the proposed algorithm is adopted to calculate the weight of each feature. Features with the weight less than a certain threshold will be removed. According to the actual situation in this paper, 2 and3 with the smallest weight will be removed. In the process of the algorithm, we will randomly select sample R. Different random numbers will lead to certain discrepancy in the weight of the result. Therefore, this paper adopts the average method by running it for 20 times. Then, we summarize the results to calculate the average value of each weight as shown in Figure 3.

By analyzing the data set, the importance of attribute weight can be obtained, which has some reference values for clinical diagnosis and can be used for the analysis of actual cases. This can avoid misdiagnosis as far as possible and improve the diagnosis speed and accuracy. According to the attributes, we obtain the object function optimal value as given in Table 2.

Table 2.

Performance comparison.

Method	HGM	WPC	ACCH	Proposed
Accuracy (%)	65	72	76	85
Optimal value	12.54	10.31	8.75	6.59

Open in a new tab

6.2. Result 2

To evaluate the performance of proposed algorithm, the composite data sets given in Table 3 are adopted. The four public data sets are assembled into a large data set, all of which are from UCI Machine Learning Repository with different attributes. Four data sets are randomly copied into several backups to form a large data set with 10⁷ records.

Table 3.

Attributes of experimental datasets.

Number	Dataset	Sample number	Dimensionality	Cluster number
1	Iris	10000050	3	4
2	CMC	10000197	3	9
3	Wine	10000040	3	13
4	Vowel	10000822	6	3

Open in a new tab

F-measure is adopted as the evaluation index of clustering quality. F-measure is calculated from two information indexes, precision, and recall rate, defined as

\begin{matrix} F (i, j) = \frac{2 \cdot r (i, j) \cdot p (i, j)}{r (i, j) + p (i, j)}, \end{matrix}

(16)

where j represents the class generated by the cluster method, i denotes the class label of original dataset, and r and p represent recall rate and precision, respectively. Recall rate is defined as r(i, j)=n_ij/n_i. Precision is defined as p(i, j)=n_ij/n_j. Here, n_ij represents the divided class number of class i. n_i and n_j are the data sizes of class i and class j, respectively. For the data set with size n, the calculation formula of F-measure is

\begin{matrix} F = \sum_{i} \frac{n_{i}}{n} \max_{j} (F (i, j)), \end{matrix}

(17)

where the upper bound of F is 1. If the F-measure value is larger, then the clustering quality will be higher as shown in Table 4. With the increase of dataset number, the F value is slightly on the whole. However, the value of 0.817 of the proposed method is still higher than that of HGM, WPC, and ACCH.

Table 4.

F comparison with different methods.

Dataset number	HGM	WPC	ACCH	Proposed
1	0.678	0.796	0.853	0.912
2	0.312	0.336	0.398	0.423
3	0.493	0.528	0.735	0.796
4	0.597	0.654	0.678	0.817

Open in a new tab

The following experiments are for the feature extraction under cloud computation.

The original big data feature distribution is random as shown in Figure 4, and it is difficult to achieve feature extraction in the two-dimensional space regularity. We use the proposed algorithm for feature extraction and processing data clustering to build big data feature extraction model. The obtained feature extraction results are shown in Figure 5.

Big data two-dimensional feature distribution in cloud computing.

Feature extraction result with the proposed method.

As can be seen in Figure 5, the proposed algorithm can effectively evaluate the feature extraction of big data in cloud computing; the beam focusing performance is good, which provides accurate basis for the data optimal clustering. Using different big data clustering optimization algorithms, we get the clustering center optimal performance curve as shown in Figure 6.

We also get the best value and mean value within 200 iterations as shown in Figures 7–10. And, we can know that the best value is with our proposed method.

7. Conclusions

In cloud computing environment, vast amounts of data need to be scheduled and accessed aiming at achieving the goal of medical data mining. This paper puts forward a new medical big data clustering algorithm based on modified immune algorithm. It firstly analyzes the big data structure model in the cloud computing environment to build big data feature extraction and information model. Designing immune optimization algorithm for clustering, it achieves the goal of optimization clustering for big data. Simulation results show that the proposed algorithm improves the clustering performance of big data in cloud computing environment. The new algorithm is used for IoT data clustering, which reduces the error rate and exhibits better performance. In the future, we will research the deep learning methods and apply them into actual engineering projects.

Acknowledgments

This research was funded by the Heilongjiang Province Science Found for Returnees (grant number: LC2017027), Jiamusi University Science and Technology Innovation Team Construction Project (grant number: CXTDPY-2016-3), and Basic Research Project of Heilongjiang Province Department of Education (grant number: 2016-kyywf-0547).

Contributor Information

Hang Li, Email: lihangsoft@163.com.

Desheng Liu, Email: zdhlds@163.com.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

1.Kisekka V., Giboney J. S. The effectiveness of health care information technologies: evaluation of trust, security beliefs, and privacy as determinants of health care outcomes. Journal of Medical Internet Research. 2018;20(4):p. e107. doi: 10.2196/jmir.9014. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zhao L., Chen Z., Yang L. T., Jamal Deen M., Jane Wang Z. Deep semantic mapping for heterogeneous multimedia transfer learning using co-occurrence data. ACM Transactions on Multimedia Computing, Communications, and Applications. 2019;15(1s):9–21. doi: 10.1145/3241055. [DOI] [Google Scholar]
3.Zhang Q., Bai C., Yang L. T., Chen Z., Li P., Yu H. A unified smart Chinese medicine framework for healthcare and medical services. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019 doi: 10.1109/TCBB.2019.2914447. [DOI] [PubMed] [Google Scholar]
4.Yin S., Liu J., Teng L. An improved artificial bee colony algorithm for staged search. TELKOMNIKA Telecommunication, Computing, Electronics and Control. 2016;14(3):1099–1104. doi: 10.12928/telkomnika.v14i3.3609. [DOI] [Google Scholar]
5.Liu T., Yin S. An improved particle swarm optimization algorithm used for BP neural network and multimedia course-ware evaluation. Multimedia Tools and Applications. 2016;76(9):11961–11974. doi: 10.1007/s11042-016-3776-5. [DOI] [Google Scholar]
6.Jie L., Teng L., Yin S. An improved discrete firefly algorithm used for traveling salesman problem. Proceedings of the International Conference in Swarm Intelligence; July–August 2017; Fukuoka, Japan. Springer; pp. 593–600. [Google Scholar]
7.Yin S.-L., Liu J. A K-means approach for map-reduce model and social network privacy protection. Journal of Information Hiding and Multimedia Signal Processing. 2016;7(6):1215–1221. [Google Scholar]
8.Peng L., Chen Z., Yang L. T., et al. Deep convolutional computation model for feature learning on big data in Internet of things. IEEE Transactions on Industrial Informatics. 2018;14(2):790–798. doi: 10.1109/tii.2017.2739340. [DOI] [Google Scholar]
9.Zhao L., Chen Z., Yang Y., Zou L., Wang Z. J. ICFS clustering with multiple representatives for large data. IEEE Transactions on Neural Networks and Learning Systems. 2019;30(3):728–738. doi: 10.1109/tnnls.2018.2851979. [DOI] [PubMed] [Google Scholar]
10.Gao J., Li P., Chen Z. A canonical polyadic deep convolutional computation model for big data feature learning in internet of things. Future Generation Computer Systems. 2019;99:508–516. doi: 10.1016/j.future.2019.04.048. [DOI] [Google Scholar]
11.Yang J., Xie Y., Guo Y. Panel data clustering analysis based on composite PCC: a parametric approach. Cluster Computing. 2019;22(S4):8823–8833. doi: 10.1007/s10586-018-1973-x. [DOI] [Google Scholar]
12.Zhang Q., Bai C., Chen Z., et al. Deep learning models for diagnosing spleen and stomach diseases in smart chinese medicine with cloud computing. Concurrency and Computation: Practice and Experience. 2019;(e5252) doi: 10.1002/cpe.5252. [DOI] [Google Scholar]
13.Parvin H., Minaei-Bidgoli B. A clustering ensemble framework based on elite selection of weighted clusters. Advances in Data Analysis and Classification. 2013;7(2):181–208. doi: 10.1007/s11634-013-0130-x. [DOI] [Google Scholar]
14.Zhao L., Chen Z., Yang Y., Jane Wang Z., Leung V. C. M. Incomplete multi-view clustering via deep semantic mapping. Neurocomputing. 2018;275:1053–1062. doi: 10.1016/j.neucom.2017.07.016. [DOI] [Google Scholar]
15.Zhao W., Yan L., Zhang Y. Geometric-constrained multi-view image matching method based on semi-global optimization. Geo-spatial Information Science. 2018;21(2):115–126. doi: 10.1080/10095020.2018.1441754. [DOI] [Google Scholar]
16.Li P., Chen Z., Yang L. T., Zhao L., Zhang Q. A privacy-preserving high-order neuro-fuzzy c-means algorithm with cloud computing. Neurocomputing. 2017;256:82–89. doi: 10.1016/j.neucom.2016.08.135. [DOI] [Google Scholar]
17.Meng Z., Pan J.-S., Kong L. Parameters with adaptive learning mechanism (PALM) for the enhancement of differential evolution. Knowledge-Based Systems. 2018;141:92–112. doi: 10.1016/j.knosys.2017.11.015. [DOI] [Google Scholar]
18.Zhang Q., Yang L. T., Chen Z., Li P. PPHOPCM: privacy-preserving high-order possibilistic c-means algorithm for big data clustering with cloud computing. IEEE Transactions on Big Data. 2017;(99):p. 1. doi: 10.1109/tbdata.2017.2701816. [DOI] [Google Scholar]
19.Li J., Ma T., Tang M., Shen W., Jin Y. Improved FIFO scheduling algorithm based on fuzzy clustering in cloud computing. Information. 2017;8(1):p. 25. doi: 10.3390/info8010025. [DOI] [Google Scholar]
20.Minaei-Bidgoli B., Asadi M., Hamid P. An ensemble based approach for feature selection. Proceedings of the International Conference on Engineering Applications of Neural Networks; September 2011; Corfu, Greece. EANN; pp. 240–246. [DOI] [Google Scholar]
21.Hu G., Du Z. Adaptive kernel-based fuzzy C-means clustering with spatial constraints for image segmentation. International Journal of Pattern Recognition & Artificial Intelligence. 2018;33(1) doi: 10.1142/s021800141954003x. [DOI] [Google Scholar]
22.Gao J., Li J., Li Y. Approximate event detection over multi-modal sensing data. Journal of Combinatorial Optimization. 2016;32(4):1002–1016. doi: 10.1007/s10878-015-9847-0. [DOI] [Google Scholar]
23.Zhao L., Chen Z., Wang Z. J., Jane Unsupervised multiview nonnegative correlated feature learning for data clustering. IEEE Signal Processing Letters. 2018;25(1):60–64. doi: 10.1109/lsp.2017.2769086. [DOI] [Google Scholar]
24.Nejatian S., Parvin H., Faraji E. Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification. Neurocomputing. 2018;276:55–66. doi: 10.1016/j.neucom.2017.06.082. [DOI] [Google Scholar]
25.Tavana M., Parvin H., Rezazadeh F. Parkinson detection: an image processing approach. Journal of Medical Imaging and Health Informatics. 2017;7(2):464–472. doi: 10.1166/jmihi.2017.1788. [DOI] [Google Scholar]
26.Jung H. C., Kim S. H., Lee J. H., Kim J. H., Han S. W. Gene regulatory network analysis for triple-negative breast neoplasms by using gene expression data. Journal of Breast Cancer. 2017;20(3):240–245. doi: 10.4048/jbc.2017.20.3.240. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Mohammadi Jenghara M., Ebrahimpour-Komleh H., Parvin H. Dynamic protein–protein interaction networks construction using firefly algorithm. Pattern Analysis and Applications. 2018;21(4):1067–1081. doi: 10.1007/s10044-017-0626-7. [DOI] [Google Scholar]
28.Langmead B., Nellore A. Cloud computing for genomic data analysis and collaboration. Nature Reviews Genetics. 2018;19(5) doi: 10.1038/nrg.2018.8. [DOI] [PubMed] [Google Scholar]
29.Abdel-Basset M., Mohamed M., Chang V. NMCDA: a framework for evaluating cloud computing services. Future Generation Computer Systems. 2018;86:12–29. doi: 10.1016/j.future.2018.03.014. [DOI] [Google Scholar]
30.Li P., Chen Z., Yang L. T., et al. An incremental deep convolutional computation model for feature learning on industrial big data. IEEE Transactions on Industrial Informatics. 2018;15(3):1341–1349. doi: 10.1109/tii.2018.2871084. [DOI] [Google Scholar]
31.Zilong G., Sun’an W., Jian Z. A novel immune evolutionary algorithm incorporating chaos optimization. Pattern Recognition Letters. 2006;27(1):2–8. doi: 10.1016/j.patrec.2005.06.014. [DOI] [Google Scholar]
32.Zhang Q., Yang L. T., Yan Z., Chen Z., Li P. An efficient deep learning model to predict cloud workload for industry informatics. IEEE Transactions on Industrial Informatics. 2018;14(7):3170–3178. doi: 10.1109/tii.2018.2808910. [DOI] [Google Scholar]
33.Li P., Chen Z., Yang L. T., Gao J., Zhang Q., Deen M. J. An improved stacked auto-encoder for network traffic flow classification. IEEE Network. 2018;32(6):22–27. doi: 10.1109/mnet.2018.1800078. [DOI] [Google Scholar]
34.Manogaran G., Vijayakumar V., Varatharajan R., et al. Machine learning based big data processing framework for cancer diagnosis using hidden markov model and GM clustering. Wireless Personal Communications. 2018;102(3):2099–2116. doi: 10.1007/s11277-017-5044-z. [DOI] [Google Scholar]
35.Zhang Q., Yang L. T., Castiglione A., et al. Secure weighted possibilistic c-means algorithm on cloud for clustering big data. Information Sciences. 2019;479:515–525. doi: 10.1016/j.ins.2018.02.013. [DOI] [Google Scholar]
36.Li H., Li H., Wei K. Automatic fast double KNN classification algorithm based on ACC and hierarchical clustering for big data. International Journal of Communication Systems. 2018;31(16):p. e3488. doi: 10.1002/dac.3488. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

[B1] 1.Kisekka V., Giboney J. S. The effectiveness of health care information technologies: evaluation of trust, security beliefs, and privacy as determinants of health care outcomes. Journal of Medical Internet Research. 2018;20(4):p. e107. doi: 10.2196/jmir.9014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Zhao L., Chen Z., Yang L. T., Jamal Deen M., Jane Wang Z. Deep semantic mapping for heterogeneous multimedia transfer learning using co-occurrence data. ACM Transactions on Multimedia Computing, Communications, and Applications. 2019;15(1s):9–21. doi: 10.1145/3241055. [DOI] [Google Scholar]

[B3] 3.Zhang Q., Bai C., Yang L. T., Chen Z., Li P., Yu H. A unified smart Chinese medicine framework for healthcare and medical services. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019 doi: 10.1109/TCBB.2019.2914447. [DOI] [PubMed] [Google Scholar]

[B4] 4.Yin S., Liu J., Teng L. An improved artificial bee colony algorithm for staged search. TELKOMNIKA Telecommunication, Computing, Electronics and Control. 2016;14(3):1099–1104. doi: 10.12928/telkomnika.v14i3.3609. [DOI] [Google Scholar]

[B5] 5.Liu T., Yin S. An improved particle swarm optimization algorithm used for BP neural network and multimedia course-ware evaluation. Multimedia Tools and Applications. 2016;76(9):11961–11974. doi: 10.1007/s11042-016-3776-5. [DOI] [Google Scholar]

[B6] 6.Jie L., Teng L., Yin S. An improved discrete firefly algorithm used for traveling salesman problem. Proceedings of the International Conference in Swarm Intelligence; July–August 2017; Fukuoka, Japan. Springer; pp. 593–600. [Google Scholar]

[B7] 7.Yin S.-L., Liu J. A K-means approach for map-reduce model and social network privacy protection. Journal of Information Hiding and Multimedia Signal Processing. 2016;7(6):1215–1221. [Google Scholar]

[B8] 8.Peng L., Chen Z., Yang L. T., et al. Deep convolutional computation model for feature learning on big data in Internet of things. IEEE Transactions on Industrial Informatics. 2018;14(2):790–798. doi: 10.1109/tii.2017.2739340. [DOI] [Google Scholar]

[B9] 9.Zhao L., Chen Z., Yang Y., Zou L., Wang Z. J. ICFS clustering with multiple representatives for large data. IEEE Transactions on Neural Networks and Learning Systems. 2019;30(3):728–738. doi: 10.1109/tnnls.2018.2851979. [DOI] [PubMed] [Google Scholar]

[B10] 10.Gao J., Li P., Chen Z. A canonical polyadic deep convolutional computation model for big data feature learning in internet of things. Future Generation Computer Systems. 2019;99:508–516. doi: 10.1016/j.future.2019.04.048. [DOI] [Google Scholar]

[B11] 11.Yang J., Xie Y., Guo Y. Panel data clustering analysis based on composite PCC: a parametric approach. Cluster Computing. 2019;22(S4):8823–8833. doi: 10.1007/s10586-018-1973-x. [DOI] [Google Scholar]

[B12] 12.Zhang Q., Bai C., Chen Z., et al. Deep learning models for diagnosing spleen and stomach diseases in smart chinese medicine with cloud computing. Concurrency and Computation: Practice and Experience. 2019;(e5252) doi: 10.1002/cpe.5252. [DOI] [Google Scholar]

[B13] 13.Parvin H., Minaei-Bidgoli B. A clustering ensemble framework based on elite selection of weighted clusters. Advances in Data Analysis and Classification. 2013;7(2):181–208. doi: 10.1007/s11634-013-0130-x. [DOI] [Google Scholar]

[B14] 14.Zhao L., Chen Z., Yang Y., Jane Wang Z., Leung V. C. M. Incomplete multi-view clustering via deep semantic mapping. Neurocomputing. 2018;275:1053–1062. doi: 10.1016/j.neucom.2017.07.016. [DOI] [Google Scholar]

[B15] 15.Zhao W., Yan L., Zhang Y. Geometric-constrained multi-view image matching method based on semi-global optimization. Geo-spatial Information Science. 2018;21(2):115–126. doi: 10.1080/10095020.2018.1441754. [DOI] [Google Scholar]

[B16] 16.Li P., Chen Z., Yang L. T., Zhao L., Zhang Q. A privacy-preserving high-order neuro-fuzzy c-means algorithm with cloud computing. Neurocomputing. 2017;256:82–89. doi: 10.1016/j.neucom.2016.08.135. [DOI] [Google Scholar]

[B17] 17.Meng Z., Pan J.-S., Kong L. Parameters with adaptive learning mechanism (PALM) for the enhancement of differential evolution. Knowledge-Based Systems. 2018;141:92–112. doi: 10.1016/j.knosys.2017.11.015. [DOI] [Google Scholar]

[B18] 18.Zhang Q., Yang L. T., Chen Z., Li P. PPHOPCM: privacy-preserving high-order possibilistic c-means algorithm for big data clustering with cloud computing. IEEE Transactions on Big Data. 2017;(99):p. 1. doi: 10.1109/tbdata.2017.2701816. [DOI] [Google Scholar]

[B19] 19.Li J., Ma T., Tang M., Shen W., Jin Y. Improved FIFO scheduling algorithm based on fuzzy clustering in cloud computing. Information. 2017;8(1):p. 25. doi: 10.3390/info8010025. [DOI] [Google Scholar]

[B20] 20.Minaei-Bidgoli B., Asadi M., Hamid P. An ensemble based approach for feature selection. Proceedings of the International Conference on Engineering Applications of Neural Networks; September 2011; Corfu, Greece. EANN; pp. 240–246. [DOI] [Google Scholar]

[B21] 21.Hu G., Du Z. Adaptive kernel-based fuzzy C-means clustering with spatial constraints for image segmentation. International Journal of Pattern Recognition & Artificial Intelligence. 2018;33(1) doi: 10.1142/s021800141954003x. [DOI] [Google Scholar]

[B22] 22.Gao J., Li J., Li Y. Approximate event detection over multi-modal sensing data. Journal of Combinatorial Optimization. 2016;32(4):1002–1016. doi: 10.1007/s10878-015-9847-0. [DOI] [Google Scholar]

[B23] 23.Zhao L., Chen Z., Wang Z. J., Jane Unsupervised multiview nonnegative correlated feature learning for data clustering. IEEE Signal Processing Letters. 2018;25(1):60–64. doi: 10.1109/lsp.2017.2769086. [DOI] [Google Scholar]

[B24] 24.Nejatian S., Parvin H., Faraji E. Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification. Neurocomputing. 2018;276:55–66. doi: 10.1016/j.neucom.2017.06.082. [DOI] [Google Scholar]

[B25] 25.Tavana M., Parvin H., Rezazadeh F. Parkinson detection: an image processing approach. Journal of Medical Imaging and Health Informatics. 2017;7(2):464–472. doi: 10.1166/jmihi.2017.1788. [DOI] [Google Scholar]

[B26] 26.Jung H. C., Kim S. H., Lee J. H., Kim J. H., Han S. W. Gene regulatory network analysis for triple-negative breast neoplasms by using gene expression data. Journal of Breast Cancer. 2017;20(3):240–245. doi: 10.4048/jbc.2017.20.3.240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Mohammadi Jenghara M., Ebrahimpour-Komleh H., Parvin H. Dynamic protein–protein interaction networks construction using firefly algorithm. Pattern Analysis and Applications. 2018;21(4):1067–1081. doi: 10.1007/s10044-017-0626-7. [DOI] [Google Scholar]

[B28] 28.Langmead B., Nellore A. Cloud computing for genomic data analysis and collaboration. Nature Reviews Genetics. 2018;19(5) doi: 10.1038/nrg.2018.8. [DOI] [PubMed] [Google Scholar]

[B29] 29.Abdel-Basset M., Mohamed M., Chang V. NMCDA: a framework for evaluating cloud computing services. Future Generation Computer Systems. 2018;86:12–29. doi: 10.1016/j.future.2018.03.014. [DOI] [Google Scholar]

[B30] 30.Li P., Chen Z., Yang L. T., et al. An incremental deep convolutional computation model for feature learning on industrial big data. IEEE Transactions on Industrial Informatics. 2018;15(3):1341–1349. doi: 10.1109/tii.2018.2871084. [DOI] [Google Scholar]

[B31] 31.Zilong G., Sun’an W., Jian Z. A novel immune evolutionary algorithm incorporating chaos optimization. Pattern Recognition Letters. 2006;27(1):2–8. doi: 10.1016/j.patrec.2005.06.014. [DOI] [Google Scholar]

[B32] 32.Zhang Q., Yang L. T., Yan Z., Chen Z., Li P. An efficient deep learning model to predict cloud workload for industry informatics. IEEE Transactions on Industrial Informatics. 2018;14(7):3170–3178. doi: 10.1109/tii.2018.2808910. [DOI] [Google Scholar]

[B33] 33.Li P., Chen Z., Yang L. T., Gao J., Zhang Q., Deen M. J. An improved stacked auto-encoder for network traffic flow classification. IEEE Network. 2018;32(6):22–27. doi: 10.1109/mnet.2018.1800078. [DOI] [Google Scholar]

[B34] 34.Manogaran G., Vijayakumar V., Varatharajan R., et al. Machine learning based big data processing framework for cancer diagnosis using hidden markov model and GM clustering. Wireless Personal Communications. 2018;102(3):2099–2116. doi: 10.1007/s11277-017-5044-z. [DOI] [Google Scholar]

[B35] 35.Zhang Q., Yang L. T., Castiglione A., et al. Secure weighted possibilistic c-means algorithm on cloud for clustering big data. Information Sciences. 2019;479:515–525. doi: 10.1016/j.ins.2018.02.013. [DOI] [Google Scholar]

[B36] 36.Li H., Li H., Wei K. Automatic fast double KNN classification algorithm based on ACC and hierarchical clustering for big data. International Journal of Communication Systems. 2018;31(16):p. e3488. doi: 10.1002/dac.3488. [DOI] [Google Scholar]

PERMALINK

Modified Immune Evolutionary Algorithm for Medical Data Clustering and Feature Extraction under Cloud Computing Environment

Jing Yu

Hang Li

Desheng Liu

Abstract

1. Introduction

2. Analysis on Storage Mechanism and Structure of Medical Big Data in Cloud Computing Environment

3. Immune Evolutionary Algorithm (IEA)

4. Modified Immune Evolutionary Algorithm for Data Clustering

4.1. Encoding

4.2. Constructing Fitness Function

4.3. Genetic Operator Selection

4.4. Immune Vaccine Selection

Figure 1.

5. MapReduce Framework

Figure 2.

6. Experiments and Analysis

6.1. Result 1

Table 1.

Figure 3.

Table 2.

6.2. Result 2

Table 3.

Table 4.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

7. Conclusions

Acknowledgments

Contributor Information

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases