HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets

Shilan S Hameed; Rohayanti Hassan; Wan Haslina Hassan; Fahmi F Muhammadsharif; Liza Abdul Latiff

doi:10.1371/journal.pone.0246039

. 2021 Jan 28;16(1):e0246039. doi: 10.1371/journal.pone.0246039

HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets

Shilan S Hameed ^1,^2,^*, Rohayanti Hassan ³, Wan Haslina Hassan ¹, Fahmi F Muhammadsharif ⁴, Liza Abdul Latiff ⁵

Editor: Bryan C Daniels⁶

PMCID: PMC7842997 PMID: 33507983

Abstract

The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. In this work, a novel stand-alone application, which is based on graphical user interface (GUI), is developed to perform the full functionality of gene selection and classification in high dimensional datasets. The so-called HDG-select application is validated on eleven high dimensional datasets of the format CSV and GEO soft. The proposed tool uses the efficient algorithm of combined filter-GBPSO-SVM and it was made freely available to users. It was found that the proposed HDG-select outperformed other tools reported in literature and presented a competitive performance, accessibility, and functionality.

Introduction

The microarray is a tool used to estimate whether mutations in specific genes are present in a particular individual. The most common type of microarray is utilized to measure gene expression, where the expression values of thousands of genes are calculated from the microarray sample [1]. The identification of the most attributed genes to a specific disease can be carried out by means of gene selection and classification of the microarray datasets, wherein various statistical and optimization algorithms are involved. The outcome of accurate selection of attributed genes would ultimately lead to establishing a cost-effective and useful studies on the altered genes [2]. Furthermore, the identified genes help in classifying the clinical samples to normal and disease samples. Gene selection methods are classified into two main types: filter-based methods and wrapper based ones [3, 4]. Filter based methods work separately without using any connected classifier, so they provide the results faster. They are better applied in analyzing high dimensional data of microarray datasets with thousands of genes and hundreds of samples [5, 6]. The weakness of filter methods is that most of them are unable to establish a useful correlation among the genes and hence there would be the possibility of selecting redundant genes. This drawback acts to reduce the final classifier accuracy if only filter is applied to select the discriminative genes [4]. Hence, the best approach is to use filters in the preliminary selection steps [6]. Wrappers perform better in selecting discriminative genes since they depend on the model hypothesis to train and test in the gene space [4]. However, wrapper-based techniques are heavy and could be a worst choice if they are directly applied on high dimensional datasets without any preprocessing [7].

Many computational methods are failed to extract a small subset of attributed genes in high dimensional datasets because of the presence of various correlations and redundancy among the genes. Interestingly, studies in the field of cancer informatics have shown a splendid contribution of data mining and machine learning to find the attributed genes [8–11]. It is however proved that machine learning can perform well in cancer classification, it is yet required further improvement and robustness in terms of efficiency and computational cost, especially when high dimensional datasets are investigated. This is because high dimensional datasets contain several redundant and variant genes expression, which in turn acts upon reducing the accuracy and efficiency of the computational techniques used to mine the most attributed genes [12]. In general, the noise in gene expression level is occurred due to biological variations associated with the experiments or the existence of alterations in the genes [4, 13]. Therefore, it is not an easy or a straightforward task to find the attributed genes in high dimensional datasets unless a careful analysis and selection rule is carried out.

Along this line, a binary variant of Harris hawk’s optimizer (HHO) was proposed to boost the efficacy of wrapper-based gene selection in high dimensional dataset [14]. Besides, a two-stage sparse logistic regression was reported aiming at obtaining efficient subset of genes with high classification capabilities [15]. That is by combining the screening approach as filter method and adaptive lasso with a new weight as wrapper method. Gene selection in high-dimensional colon cancer microarray dataset was seen to be enhanced by using an ensemble of gene selection technique based on t-test and GA [16]. After preprocessing the data using t-test, a Nested-GA was employed to get the optimal subset of genes. As such, various approaches were reported in literature in order to increase the gene selection efficacy in high dimensional datasets such as hybrid binary coral reefs optimization algorithm with simulated annealing [17], ensembles of regularized regression models with resampling-based lasso [18], variable-size cooperative coevolutionary particle swarm optimization [19], hybrid dimensionality reduction forest with pruning [20], hybrid feature selection based on reliefF and binary dragonfly [21] as well as hybrid rough set theory and hypergraph [22]. It is observed that the effective approach for gene selection in microarray dataset can be a combination of filter and wrapper algorithms. Obviously, there exist numerous techniques used to select attributed genes in high dimensional datasets, however the complexity of the algorithms and computational cost are limiting their reproducibility with rapid selection of discriminated genes in massive datasets. Nevertheless, particle swarm optimization (PSO), as a searching strategy for genes selection, is proved to be more efficient and easy to implement compared to other methods [23, 24]. This is because few parameters are needed to perform its adjustment and therefore it saves memory. The modified geometric binary of PSO (GBPSO) was effectively utilized for gene selection in autism dataset [12]. Details on PSO and its GBPSO variant can be found in literature [12, 24–27]. GBPSO can be used as a wrapper feature selection method with a support vector machine (SVM). SVMs represent a group of supervised machine-learning methods which were developed by Vapnik [28]. The various forms of this algorithm are widely used [9, 24, 29], particularly for medical related data classification [30–34]. Moreover, SVM can perform both linear and nonlinear separable data classification. When using SVM, it is essential that the number of coefficients to be determined are primarily based on the number of samples not on the number of genes. In the case of gene classification, SVM utilizes kernel functions to get an orthogonal hyperplane to separate the genes in a specific dimension. Different types of kernels can be applied [24, 35, 36], whereas each kernel type is appropriate for different data. In the current work, a polynomial kernel was utilized for the SVM due to its highest classification accuracy when it is applied for high dimensional datasets.

It is well-known that the process of gene selection and classification is becoming tedious and time consuming when the datasets are not curated such as soft GEO datasets. A review of literature showed that there are various tools created for sequence and genomic data analysis [37–40], while there has been few applications established for gene selection and classification [41–43]. For instance, a java GUI application was developed for microarray data classification using SVM classifier [43]. The researchers concluded that the application performs well when a radial basis SVM kernel is used. However, their tool is not accessible now and it is created only for classification. The varSelRF package and GeneSrF tool were developed for gene selection given the associated error of classification using R language and python [41]. The package of varSelRF can be only used on Linux and Unix OS, while GeneSrF is a web-based tool and is not currently accessible. In another study [42], ArrayMining.net, which is a web-based tool, was constructed for gene selection and class identification using supervised and un-supervised techniques. In the current work, a novel user-friendly and stand-alone (non-web based) application is proposed for a simple and efficient gene selection and classification in the high dimensional datasets. The software program was developed with the help of interfacing MATLAB with Weka tool, combining their benefits in one package. The proposed application is named as HDG-select, referring to its capability of high dimensional gene selection. It can be used by researchers and students to reduce the burden of hard-working steps of dataset curation, gene selection and classification on a one-platform scheme. The main advantages of the proposed application include dataset curation, user-defined gene filtration, handling both numerical and categorical samples and combining the functionality of MATLAB and Weka in a single tool. If someone wants to perform a complete gene selection including dataset curation and filtration, a comprehensive coding is first required in MATLAB and then the results need to be transferred to Weka tool in order to run the GBPSO-SVM algorithm. Interestingly, the proposed HDG-select is the collection of all necessary operations within a single user-friendly graphical interface, which helps the users to practice simplicity, accuracy and reduced computational cost. Noticeably, the tool uses a combination of filters and GBPSO wrapper for gene selection, while SVM is used for classification. Furthermore, the reported tools in literature accept CSV files as input datasets. However, our developed tool can handle both CSV and.Soft file formats, which is specifically useful for analysing the non-curated genomic data available in the GEO database.

Materials and methods

Implementation procedure

The process of selection and classification of genes in high dimensional microarrays using the developed HDG-select tool and the built-in structure of the application are shown in Figs 1 and 2, respectively. The first step was to reduce the dimensionality of the datasets by removing the redundant/irrelevant genes whose expression values are close amid the control and non-control classes. For this purpose, the values of mean and median ratio were calculated based on the variance of the genes expression, which is discussed later in detail. This process is performed so that the next steps become more efficient and easier. Later on, two different filters, namely t-test (TT) and Wilcoxon rank sum (WRS), are used to filter the desired number of top relevant genes. These filters and their combination process are elaborated in the next sections. As shown in Fig 2, the curation and filtration steps are implemented by MATLAB coding, while the use of wrapper based GBPSO-SVM algorithm is realized by Java programming, that is by interfacing the GBPSO-SVM algorithm from Weka with MATLAB.

Fig 1 — The highly irrelevant genes are fist removed (dataset curation) by considering the values of mean and median ratio, followed by the use of different filters in combination with the GBPSO algorithm.

Fig 2 — The curation and filtration steps are implemented by MATLAB coding, while the use of wrapper based GBPSO-SVM algorithm is realized by MATLAB interfaced Weka through Java coding.

Microarray datasets

Validation and assessment of the developed HDG-select application was carried out by testing 11 high dimensional datasets of different types of diseases. The characteristics of the datasets are given in Tables 1 and 2. The first six datasets in Table 1 are in csv format, which were pre-processed and previously used for gene expression analysis [11]. The Leukemia cancer dataset was achieved from [44]. The dataset of leukemia cancer is given as S1 Dataset. The colon cancer microarray dataset was originally analyzed by Alon et al [45], it is given as S2 Dataset. The prostate cancer dataset is based on oligonucleotide microarray, which was obtained from [46]. The dataset of prostate cancer is given as S3 Dataset. The rest of the datasets (Breast, CNS and Ovarian) were achieved from [47] whose datasets are given as S4–S6 Datasets, respectively.

Table 1. The main characteristics of the pre-reduced high dimensional datasets in csv format.

Dataset	#Genes	#Samples	#Class (class1:class2)
Leukemia	3051	72	2(25:47)
Colon	2000	62	2(22:40)
Prostate	6033	102	2(50:52)
Breast	24481	97	2(46:51)
CNS^a	7129	72	2(21:39)
Ovarian	15154	253	2(162:91)

Open in a new tab

^a Central Nervous system.

Table 2. The main characteristics of the original/non-curated GEO datasets in soft format.

Dataset	#Genes	#Samples	#Class (class1:class2)
Inflammatory Breast Cancer (GDS3097)	22283	48	2(35 NIBC:13 IBC)
Breast Cancer (GDS3716)	22283	42	2(24 control:18 breast cancer)
Brain Metastatic Breast Cancer (GDS5306)	61359	38	2(19(BMBC) tumor:19(NBC) tumor)
Autism (GDS4431)	54675	146	2(69 control:77 autism)
Influenza A (GDS6063)	48107	10	2(5 postive:5negative)

Open in a new tab

The second batch of investigated datasets are in soft format with originally non-curated condition, which can be downloaded from the well-known public repository of GEO (NCBI) [48] under GDS file name. The main characteristics of these datasets are given in Table 2 with some descriptions as follows:

Inflammatory breast cancer (GDS3097)

Tumor epithelium and underlying stromal cells were extracted using laser capture microdissection of human breast cancer to study gene expression variations based on inflammatory and non-inflammatory breast cancer tissue types.

Breast cancer (GDS3716)

With Affymetrix HU133A microarrays, 42 total laser capture micro dissected histologically normal samples of breast tissue were analyzed in this dataset.

Brain metastatic breast cancer (GDS5306)

Gene expression of 19 HER2+ breast cancer brain metastases were comparably examined with HER2+ nonmetastatic primary tumors.

Autism disorder (GDS4431)

Total RNA was extracted for microarray experiments with Affymetrix Human U133 Plus 2.0 39 Expression Arrays. The autistic samples were diagnosed by medical professionals of developmental pediatrician and psychologist according to the DSM-IV criteria and the diagnosis was confirmed on the basis of ADOS and ADI-R criteria [49].

Influenza A (GDS6063)

More than 2600 genes were expressed differently in pDCs exposed to influenza A compared to controls (no viruses) blood pDCs.

Dimensionality reduction of soft format datasets using mean and median ratio

Because of the presence of variance among the genes expression in high dimensional datasets and the non-curated nature of GEO soft datasets [48, 50], it is imperative to perform a pre-processing mechanism in order to reduce the dimensionality of the datasets, thereby removing the redundant and highly irrelevant genes. For this purpose, the overall similarity of genes expression was assessed through the estimation of their mean and median values among the two classes. When a mean criterion is applied to identify the redundant/irrelevant genes, the mean of genes expression is determined. Similarly, when the median criterion is considered, the median value of genes expressions is calculated.

It was seen that when there is a high variance in the genes expression (variance ≥ 15%), the application of median criterion to reduce the dataset dimensionality is performed better in comparison to the application of mean criterion [12]. This is because the mean value of genes expression is affected by the high variance. Therefore, in this work, median criterion is applied on the genes of variance ≥ 15%, while mean criterion is used for those with variance < 15%. Consequently, genes whose median and mean ratio of their expression are between 0.95 and 1/0.95 are removed from the dataset. This threshold range is chosen intentionally in order to remove the redundant and less significant genes from the whole dataset, and hence making the next steps of the gene selection simple and cost-effective without compromising the selection accuracy.

Gene selection using statistical filters

The second step of gene selection in the GEO soft datasets, after the dimensionality reduction, was performed by using two different statistical filters and their combination, namely two-sample t-test (TT), Wilcoxon rank sum test (WRS) and combined TT-WRS. However, this selection by means of the filters can be the first step of gene selection for the pre-curated datasets of CSV format because the proposed HDG-select tool allows users to bypass the dimensionality reduction step for the pre-curated CSV datasets. The choice of these filters is based on the findings that the TT and WRS filters performed well when they used for gene selection in the high dimensional dataset of autism [12]. This is because each filter is based on different assumptions related to the mean, median and variance which can be found in the high dimensional datasets. Because the filtration power is different for each filter, the combination of them might yield a better selection performance [11, 51, 52]. The TT filter was applied to microarray genes [53] and it was seen that the filter shows a strong scalability when the number of genes is high [54]. Hence, some researchers have used the TT filter as the only step of gene selection [55, 56]. Also, WRS filter was effectively used for the pre-selection of genes [57, 58], especially when the data are associated with high variance [59].

In this work, the statistical filtrations are applied on the datasets in a 10-fold run in order to avoid overfitting. As such, the genes are ranked among the 10-fold from the most significant to the least significant ones. Hence, based on their ranking position (weight), the desired number of most highly ranked genes can be extracted. The equation used to weigh the genes and ranking their positions based on their significancy is a formula of global weight that is given by:

w (f) = \sum_{i = 1}^{K} w_{i} (f)

(1)

where each i in K = the number of current fold iterations in the entire 10-fold run.

The t-test (TT) filter is a univariate filter that is commonly used for binary classes [53, 54]. The general assumption of the t-test is that the values are uniformly distributed with a bell-shaped distribution curve among the two classes. The t-test null hypothesis supposes equal means and equal variances, and this assertion is rejected by the alternative hypothesis. The t—test formula is [60]:

t = \frac{c_{1} - c_{2}}{\sqrt{\frac{{σ_{1}}^{2}}{n} + \frac{{σ_{2}}^{2}}{m}}}

(2)

where n and m are the first- and second-class population size, respectively. The result of the evaluation calls t, its value is ranged from 0 to 1 based on the significancy level. The value of 1 refers the abandonment of the null hypothesis at the 5 percent and 0 refers to the acceptance of the null hypothesis at the same level of significance. The test also returns the probability value of t. The lower p-value implies a noticeable difference between the compared samples. The parametric form of TT filter assumes equal variance and normal distribution, while non-parametric one assumes unequal variance and random distribution. In this work, the non-parametric TT filter is used because most of the data distribution in high dimensional datasets follow unequal variance and random distribution due to the presence of high noise and various expression values.

The second filter is Wilcoxon rank sum (WRS) test, which is a non-parametric filter method [61]. Hence, it is not essential for the gene values in the classes to have a normal distribution such as seen in the high dimensional datasets. This method is also known as the Mann-Whitney test [62, 63]. It uses a median based criterion to distinguish between the two classes. The test compares the samples medians and provides results on a ranking manner rather than in numerical values [64]. The index value and rank for each element in the result can be determined by arranging them in an ascending order. The null hypothesis considered by WRS test is that all genes originate from one class. The statistical formula of the Wilcoxon rank sum is as follows [57]:

s (g) = \sum_{i \in N_{0}} \sum_{j \in N_{1}} I ((x_{j}^{(g)} - x_{i}^{(g)})) \leq 0

(3)

where I is the function used to distinguish the classes. If the logical expression $(x_{j}^{(g)} - x_{i}^{(g)}) \leq 0$ is true, I is 1; otherwise, it is 0. $x_{i}^{(g)}$ is the expression value of gene g in sample I, N₀ and N₁ represent the number of observations in each of the two classes, respectively, and s(g) denotes the difference in the expression of the gene in the two classes. Based on whether s(g) becomes 0 or reaches the maximum of N₀ × N₁, the considered gene is ranked in importance in the classification process. The following equation is used to calculate the gene’s importance:

q (g) = m a x (s (g), N_{o} \times N_{1} - s (g))

(4)

Gene selection using GBPSO-SVM algorithm

In the final step of gene selection, the wrapper based GBPSO-SVM algorithm is applied. The GBPSO uses SVM’s accuracy prediction to select the best subset of genes. SVM algorithm was used with GBPSO due to its sufficient ability in giving sensible classification accuracy for microarray data regardless of the number of samples. This is a useful feature of SVM for microarray data due to the low sample-to-gene ratio in this dataset. GBPSO starts with a number of randomly selected genes, then in each iteration it searches for the optimum subset of genes. The SVM classifier assesses the performance of each candidate subset using 10-fold cross validation. Hence, every current candidate subset of genes is commonly better than the previous subset. The GBPSO original package of the algorithm can be retrieved from [65]. In the current work, a polynomial kernel was utilized for the SVM due to its highest classification accuracy when it is applied for high dimensional datasets.

Development of HDG-select application

The so-called HDG-select application was created using graphical user interface (GUI) in MATLAB. The developed application has a user-friendly interface which is easy to understand and implement for gene selection and classification in both of high dimensional and normal datasets. The first two steps of gene selection were written in MATLAB, while the third step was written in Java, taking the advantage of Weka packages and Java interfacing [66]. Meanwhile, we used Java coding for interfacing the Weka functionality with MATLAB. It was designed in a way that it can handle errors and control the user’s inputs to perform each step correctly. This was achieved by using message handlers during the application process. The HDG-select application was made freely available to users, which can be downloaded from GitHub (https://github.com/Shilan-Jaff/HDG_select). Fig 3 shows the interface of the tool, which is composed of four major sections described as below:

Input (Dataset import): The user is able to import two formats of datasets, namely soft and csv, as shown in part (a) of Fig 7. Most of the curated datasets available online are in the form of csv format, which is the common format for machine learning applications. The other dataset type is soft file, which is the format of the gene expression profiling datasets made available to public at GEO NCBI database [48]. This type of dataset is usually in the form of a non-curated and high dimensional structure. Before importing the csv files, users have to make sure that the last column contains the class label, while for the soft file the range of the sample class should be manually given to the application. This is because information regarding the sample class is inherently not presented in the soft files, so it must be obtained from the dataset description given in the NCBI database.
Preprocessing: In this step of analysis, a reduction process can be made upon the high dimensional datasets, or upon the datasets that have not yet been reduced/curated by researchers. As such, the dataset will be easily handled for the next steps of analysis. It can be seen from part (b) of Fig 7 that this section has two options, which allows the user to choose between reducing the dataset or leaving it as it is. However, for the soft format datasets, gene reduction is obligatory, otherwise the process would be computationally costly and memory overload is resulted. At the end of this process, users can save the reduced dataset for their future use if required.
Filtration and classification: Once the user finalized the preprocessing step, the dataset is proceeded to the next stage of filtering the most significant genes, classification assessment and saving the filtered dataset, as shown in part (c) of Fig 7. It is worth to mention that with the help of this application, the user can get accessibility to decide on the number of genes to be filtered. Hence, one can choose the optimum filtered genes based on the preference and understanding of the dataset. Nevertheless, the default number of gene filtration was set to be 200.
Selection and classification using Filter-GBPSO-SVM algorithm: The last and most important action is to select the genes and to apply the SVM classifier on the selected genes, as shown in part (d) of Fig 7. This is applied on the results achieved from previous steps and is performed on each dataset generated from the filtration step. Here, the user can see how many genes are selected by each approach and has access to save them.

The HDG-select application uses the following equations to determine the accuracy and precision of the gene selection and classification, respectively.

A c c u r a c y = \frac{T P + T N}{T P + T N + F N + F P}

(5)

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

Where TP, TN, FN, FP are the true positive, true negative, false negative and false positive detected samples, respectively.

Results and discussions

In order to show the functionality and robustness of the proposed HDG-select tool, a step by step analysis is presented. As we mentioned earlier, users can choose to not perform the dimensionality reduction on the pre-curated CSV datasets. However, the preprocessing step (see Fig 3B) for the soft GEO datasets is a must since these datasets are not pre-curated. It leads to a heavy computational burden and low classification accuracy if they are directly applied for gene selection. Consequently, the soft GEO files were preprocessed and the dataset dimensionality was interestingly reduced. For example, the genes in Autism and influenza A datasets were reduced from 54613 to 14530 and from 22283 to 17939, respectively upon the application of the preprocessing step. As such, with the help of HDG-select, the impact of filtered genes on the SVM classification accuracy in the soft GEO datasets was investigated, as shown in Fig 4. Results showed that limiting the filtered genes to below 150 genes has negatively affected the classification accuracy, except for the influenza A dataset which showed a stable performance regardless of the change in the genes number. Concludingly, the dataset with a steady curve indicating the presence of good correlation between the genes and hence extra filtration does not further improve the classification accuracy. Nevertheless, filtering the genes to a low possible number can save memory and speed up the execution time in the final stage of gene selection and classification.

Fig 5A and 5B show the SVM accuracy of gene selection in CSV datasets and soft GEO datasets after the application of TT filter and TT filter-GBPSO algorithm with the help of the proposed HDG-select, respectively. More data results from the application of different filters and GBPSO are given as S1 and S2 Tables. One can notice from the results that the accuracy of SVM is largely improved when the statistical filters are used in combination with GBPSO. The application of filters has improved the classification accuracy when it is compared to the results of the original dataset. Comparably, the use of GBPSO algorithm in combination with the filters has led to improved performance. For instance, the classification accuracy in leukemia and colon cancer has reached 100% when a combined TT-WRS filter with GBPSO was utilized, surpassing the results obtained by Nested-GA algorithm [16]. Noteworthy, the SVM accuracy for brain metastatic BC and Influenza A datasets remained 100% in both steps of genes selection by filters and GBPSO-SVM, indicating that the HDG-select tool has a strong power on the dimensionality reduction of the datasets to maintain the most important genes that are quite useful for the subsequent steps of gene selection process.

Results showed that the best approach to increase the accuracy of classification of the gene selection in high dimensional datasets is to utilize different filters in combination with the GBPSO-SVM algorithm, as shown in Fig 6. It was observed that the combination of TT-WRS filter with GBPSO has led to improve the SVM accuracy in eight datasets out of eleven datasets. It is worth to mention that the use of HDG-select tool is also important even if the accuracy is not much improved after the selection process because the HDG-select toll can help in selecting a small subset of the attributed genes while maintaining the original accuracy but reducing the computational burden.

Fig 7 shows the achieved accuracy and precision of gene selection from various high dimensional datasets using the proposed HDG-select application. It was seen from the results that the values of accuracy and precision are in the range from 90% to 100% for different datasets. For instance, the precision of gene selection in the Prostate, Ovarian, Inflammatory BC and Influenza A datasets has reached 100% which is close enough to their classification accuracy. Hence, it can be concluded from the coincidence of the accuracy and precision data that the proposed HDG-select has performed very well on various datasets when it was used to select the most attributed genes efficiently, thereby providing a competitive classification accuracy.

Furthermore, a comparison of the results obtained by the HDG-select application to those reported in literature suggesting that the filter-GBPSO-SVM algorithm is more efficient than PCC-BPSO-SVM and PCC-GA-SVM algorithm [11] when it comes to the selection and classification of attributed genes in high dimensional datasets, as shown in Fig 8 and S3 Table.

In order to show the effectiveness of the proposed HDG-select application in the final step of selecting the biomarker/attributed genes, heatmap graphical analysis was plotted. Figs 9 and 10 show the produced heatmap of seven biomarker genes versus the samples for breast cancer and influenza A datasets. It can be seen from the heatmap that the biomarker genes show a high discrimination ability between the control and non-control samples. For instance, gene number 1 and 3 have the highest discrimination ability among the breast cancer and influenza A datasets, respectively.

Table 3 shows a detailed comparison of our proposed application with those reported in literature. It was concluded that the proposed HDG-select outperformed the other tools in terms of overall performance, accessibility and functionality. Noticeably, the most competitive tool to the proposed HDG-select can be ArrayMining [42]. However, this tool accepts dataset files of CSV format only, while with the help of HDG-select one can also perform dataset curation and dimensionality reduction on the soft GEO datasets. Furthermore, with the proposed HDG-select a multiple gene selection and classification can be performed simultaneously and the selected genes with their expression can be downloaded in CSV format, while ArrayMining [42] can perform one task at each time and the selected genes is downloadable in text format. Table 4 shows a comparison between HDG-select and ArrayMining tool for two representative CSV datasets that were previously analyzed by ArrayMining in terms of accuracy and precision.

Table 3. Comparison of the proposed HDG-select tool with those reported in literature for gene selection and classification.

Application name	language/package	Gene selection/ classification	Algorithm	Accessibility (online/offline)	GUI Interface	Operating system	Dataset format	User-friendliness
SVM Classifier [43]	Java	No/Yes	SVM	No (online)	Yes	No-restriction	Not known	Low
R. GeneSrF and varSelRF [41]	R, Python	Yes/No	Random forest	No (online)	No	Linux, Unix for R package	CSV	Low
ArrayMining [42]	R, C++ and a PHP-interface	Yes/Yes	Filter Classification clustering	Yes (online)	Yes	No restriction	CSV	Medium
HDG-select (proposed)	Weka, Java and MATLAB	Yes/Yes	Two filters, their combination, GBPSO wrapper and SVM classifier	Yes (offline-no internet required)	Yes	No restriction	CSV and (.soft) GEO dataset	High

Open in a new tab

Table 4. Comparison of the proposed HDG-select tool with ArrayMining tool in terms of accuracy and precision of gene selection and classification in high dimensional datasets.

Dataset	Application name	Algorithm (# selected genes)	SVM classification accuracy (%)	Precision (%)
Colon	ArrayMining [42]	Filter (80)	80.7±13	86.6
Colon	HDG-select (Proposed)	Filter-wrapper (30)	98.3±1.7	96.6
Prostate	ArrayMining [42]	Filter (50)	82.2±13	86.6
Prostate	HDG-select (Proposed)	Filter-wrapper (30)	100±0	100

Open in a new tab

Conclusions

A novel GUI based stand-alone application, named as HDG-select, was developed to select and classify the attributed genes in high dimensional datasets effectively. The application was validated on 11 datasets and it was found to perform well on most of high dimensional datasets, including CSV and GEO soft file formats. The proposed HDG-select tool uses efficient algorithm of combined filter-GBPSO-SVM. It was observed that the best approach of increasing gene selection efficiency in high dimensional data is to utilize a mixed filter-GBPSO-SVM algorithm. It was concluded that the proposed HDG-select outperformed the other tools in terms of overall performance, accessibility, and functionality.

Supporting information

S1 Dataset. The microarray dataset of leukemia cancer in csv format.

(CSV)

Click here for additional data file.^{(3.1MB, csv)}

S2 Dataset. The microarray dataset of colon cancer in csv format.

(CSV)

Click here for additional data file.^{(1.5MB, csv)}

S3 Dataset. The microarray dataset of prostate cancer in csv format.

(CSV)

Click here for additional data file.^{(7.3MB, csv)}

S4 Dataset. The microarray dataset of breast cancer in csv format.

(CSV)

Click here for additional data file.^{(14.5MB, csv)}

S5 Dataset. The microarray dataset of central nervous system cancer in csv format.

(CSV)

Click here for additional data file.^{(1.7MB, csv)}

S6 Dataset. The microarray dataset of ovarian cancer in csv format.

(CSV)

Click here for additional data file.^{(32.3MB, csv)}

S1 Table. The accuracy percentage of selecting attributed genes using Filter-SVM and Filter-GBPSO-SVM approach compared to that of the original dataset.

(DOCX)

Click here for additional data file.^{(14.3KB, docx)}

S2 Table. The accuracy percentage of Filter-SVM and Filter-GBPSO-SVM algorithm upon various soft files of high dimensional datasets.

(DOCX)

Click here for additional data file.^{(13.7KB, docx)}

S3 Table. Comparison of the accuracy result of the proposed HDG-select application to those reported in literature.

(DOCX)

Click here for additional data file.^{(13.8KB, docx)}

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

RH received a financial support from the Fundamental Research Grant Scheme (FRGS), Ministry of Education and Universiti Teknologi Malaysia under Vote No: RJ130000.7851.5F037.

References

1.Govindarajan R, Duraiyan J, Kaliyappan K, Palanisamy M. Microarray and its applications. Journal of Pharmacy & Bioallied Sciences. 2012;4(Suppl 2):S310–S2. 10.4103/0975-7406.100283 PMC3467903. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Cosma G, Brown D, Archer M, Khan M, Pockley AG. A survey on computational intelligence approaches for predictive modeling in prostate cancer. Expert Systems with Applications. 2017;70:1–19. 10.1016/j.eswa.2016.11.006. [DOI] [Google Scholar]
3.Singh RK, Sivabalakrishnan M. Feature selection of gene expression data for cancer classification: a review. Procedia Computer Science. 2015;50:52–7. 10.1016/j.procs.2015.04.060 [DOI] [Google Scholar]
4.Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in bioinformatics. 2015;2015 10.1155/2015/198363 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. bioinformatics. 2007;23(19):2507–17. 10.1093/bioinformatics/btm344 [DOI] [PubMed] [Google Scholar]
6.Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis. 2020;143:106839. [Google Scholar]
7.Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences. 2014;282:111–35. 10.1016/j.ins.2014.05.042. [DOI] [Google Scholar]
8.Chandra Sekhara Rao Annavarapu SD, Banka H. Cancer microarray data feature selection using multi-objective binary particle swarm optimization algorithm. EXCLI journal. 2016;15:460 10.17179/excli2016-481 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Rejani Y, Selvi ST. Early detection of breast cancer using SVM classifier technique. arXiv preprint arXiv:09122314. 2009. [Google Scholar]
10.Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002;46(1):389–422. 10.1023/A:1012487302797 [DOI] [Google Scholar]
11.Hameed SS, Muhammad FF, Hassan R, Saeed F. Gene Selection and Classification in Microarray Datasets using a Hybrid Approach of PCC-BPSO/GA with Multi Classifiers. JCS. 2018;14(6):868–80. [Google Scholar]
12.Hameed SS, Hassan R, Muhammad FF. Selection and classification of gene expression in autism disorder: Use of a combination of statistical filters and a GBPSO-SVM algorithm. PloS one. 2017;12(11). 10.1371/journal.pone.0187371 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Han J, Pei J, Kamber M. Data mining: concepts and techniques: Elsevier; 2011. [Google Scholar]
14.Thaher T, Heidari AA, Mafarja M, Dong JS, Mirjalili S. Binary Harris Hawks Optimizer for High-Dimensional, Low Sample Size Feature Selection. Evolutionary Machine Learning Techniques: Springer; 2020. p. 251–72. [Google Scholar]
15.Algamal ZY, Lee MH. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Advances in data analysis and classification. 2019;13(3):753–71. [Google Scholar]
16.Sayed S, Nassef M, Badr A, Farag I. A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Systems with Applications. 2019;121:233–43. [Google Scholar]
17.Yan C, Ma J, Luo H, Patel A. Hybrid binary coral reefs optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical datasets. Chemometrics and Intelligent Laboratory Systems. 2019;184:102–11. [Google Scholar]
18.Kim ARP. Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data. Mathematics. 2020;8(1):110. [Google Scholar]
19.Song X-f, Zhang Y, Guo Y-n, Sun X-y, Wang Y-l. Variable-size Cooperative Coevolutionary Particle Swarm Optimization for Feature Selection on High-dimensional Data. IEEE Transactions on Evolutionary Computation. 2020. [Google Scholar]
20.Chen W, Xu Y, Yu Z, Cao W, Chen CP, Han G. Hybrid Dimensionality Reduction Forest With Pruning for High-Dimensional Data Classification. IEEE Access. 2020;8:40138–50. [Google Scholar]
21.Karizaki AA, Tavassoli M, editors. A novel hybrid feature selection based on ReliefF and binary dragonfly for high dimensional datasets. 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE); 2019: IEEE.
22.Raman MG, Nivethitha S, Kannan K, Sriram VS. A hybrid approach using rough set theory and hypergraph for feature selection on high-dimensional medical datasets. Soft Computing. 2019;23(23):12655–72. [Google Scholar]
23.Chen L-F, Su C-T, Chen K-H, Wang P-C. Particle swarm optimization for feature selection with application in obstructive sleep apnea diagnosis. Neural Computing and Applications. 2012;21(8):2087–96. 10.1007/s00521-011-0632-4 [DOI] [Google Scholar]
24.Alba E, Garcia-Nieto J, Jourdan L, Talbi E-G, editors. Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms. Evolutionary Computation, 2007 CEC 2007 IEEE Congress on; 2007: IEEE.
25.Kennedy J, Eberhart RC, editors. A discrete binary version of the particle swarm algorithm. Systems, Man, and Cybernetics, 1997 Computational Cybernetics and Simulation, 1997 IEEE International Conference on; 1997: IEEE.
26.Zhang Y, Wang S, Phillips P, Ji G. Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowledge-Based Systems. 2014;64:22–31. [Google Scholar]
27.Moraglio A, Di Chio C, Togelius J, Poli R. Geometric particle swarm optimization. Journal of Artificial Evolution and Applications. 2008;2008. [Google Scholar]
28.Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–97. [Google Scholar]
29.Ardjani F, Sadouni K, Benyettou M, editors. Optimization of SVM MultiClass by Particle Swarm (PSO-SVM). 2010 2nd International Workshop on Database Technology and Applications; 2010 27–28 Nov. 2010.
30.Jirapech-Umpai T, Aitken S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics. 2005;6(1):148 10.1186/1471-2105-6-148 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Hassanien AE, Al-Shammari ET, Ghali NI. Computational intelligence techniques in bioinformatics. Computational biology and chemistry. 2013;47:37–47. 10.1016/j.compbiolchem.2013.04.007 [DOI] [PubMed] [Google Scholar]
32.Huerta EB, Duval B, Hao J-K, editors. A hybrid GA/SVM approach for gene selection and classification of microarray data Workshops on Applications of Evolutionary Computation; 2006: Springer. [Google Scholar]
33.Qian R, Wu Y, Duan X, Kong G, Long H. SVM Multi-Classification Optimization Research based on Multi-Chromosome Genetic Algorithm. International Journal of Performability Engineering. 2018;14(4). [Google Scholar]
34.Barash E, Sal-Man N, Sabato S, Ziv-Ukelson M. BacPaCS—Bacterial Pathogenicity Classification via Sparse-SVM. Bioinformatics. 2018. [DOI] [PubMed] [Google Scholar]
35.Latkowski T, Osowski S, editors. Developing Gene Classifier System for Autism Recognition International Work-Conference on Artificial Neural Networks; 2015: Springer. [Google Scholar]
36.García-Nieto J, Alba E, Jourdan L, Talbi E. Sensitivity and specificity based multiobjective approach for feature selection: Application to cancer diagnosis. Information Processing Letters. 2009;109(16):887–96. [Google Scholar]
37.Duez M, Giraud M, Herbert R, Rocher T, Salson M, Thonier F. Vidjil: a web platform for analysis of high-throughput repertoire sequencing. PLoS One. 2016;11(11):e0166126 10.1371/journal.pone.0166126 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Kaya H, Hasman H, Larsen J, Stegger M, Johannesen TB, Allesøe RL, et al. SCCmecFinder, a web-based tool for typing of staphylococcal cassette chromosome mec in Staphylococcus aureus using whole-genome sequence data. Msphere. 2018;3(1). 10.1128/mSphere.00612-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Bruyneel AA, Colas AR, Karakikes I, Mercola M. AlleleProfileR: A versatile tool to identify and profile sequence variants in edited genomes. Plos one. 2019;14(12):e0226694 10.1371/journal.pone.0226694 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Tamazian G, Dobrynin P, Krasheninnikova K, Komissarov A, Koepfli K-P, O’brien SJ. Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences. GigaScience. 2016;5(1):s13742-016–0141-6. 10.1186/s13742-016-0141-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Diaz-Uriarte R. GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC bioinformatics. 2007;8(1):328 10.1186/1471-2105-8-328 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Glaab E, Garibaldi JM, Krasnogor N. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization. BMC bioinformatics. 2009;10(1):1–7. 10.1186/1471-2105-10-358 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Pirooznia M, Deng Y. SVM Classifier–a comprehensive java interface for support vector machine classification of microarray data. BMC bioinformatics. 2006;7(S4):S25. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science. 1999;286(5439):531–7. 10.1126/science.286.5439.531 [DOI] [PubMed] [Google Scholar]
45.Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999;96(12):6745–50. 10.1073/pnas.96.12.6745 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006;7(1):3 10.1186/1471-2105-7-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Z Z., Ong YS, Dash M. Markov blanketembedded genetic algorithm for gene selection. Pattern Recognition. 2007;40:3236–48. 10.1016/j.patcog.2007.02.007. [DOI] [Google Scholar]
48.Autistic children and their father’s age: peripheral blood lymphocytes [Internet]. from www.ncbi.nlm.nih.gov. 2011. Available from: http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4431.
49.Alter MD, Kharkar R, Ramsey KE, Craig DW, Melmed RD, Grebe TA, et al. Autism and increased paternal age related changes in global levels of gene expression regulation. PloS one. 2011;6(2):e16715 10.1371/journal.pone.0016715 [DOI] [PMC free article] [PubMed] [Google Scholar]
50.El-Fishawy P, State MW. The genetics of autism: key issues, recent findings, and clinical implications. Psychiatric Clinics of North America. 2010;33(1):83–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Latkowski T, Osowski S. Computerized system for recognition of autism on the basis of gene expression microarray data. Computers in biology and medicine. 2015;56:82–8. 10.1016/j.compbiomed.2014.11.004 [DOI] [PubMed] [Google Scholar]
52.Latkowski T, Osowski S. Data mining for feature selection in gene expression autism data. Expert Systems with Applications. 2015;42(2):864–72. 10.1016/j.eswa.2014.08.043 [DOI] [Google Scholar]
53.Lai C, Reinders MJ, van't Veer LJ, Wessels LF. A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC bioinformatics. 2006;7(1):235 10.1186/1471-2105-7-235 [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Huertas C, Juárez-Ramírez R, editors. Filter feature selection performance comparison in high-dimensional data: A theoretical and empirical analysis of most popular algorithms. Information Fusion (FUSION), 2014 17th International Conference on; 2014: IEEE.
55.Haury A-C, Gestraud P, Vert J-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PloS one. 2011;6(12):e28210 10.1371/journal.pone.0028210 [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012;9(4):1106–19. 10.1109/TCBB.2012.33 [DOI] [PubMed] [Google Scholar]
57.Li S, Wu X, Tan M. Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Computing-A Fusion of Foundations, Methodologies and Applications. 2008;12(11):1039–48. [Google Scholar]
58.Saha S, Seal DB, Ghosh A, Dey KN. A novel gene ranking method using Wilcoxon rank sum test and genetic algorithm. International Journal of Bioinformatics Research and Applications. 2016;12(3):263–79. [Google Scholar]
59.Bridge PD, Sawilowsky SS. Increasing physicians’ awareness of the impact of statistics on research outcomes: comparative power of the t-test and Wilcoxon rank-sum test in small samples applied research. Journal of clinical epidemiology. 1999;52(3):229–35. 10.1016/s0895-4356(98)00168-1 [DOI] [PubMed] [Google Scholar]
60.Ruxton GD. The unequal variance t-test is an underused alternative to Student's t-test and the Mann–Whitney U test. Behavioral Ecology. 2006;17(4):688–90. [Google Scholar]
61.Wilcoxon F. Individual comparisons by ranking methods. Biometrics bulletin. 1945;1(6):80–3. [PubMed] [Google Scholar]
62.Wild C, Seber G. The Wilcoxon rank-sum test. Chapter; 2011.
63.Khoshgoftaar T, Dittman D, Wald R, Fazelpour A, editors. First order statistics based feature selection: A diverse and powerful family of feature seleciton techniques. Machine Learning and Applications (ICMLA), 2012 11th International Conference on; 2012: IEEE.
64.Sprent P, Smeeton NC. Applied nonparametric statistical methods: CRC Press; 2016. [Google Scholar]
65.Geometric Particle Swarm Optimisation [Internet]. 2016 [cited 28/20/2020]. Available from: https://github.com/sebastian-luna-valero/PSOSearch/.
66.wekalab [Internet]. 2016 [cited 28/20/2020]. Available from: https://github.com/NicholasMcCarthy/wekalab.

PLoS One. doi: 10.1371/journal.pone.0246039.r001

Decision Letter 0

Bryan C Daniels

5 Oct 2020

PONE-D-20-11244

A novel GUI based stand-alone application using Filters-GBPSO-SVM algorithm for the selection and classification of attributed genes in high dimensional datasets

PLOS ONE

Dear Dr. Hameed,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Reviewers identified multiple issues with the clarity of the manuscript, particularly with regard to differences of the proposed approach compared to existing approaches. A major revision may be able to address these issues. In particular, the revision should clearly articulate the scientific rationale for the submitted work and clearly outline how it differs from past work.

Please submit your revised manuscript by Nov 19 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Bryan C Daniels

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 6 in your text; if accepted, production will need this reference to link the reader to the Table.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dataset reduction == feature engineering/selection

"The SVM classifier assesses the performance of each candidate subset using 10 folds cross-validation" type: should be 10 fold cross validation.

Check grammar on this sentence: "The software application was established by means of interfacing Weka with MATLAB features using graphical user interface (GUI) in MATLAB. It"

Check grammar: "which is usually used by data analytics tool."

Check grammar: "A novel stand-alone application was successfully established"

For Figs 9 and 10, please change color scheme to be colorblind friendly (not red/green).

Where is Table 5? Jumps from 4 to 6.

Tables 3,4,6 would be easier to interpret as figures. If this is changed, please include the tables as supplemental information.

Reviewer #2: The language is unclear and needs to be checked as some parts are difficult to follow or even understand.

Moreover, in the introduction I do not clearly understand why your algorithm is a novelty in comparison to the approach currently used in literature and why is so different. I suggest to reorganise and describe in a clearer way the currently used approach, and then state the differences with the proposed approach. In particular, the part from line 84 to 103 was not clear to me. Also, it is not clear why you decided to couple GBPSO to SVM and what is the advantage; if GBPSO select the genes, why you have to use a machine learning model to classify the genes already selected?

In line 114-115 there is a repetition of the concept already stated in 111-112. Moreover, you developed a tool which seems to have no name (or it is not clearly stated). You frequently call it "application" or similar term. I think that the manuscript will be clearer by giving a name to the tool and use it to call it in the text. Also, as it has no name, I was unable to ascertain if the link present in the text points to the gitHub page to download it. If it is, then you should provide in the gitHub page instruction for the installation, a manual and an example database that can be used for test.

The figure 1 and 2 can be improved to be clearer and easier to read: why the boxes have different shapes? why are they in purple? You may try to reduce the text in the flows to explain more in the captions, use explicative shapes and color to help the reader.

The information in table 1 and 2 are repetitions of the information given in the text from line 150 to 204. I think that this part may be improved to be more fluent and not a big list of data. Maybe maintaining the tables and removing the most numbers from the written part will improve the clearness. Also, some of the descriptions of the study where the database are taken may be too detailed, giving to the reader details that are not needed to understand the proposed algorithm. I suggest to lighten this part to help the reader and the clearness.

The statement in line 219-220 "since the mean values of gene expression are affected by the high variance" is a repetition of what already stated in line 216-217 "Those with high variance affect the mean and median values of the expression of individual genes". Moreover, in Line 218 you called a "median criterion" which has not been introduced before. I have observed in the text that you frequently use term that may be explicative to you but, as they are not introduced nor explained, cannot be clearly understood. Another example is the term "feature" introduced first in line 215 which is not clear to what it refers. To correctly introduce these terms, you may decide to use them as i.e.: "to simplify the initial selection, similar genes should be removed. To this purpose authors usually utilise two criterion: the mean and the median criterion. The first implies (...); the second (...)". Also, please clarify what is the meaning of "similar genes".

The structure of the statements in line 221-224 can be simplified to be clearer. Also, in this statement you introduce a new approach without giving any information: nor why it is new, nor some short description, nor it strong points and drawback. The reader here have to decide to believe in you and not understanding what is going on or to read the other article to understand.

The whole "Statistical filters for gene selection" paragraph needs to be fully revised. At line 239 you introduced the term weights with no description or introduction of the usage, meaning, scope.

The statement in line 250-251 is not formally correct: it is true that t-test consider equal mean and variance, but between the two groups in analysis; in the text this part is missing and the reader may get lost. Also, it is not true that the t in the t-test is equal to 0 or 1, as it can assume value from 0 to 1 (line 257).

Conversely, I think that the description of the functioning of the t-test is not important for explaining the algorithm you have developed. I consider more important to describe shortly the test, the differences between a parametric and non-parametric test and why you chose to use a non-parametric one and how you implement it in your application, adding also some literature to support your decision.

The equations 2 and 3 needs to be described more in detail, to help the reader understand their functioning and why you have decided to use them.

The paragraph "Selection and classification using a wrapper-based GBPSO-SVM algorithm" also needs to be revised: first you have to describe GBPSO, SVM, and what they do (which, in reality, should have been done previously in the introduction), and then describe why you have decided to couple them and how.

Are the figure from 3 to 6 portion of the figure 7? if is that so, please, use only the figure 7, which is much more explicative and gives an idea of the interface the user will face. Maybe you can consider to use letters in the figure 7 to help the reader follow the description of the procedure.

The text from line 400 to line 442 needs to have its own paragraph, as it is no more the description of the GUI based application, but here you describe the results of the test performed to test the you algorithm and application. First of all you have to describe what you are comparing, how you have calculated the accuracy and then presents the results. Moreover, in your paper the section "results and discussion" is missing, and it should be there that you presents the results of the tests and comment them.

Your application needs to be compared to other available tool or approach, and you need to shortly introduce and describe them in the Introduction. This part is completely absent.

What is the purpose to compare biomarkers of different studies? is there a meaning in comparing biomarkers of influenza A and breast cancer? why are you making these observations? Are they important to understand your application and its performances?

In the conclusion, you stated that "A novel stand-alone application was successfully established which can be used to select and classify the most attributed genes in high dimensional data rapidly and efficiently. The application was validated on 11 datasets and it was found to perform well on most of high dimensional datasets. " but you have provided no information about the time needed to perform the analysis with your application and the other tool used as comparison. Also, you can state that your tool perform well only in comparison to other approaches or tools, having good accuracy is not enough; consider that accuracy is not the only way to compare performance, it has its own meaning and drawbacks which you have not considered. An analysis can return 100% of accuracy but can have poor precision compared to other, for example.

Lastly, please revise your manuscript keeping in mind the structure and the meaning of each part of an article:

-introduction: here you introduce the problem and your solution by also giving to the reader all the information and notions needed to fully understand the paper. These needed to be only introduced and do not have to be fully described in detail. Also, you need to say why your solution is important

-materials and methods: here you have to describe in detail your work: the algorithm, the application, the dataset used to test it, the other tools and approaches used as a comparison, the statistics used to reassume the performance of the tools/approaches

-results: here you presents the results of the test and the comparisons done

-discussion: where you discuss the results. Results and discussion may sometimes go in a fused "results and discussion part", if it makes the paper clearer and easier to understand

-conclusion: here, and nowhere else in the paper, you make some conclusions based only on the results you have presented, no more no less. If you made tests that you have not inserted in the manuscript, you have to cite them in the text, adding that you will not show the data, and the you can discuss them and make some conclusion about them.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jan 28;16(1):e0246039. doi: 10.1371/journal.pone.0246039.r002

Author response to Decision Letter 0

4 Nov 2020

Subject: Response to reviewer’s comments

Manuscript Number: PONE-D-20-11244

Manuscript Title: A novel GUI based stand-alone application using Filters-GBPSO-SVM algorithm for the selection and classification of attributed genes in high dimensional datasets

Dear Professor

Bryan C. Daniels

Academic Editor

PLOS ONE

We acknowledge the constructive comments and considerations received from reviewers and editor to improve the contents of our manuscript. We gratefully think that the comments have helped us to strengthen the contents of our paper. The manuscript was carefully revised, the required amendments were performed and the revised sections were highlighted throughout the manuscript. Furthermore, the manuscript title was modified to appropriately cover the presented work.

Please find below our response to the reviewers’ comments. We hope the revised manuscript is now acceptable for publication in your journal.

Reviewer #1:

- Dataset reduction == feature engineering/selection.

Response:

We meant by Dataset reduction, the dimensionality reduction of the dataset which was performed through removing the genes whose expression values are close to each other among the samples. This term was revised throughout the manuscript.

- "The SVM classifier assesses the performance of each candidate subset using 10 folds cross-validation" type: should be 10 fold cross validation.

Response:

This change has been made throughout the manuscript.

- Check grammar on this sentence: "The software application was established by means of interfacing Weka with MATLAB features using graphical user interface (GUI) in MATLAB. It"

- Check grammar: "which is usually used by data analytics tool."

- Check grammar: "A novel stand-alone application was successfully established"

Response:

The grammar of the sentences was double checked and a careful language proofread was performed throughout the whole manuscript contents.

- For Figs 9 and 10, please change color scheme to be colorblind friendly (not red/green).

Response:

The color scheme of the Figures was changed to colorblind friendly one and further discussion was given.

- Where is Table 5? Jumps from 4 to 6.

Response:

We apologize for the typo. Table 6 is Table 5. It was corrected in this revised version.

- Tables 3,4,6 would be easier to interpret as figures. If this is changed, please include the tables as supplemental information.

Response:

The tables have been interpreted as figures and the tables were moved to the supplementary information.

Reviewer #2:

- The language is unclear and needs to be checked as some parts are difficult to follow or even understand. Moreover, in the introduction I do not clearly understand why your algorithm is a novelty in comparison to the approach currently used in literature and why is so different. I suggest to reorganise and describe in a clearer way the currently used approach, and then state the differences with the proposed approach. In particular, the part from line 84 to 103 was not clear to me. Also, it is not clear why you decided to couple GBPSO to SVM and what is the advantage; if GBPSO select the genes, why you have to use a machine learning model to classify the genes already selected?

Response:

A careful language proofread was performed on the whole manuscript. The novelty of the proposed application and the difference between the current algorithm and those reported in literature are also given in the introduction section. Taking these comments into consideration, the introduction section has been revised accordingly. These changes can be found in Line 82-141.

- In line 114-115 there is a repetition of the concept already stated in 111-112. Moreover, you developed a tool which seems to have no name (or it is not clearly stated). You frequently call it "application" or similar term. I think that the manuscript will be clearer by giving a name to the tool and use it to call it in the text. Also, as it has no name, I was unable to ascertain if the link present in the text points to the gitHub page to download it. If it is, then you should provide in the gitHub page instruction for the installation, a manual and an example database that can be used for test.

Response:

The repetitive concepts were merged and the tool was given the name HDG-select, which is derived from high dimensional gene selection. A GitHub link for the app is provided with manual and dataset for testing.

- The figure 1 and 2 can be improved to be clearer and easier to read: why the boxes have different shapes? why are they in purple? You may try to reduce the text in the flows to explain more in the captions, use explicative shapes and color to help the reader. The information in table 1 and 2 are repetitions of the information given in the text from line 150 to 204. I think that this part may be improved to be more fluent and not a big list of data. Maybe maintaining the tables and removing the most numbers from the written part will improve the clearness. Also, some of the descriptions of the study where the database are taken may be too detailed, giving to the reader details that are not needed to understand the proposed algorithm. I suggest to lighten this part to help the reader and the clearness.

Response:

Figure 1 and 2 have been revised accordingly, their color changed to black and contents modified to better understand by the readers. Based on the standards of flowchart, the start and end shape boxes are oval, while the boxes contain input or output commands are parallelogram and those include processing commands are in rectangular shape. The tables were maintained and the numbers from the written part were removed to improve the clearness. The text in the flows reduced to explain more in the captions.

- The statement in line 219-220 "since the mean values of gene expression are affected by the high variance" is a repetition of what already stated in line 216-217 "Those with high variance affect the mean and median values of the expression of individual genes". Moreover, in Line 218 you called a "median criterion" which has not been introduced before. I have observed in the text that you frequently use term that may be explicative to you but, as they are not introduced nor explained, cannot be clearly understood. Another example is the term "feature" introduced first in line 215 which is not clear to what it refers. To correctly introduce these terms, you may decide to use them as i.e.: "to simplify the initial selection, similar genes should be removed. To this purpose authors usually utilise two criterion: the mean and the median criterion. The first implies (...); the second (...)". Also, please clarify what is the meaning of "similar genes".

Response:

The repetitive statements were merged in the revised version. The term feature is representing the gene; hence it was replaced by gene throughout the manuscript. Details on mean and median criterion along with other required revisions are given in the section of dimensionality reduction using mean and median ratio.

- The structure of the statements in line 221-224 can be simplified to be clearer. Also, in this statement you introduce a new approach without giving any information: nor why it is new, nor some short description, nor it strong points and drawback. The reader here have to decide to believe in you and not understanding what is going on or to read the other article to understand.

Response:

The statements in line 221-224 have been revised accordingly and detailed explanation was given in the section of dimensionality reduction using mean and median ratio

- The whole "Statistical filters for gene selection" paragraph needs to be fully revised. At line 239 you introduced the term weights with no description or introduction of the usage, meaning, scope. The statement in line 250-251 is not formally correct: it is true that t-test consider equal mean and variance, but between the two groups in analysis; in the text this part is missing and the reader may get lost. Also, it is not true that the t in the t-test is equal to 0 or 1, as it can assume value from 0 to 1 (line 257). Conversely, I think that the description of the functioning of the t-test is not important for explaining the algorithm you have developed. I consider more important to describe shortly the test, the differences between a parametric and non-parametric test and why you chose to use a non-parametric one and how you implement it in your application, adding also some literature to support your decision. The equations 2 and 3 needs to be described more in detail, to help the reader understand their functioning and why you have decided to use them.

Response:

The description of t-test and its application has been revised. The statement in line 250-251 has been also corrected. The difference between parametric and non-paramteric filters was given and the reason of applying the non-parametric version of was presented in the revised section of gene selection using statistical filters.

- The paragraph "Selection and classification using a wrapper-based GBPSO-SVM algorithm" also needs to be revised: first you have to describe GBPSO, SVM, and what they do (which, in reality, should have been done previously in the introduction), and then describe why you have decided to couple them and how.

Response:

The paragraph "Selection and classification using a wrapper-based GBPSO-SVM algorithm" was changed to “genes selection using GBPSO-SVM algorithm”. Both GBPSO and SVM algorithms were described in the introduction section and the reasons of why SVM has chosen was also given.

- Are the figure from 3 to 6 portion of the figure 7? if is that so, please, use only the figure 7, which is much more explicative and gives an idea of the interface the user will face. Maybe you can consider to use letters in the figure 7 to help the reader follow the description of the procedure.

Response:

Yes, they were part of Figure 7. So, we only use one main interface figure with complete descriptions in the revised paper.

- The text from line 400 to line 442 needs to have its own paragraph, as it is no more the description of the GUI based application, but here you describe the results of the test performed to test the you algorithm and application. First of all you have to describe what you are comparing, how you have calculated the accuracy and then presents the results. Moreover, in your paper the section "results and discussion" is missing, and it should be there that you presents the results of the tests and comment them.

Response:

The text from line 400 to line 442 were included into the results and discussion section. The equations used to determine the accuracy and precision of the HDG-select were also given in the section of “Development of HDG-select application”.

- Your application needs to be compared to other available tool or approach, and you need to shortly introduce and describe them in the Introduction. This part is completely absent.

Response:

A comparison of the results achieved from the HDG-select application to the pervious reported methods/tools have been made accordingly, while these methods are mentioned and cited in the introduction section.

- What is the purpose to compare biomarkers of different studies? is there a meaning in comparing biomarkers of influenza A and breast cancer? why are you making these observations? Are they important to understand your application and its performances?

Response:

The presentation of heatmap results for two representative datasets of influenza A and breast cancer is to show the performance and effectiveness of the proposed HDG-select tool.

- In the conclusion, you stated that "A novel stand-alone application was successfully established which can be used to select and classify the most attributed genes in high dimensional data rapidly and efficiently. The application was validated on 11 datasets and it was found to perform well on most of high dimensional datasets. " but you have provided no information about the time needed to perform the analysis with your application and the other tool used as comparison. Also, you can state that your tool perform well only in comparison to other approaches or tools, having good accuracy is not enough; consider that accuracy is not the only way to compare performance, it has its own meaning and drawbacks which you have not considered. An analysis can return 100% of accuracy but can have poor precision compared to other, for example.

Response:

In the revised version of the manuscript, the application was edited so as to determine the precision of the gene selection and classification. Also, the computational time that the proposed application is spent on the completion of gene selection and classification from the beginning to the end is given in the revised version. Furthermore, the results are compared with those of other tools or approaches reported in literature.

Lastly, please revise your manuscript keeping in mind the structure and the meaning of each part of an article:

-results: here you presents the results of the test and the comparisons done

-discussion: where you discuss the results. Results and discussion may sometimes go in a fused "results and discussion part", if it makes the paper clearer and easier to understand

Response:

Thank you for the constructive comments and the guidance on each section. We have followed the necessary corrections and modifications in order to strengthen the contents of our paper.

The revised manuscript is submitted for your positive consideration and we hope that the revised manuscript is now acceptable for publication.

Yours sincerely

Shilan S. Hameed

On-behalf of all the co-authors

shilansamin@gmail.com

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(29.5KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0246039.r003

Decision Letter 1

Bryan C Daniels

13 Jan 2021

HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets

PONE-D-20-11244R1

Dear Dr. Hameed,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Bryan C Daniels

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

One reviewer suggests additional references that you may decide to include.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: All comments have been addressed

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #3: Yes

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #3: Yes

Reviewer #4: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #3: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #3: Non.......................................................................................................................................................................................................................................................................................................................................................................................

Reviewer #4: 1. The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. Author developed a new stand-alone application, which is based on graphical user interface (GUI) to perform the full functionality of gene selection and classification in high dimensional datasets. HDG application is validated on eleven high dimensional datasets of the format CSV and GEO soft.

2. In gene selection research, now I consider this revised contribution as good enough to make a scientific paper valuable for the field of interest. I am a strong advocate of introducing the rigor of feature selection in classification research. However, I read the manuscript and I believe the work while having significance in the introduction of HDG technique to the problem of interest, can be suitable for publication.

3. Good paper and improved revised article.

4. Include some of the latest and relevant references for the benefit of the readers/authors of evolutionary/feature selection based journal. The following citations will be very useful for the current, future and young research scholars in this research field from all over the globe.

a. Feature selection inspired by human intelligence for improving classification accuracy of cancer types.

b. Knowledge discovery in medical and biological datasets by integration of Relief-F and correlation feature selection techniques.

c. Multi-population adaptive genetic algorithm for selection of microarray biomarkers.

d. A study on metaheuristics approaches for gene selection in microarray data: algorithms, applications and open challenges

e. Hybrid approach for gene selection and classification using filter and genetic algorithm.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: Yes: Zakariya Yahya Algamal

Reviewer #4: No

PLoS One. doi: 10.1371/journal.pone.0246039.r004

Acceptance letter

Bryan C Daniels

18 Jan 2021

PONE-D-20-11244R1

HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets

Dear Dr. Hameed:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Bryan C Daniels

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Dataset. The microarray dataset of leukemia cancer in csv format.

(CSV)

Click here for additional data file.^{(3.1MB, csv)}

S2 Dataset. The microarray dataset of colon cancer in csv format.

(CSV)

Click here for additional data file.^{(1.5MB, csv)}

S3 Dataset. The microarray dataset of prostate cancer in csv format.

(CSV)

Click here for additional data file.^{(7.3MB, csv)}

S4 Dataset. The microarray dataset of breast cancer in csv format.

(CSV)

Click here for additional data file.^{(14.5MB, csv)}

S5 Dataset. The microarray dataset of central nervous system cancer in csv format.

(CSV)

Click here for additional data file.^{(1.7MB, csv)}

S6 Dataset. The microarray dataset of ovarian cancer in csv format.

(CSV)

Click here for additional data file.^{(32.3MB, csv)}

S1 Table. The accuracy percentage of selecting attributed genes using Filter-SVM and Filter-GBPSO-SVM approach compared to that of the original dataset.

(DOCX)

Click here for additional data file.^{(14.3KB, docx)}

S2 Table. The accuracy percentage of Filter-SVM and Filter-GBPSO-SVM algorithm upon various soft files of high dimensional datasets.

(DOCX)

Click here for additional data file.^{(13.7KB, docx)}

S3 Table. Comparison of the accuracy result of the proposed HDG-select application to those reported in literature.

(DOCX)

Click here for additional data file.^{(13.8KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(29.5KB, docx)}

Data Availability Statement

All relevant data are within the manuscript and its Supporting Information files.

[pone.0246039.ref001] 1.Govindarajan R, Duraiyan J, Kaliyappan K, Palanisamy M. Microarray and its applications. Journal of Pharmacy & Bioallied Sciences. 2012;4(Suppl 2):S310–S2. 10.4103/0975-7406.100283 PMC3467903. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref002] 2.Cosma G, Brown D, Archer M, Khan M, Pockley AG. A survey on computational intelligence approaches for predictive modeling in prostate cancer. Expert Systems with Applications. 2017;70:1–19. 10.1016/j.eswa.2016.11.006. [DOI] [Google Scholar]

[pone.0246039.ref003] 3.Singh RK, Sivabalakrishnan M. Feature selection of gene expression data for cancer classification: a review. Procedia Computer Science. 2015;50:52–7. 10.1016/j.procs.2015.04.060 [DOI] [Google Scholar]

[pone.0246039.ref004] 4.Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in bioinformatics. 2015;2015 10.1155/2015/198363 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref005] 5.Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. bioinformatics. 2007;23(19):2507–17. 10.1093/bioinformatics/btm344 [DOI] [PubMed] [Google Scholar]

[pone.0246039.ref006] 6.Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis. 2020;143:106839. [Google Scholar]

[pone.0246039.ref007] 7.Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences. 2014;282:111–35. 10.1016/j.ins.2014.05.042. [DOI] [Google Scholar]

[pone.0246039.ref008] 8.Chandra Sekhara Rao Annavarapu SD, Banka H. Cancer microarray data feature selection using multi-objective binary particle swarm optimization algorithm. EXCLI journal. 2016;15:460 10.17179/excli2016-481 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref009] 9.Rejani Y, Selvi ST. Early detection of breast cancer using SVM classifier technique. arXiv preprint arXiv:09122314. 2009. [Google Scholar]

[pone.0246039.ref010] 10.Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002;46(1):389–422. 10.1023/A:1012487302797 [DOI] [Google Scholar]

[pone.0246039.ref011] 11.Hameed SS, Muhammad FF, Hassan R, Saeed F. Gene Selection and Classification in Microarray Datasets using a Hybrid Approach of PCC-BPSO/GA with Multi Classifiers. JCS. 2018;14(6):868–80. [Google Scholar]

[pone.0246039.ref012] 12.Hameed SS, Hassan R, Muhammad FF. Selection and classification of gene expression in autism disorder: Use of a combination of statistical filters and a GBPSO-SVM algorithm. PloS one. 2017;12(11). 10.1371/journal.pone.0187371 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref013] 13.Han J, Pei J, Kamber M. Data mining: concepts and techniques: Elsevier; 2011. [Google Scholar]

[pone.0246039.ref014] 14.Thaher T, Heidari AA, Mafarja M, Dong JS, Mirjalili S. Binary Harris Hawks Optimizer for High-Dimensional, Low Sample Size Feature Selection. Evolutionary Machine Learning Techniques: Springer; 2020. p. 251–72. [Google Scholar]

[pone.0246039.ref015] 15.Algamal ZY, Lee MH. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Advances in data analysis and classification. 2019;13(3):753–71. [Google Scholar]

[pone.0246039.ref016] 16.Sayed S, Nassef M, Badr A, Farag I. A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Systems with Applications. 2019;121:233–43. [Google Scholar]

[pone.0246039.ref017] 17.Yan C, Ma J, Luo H, Patel A. Hybrid binary coral reefs optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical datasets. Chemometrics and Intelligent Laboratory Systems. 2019;184:102–11. [Google Scholar]

[pone.0246039.ref018] 18.Kim ARP. Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data. Mathematics. 2020;8(1):110. [Google Scholar]

[pone.0246039.ref019] 19.Song X-f, Zhang Y, Guo Y-n, Sun X-y, Wang Y-l. Variable-size Cooperative Coevolutionary Particle Swarm Optimization for Feature Selection on High-dimensional Data. IEEE Transactions on Evolutionary Computation. 2020. [Google Scholar]

[pone.0246039.ref020] 20.Chen W, Xu Y, Yu Z, Cao W, Chen CP, Han G. Hybrid Dimensionality Reduction Forest With Pruning for High-Dimensional Data Classification. IEEE Access. 2020;8:40138–50. [Google Scholar]

[pone.0246039.ref021] 21.Karizaki AA, Tavassoli M, editors. A novel hybrid feature selection based on ReliefF and binary dragonfly for high dimensional datasets. 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE); 2019: IEEE.

[pone.0246039.ref022] 22.Raman MG, Nivethitha S, Kannan K, Sriram VS. A hybrid approach using rough set theory and hypergraph for feature selection on high-dimensional medical datasets. Soft Computing. 2019;23(23):12655–72. [Google Scholar]

[pone.0246039.ref023] 23.Chen L-F, Su C-T, Chen K-H, Wang P-C. Particle swarm optimization for feature selection with application in obstructive sleep apnea diagnosis. Neural Computing and Applications. 2012;21(8):2087–96. 10.1007/s00521-011-0632-4 [DOI] [Google Scholar]

[pone.0246039.ref024] 24.Alba E, Garcia-Nieto J, Jourdan L, Talbi E-G, editors. Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms. Evolutionary Computation, 2007 CEC 2007 IEEE Congress on; 2007: IEEE.

[pone.0246039.ref025] 25.Kennedy J, Eberhart RC, editors. A discrete binary version of the particle swarm algorithm. Systems, Man, and Cybernetics, 1997 Computational Cybernetics and Simulation, 1997 IEEE International Conference on; 1997: IEEE.

[pone.0246039.ref026] 26.Zhang Y, Wang S, Phillips P, Ji G. Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowledge-Based Systems. 2014;64:22–31. [Google Scholar]

[pone.0246039.ref027] 27.Moraglio A, Di Chio C, Togelius J, Poli R. Geometric particle swarm optimization. Journal of Artificial Evolution and Applications. 2008;2008. [Google Scholar]

[pone.0246039.ref028] 28.Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–97. [Google Scholar]

[pone.0246039.ref029] 29.Ardjani F, Sadouni K, Benyettou M, editors. Optimization of SVM MultiClass by Particle Swarm (PSO-SVM). 2010 2nd International Workshop on Database Technology and Applications; 2010 27–28 Nov. 2010.

[pone.0246039.ref030] 30.Jirapech-Umpai T, Aitken S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics. 2005;6(1):148 10.1186/1471-2105-6-148 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref031] 31.Hassanien AE, Al-Shammari ET, Ghali NI. Computational intelligence techniques in bioinformatics. Computational biology and chemistry. 2013;47:37–47. 10.1016/j.compbiolchem.2013.04.007 [DOI] [PubMed] [Google Scholar]

[pone.0246039.ref032] 32.Huerta EB, Duval B, Hao J-K, editors. A hybrid GA/SVM approach for gene selection and classification of microarray data Workshops on Applications of Evolutionary Computation; 2006: Springer. [Google Scholar]

[pone.0246039.ref033] 33.Qian R, Wu Y, Duan X, Kong G, Long H. SVM Multi-Classification Optimization Research based on Multi-Chromosome Genetic Algorithm. International Journal of Performability Engineering. 2018;14(4). [Google Scholar]

[pone.0246039.ref034] 34.Barash E, Sal-Man N, Sabato S, Ziv-Ukelson M. BacPaCS—Bacterial Pathogenicity Classification via Sparse-SVM. Bioinformatics. 2018. [DOI] [PubMed] [Google Scholar]

[pone.0246039.ref035] 35.Latkowski T, Osowski S, editors. Developing Gene Classifier System for Autism Recognition International Work-Conference on Artificial Neural Networks; 2015: Springer. [Google Scholar]

[pone.0246039.ref036] 36.García-Nieto J, Alba E, Jourdan L, Talbi E. Sensitivity and specificity based multiobjective approach for feature selection: Application to cancer diagnosis. Information Processing Letters. 2009;109(16):887–96. [Google Scholar]

[pone.0246039.ref037] 37.Duez M, Giraud M, Herbert R, Rocher T, Salson M, Thonier F. Vidjil: a web platform for analysis of high-throughput repertoire sequencing. PLoS One. 2016;11(11):e0166126 10.1371/journal.pone.0166126 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref038] 38.Kaya H, Hasman H, Larsen J, Stegger M, Johannesen TB, Allesøe RL, et al. SCCmecFinder, a web-based tool for typing of staphylococcal cassette chromosome mec in Staphylococcus aureus using whole-genome sequence data. Msphere. 2018;3(1). 10.1128/mSphere.00612-17 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref039] 39.Bruyneel AA, Colas AR, Karakikes I, Mercola M. AlleleProfileR: A versatile tool to identify and profile sequence variants in edited genomes. Plos one. 2019;14(12):e0226694 10.1371/journal.pone.0226694 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref040] 40.Tamazian G, Dobrynin P, Krasheninnikova K, Komissarov A, Koepfli K-P, O’brien SJ. Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences. GigaScience. 2016;5(1):s13742-016–0141-6. 10.1186/s13742-016-0141-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref041] 41.Diaz-Uriarte R. GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC bioinformatics. 2007;8(1):328 10.1186/1471-2105-8-328 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref042] 42.Glaab E, Garibaldi JM, Krasnogor N. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization. BMC bioinformatics. 2009;10(1):1–7. 10.1186/1471-2105-10-358 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref043] 43.Pirooznia M, Deng Y. SVM Classifier–a comprehensive java interface for support vector machine classification of microarray data. BMC bioinformatics. 2006;7(S4):S25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref044] 44.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science. 1999;286(5439):531–7. 10.1126/science.286.5439.531 [DOI] [PubMed] [Google Scholar]

[pone.0246039.ref045] 45.Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999;96(12):6745–50. 10.1073/pnas.96.12.6745 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref046] 46.Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006;7(1):3 10.1186/1471-2105-7-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref047] 47.Z Z., Ong YS, Dash M. Markov blanketembedded genetic algorithm for gene selection. Pattern Recognition. 2007;40:3236–48. 10.1016/j.patcog.2007.02.007. [DOI] [Google Scholar]

[pone.0246039.ref048] 48.Autistic children and their father’s age: peripheral blood lymphocytes [Internet]. from www.ncbi.nlm.nih.gov. 2011. Available from: http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4431.

[pone.0246039.ref049] 49.Alter MD, Kharkar R, Ramsey KE, Craig DW, Melmed RD, Grebe TA, et al. Autism and increased paternal age related changes in global levels of gene expression regulation. PloS one. 2011;6(2):e16715 10.1371/journal.pone.0016715 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref050] 50.El-Fishawy P, State MW. The genetics of autism: key issues, recent findings, and clinical implications. Psychiatric Clinics of North America. 2010;33(1):83–105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref051] 51.Latkowski T, Osowski S. Computerized system for recognition of autism on the basis of gene expression microarray data. Computers in biology and medicine. 2015;56:82–8. 10.1016/j.compbiomed.2014.11.004 [DOI] [PubMed] [Google Scholar]

[pone.0246039.ref052] 52.Latkowski T, Osowski S. Data mining for feature selection in gene expression autism data. Expert Systems with Applications. 2015;42(2):864–72. 10.1016/j.eswa.2014.08.043 [DOI] [Google Scholar]

[pone.0246039.ref053] 53.Lai C, Reinders MJ, van't Veer LJ, Wessels LF. A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC bioinformatics. 2006;7(1):235 10.1186/1471-2105-7-235 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref054] 54.Huertas C, Juárez-Ramírez R, editors. Filter feature selection performance comparison in high-dimensional data: A theoretical and empirical analysis of most popular algorithms. Information Fusion (FUSION), 2014 17th International Conference on; 2014: IEEE.

[pone.0246039.ref055] 55.Haury A-C, Gestraud P, Vert J-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PloS one. 2011;6(12):e28210 10.1371/journal.pone.0028210 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246039.ref056] 56.Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012;9(4):1106–19. 10.1109/TCBB.2012.33 [DOI] [PubMed] [Google Scholar]

[pone.0246039.ref057] 57.Li S, Wu X, Tan M. Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Computing-A Fusion of Foundations, Methodologies and Applications. 2008;12(11):1039–48. [Google Scholar]

[pone.0246039.ref058] 58.Saha S, Seal DB, Ghosh A, Dey KN. A novel gene ranking method using Wilcoxon rank sum test and genetic algorithm. International Journal of Bioinformatics Research and Applications. 2016;12(3):263–79. [Google Scholar]

[pone.0246039.ref059] 59.Bridge PD, Sawilowsky SS. Increasing physicians’ awareness of the impact of statistics on research outcomes: comparative power of the t-test and Wilcoxon rank-sum test in small samples applied research. Journal of clinical epidemiology. 1999;52(3):229–35. 10.1016/s0895-4356(98)00168-1 [DOI] [PubMed] [Google Scholar]

[pone.0246039.ref060] 60.Ruxton GD. The unequal variance t-test is an underused alternative to Student's t-test and the Mann–Whitney U test. Behavioral Ecology. 2006;17(4):688–90. [Google Scholar]

[pone.0246039.ref061] 61.Wilcoxon F. Individual comparisons by ranking methods. Biometrics bulletin. 1945;1(6):80–3. [PubMed] [Google Scholar]

[pone.0246039.ref062] 62.Wild C, Seber G. The Wilcoxon rank-sum test. Chapter; 2011.

[pone.0246039.ref063] 63.Khoshgoftaar T, Dittman D, Wald R, Fazelpour A, editors. First order statistics based feature selection: A diverse and powerful family of feature seleciton techniques. Machine Learning and Applications (ICMLA), 2012 11th International Conference on; 2012: IEEE.

[pone.0246039.ref064] 64.Sprent P, Smeeton NC. Applied nonparametric statistical methods: CRC Press; 2016. [Google Scholar]

[pone.0246039.ref065] 65.Geometric Particle Swarm Optimisation [Internet]. 2016 [cited 28/20/2020]. Available from: https://github.com/sebastian-luna-valero/PSOSearch/.

[pone.0246039.ref066] 66.wekalab [Internet]. 2016 [cited 28/20/2020]. Available from: https://github.com/NicholasMcCarthy/wekalab.

PERMALINK

HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets

Shilan S Hameed

Rohayanti Hassan

Wan Haslina Hassan

Fahmi F Muhammadsharif

Liza Abdul Latiff

Roles

Abstract

Introduction

Materials and methods

Implementation procedure

Fig 1. The flowchart of selecting genes and classification in high dimensional datasets using the HDG-select application.

Fig 2. The built-in structure of the developed HDG-select application.

Microarray datasets

Table 1. The main characteristics of the pre-reduced high dimensional datasets in csv format.

Table 2. The main characteristics of the original/non-curated GEO datasets in soft format.

Inflammatory breast cancer (GDS3097)

Breast cancer (GDS3716)

Brain metastatic breast cancer (GDS5306)

Autism disorder (GDS4431)

Influenza A (GDS6063)

Dimensionality reduction of soft format datasets using mean and median ratio

Gene selection using statistical filters

Gene selection using GBPSO-SVM algorithm

Development of HDG-select application

Fig 3. The main interface of the developed HDG-select application used for gene selection and classification in high dimensional datasets.

Fig 7. Accuracy and precision of the classification of datasets after final gene selection using the proposed HDG-select application.

Results and discussions

Fig 4. Effect of the number of filtered genes on the SVM classification accuracy in the first step of gene selection in high dimensional datasets.

Fig 5.

Fig 6. Comparison of the SVM accuracy in selected genes by different filters-GBPSO-SVM approach using the proposed HDG-select application.

Fig 8. Comparison of the classification accuracy in genes selection by different methods.

Fig 9. The heatmap of seven selected biomarker genes of breast cancer using the proposed HDG-select application.

Fig 10. The heatmap of seven selected biomarker genes of influenza A using the proposed HDG-select application.

Table 3. Comparison of the proposed HDG-select tool with those reported in literature for gene selection and classification.

Table 4. Comparison of the proposed HDG-select tool with ArrayMining tool in terms of accuracy and precision of gene selection and classification in high dimensional datasets.

Conclusions

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Bryan C Daniels

Roles

Author response to Decision Letter 0

Decision Letter 1

Bryan C Daniels

Roles

Acceptance letter

Bryan C Daniels

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases