Learning misclassification costs for imbalanced classification on gene expression data

Huijuan Lu; Yige Xu; Minchao Ye; Ke Yan; Zhigang Gao; Qun Jin

doi:10.1186/s12859-019-3255-x

. 2019 Dec 24;20(Suppl 25):681. doi: 10.1186/s12859-019-3255-x

Learning misclassification costs for imbalanced classification on gene expression data

Huijuan Lu ¹, Yige Xu ¹, Minchao Ye ^1,^✉, Ke Yan ¹, Zhigang Gao ², Qun Jin ³

PMCID: PMC6929277 PMID: 31874599

Abstract

Background

Cost-sensitive algorithm is an effective strategy to solve imbalanced classification problem. However, the misclassification costs are usually determined empirically based on user expertise, which leads to unstable performance of cost-sensitive classification. Therefore, an efficient and accurate method is needed to calculate the optimal cost weights.

Results

In this paper, two approaches are proposed to search for the optimal cost weights, targeting at the highest weighted classification accuracy (WCA). One is the optimal cost weights grid searching and the other is the function fitting. Comparisons are made between these between the two algorithms above. In experiments, we classify imbalanced gene expression data using extreme learning machine to test the cost weights obtained by the two approaches.

Conclusions

Comprehensive experimental results show that the function fitting method is generally more efficient, which can well find the optimal cost weights with acceptable WCA.

Keywords: Cost-sensitive, Misclassification cost, Weighted classification accuracy, Parameter fitting

Background

Classification of gene expression data reveals tremendous information in various application fields of biomedical research, such as cancer diagnosis, prognosis and predictions [1–3]. However, the gene expression data is composed of high-dimensional, noisy and imbalanced data samples [4]. The characteristic of imbalanced data is serious imbalance in the proportion of positive and negative samples [5, 6]. Gene expression data exacts a series of pre-processing steps to eliminate misleading classification results [7]. Moreover, the classification of gene expression data is a cost-sensitive problem, although both positive and negative classifications of cancer genes provide important evidences for doctors to make the treatment plan.

Traditional machine learning algorithms usually assume that the training set is balanced. For imbalanced datasets, such as the gene expression datasets, the classical classification algorithms with the correct classification rates (CCR) may bias towards the majority classes. However, the misclassifications of minority classes usually contribute the higher influences than those of majority classes. Therefore, The introduction of cost sensitive learning (CSL) is necessary to eliminate the defects of traditional classification algorithms for imbalanced datasets. Traditionally, oversampling the minority class, undersampling the majority class, and synthesizing new minority classes can be used to handle this problem. In this work, we utilize a more sophisticated way to search for the optimal weights, and the proposed methods are more advanced than ever.

In CSL, misclassification cost is an important factor to evaluate the classification performance of imbalanced datasets. However, solving the misclassification cost matrix is not a trivial task in many situations [8–10]. A direct solution for finding the misclassification costs is to assign them manually according to user expertise or inversely calculate the costs based on class distribution [11–13]. More sophisticated solutions can be found by fitting the importance of features to adaptive equations.

In this paper, we learn the misclassification cost from the evaluation functions of cost-sensitive algorithms, using weighted classification accuracy as the measurement of cost-sensitive classification performance. The cost weights that lead to optimal classification performance are learned by grid searching strategy. It will help the researchers to obtain a reference weight. Then, three fitting functions will be found to represent the optimal cost weights. A series of comprehensive experimental results show that the function fitting approach is an effective way of finding the optimal cost weights, targeting at high weighted classification accuracy (WCA). Fitting functions can accurately locate optimal weights. Appropriate weights will greatly improve the accuracy of the model.

Imbalanced data greatly affects the accuracy of classification. We discuss the cost-sensitive classification algorithms in the imbalance problem. CSL is one of the most hot topics in the field of machine learning. Many works have studied on CSL and embedded the misclassification costs into various classifiers, such as the decision trees (DTs), support vector machines (SVMs) and extreme learning machines (ELMs). Chai et al. [14] considered the testing costs of missing values in naive Bayes (NB) and DT algorithms. Feng [15] defined a customized objective function for misclassification costs and designed a score evaluation based cost-sensitive DT. For multi-class classification problems, Feng’s method generally achieves higher classification accuracy or lower misclassification costs. Zhao and Li [16] extended the evaluation function by including weighted information gain ratio and the test cost for the cost-sensitive DT. The proposed cost-sensitive DT algorithm not only reduced the misclassification cost, but also improved the classification efficiency of the original C4.5 algorithm [17, 18]. Lu et al. [19] made use of the cost-sensitive DTs as base classifiers and constructed a cost-sensitive rotational forest. Two kinds of DTs, i.e., EG2 and C4.5, are considered and tested [20]. These experiments show that integrating cost-sensitive to classification algorithms can effectively improve classification efficiency.

Cost sensitivity and classification algorithms combine to form efficient classification methods. Cao et al. [21] proposed to embed evaluation measures into the objective function for to improve the performance of a cost-sensitive support vector machine (CS-SVM). He et al. [22] integrated the Gaussian Mixture Model (GMM) into the CS-SVM to deal with the imbalanced classification problem. Cheng and Wu [23] added weights to features and introduced a weighted features cost-sensitive SVM (WF-CSSVM). The WF-CSSVM algorithm showed significant performance improvement on both aspects of accuracy and cost. Silva et al. [24] combined CS-SVM with semi-supervised learning method to form a hybrid classification algorithm. The effectiveness of the proposed hybrid method is shown in the experimental results on Earth monitoring and landscape mapping. Cao et al. [25] tackled the problem of multi-labeled imbalanced data classification problem. They successfully assigned different misclassification costs to different label sets for reducing the overall misclassification cost.

CS-ELM has been studied by many researchers in various aspects. Zong et al. [26] introduced a weighted extreme learning machine (WELM) for imbalanced data learning. It was claimed that the WELM can be extended to a cost-sensitive ELM (CS-ELM). Zheng et al. [27] formally applied the concept of the cost-sensitivity to extreme learning machine (ELM). Yan et al. [28, 29] extended Zheng et al.’s work and introduced a cost-sensitive dissimilar ELM (CS-D-ELM). Compared to traditional ELM algorithms, the CS-ELM algorithms guarantee the classification accuracy and reduce the misclassification cost. More recently, Zhang and Zhang [30] solved the problem of defining and optimizing the cost matrix for CS-ELM to make it more robust and stable [31, 32]. Zhu and Wang [33] treated CS-ELM as a base classifier to solve a semi-supervised learning problem. Incremental results show that the CS-ELM has better performance in terms of accuracy, cost, efficiency and robustness over other existing classifiers.

Classical definition of cost matrix

Considering the binary classification problem, the confusion matrix shows four types of classification results according to the prediction values, namely, true positive, false positive, false negative and true negative (Table 1) [34, 35].

Table 1.

The confusion matrix for binary classification

	Prediction of Positive	Prediction of Negative
Positive samples	True Positive TP	False Negative FN
Negative samples	False Positive FP	True Negative TN

Open in a new tab

The CSL seeks the overall minimum cost by introducing sensitive costs, rather than only aiming at high CCR. While there are several types of classification costs, it should be noted that this work only focuses on the misclassification cost.

Misclassification cost can be viewed as penalties for errors in the classification process. In binary classification problems, costs caused by different types of errors may be different. We define the minority class as positive (P), the majority class as negative (N), and construct the cost matrix C as shown in Table 2.

Table 2.

Cost matrix

Predicted Actual	P	N
P	C₀₀	C₀₁
N	C₁₀	C₁₁

Open in a new tab

In Table 2, C₀₀ and C₁₁ show the cost of correct classification. By default, we set the costs of correct classifications as 0. C₀₁ and C₁₀ show the costs of error classifications, where C₀₁ denotes the misclassification costs of samples from P class, and C₁₀ denotes the misclassification costs of samples from N class. Therefore, the cost matrix in Table 2 can be simplified as:

C = [\begin{matrix} 0 & C_{01} \\ C_{10} & 0 \end{matrix}]

Correct classification rates versus weighted classification accuracy

For classical machine learning problems, the classification accuracy always refers to the correct classification rate (CCR) [36–38], or called overall accuracy (OA) [39–42], which is the proportion of all correctly classified samples:

OA = \frac{TP + TN}{TP + FN + TN + FP} \times 100 %

However, for imbalanced datasets where the numbers of positive and negative samples differ significantly, the CCR might be misleading [43, 44]. Considering a test set containing 99 negative samples but with only one positive sample [45, 46], a poorly designed classifier that simply puts all samples as negative will achieve an overall accuracy of 99/100 = 0.99, even though the accuracy for positive class is 0. To resolve this issue, we introduce the notion of adaptive classification accuracy (ACA) defined as follows:

ACA = \frac{1}{2} \cdot (\frac{TP}{TP + FN} + \frac{TN}{TN + FP})

By embedding a weight w_i into the i-th class, we get the weighted classification accuracy (WCA) as:

WCA = \frac{w_{1}}{w_{1} + w_{2}} \cdot \frac{TP}{TP + FN} + \frac{w_{2}}{w_{1} + w_{2}} \cdot \frac{TN}{TN + FP}

By enforcing w₁ + w₂ = 1, Formula (9) is reduced to:

WCA = w_{1} \cdot \frac{TP}{TP + FN} + w_{2} \cdot \frac{TN}{TN + FP}

Formula (10) can be easily extended to multi-classification problems:

WC A_{n} = \sum_{i = 1}^{n} w_{i} \frac{C M_{i}}{M_{i}}, \sum_{i = 1}^{n} w_{i} = 1

where n denotes the number of classes, M_i (i = 1, 2,..., n) denotes the number of samples belonging to the i-th class, and CM_i (i = 1, 2,..., n) denotes the number of correctly classified samples within i-th class. Since the WCA is more accurate describing the classification accuracy, we use the WCA to evaluate the classification performance of cost-sensitive classifiers in the problem of gene expression data classification.

Methods

Optimal cost weights searching

From the University of California Irvine (UCI) standard classification dataset, we choose Leukemia, Colon, Prostate, Lung and Ovarian gene as the datasets for cost weights searching and further test, i.e., the Leukemia cancer dataset, the Colon cancer dataset, the Prostate cancer dataset, the Lung cancer dataset, and the Ovarian cancer in the tumor data respectively. All details of aforementioned datasets are shown in Table 3.

Table 3.

Specifications of datasets

Dataset	Sample number	Feature dimension	Classification number
Leukemia	34	7130	2
Colon	62	2000	2
Prostate	136	12600	2
Lung	181	12533	2
Ovarian	253	15154	2

Open in a new tab

Optimal cost weights searching by grid searching strategy

The optimal weights are searched by an adaptive algorithm using grid searching. There are two crucial factors to consider: the sample importance w and sample categorical distribution p. The sample categorical distribution p is the proportion between the number of positive class and negative class in test sets. Test set is constructed by random sampling. As such, it is necessary to study the relationship between the three factors, namely, w, p and WCA, where WCA is the fitness value for the grid searching strategy. In general, the grid searching strategy can be described as follows (the detailed algorithm steps are listed in Table 4):

Table 4.

Grid Searching Strategy

Grid Searching Strategy
1: procedure GRIDSEARCHING(M, T, P₀)
2: P = P₀
3: f = WCA(P)
4: if P < M then
5: P = P + T
6: if f > f_max then
7: f_max = f
8: P_max = P
9: end if
10: end if
11: return P_max, f_max
12: end procedure

Open in a new tab

1) Set the searching region as M, grid searching step size as T, and the initial position as P₀;

2) Calculate the fitness of the current position, record the position P_max that has the best fitness f_max (f_max = WCA);

3) Update current location, P=P + T;

4) if the current fitness value is greater than f_max, update f_max and P_max;

5) return f_max and P_max.

Extreme learning machine is an effective single hidden-layer feed-forward neural network (SLFN) learning algorithm. Cost-sensitive extreme learning machine (CS-ELM) is a kind of ELM, which attaches a cost matrix on output layer. In this research, we set the number of hidden neurons at 10. Less neurons will make the result more sensitive to observe the change of weights. And seven different gene expression datasets are used to obtain the classification results with CS-ELM as the classifier. CS-ELM minimizes the conditional risk by embedding misclassification cost in ELM.

argmin R (i| x) = argmin \sum_{j} P (j| x) • C (i, j)

where R(i|x) is the conditional risk when the sample x is assigned to the class i, and P(j|x) is the conditional probability that x belongs to j, C(i, j) is the risk of misclassifying j to class i, where i, j ∈ {c₁, c₂, …, c_m} and m is the number of classification categories.

Results

Optimal cost weights searching by function fitting

In this subsection, we use w and p as independent variables, and define a function fitting problem as:

w_{c} = f (w, p)

where w_c = C₀₁/C₁₀, w = w₁/(w₁ + w₂) and p represents the proportion of positive and negative classes. We set C₁₀ to 1 to reduce the complexity of calculation, i.e., f_c = C₀₁.

Table 5.

Optimal weights for different data set

Data set	Sample categorical	Influence factors		Optimal weights w_c	WCA
Data set	distribution p	w₁/(w₁+w₂)	w₂/(w₁+w₂)	Optimal weights w_c	WCA
Colon	1	0.2	0.8	1.03	0.6167
Leukemia	1.33	0.9	0.1	0.9	0.9179
Ovarian1	1.68	0.9	0.1	1.65	0.9055
Prostate1	2	0.9	0.1	1.06	0.939
Prostate2	2.5	0.9	0.1	1.04	0.9372
Lung1	3	0.9	0.1	0.93	0.92
Ovarian2	4	0.1	0.9	3.45	0.9094
Lung2	5	0.1	0.9	4.26	0.9078
Ovarian3	6.5	0.9	0.1	0.8	0.9075
Lung3	8	0.9	0.1	0.92	0.9009

Open in a new tab

The sample distribution p, the optimal weight w_c = C₀₁/C₁₀ and the highest fitness value of each dataset are listed in Table 6.

Table 6.

Datasets, cost weights and WCAs with the two approaches proposed

Dataset			Cost weight				WCA
type	p	w	optimal	w_c1	w_c2	w_c3	optimal	w_c1	w_c2	w_c3	ECSELM
ovarian	1.68	0.1	1.65	1.63	1.53	1.58	0.9055	0.9695	0.1966	0.2084	0.1017
Prostate	2.5	0.9	1.04	1.05	1.05	1	0.9372	0.9815	0.9509	0.9869	0.8985
Lung1	5	0.1	4.26	4.03	4.1	3.94	0.9078	0.9778	0.9786	0.9779	0.875
Lung2	8	0.9	0.92	0.9	0.66	0.61	0.9009	0.9564	0.9762	0.9675	0.9

Open in a new tab

We use an automatic fitting software named 1STOPT to do the function fitting [47]. In 1STOPT, Levenberg-Marquardt and Universal Global Optimization are used to fit functions. We compared 500 functions with different types, and selected the three functions with the highest correlation coefficient:

w_{c 1} = f_{1} (w, p) = \frac{a_{1} + a_{2} \cdot w + a_{3} \cdot w^{2} + a_{4} \cdot w^{3} + a_{5} \cdot a_{12} \cdot ln p + a_{6} \cdot {(a_{12} \cdot ln p)}^{2}}{1 + a_{7} \cdot w + a_{8} \cdot w^{2} + a_{9} \cdot a_{12} \cdot ln p + a_{10} \cdot {(a_{12} \cdot ln p)}^{2} + a_{11} \cdot {(a_{12} \cdot ln p)}^{3}}

where a₁ = 1.323, a₂ = − 2.278, a₃ = 3.047, a₄ = − 1.286, a₅ = − 1.746, a₆ = 0.998, a₇ = − 0.400, a₈ = 0.369, a₉ = − 2.606, a₁₀ = 2.544, a₁₁ = − 0.818, a₁₂ = 0.482. The correlation coefficient R₁ of f₁ is 0.96346.

w_{c 2} = f_{2} (w, p) = \frac{b_{1} + b_{3} \cdot w + b_{5} \cdot ln p + b_{7} \cdot w^{2} + b_{9} \cdot {ln}^{2} p + b_{11} \cdot w \cdot ln p}{1 + b_{2} \cdot w + b_{4} \cdot ln p + b_{6} \cdot x^{2} + b_{8} \cdot {ln}^{2} p + b_{10} \cdot w \cdot ln p}

where b₁ = 1.008, b₂ = 2.618, b₃ = 1.743, b₄ = − 0.808, b₅ = 0.297, b₆ = 2.327, b₇ = 4.605, b₈ = 0.406, b₉ = 0.699, b₁₀ = − 2.343, b₁₁ = − 4.984. The correlation coefficient R₂ of f₂ is 0.95903.

w_{c 3} = f_{3} (w, p) = \frac{c_{1} + c_{3} \cdot ln w + c_{5} \cdot p + c_{7} \cdot {ln}^{2} w + c_{9} \cdot p^{2} + c_{11} \cdot p \cdot ln w}{1 + c_{2} \cdot ln w + c_{4} \cdot p + c_{6} \cdot {ln}^{2} w + c_{8} \cdot p^{2} + c_{10} \cdot p \cdot ln w}

where c₁ = 1.279, c₂ = 0.574, c₃ = 0.943, c₄ = − 0.152, c₅ = − 0.291, c₆ = 0.113, c₇ = 0.154, c₈ = 0.009, c₉ = 0.018, c₁₀ = − 0.062, c₁₁ = − 0.250. The correlation coefficient R₃ of f₃ is 0.95244.

We compare the fitting functions with the optimal weights in Figs. 1, 2 and 3.

Fig. 1 — The values of function w_c1 compared with the optimal weights

Fig. 2 — The values of function w_c2 compared with the optimal weights

Fig. 3 — The values of function w_c3 compared with the optimal weights

Figures 1, 2 and 3 show the comparison results of the three-dimensional interpolation of optimal weights and fitting functions. The red surface represents the optimal weights. The green, yellow, blue planes are fit surfaces of f₁, f₂ and f₃. The correlation coefficient R of f₁, f₂ and f₃ identified that the overall fitness of the function f₁ is better than other two. The function f₂ gradually deviates from optimal weights while we increase the value of w, and decrease the value of p. The function f₃ is slightly coarser than the function f₁ in general.

Discussion

Comparison with grid searching and function fitting

Using different gene expression datasets, we compared the optimal cost weights obtained from the grid searching strategy and fitted functions f₁, f₂ and f₃. In Table 6, we compared the WCAs with four different datasets, namely, Ovarian, Prostate, Lung1 and Lung2. The majority over minority class proportion of the four datasets are 1.68, 2.5, 5 and 8 respectively. All WCAs are computed using ELM as the base classifier. We also compare the two approaches with ECSELM. The best fit datasets are listed in Table 6.

For each dataset, we plot the weight variance with different values of w. For different dataset, the fittest function (choice from f₁, f₂ and f₃) might be different (Fig. 4).

Fig. 4 — Cost weight comparison using Ovarian, Prostate, Lung1, Lung2 dataset (p = 1.68, 2.5, 5, 8)

Figure 4 shows that the more unbalanced the dataset is, the higher degree of fitness we can get; and the cost weights obtained from the fitting functions are closer to the optimal weights. In addition, the cost weights from function f₁ and f₃ are slightly superior to f₂. We put all cost weights obtained by different methods in a three-dimensional picture and show the results in Fig. 5.

Fig. 5 — Cost weight comparison in overall

For each dataset, we also illustrate the comparison of WCAs against different w values (Fig. 6). Besides, we compare WCAs of optimal weights and f_1–3 with ECSELM [48].

In Fig. 6, we can see that the WCAs of the three fitting functions are lower than the optimal accuracy when w is less than 0.5. The reason is that the fitting degree of the cost weights in this range is lower. Moreover, it can be seen from Fig. 6 that the WCAs of the fitting functions approach to the optimal accuracy with the increment of p. Furthermore, the WCAs of our approaches is better than ECSELM in most field. Compared with ECSELM, our methods are more stable, and meanwhile can guarantee high WCA. This proves the robustness of our strategy. Similar to the case of cost weights, we ensemble all WCAs obtained by different methods in a three-dimensional picture (Fig. 7). In summary, we find that the function f₁ provides better classification performance than the other two functions in general; and the fitting function f₃ and f₂ have better performance while the valuable p is large (when p above 5).

Fig. 7 — The WCA comparison in 3-dimension

Conclusions

In this paper, we have proposed two approaches to calculate the optimal cost weights for gene expression data. The two approaches include a grid searching strategy and a function fitting method. They enrich the ways of calculating the cost weights for imbalanced data classification problems. In general, the function fitting approach is more efficient than the grid searching strategy. The experimental results also show that the function fitting approach can accurate find the optimal cost weights for imbalanced gene expression datasets.

The limitation of this work is that, although the ELM classifier is tested, the stability of the function fitting method is not proven, especially for other significantly different datasets. The exploration of the proposed algorithm’s stability is left as future work.

Acknowledgments

We thank T. R. Golub, D. K. Slonim et al. providing us the gene expression data.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 25, 2019: Proceedings of the 2018 International Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and Biomedical Informatics (ICBI) 2018 conference: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-25.

Abbreviations

ACA: Adaptive classification accuracy
CCR: Correct classification rates
CSL: Cost sensitive learning
OA: Overall accuracy
WCA: Weighted classification accuracy

Authors’ contributions

HL and MY conceived the project. YX developed the methodology, analyzed the results and wrote the manuscript. KY, ZG and QJ provided administrative and technical support. All authors read and approved the final manuscript.

Funding

This study is supported by National Natural Science Foundation of China (Nos. 61272315, 61402417, 61602431, 61701468, 61850410531, 61572164 and 61877015), Zhejiang Provincial Natural Science Foundation (Nos. Y1110342, LY15F020037 and LY19F020016) and International Cooperation Project of Zhejiang Provincial Science and Technology Department (No. 2017C34003).

Availability of data and materials

The datasets analyzed in this manuscript are available from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(2):531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
2.Shipp M, Ross K, Tamayo P, Weng A, Kutok J, Aguiar R, Gaasenbeek M, Angelo M, Reich M, Pinkus G. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8(1):68–74. doi: 10.1038/nm0102-68. [DOI] [PubMed] [Google Scholar]
3.Veer L, Dai H, Vijver M, He Y, Hart A, Mao M, Peterse H, Kooy K, Marton M, Witteveen A. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
4.Deng S, Zhu L, Huang D. Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE Comput Soc Press. 2016;13(1):27–35. [DOI] [PubMed]
5.Deng S, Zhu L, Huang D. Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks. BMC Genomics. 2015;16(Suppl 3):1–10. doi: 10.1186/1471-2164-16-S3-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wang W, Lou B, Li X, Lou X, Jin N, Yan K. Intelligent maintenance frameworks of large-scale grid using genetic algorithm and k-mediods clustering methods. World Wide Web. 2019;2019(7):1573–1413. [Google Scholar]
7.Deng S, Cao S, Huang D, Wang Y. Identifying stages of kidney renal cell carcinoma by combining gene expression and DNA methylation data. IEEE/ACM Trans Comput Biol Bioinform. 2016;14(5):1147–1153. doi: 10.1109/TCBB.2016.2607717. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Elkan C. Seventeenth International Joint Conference on Artificial Intelligence. 2001. The foundations of cost-sensitive learning; pp. 973–978. [Google Scholar]
9.Zhou Z, Liu X. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng. 2006;18(1):63–77. doi: 10.1109/TKDE.2006.17. [DOI] [Google Scholar]
10.Yan K, Zhong C, Ji Z, Huang J. Semi-supervised learning for early detection and diagnosis of various air handling unit faults. 2018. pp. 75–83. [Google Scholar]
11.Liu X, Zhou Z. International Conference on Data Mining. 2007. The influence of class imbalance on cost-sensitive learning: An empirical study; pp. 970–974. [Google Scholar]
12.Maheshwari S, Jain R, Jadon R. An insight into rare class problem: analysis and potential solutions. J Comput Sci. 2018;14(8):777–792. doi: 10.3844/jcssp.2018.777.792. [DOI] [Google Scholar]
13.Hu M, Li W, Yan K, Ji Z, Hu H. Modern machine learning techniques for univariate tunnel settlement forecasting: A comparative study. 2019. pp. 1–12. [Google Scholar]
14.Chai X, Deng L, Yang Q, Ling C. IEEE International Conference on Data Mining. 2004. Test-cost sensitive naive bayes classification; pp. 51–58. [Google Scholar]
15.Feng S. International Conference on Logistics Engineering, Management and Computer Science. 2015. A cost-sensitive decision tree under the condition of multiple classes; pp. 1212–1218. [Google Scholar]
16.Zhao H, Li X. A cost sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism. Inf Sci. 2016;378(2):303–316. [Google Scholar]
17.Quinlan J. C4.5: Programs for machine learning: Morgan Kaufmann Publishers Inc; 1992.
18.Liu K, Huang D. Cancer classification using rotation forest. Comput Biol Med. 2008;38(5):601–610. doi: 10.1016/j.compbiomed.2008.02.007. [DOI] [PubMed] [Google Scholar]
19.Lu H, Yang L, Yan K, Xue Y, Gao Z. A cost-sensitive rotation forest algorithm for gene expression data classification. Neurocomputing. 2017;228(C):270–276. doi: 10.1016/j.neucom.2016.09.077. [DOI] [Google Scholar]
20.Turney P. Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. AI Access Foundation. 1994;2(1):369–409. [Google Scholar]
21.Cao P, Zhao D, Zaiane O. Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2013. An optimized cost-sensitive svm for imbalanced data learning; pp. 280–292. [Google Scholar]
22.Yuan H, Zhang X. Multiscale fragile watermarking based on the gaussian mixture model. IEEE Trans Image Process. 2006;15(10):3189–3200. doi: 10.1109/TIP.2006.877310. [DOI] [PubMed] [Google Scholar]
23.Cheng D, Wu M. IEEE International Conference on Internet of Things. 2017. A novel classifier - weighted features cost-sensitive svm; pp. 598–603. [Google Scholar]
24.Silva J, Bacao F, Caetano M. Specific land cover class mapping by semi-supervised weighted support vector machines. Remote Sens. 2017;9(2):1–16. doi: 10.3390/rs9020181. [DOI] [Google Scholar]
25.Cao P, Liu X, Zhao D, Zaiane O. International Conference on Hybrid Intelligent Systems. 2016. Cost sensitive ranking support vector machine for multi-label data learning; pp. 244–255. [Google Scholar]
26.Zong W, Huang G, Chen Y. Weighted extreme learning machine for imbalance learning. Neurocomputing. 2013;101(3):229–242. doi: 10.1016/j.neucom.2012.08.010. [DOI] [Google Scholar]
27.Zheng E, Zhang C, Liu X, Lu H, Sun J. International Conference on Advanced Data Mining and Applications. 2013. Cost-sensitive extreme learning machine; pp. 478–488. [Google Scholar]
28.Liu Y, Lu H, Yan K, Xia H, An C. Applying cost-sensitive extreme learning machine and dissimilarity integration to gene expression data classification. Comput Intell Neurosci. 2016;2016(8):1–9. doi: 10.1155/2016/8056253. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Yan K, Ji Z, Lu H, Huang J, Shen W, Xue Y. Fast and accurate classification of time series data using extended elm: application in fault diagnosis of air handling units. IEEE Trans Syst Man Cybern Syst. 2017;49(7):1–8.
30.Zhang L, Zhang D. Evolutionary cost-sensitive extreme learning machine. IEEE Trans Neural Netw Learn Syst. 2017;28(12):3045–3060. doi: 10.1109/TNNLS.2016.2607757. [DOI] [PubMed] [Google Scholar]
31.Wang S, Li X, Zhang S, Gui J, Huang D. Tumor classification by combining pnn classifier ensemble with neighborhood rough set based gene reduction. Comput Biol Med. 2010;40(2):179–89. [DOI] [PubMed]
32.Wang S, Zhu Y, Jia W, Huang D. Robust classification method of tumor subtype by using correlation filters. IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):580–591. doi: 10.1109/TCBB.2011.135. [DOI] [PubMed] [Google Scholar]
33.Zhu H, Wang X. A cost-sensitive semi-supervised learning model based on uncertainty. Neurocomputing. 2017;251(8):106–114. doi: 10.1016/j.neucom.2017.04.010. [DOI] [Google Scholar]
34.Ailijiang A, Charapko A, Demirbas M. Consensus in the cloud: Paxos systems demystified. In: 25th International Conference on Computer Communication and Networks (ICCCN). Waikoloa: IEEE; 2016. p. 1–10.
35.Zheng C, Huang D, Kong X, Zhao X. Gene expression data classification using consensus independent component analysis. Genomics Proteomics Bioinformatics. 2008;6(2):74–82. doi: 10.1016/S1672-0229(08)60022-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Yan K, Ji Z, Shen W. Online fault detection methods for chillers combining extended kalman filter and recursive one-class svm. Neurocomputing. 2017;228(3):205–12.
37.Wang X, Wang J, Yan K. Gait recognition based on gabor wavelets and (2d) 2 pca. Multimed Tools Appl. 2017;77(10):1–17. [Google Scholar]
38.Pei S, Huang D. Cooperative competition clustering for gene selection. J Clust Sci. 2006;17(4):637–651. doi: 10.1007/s10876-006-0077-6. [DOI] [Google Scholar]
39.Zheng C, Zhang L, Ng V, Shiu S, Huang D. Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(6):1592–1603. doi: 10.1109/TCBB.2011.79. [DOI] [PubMed] [Google Scholar]
40.Zheng C, Zhang L, Ng T, Shiu C, Huang D. Metasample-based sparse representation for tumor classification. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(5):1273–1282. doi: 10.1109/TCBB.2011.20. [DOI] [PubMed] [Google Scholar]
41.Zheng C. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE/ACM Trans Comput Biol Bioinform. 2009;4(13):599–607. doi: 10.1109/TITB.2009.2018115. [DOI] [PubMed] [Google Scholar]
42.Huang D, Zheng C. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics. 2006;22(15):1855–1862. doi: 10.1093/bioinformatics/btl190. [DOI] [PubMed] [Google Scholar]
43.Huang D, Yu H. Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Trans Comput Biol Bioinform. 2013;10(2):457–467. doi: 10.1109/TCBB.2013.10. [DOI] [PubMed] [Google Scholar]
44.Zhao X, Dheung Y, Huang D. Analysis of gene expression data using rpem algorithm in normal mixture model with dynamic adjustment of learning rate. Int J Pattern Recognit Artif Intell. 2010;24(4):651–666. doi: 10.1142/S0218001410008056. [DOI] [Google Scholar]
45.Zheng C, Huang D, Shang L. Feature selection in independent component subspace for microarray data classification. Neurocomputing. 2006;69(16):2407–2410. doi: 10.1016/j.neucom.2006.02.006. [DOI] [Google Scholar]
46.Zheng C, Huang D, Sun Z, Lyu M, Lok T. Nonnegative independent component analysis based on minimizing mutual information technique. Neurocomputing. 2006;69(7):878–883. doi: 10.1016/j.neucom.2005.06.008. [DOI] [Google Scholar]
47.Cheng X, Chai F, Gao J, Zhang K. Proceedings of the 4th IEEE International Conference on Computer Science and Information Technology. 2011. 1stopt and global optimization platform-comparison and case study; pp. 18–21. [Google Scholar]
48.Alejo R, Sotoca JM, García V, Valdovinos RM. Cost-sensitive neural networks and editing techniques for imbalance problems. In: Mexican Conference on Pattern Recognition. Berlin, Heidelberg: Springer; 2010. p. 180–8.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets analyzed in this manuscript are available from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

[CR1] 1.Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(2):531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Shipp M, Ross K, Tamayo P, Weng A, Kutok J, Aguiar R, Gaasenbeek M, Angelo M, Reich M, Pinkus G. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8(1):68–74. doi: 10.1038/nm0102-68. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Veer L, Dai H, Vijver M, He Y, Hart A, Mao M, Peterse H, Kooy K, Marton M, Witteveen A. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Deng S, Zhu L, Huang D. Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE Comput Soc Press. 2016;13(1):27–35. [DOI] [PubMed]

[CR5] 5.Deng S, Zhu L, Huang D. Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks. BMC Genomics. 2015;16(Suppl 3):1–10. doi: 10.1186/1471-2164-16-S3-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Wang W, Lou B, Li X, Lou X, Jin N, Yan K. Intelligent maintenance frameworks of large-scale grid using genetic algorithm and k-mediods clustering methods. World Wide Web. 2019;2019(7):1573–1413. [Google Scholar]

[CR7] 7.Deng S, Cao S, Huang D, Wang Y. Identifying stages of kidney renal cell carcinoma by combining gene expression and DNA methylation data. IEEE/ACM Trans Comput Biol Bioinform. 2016;14(5):1147–1153. doi: 10.1109/TCBB.2016.2607717. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Elkan C. Seventeenth International Joint Conference on Artificial Intelligence. 2001. The foundations of cost-sensitive learning; pp. 973–978. [Google Scholar]

[CR9] 9.Zhou Z, Liu X. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng. 2006;18(1):63–77. doi: 10.1109/TKDE.2006.17. [DOI] [Google Scholar]

[CR10] 10.Yan K, Zhong C, Ji Z, Huang J. Semi-supervised learning for early detection and diagnosis of various air handling unit faults. 2018. pp. 75–83. [Google Scholar]

[CR11] 11.Liu X, Zhou Z. International Conference on Data Mining. 2007. The influence of class imbalance on cost-sensitive learning: An empirical study; pp. 970–974. [Google Scholar]

[CR12] 12.Maheshwari S, Jain R, Jadon R. An insight into rare class problem: analysis and potential solutions. J Comput Sci. 2018;14(8):777–792. doi: 10.3844/jcssp.2018.777.792. [DOI] [Google Scholar]

[CR13] 13.Hu M, Li W, Yan K, Ji Z, Hu H. Modern machine learning techniques for univariate tunnel settlement forecasting: A comparative study. 2019. pp. 1–12. [Google Scholar]

[CR14] 14.Chai X, Deng L, Yang Q, Ling C. IEEE International Conference on Data Mining. 2004. Test-cost sensitive naive bayes classification; pp. 51–58. [Google Scholar]

[CR15] 15.Feng S. International Conference on Logistics Engineering, Management and Computer Science. 2015. A cost-sensitive decision tree under the condition of multiple classes; pp. 1212–1218. [Google Scholar]

[CR16] 16.Zhao H, Li X. A cost sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism. Inf Sci. 2016;378(2):303–316. [Google Scholar]

[CR17] 17.Quinlan J. C4.5: Programs for machine learning: Morgan Kaufmann Publishers Inc; 1992.

[CR18] 18.Liu K, Huang D. Cancer classification using rotation forest. Comput Biol Med. 2008;38(5):601–610. doi: 10.1016/j.compbiomed.2008.02.007. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Lu H, Yang L, Yan K, Xue Y, Gao Z. A cost-sensitive rotation forest algorithm for gene expression data classification. Neurocomputing. 2017;228(C):270–276. doi: 10.1016/j.neucom.2016.09.077. [DOI] [Google Scholar]

[CR20] 20.Turney P. Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. AI Access Foundation. 1994;2(1):369–409. [Google Scholar]

[CR21] 21.Cao P, Zhao D, Zaiane O. Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2013. An optimized cost-sensitive svm for imbalanced data learning; pp. 280–292. [Google Scholar]

[CR22] 22.Yuan H, Zhang X. Multiscale fragile watermarking based on the gaussian mixture model. IEEE Trans Image Process. 2006;15(10):3189–3200. doi: 10.1109/TIP.2006.877310. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Cheng D, Wu M. IEEE International Conference on Internet of Things. 2017. A novel classifier - weighted features cost-sensitive svm; pp. 598–603. [Google Scholar]

[CR24] 24.Silva J, Bacao F, Caetano M. Specific land cover class mapping by semi-supervised weighted support vector machines. Remote Sens. 2017;9(2):1–16. doi: 10.3390/rs9020181. [DOI] [Google Scholar]

[CR25] 25.Cao P, Liu X, Zhao D, Zaiane O. International Conference on Hybrid Intelligent Systems. 2016. Cost sensitive ranking support vector machine for multi-label data learning; pp. 244–255. [Google Scholar]

[CR26] 26.Zong W, Huang G, Chen Y. Weighted extreme learning machine for imbalance learning. Neurocomputing. 2013;101(3):229–242. doi: 10.1016/j.neucom.2012.08.010. [DOI] [Google Scholar]

[CR27] 27.Zheng E, Zhang C, Liu X, Lu H, Sun J. International Conference on Advanced Data Mining and Applications. 2013. Cost-sensitive extreme learning machine; pp. 478–488. [Google Scholar]

[CR28] 28.Liu Y, Lu H, Yan K, Xia H, An C. Applying cost-sensitive extreme learning machine and dissimilarity integration to gene expression data classification. Comput Intell Neurosci. 2016;2016(8):1–9. doi: 10.1155/2016/8056253. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Yan K, Ji Z, Lu H, Huang J, Shen W, Xue Y. Fast and accurate classification of time series data using extended elm: application in fault diagnosis of air handling units. IEEE Trans Syst Man Cybern Syst. 2017;49(7):1–8.

[CR30] 30.Zhang L, Zhang D. Evolutionary cost-sensitive extreme learning machine. IEEE Trans Neural Netw Learn Syst. 2017;28(12):3045–3060. doi: 10.1109/TNNLS.2016.2607757. [DOI] [PubMed] [Google Scholar]

[CR31] 31.Wang S, Li X, Zhang S, Gui J, Huang D. Tumor classification by combining pnn classifier ensemble with neighborhood rough set based gene reduction. Comput Biol Med. 2010;40(2):179–89. [DOI] [PubMed]

[CR32] 32.Wang S, Zhu Y, Jia W, Huang D. Robust classification method of tumor subtype by using correlation filters. IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):580–591. doi: 10.1109/TCBB.2011.135. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Zhu H, Wang X. A cost-sensitive semi-supervised learning model based on uncertainty. Neurocomputing. 2017;251(8):106–114. doi: 10.1016/j.neucom.2017.04.010. [DOI] [Google Scholar]

[CR34] 34.Ailijiang A, Charapko A, Demirbas M. Consensus in the cloud: Paxos systems demystified. In: 25th International Conference on Computer Communication and Networks (ICCCN). Waikoloa: IEEE; 2016. p. 1–10.

[CR35] 35.Zheng C, Huang D, Kong X, Zhao X. Gene expression data classification using consensus independent component analysis. Genomics Proteomics Bioinformatics. 2008;6(2):74–82. doi: 10.1016/S1672-0229(08)60022-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Yan K, Ji Z, Shen W. Online fault detection methods for chillers combining extended kalman filter and recursive one-class svm. Neurocomputing. 2017;228(3):205–12.

[CR37] 37.Wang X, Wang J, Yan K. Gait recognition based on gabor wavelets and (2d) 2 pca. Multimed Tools Appl. 2017;77(10):1–17. [Google Scholar]

[CR38] 38.Pei S, Huang D. Cooperative competition clustering for gene selection. J Clust Sci. 2006;17(4):637–651. doi: 10.1007/s10876-006-0077-6. [DOI] [Google Scholar]

[CR39] 39.Zheng C, Zhang L, Ng V, Shiu S, Huang D. Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(6):1592–1603. doi: 10.1109/TCBB.2011.79. [DOI] [PubMed] [Google Scholar]

[CR40] 40.Zheng C, Zhang L, Ng T, Shiu C, Huang D. Metasample-based sparse representation for tumor classification. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(5):1273–1282. doi: 10.1109/TCBB.2011.20. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Zheng C. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE/ACM Trans Comput Biol Bioinform. 2009;4(13):599–607. doi: 10.1109/TITB.2009.2018115. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Huang D, Zheng C. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics. 2006;22(15):1855–1862. doi: 10.1093/bioinformatics/btl190. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Huang D, Yu H. Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Trans Comput Biol Bioinform. 2013;10(2):457–467. doi: 10.1109/TCBB.2013.10. [DOI] [PubMed] [Google Scholar]

[CR44] 44.Zhao X, Dheung Y, Huang D. Analysis of gene expression data using rpem algorithm in normal mixture model with dynamic adjustment of learning rate. Int J Pattern Recognit Artif Intell. 2010;24(4):651–666. doi: 10.1142/S0218001410008056. [DOI] [Google Scholar]

[CR45] 45.Zheng C, Huang D, Shang L. Feature selection in independent component subspace for microarray data classification. Neurocomputing. 2006;69(16):2407–2410. doi: 10.1016/j.neucom.2006.02.006. [DOI] [Google Scholar]

[CR46] 46.Zheng C, Huang D, Sun Z, Lyu M, Lok T. Nonnegative independent component analysis based on minimizing mutual information technique. Neurocomputing. 2006;69(7):878–883. doi: 10.1016/j.neucom.2005.06.008. [DOI] [Google Scholar]

[CR47] 47.Cheng X, Chai F, Gao J, Zhang K. Proceedings of the 4th IEEE International Conference on Computer Science and Information Technology. 2011. 1stopt and global optimization platform-comparison and case study; pp. 18–21. [Google Scholar]

[CR48] 48.Alejo R, Sotoca JM, García V, Valdovinos RM. Cost-sensitive neural networks and editing techniques for imbalance problems. In: Mexican Conference on Pattern Recognition. Berlin, Heidelberg: Springer; 2010. p. 180–8.

PERMALINK

Learning misclassification costs for imbalanced classification on gene expression data

Huijuan Lu

Yige Xu

Minchao Ye

Ke Yan

Zhigang Gao

Qun Jin

Conference

Abstract

Background

Results

Conclusions

Background

Classical definition of cost matrix

Table 1.

Table 2.

Correct classification rates versus weighted classification accuracy

Methods

Optimal cost weights searching

Table 3.

Optimal cost weights searching by grid searching strategy

Table 4.

Results

Optimal cost weights searching by function fitting

Table 5.

Table 6.

Fig. 1.

Fig. 2.

Fig. 3.

Discussion

Comparison with grid searching and function fitting

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Conclusions

Acknowledgments

About this supplement

Abbreviations

Authors’ contributions

Funding

Availability of data and materials

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases