Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests

Chiou-Yi Hor; Chang-Biau Yang; Zih-Jie Yang; Chiou-Ting Tseng

doi:10.4137/EBO.S11975

. 2013 Oct 3;9:387–416. doi: 10.4137/EBO.S11975

Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests

Chiou-Yi Hor ¹, Chang-Biau Yang ^1,^✉, Zih-Jie Yang ¹, Chiou-Ting Tseng ¹

PMCID: PMC3795531 PMID: 24250217

Abstract

Essential proteins include the minimum required set of proteins to support cell life. Identifying essential proteins is important for understanding the cellular processes of an organism. However, identifying essential proteins experimentally is extremely time-consuming and labor-intensive. Alternative methods must be developed to examine essential proteins. There were two goals in this study: identifying the important features and building learning machines for discriminating essential proteins. Data for Saccharomyces cerevisiae and Escherichia coli were used. We first collected information from a variety of sources. We next proposed a modified backward feature selection method and build support vector machines (SVM) predictors based on the selected features. To evaluate the performance, we conducted cross-validations for the originally imbalanced data set and the down-sampling balanced data set. The statistical tests were applied on the performance associated with obtained feature subsets to confirm their significance. In the first data set, our best values of F-measure and Matthews correlation coefficient (MCC) were 0.549 and 0.495 in the imbalanced experiments. For the balanced experiment, the best values of F-measure and MCC were 0.770 and 0.545, respectively. In the second data set, our best values of F-measure and MCC were 0.421 and 0.407 in the imbalanced experiments. For the balanced experiment, the best values of F-measure and MCC were 0.718 and 0.448, respectively. The experimental results show that our selected features are compact and the performance improved. Prediction can also be conducted by users at the following internet address: http://bio2.cse.nsysu.edu.tw/esspredict.aspx.

Keywords: support vector machine, feature selection, protein-protein interaction, essential protein, statistical test

Introduction

Identifying essential proteins is important for understanding the cellular processes in an organism because no other proteins can perform the functions of essential proteins. Once an essential protein is removed, dysfunction or cell death results. Thus, several studies have been conducted to identify essential proteins. Experimental approaches for identifying essential proteins include gene deletion,¹ RNA interference,² and conditional knockouts.³ However, these methods are labor-intensive and time-consuming. Hence, alternative methods for identifying essential proteins are necessary.

The essential protein classification problem involves determining the necessity of a protein for sustaining cellular function or life. Among the methods available for identifying essential proteins, machine-learning based methods are promising approaches. Therefore, several studies have been conducted to examine the effectiveness of this technique. Chin⁴ proposed a double-screening scheme and constructed a framework known as the hub analyzer (http://hub.iis.sinica.edu.tw/Hubba/index.php) to rank the proteins. Acencio and Lemke⁵ used Waikato Environment for Knowledge Analysis (WEKA)⁶ to predict the essential proteins. Hwang et al⁷ applied a support vector machine (SVM) to classify the proteins.

Protein-protein interactions (PPIs) are well-known to be significant characteristics of protein function. Several studies have attempted to predict and classify protein function⁸ as well as analyze protein phenotype⁹ by studying interactions. A previous study¹⁰ further suggested that essential proteins and nonessential proteins can be discriminated by means of topological properties derived from the PPI network. In spite of the above superior properties, however, analyzing PPI experimentally is time-consuming. With the advent of yeast two-hybrid¹¹ high-throughput techniques, which can be used to identify several PPIs in one experiment, obtaining PPI information has become easier. Since a PPI network is similar to a social network in many aspects, some researchers apply social network techniques for analyzing PPI networks. Thus, several topological properties have been extensively explored and studied in recent years.

Fundamental properties, such as sequence or protein physiochemical ones, are not subjected to detailed examination in previous studies. This may be because each of these preliminary properties alone is somewhat less relevant to essentiality. However, this information is highly accessible because only sequence information is required to derive these properties. Hence, we included these properties in our study. For topological properties, in addition to physical interactions, we incorporated a variety of interaction information, including metabolic, transcriptional regulation, integrated functional, and genomic context interactions. Our experimental results revealed that these features provide either complement information for essentiality identification or provide other biological justification.

To identify the reduced feature subset, which is crucial for biological processes, previous studies have used feature selection techniques. The advantages of this method include storage reduction, performance improvement or data interpretation.¹² In accordance with whether the feature selection procedure is bound with the predictor, the method is roughly classified into three categories: filter, wrapper, and embedded. Filter methods often provide a complete order of available features in terms of relevance measures. Methods such as Fisher score,¹² mutual information, minimal redundancy and maximal relevance (mRMR),¹³ conditional mutual information maximization (CMIM),¹⁴ and minimal relevant redundancy (mRR)¹⁵ belong to this category. Both wrapper and embedded methods involve the selection process as a part of the learning algorithm. The former utilizes a learning machine to evaluate subsets of features according to some performance measurements. For example, sequential backward and forward feature selection¹² falls into this category. Embedded methods directly perform feature selection in the learning process and they are usually specific to given learning machines. Example include C4.5,¹⁶ Classification and Regression Trees (CART),¹⁷ and ID3.¹⁸ Additionally, some researchers proposed an information-gain based the feature selection method,¹⁹ which examines the effectiveness of classifier combination.

In this paper, we used two datasets. The first one was from Saccharomyces cerevisiae. The corresponding PPI data set was Scere 20070107, which was obtained from the DIP database. The data set totally contains 4873 proteins and 17,166 interactions. Our feature set consisted of the features obtained or extracted from the methods proposed by Acencio and Lemke,⁵ Chin,⁴ Hwang et al,⁷ and Lin et al.²⁰ The second data set was from Escherichia coli, which was first compiled by Gustafson et al.²¹ The data set totally contained 3569 proteins. The associated network information included physical, integrated functional, and genomic context interactions and is collected from Hu et al.²² For both data sets, we propose a modified sequential backward feature selection method for selecting important features.

Next, SVM models were built using the selected feature subsets. In this study, the SVM software LIBSVM²³ was adopted for classification models. Each model was applied to both imbalanced and balanced data sets. The results were compared with those of previous studies and statistical tests were conducted to examine significance. For the imbalanced S. cerevisiae data, our best results for F-measure and MCC were 0.549 and 0.495, respectively, which outperform the best previous method⁷ with results of 0.354 and 0.36, respectively. We obtained values of 0.770 and 0.545 for F-measure and MCC in the balanced data experiment, which was superior to the best previous method⁷ with 0.737 and 0.492, respectively. For experiments examining the E. coli data set, our best values for F-measure and MCC were 0.421 and 0.407, respectively, in the imbalanced data set. In the balanced experiment, the best values for F-measure and MCC were 0.718 and 0.448, respectively. The results are similar to those of Gustafson et al,²¹ who examined 29 features, but in our method, only five or seven features were used for prediction. To verify whether our improvement was statistically significant, we performed bootstrap cross-validation²⁴ on performance measures.

Background

The data set

In this paper, we used two data sets for experiments: S. cerevisiae and E. coli. The former included PPI network data. We downloaded the data set from the DIP ( http://dip.doe-mbi.ucla.edu/) website.²⁵ The original data set contained 4873 proteins and 17166 interactions. To comply with previous studies, we also adopted the largest connected component of the network data. There were a total of 4815 proteins, including 975 essential proteins and 3840 nonessential proteins. The information of protein essentiality was obtained from the Saccharomyces Genome Database (SGD), which is located at http://www.yeastgenome.org/. Since this data set has been used in several previous studies, we thus obtained and incorporated various related features for experiments.

The E. coli data set was obtained from Gustafson et al.²¹ It contained 3569 proteins, among which 611 are essential. Due to availability and coverage issues, we used another information from three additional networks: physical interaction (PI), integrated functional interaction, and integrated PI and genomic context (GC) network. The information was collected from Hu et al.²²

In the above two data sets, the ratio of nonessential proteins to essential proteins was approximately 4:1 and 5:1, respectively. The data imbalance will inevitably led to biased fitting to nonessential proteins during the learning processes. Thus, we constructed another balanced data set. Taking the first data set, for example, we randomly selected 975 nonessential proteins and mixed them with essential proteins to form a balanced data set. In the new data set, the number of nonessential data elements against that of essential elements was equal.

Bootstrap cross-validation

We used bootstrap cross-validation (BCV) to compare the performance of the two classifiers using the k-fold cross-validation. Assume that a sample S = {(x₁, y₁), (x₂, y₂), …, (x_n, y_n)} is composed of n observations, where x_i represents the feature vector of the ith observation and y_i denotes the class label associated with x_i. A bootstrap sample S_b^* = {(x₁^*_,y₁^*),(x₂^*_,y₂^*),…,(x_n^*_,y_n^*)} consists of n observations that are sampled from S with replacement, where 1 ≤ b ≤ B, and B is a constant between 50 and 200. For each sample S_b^*, a k-fold cross-validation was carried out. The performance measure c_b, such as error rate, was calculated with S_b^*. The procedure was repeated B times and then the average performance measure C_B = ∑^B_b=1c_b/B was evaluated over the B bootstrap samples. Since the distribution of the bootstrap performance measures was approximately normal, the confidence interval and significance were estimated accordingly.

Performance measures

In this study, the performance measures included precision, recall, F-measure (F1), Matthews correlation coefficient (MCC), and top percentage of essential proteins. Their formulas are given as follows:

Precision: TP/TP + FP
Recall: TP/TP + FN
F-measure: 2 × precision × recall/(precision + recall)
MCC: $\frac{T P \times T N - F P \times F N}{\sqrt{(T P + F N) (T P + F P) (T N + F P) (T N + F N)}}$
Top percentage of essential protein: $\frac{N u m b e r o f r e a l e s s e n t i a l p r o t e i n s i n t o p n}{n}$ .

Here, an essential protein is represented by the positive observation. True positive (TP), true negative (TN), false positive (FP), and false negative (FN) represent the numbers of true positive, true negative, false positive, and false negative proteins, respectively. The value n denotes the total number of predictions. In addition, receiver operating characteristic (ROC) curve¹⁸ and area under curve (AUC) were used to evaluate the classification performance.

Feature extraction

The feature set we used included sequence properties (S), such as amino acid occurrence and average amino acid PSSM; protein properties (P), such as cell cycle and metabolic process; topological properties (T), such as bit string of double screening scheme and betweenness centrality related to physical interactions; and other properties (O), such as phyletic retention and essential index. There were a total of 45 groups and 90 features in the S. cerevisiae data set. For the E. coli data set, there were 35 groups and 80 features. All names and sources are shown in Table 1. Only Bit string of double screening scheme is presented.

Table 1.

Protein features.

ID	Property name	Type	Size	Sub-names	S. cere	E. coli
1	Amino acid occurrence²⁰	S	20	A … Y	•	•
2	Average amino acid PSSM²⁰	S	20	A … Y	•	•
3	Average cysteine position²⁰	S	1		•	•
4	Average distance of every two cysteines²⁰	S	1		•	•
5	Average hydrophobic²⁰	S	1		•	•
6	Average hydrophobicity around cysteine²⁰	S	4	1 … 4	•	•
7	Cysteine count²⁰	S	1		•	•
8	Cysteine location²⁰	S	5	1 … 5	•	•
9	Cysteine odd-even index²⁰	S	1		•	•
10	Protein length²⁰	S	1		•	•
11	Cell cycle⁵	P	1		•
12	Cytoplasm⁵	P	1		•
13	Endoplasmic reticulum⁵	P	1		•
14	Metabolic process⁵	P	1		•
15	Mitochondrion⁵	P	1		•
16	Nucleus⁵	P	1		•
17	Other process⁵	P	1		•
18	Other localization⁵	P	1		•
19	Signal transduction⁵	P	1		•
20	Transport⁵	P	1		•
21	Transcription⁵	P	1		•
22	Betweenness centrality related to all interactions⁴¹	T	1		•	•
23	Betweenness centrality related to metabolic interactions⁵	T	1		•
24	Betweenness centrality related to physical interactions⁵	T	1		•	•
25	Betweenness centrality transcriptional regulation interactions⁵	T	1		•
26	Bit string of double screening scheme [this paper]	T	1		•	•
27	Bottleneck⁸,⁴¹	T	1		•	•
28	Clique level⁷	T	1		•	•
29	Closeness centrality⁴²	T	1		•	•
30	Clustering coefficient⁷	T	1		•	•
31	Degree related to all interactions⁴³	T	1		•	•
32	Degree related to physical interactions⁵	T	1		•	•
33	Density of maximum neighborhood component⁴	T	1		•	•
34	Edge percolated component⁹	T	1		•	•
35	Indegree related to metabolic interaction⁵	T	1		•
36	Indegree related to transcriptional regulation⁵	T	1		•
37	Maximum neighborhood component⁴	T	1		•	•
38	Neighbors’ intra-degree⁷	T	1		•	•
39	Outdegree related to metabolic interaction⁵	T	1		•
40	Outdegree related to transcriptional regulation interaction⁵	T	1		•
41	Betweenness centrality related to integrated functional interaction²²	T	1			•
42	Betweenness centrality related to integrated PI and GC network²²	T	1			•
43	Degree related to integrated functional interaction²²	T	1			•
44	Degree related to integrated PI and GC network²²	T	1			•
45	Common function degree⁷	O	1		•
46	Essential index⁷	O	1		•
47	Identicalness⁵	O	1		•
48	Open reading frame length⁷	O	1		•	•
49	Phyletic retention²¹	O	1		•	•
50	Number of paralagous genes²¹	O	1			•
51	Codon Adaptation Index (CAI)²¹,⁴⁴	O	1			•
52	Codon Bias Index (CBI)²¹,⁴⁴	O	1			•
53	Frequency of optimal codons²¹,⁴⁴	O	1			•
54	Aromaticity score²¹,⁴⁴	O	1			•
55	Leading strand of the circular chromosome²¹	O	1			•
	Total		100		90	80

	Feature	N5	N6	N7	N8	N9	N10	N11	N12	N13	N14	N15	N16	N17	N18	m31	C32	TOT
1	PR (phyletic retention)	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	16
2	EI (essentiality index)	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	16
3	Cytoplasm	•	•	•	•	•	•	•	•	•	•	•	•	•	•		•	15
4	Nucleus	•	•	•	•	•	•	•	•	•		•	•	•	•	•	•	15
5	Occurrence of A.A. I			•	•	•	•	•	•	•	•	•	•	•	•		•	13
6	Bit string of DSS		•	•	•	•	•	•	•		•	•	•	•	•			12
7	Occurrence of A.A. W				•	•	•	•	•	•	•	•	•	•	•		•	12
8	Endoplasmic reticulum					•	•	•	•	•	•	•	•	•	•	•		11
9	Other process										•	•	•	•	•	•	•	7
10	Occurrence of A.A. S								•		•		•	•	•	•	•	7
11	Occurrence of A.A. G							•		•		•	•	•	•			6
12	KLV (clique level)								•	•	•		•			•	•	6
13	Cell cycle								•		•		•			•	•	5
14	Average hydrophobic									•		•		•	•		•	5
15	Average PSSM of A.A. R						•	•				•		•	•			5
16	B.C. related to PI									•				•	•	•		4
17	Occurrence of A.A. E									•			•			•	•	4
18	Average PSSM of A.A. P						•	•				•					•	4
19	ID related to T.R.													•	•	•		3
20	B.C. T.R. interactions											•			•	•		3
21	Other localization													•	•		•	3
22	DMNC	•	•														•	3
23	Average HYD around C-2								•		•		•					3
24	Signal transduction													•	•			2
25	Edge percolated component															•	•	2
26	Occurrence of A.A. P															•	•	2
27	Occurrence of A.A. T				•	•												2
28	Occurrence of A.A. Y															•	•	2
29	Average PSSM of A.A. Q									•							•	2
30	Average PSSM of A.A. E										•		•					2
31	CLC (clustering coefficient)															•	•	2
32	FunK (common function degree)															•	•	2
33	OD related to T.R. interaction															•		1
34	OD related to M.I.															•		1
35	ID related to M.I.															•		1
36	B.C. related to M.I.															•		1
37	Degree related to PI																•	1
38	Metabolic process															•		1
39	Bottleneck															•		1
40	MNC															•		1
41	Occurrence of A.A. A																•	1
42	Occurrence of A.A. C															•		1
43	Occurrence of A.A. D																•	1
44	Occurrence of A.A. H															•		1
45	Occurrence of A.A. K																•	1
46	Occurrence of A.A. M															•		1
47	Average C position																•	1
48	Protein length															•		1
49	Cysteine count																	1
50	Cysteine odd-even index			•												•		1
51	Average HYD around C-1										•							1
52	Cysteine location-1															•		1
53	Average PSSM of A.A. A											•						1
54	Average PSSM of A.A. D																•	1
55	Average PSSM of A.A. S																•	1
56	Average PSSM of A.A. W																•	1
57	Average PSSM of A.A. Y																•	1
58	ORFL (ORF length)																•	1
59	CC (closeness centrality)																•	1
60	BC (B.C.)															•		1

	Feature	N4	N5	N6	N7	N8	N9	N10	N11	N12	N13	C9	m13	TOT
1	PR (phyletic retention)	•	•	•	•	•	•	•	•	•	•	•	•	12
2	Open reading frame length		•		•	•	•	•	•	•	•			8
3	Average PSSM of A.A. C	•	•	•	•	•					•			6
4	Degree related to F.I.	•	•	•	•	•					•			6
5	Degree related to A.I.						•	•	•	•		•		5
6	Degree related to PI						•	•	•	•			•	5
7	Average PSSM of A.A. A				•	•						•	•	4
8	Average PSSM of A.A. R						•	•	•	•				4
9	Average hydrophobic						•	•	•	•				4
10	Bit string of DSS for PI						•	•	•	•				4
11	Paralog count			•						•	•		•	4
12	Occurrence of A.A. M					•	•			•				3
13	Occurrence of A.A. W							•			•	•		3
14	Occurrence of A.A. E		•	•										2
15	Occurrence of A.A. F								•			•		2
16	Occurrence of A.A. G						•		•					2
17	Occurrence of A.A. I											•	•	2
18	Average PSSM of A.A. Y					•			•					2
19	Cysteine location-4											•	•	2
20	KLV (clique level) for PI				•						•			2
21	Degree related to PI and GC								•	•				2
22	Strand bias									•			•	2
23	Occurrence of A.A. A							•						1
24	Occurrence of A.A. C											•		1
25	Occurrence of A.A. H											•		1
26	Occurrence of A.A. P			•										1
27	Occurrence of A.A. S							•						1
28	Average PSSM of A.A. N										•			1
29	Average PSSM of A.A. G										•			1
30	Average PSSM of A.A. K				•									1
31	Average PSSM of A.A. F					•								1
32	Average PSSM of A.A. T										•			1
33	Average PSSM of A.A. V										•			1
34	Average distance of every two Cs									•				1
35	Average HYD around C-2										•			1
36	Cysteine location-1												•	1
37	Cysteine location-5												•	1
38	Cysteine odd-even index												•	1
39	Protein length	•												1
40	Bottleneck for PI												•	1
41	CC (closeness centrality) for PI												•	1
42	MNC for PI										•			1
43	B.C. related to all F.I.												•	1

	AUC					Precision					Recall					F-measure					MCC
CMIM32	0.825	*	(+)	(+)		0.744	*		(+)		0.369	*		(+)		0.493	*		(+)		0.450	*		(+)
mRMR31	0.821	(−)	*	(+)		0.738		*			0.372		*	(+)		0.495		*	(+)		0.449		*	(+)
Hwang(10)	0.775	(−)	(−)	*		0.743	(−)		*		0.343	(−)	(−)	*		0.469	(−)	(−)	*		0.432	(−)	(−)	*
Acencio(23)	0.707	(−)	(−)	(−)		0.675	(−)	(−)	(−)		0.121	(−)	(−)	(−)		0.204	(−)	(−)	(−)		0.228	(−)	(−)	(−)
N4	0.744	(−)	(−)	(−)		0.782					0.327	(−)	(−)	(−)		0.461	(−)	(−)	(−)		0.439	(−)	(−)
N5	0.727	(−)	(−)	(−)		0.741	(−)			(−)	0.387	(−)	(−)		(+)	0.509	(−)	(−)		(+)	0.461	(−)	(−)		(+)
N6	0.730	(−)	(−)	(−)		0.752	(−)				0.395	(−)	(−)	(+)		0.518	(−)	(−)			0.472	(−)	(−)
N7	0.761	(−)	(−)		(+)	0.767	(−)				0.386	(−)	(−)	(+)		0.513	(−)	(−)	(+)		0.473	(−)	(−)
N8	0.772	(−)	(−)			0.755					0.371	(−)	(−)		(−)	0.498	(−)	(−)	(+)	(−)	0.457	(−)	(−)		(−)
N9	0.782	(−)	(−)	(+)		0.749					0.382	(−)		(+)	(+)	0.506	(−)		(+)		0.462	(−)		(+)
N10	0.781	(−)	(−)	(+)		0.751					0.399			(+)		0.521			(+)		0.474			(+)
N11	0.786	(−)	(−)	(+)		0.752					0.402			(+)		0.524			(+)		0.476			(+)
N12	0.798	(−)	(−)	(+)		0.759					0.409			(+)		0.532			(+)		0.485			(+)
N13	0.789	(−)	(−)	(+)		0.748					0.433			(+)	(+)	0.549			(+)		0.495			(+)
N14	0.802	(−)		(+)		0.749					0.397			(+)	(−)	0.519			(+)	(−)	0.471			(+)	(−)
N15	0.801	(−)		(+)		0.763					0.406			(+)		0.530			(+)		0.485			(+)	(+)
N16	0.814	(−)		(+)	(+)	0.762					0.401			(+)		0.525			(+)		0.480			(+)
N17	0.814	(−)		(+)		0.761					0.407			(+)		0.530			(+)		0.484			(+)
N18	0.811	(−)		(+)		0.751					0.411			(+)		0.531			(+)		0.482			(+)
N90	0.829	(+)	(+)	(+)	(+)	0.738		(+)	(+)		0.355		(+)	(+)	(−)	0.479	(+)	(+)	(+)	(−)	0.438	(+)	(+)	(+)	(−)

	AUC					Precision					Recall					F-measure					MCC
CMIM32	0.842	*		(+)		0.772	*				0.766	*		(+)		0.769	*		(+)		0.540	*		(+)
mRMR31	0.836		*	(+)		0.765		*			0.741		*	(+)		0.752		*	(+)		0.513		*	(+)
Hwang(10)	0.822	(−)	(−)	*		0.778			*		0.720	(−)	(−)	*		0.748	(−)	(−)	*		0.516	(−)	(−)	*
Acencio(23)	0.768	(−)	(−)	(−)		0.696	(−)	(−)	(−)		0.734	(−)	(−)			0.714	(−)	(−)	(−)		0.414	(−)	(−)	(−)
N4	0.811	(−)	(−)	(−)	(+)	0.777		(−)		(+)	0.716	(−)	(−)			0.745	(−)	(−)		(+)	0.512	(−)	(−)		(+)
N5	0.824	(−)	(−)		(+)	0.778		(−)			0.735	(−)	(−)		(+)	0.756	(−)	(−)			0.527	(−)	(−)
N6	0.827	(−)	(−)		(+)	0.778		(−)			0.739	(−)	(−)			0.758	(−)	(−)			0.530	(−)	(−)
N7	0.831	(−)	(−)			0.779					0.733	(−)	(−)			0.755	(−)	(−)			0.526	(−)	(−)
N8	0.826	(−)	(−)			0.786					0.721	(−)	(−)			0.752	(−)	(−)			0.527	(−)	(−)
N9	0.833	(−)	(−)		(+)	0.791					0.735	(−)	(−)		(+)	0.762	(−)	(−)			0.541	(−)	(−)
N10	0.834	(−)	(−)			0.789					0.736	(−)	(−)			0.761	(−)	(−)			0.540	(−)	(−)
N11	0.831	(−)	(−)			0.784					0.737	(−)	(−)			0.760	(−)	(−)			0.535	(−)	(−)
N12	0.829	(−)	(−)			0.779					0.732					0.755			(+)		0.526			(+)
N13	0.834	(−)	(−)			0.788					0.730	(−)	(−)			0.758	(−)	(−)			0.535	(−)	(−)
N14	0.836	(−)	(−)	(+)		0.777					0.743	(−)	(−)	(+)		0.759	(−)	(−)	(+)		0.530	(−)	(−)
N15	0.843	(−)	(−)	(+)		0.784					0.748	(−)	(−)	(+)		0.766		(−)	(+)		0.542		(−)	(+)
N16	0.842			(+)		0.777					0.756			(+)		0.767			(+)		0.540			(+)
N17	0.847			(+)		0.778					0.763			(+)		0.770			(+)		0.545			(+)
N18	0.840	(−)	(−)	(+)	(−)	0.779		(−)			0.740	(−)	(−)	(+)	(−)	0.759	(−)	(−)			0.531	(−)	(−)
N90	0.839	(+)		(+)		0.760					0.753			(+)		0.757			(+)		0.516			(+)

	AUC				Precision					Recall					F1					MCC
CMIM09	0.701	*		(−)	0.720	*				0.271	*		(−)		0.394	*		(−)		0.382	*		(−)
mRMR13	0.715		*	(−)	0.713		*			0.250		*	(−)		0.370		*	(−)		0.360		*	(−)
Gustafson(29)	0.711	(+)	(+)	*	0.720			*		0.290	(+)	(+)	*		0.420	(+)	(+)	*		0.413	(+)	(+)	*
N4	0.691	(−)	(−)	(−)	0.725					0.280	(+)		(−)	(−)	0.404	(+)		(−)		0.391			(−)
N5	0.690	(−)	(−)	(−)	0.737					0.295	(+)	(+)		(+)	0.421	(+)	(+)	(−)	(+)	0.407	(+)		(−)	(+)
N6	0.701			(−)	0.742					0.287	(+)	(+)			0.414	(+)	(+)			0.403	(+)	(+)
N7	0.714			(−)	0.735					0.275	(+)	(+)	(−)		0.400	(+)	(+)	(−)		0.392	(+)
N8	0.705			(−)	0.742					0.288	(+)	(+)	(−)	(+)	0.415	(+)	(+)	(−)	(+)	0.405	(+)		(−)
N9	0.707			(−)	0.726				(−)	0.293	(+)	(+)			0.417	(+)	(+)			0.401	(+)	(+)
N10	0.711			(−)	0.724					0.294	(+)	(+)			0.418	(+)	(+)			0.401	(+)	(+)
N11	0.714			(−)	0.732					0.278	(+)	(+)		(−)	0.403	(+)	(+)			0.393	(+)	(+)
N12	0.712	(+)			0.725					0.292	(+)	(+)			0.416	(+)	(+)			0.400	(+)	(+)
N13	0.714	(+)			0.733					0.287	(+)	(+)			0.413	(+)	(+)			0.400	(+)	(+)
N80	0.716	(+)	(+)	(+)	0.677	(+)	(+)	(+)	(−)	0.237	(+)	(+)		(−)	0.352	(+)	(+)		(−)	0.339	(+)	(+)	(+)	(−)

	Top 5%			Top 10%			Top 15%			Top 20%			Top 25%			Top 30%			Top 50%			Top 75%			Top 100%
CMIM32	0.939*	–	–	0.918*			0.910*			0.892*			0.870*			0.839*			0.743*			0.645*			0.582*
mRMR31	0.955	*	–	0.905–	*	–	0.884–*			0.862–*			0.834–*	–		0.820–*			0.740–*			0.641–*			0.572–*
Hwang(10)	0.959		*	0.918		*	0.871–	–*		0.853–	–*		0.843–		*	0.816–	–*		0.720–	–*		0.637–	–*		0.563–	–*
Acencio(23)	0.800–	–	–	0.741–	–	–	0.693–	–	–	0.661–	–	–	0.646–	–	–	0.625–	–	–	0.578–	–	–	0.519–	–	–	0.457–	–	–
N4	0.980			0.930			0.905–			0.877–			0.865–			0.850			0.727–	–		0.632–	–	–	0.559–	–	–
N5	0.843–	–	–	0.861–	–	–	0.859–	–	–	0.852–	–	–	0.841–		–	0.827–			0.751			0.641–			0.530–	–	–
N6	0.908–	–	–	0.894–	–	–	0.875–	–		0.857–	–		0.850–			0.834–			0.763			0.635–	–	–	0.526–	–	–
N7	0.861–	–	–	0.892–	–	–	0.897–			0.885–			0.854–			0.832–			0.770			0.645–			0.570–	–
N8	0.892–	–	–	0.904–	–	–	0.895–			0.877–			0.868–			0.850			0.751			0.657			0.574–
N9	0.880–	–	–	0.911–		–	0.895–			0.875–			0.860–			0.832–			0.753			0.665			0.585
N10	0.882–	–	–	0.896–	–	–	0.893–			0.882–			0.858–			0.846			0.762			0.665			0.581–
N11	0.900–	–	–	0.900–	–	–	0.888–			0.875–			0.861–			0.856			0.769			0.667			0.580–
N12	0.941	–	–	0.924			0.899–			0.872–			0.866–			0.854			0.776			0.664			0.588
N13	0.941	–	–	0.910–		–	0.886–			0.870–			0.853–			0.840			0.781			0.672			0.578–
N14	0.949	–	–	0.932			0.916			0.897			0.867–			0.845			0.759			0.667			0.587
N15	0.906–	–	–	0.894–	–	–	0.897–			0.884–			0.866–			0.851			0.776			0.672			0.584
N16	0.933–	–	–	0.901–	–	–	0.895–			0.886–			0.864–			0.851			0.771			0.677			0.599
N17	0.943	–	–	0.903–	–	–	0.879–	–		0.871–			0.866–			0.856			0.777			0.679			0.595
N18	0.937–	–	–	0.892–	–	–	0.880–	–		0.870–			0.864–			0.854			0.778			0.683			0.595
N90	0.939	–	–	0.911–		–	0.884–			0.869–			0.856–			0.835–			0.728–	–		0.639–	–		0.572–

	Top 5%			Top 10%			Top 15%			Top 20%		Top 25%		Top 30%		Top 50%		Top 75%		Top 100%
CMIM09	0.745*			0.775*			0.730*			0.730*		0.737*		0.727*		0.644*		0.542*		0.440*	–
mRMR13	0.719–*			0.725–*			0.714–*			0.701–*	–	0.690–*	–	0.679–*	–	0.614–*	–	0.531–*		0.446*
Gustafson(29)	0.706–	–*		0.692–	–*		0.705–	–*		0.707–	*	0.695–	*	0.685–	*	0.624–	*	0.522–	–*	0.436–	–	*
N4	0.610–	–	–	0.689–	–	–	0.765			0.760		0.749		0.752		0.649		0.534–		0.449
N5	0.655–	–	–	0.705–	–		0.747			0.747		0.745		0.743		0.653		0.535–		0.443	–
N6	0.719–			0.723–	–		0.717–			0.736		0.744		0.748		0.658		0.525–	–	0.435–	–	–
N7	0.568–	–	–	0.700–	–		0.713–	–		0.730		0.748		0.762		0.655		0.540–		0.464
N8	0.671–	–	–	0.703–	–		0.723–			0.736		0.728–		0.731		0.652		0.535–		0.449
N9	0.813			0.785			0.766			0.754		0.734–		0.726–		0.685		0.548		0.459
N10	0.794			0.751–			0.728–			0.732		0.741		0.745		0.668		0.550		0.463
N11	0.719–			0.721–	–		0.734			0.742		0.748		0.748		0.668		0.539–		0.457
N12	0.735–			0.738–			0.745			0.750		0.743		0.740		0.667		0.548		0.458
N13	0.655–	–	–	0.672–	–	–	0.713–	–		0.739		0.736–		0.732		0.668		0.548		0.457
N80	0.674–	–	–	0.690–	–	–	0.703–	–	–	0.705–	–	0.703–		0.691–		0.632–		0.529–	–	0.452

	AUC	Precision	Recall	F1	MCC	IOR
CMIM32	84.2 ± 1.5	77.2 ± 2.2	76.6 ± 2.9	76.9 ± 2.1	54.0 ± 3.9	3.3 ± 0.4
mRMR31	83.6 ± 1.6	76.5 ± 2.4	74.1 ± 3.0	75.2 ± 2.1	51.3 ± 4.1	3.0 ± 0.3
Hwang	82.2 ± 1.7	77.8 ± 2.6	72.0 ± 3.7	74.8 ± 2.3	51.6 ± 3.9	3.0 ± 0.3
Acencio	76.8 ± 2.2	69.6 ± 2.4	73.4 ± 4.0	71.4 ± 2.3	41.4 ± 4.3	2.5 ± 0.3
N4	81.1 ± 1.8	77.7 ± 2.5	71.6 ± 3.6	74.5 ± 2.3	51.2 ± 3.9	3.0 ± 0.3
N5	82.4 ± 1.8	77.8 ± 2.6	73.5 ± 3.5	75.6 ± 2.2	52.7 ± 4.1	3.1 ± 0.3
N6	82.7 ± 1.8	77.8 ± 2.6	73.9 ± 3.6	75.8 ± 2.3	53.0 ± 4.1	3.1 ± 0.3
N7	83.1 ± 1.8	77.9 ± 2.5	73.3 ± 3.5	75.5 ± 2.2	52.6 ± 4.0	3.1 ± 0.3
N8	82.6 ± 1.8	78.6 ± 2.5	72.1 ± 3.5	75.2 ± 2.3	52.7 ± 4.0	3.1 ± 0.3
N9	83.3 ± 1.8	79.1 ± 2.4	73.5 ± 3.4	76.2 ± 2.2	54.1 ± 3.8	3.2 ± 0.3
N10	83.4 ± 1.7	78.9 ± 2.4	73.6 ± 3.3	76.1 ± 2.1	54.0 ± 3.8	3.2 ± 0.3
N11	83.1 ± 1.7	78.4 ± 2.4	73.7 ± 3.5	76.0 ± 2.2	53.5 ± 3.8	3.2 ± 0.3
N12	82.9 ± 1.8	77.9 ± 2.5	73.2 ± 3.1	75.5 ± 2.2	52.6 ± 4.2	3.1 ± 0.3
N13	83.4 ± 1.7	78.8 ± 2.4	73.0 ± 3.5	75.8 ± 2.2	53.5 ± 3.9	3.1 ± 0.3
N14	83.6 ± 1.6	77.7 ± 2.3	74.3 ± 3.4	75.9 ± 2.1	53.0 ± 3.8	3.2 ± 0.3
N15	84.3 ± 1.7	78.4 ± 2.4	74.8 ± 3.3	76.6 ± 2.1	54.2 ± 3.9	3.3 ± 0.4
N16	84.2 ± 1.6	77.7 ± 2.2	75.6 ± 3.1	76.7 ± 2.0	54.0 ± 3.7	3.3 ± 0.4
N17	84.7 ± 1.6	77.8 ± 2.3	76.3 ± 3.0	77.0 ± 2.0	54.5 ± 3.8	3.3 ± 0.4
N18	84.0 ± 1.6	77.9 ± 2.4	74.0 ± 3.3	75.9 ± 2.0	53.1 ± 3.8	3.1 ± 0.3
N90	83.9 ± 1.4	76.0 ± 2.0	75.3 ± 2.7	75.7 ± 1.8	51.6 ± 3.5	3.1 ± 0.3

	AUC	Precision	Recall	F1	MCC	IOR
CMIM09	70.1 ± 0.9	72.0 ± 1.4	27.1 ± 0.7	39.4 ± 0.9	38.2 ± 0.9	5.2 ± 0.6
mRMR13	71.5 ± 2.4	71.3 ± 5.4	25.0 ± 5.8	37.0 ± 6.8	36.0 ± 5.8	4.7 ± 0.6
Gustafson	71.1 ± 2.3	66.5 ± 4.6	25.5 ± 5.0	36.8 ± 5.2	34.7 ± 4.8	4.9 ± 0.6
N4	69.1 ± 1.8	72.5 ± 1.0	28.0 ± 0.6	40.4 ± 0.7	39.1 ± 0.7	5.5 ± 0.6
N5	69.0 ± 2.0	73.7 ± 1.4	29.5 ± 0.8	42.1 ± 0.9	40.7 ± 1.0	5.7 ± 0.6
N6	70.1 ± 1.7	74.2 ± 1.4	28.7 ± 0.9	41.4 ± 1.1	40.3 ± 1.0	5.7 ± 0.6
N7	71.4 ± 1.4	73.5 ± 1.3	27.5 ± 0.7	40.0 ± 0.9	39.2 ± 0.9	5.5 ± 0.6
N8	70.5 ± 1.3	74.2 ± 1.1	28.8 ± 0.8	41.5 ± 0.9	40.5 ± 0.9	5.7 ± 0.6
N9	70.7 ± 1.5	72.6 ± 1.4	29.3 ± 1.0	41.7 ± 1.2	40.1 ± 1.2	5.6 ± 0.6
N10	71.1 ± 1.5	72.4 ± 1.6	29.4 ± 1.0	41.8 ± 1.1	40.1 ± 1.2	5.6 ± 0.6
N11	71.4 ± 1.4	73.2 ± 1.3	27.8 ± 0.8	40.3 ± 1.0	39.3 ± 1.0	5.5 ± 0.6
N12	71.2 ± 1.4	72.5 ± 1.9	29.2 ± 1.1	41.6 ± 1.2	40.0 ± 1.3	5.6 ± 0.6
N13	71.4 ± 1.3	73.3 ± 1.8	28.7 ± 1.2	41.3 ± 1.4	40.0 ± 1.4	5.6 ± 0.6
N80	71.6 ± 0.9	67.7 ± 2.1	23.7 ± 1.3	35.2 ± 1.6	33.9 ± 1.6	4.9 ± 0.6

	AUC	Precision	Recall	F1	MCC	IOR
CMIM32	82.5 ± 1.2	74.4 ± 3.1	36.9 ± 4.5	49.3 ± 3.8	45.0 ± 3.6	5.2 ± 0.5
mRMR31	82.1 ± 1.6	73.8 ± 3.2	37.2 ± 4.3	49.5 ± 3.6	44.9 ± 3.4	5.2 ± 0.5
Hwang	77.5 ± 2.2	74.3 ± 3.7	34.3 ± 4.3	46.9 ± 4.0	43.2 ± 3.6	5.1 ± 0.4
Acencio	70.7 ± 3.4	67.5 ± 6.3	12.1 ± 5.5	20.4 ± 7.6	22.8 ± 6.0	3.7 ± 0.4
N4	74.4 ± 2.7	78.2 ± 3.7	32.7 ± 4.1	46.1 ± 4.1	43.9 ± 3.5	5.3 ± 0.4
N5	72.7 ± 3.6	74.1 ± 4.1	38.7 ± 4.7	50.9 ± 4.1	46.1 ± 3.8	5.3 ± 0.5
N6	73.0 ± 3.2	75.2 ± 4.2	39.5 ± 4.4	51.8 ± 3.8	47.2 ± 3.6	5.5 ± 0.5
N7	76.1 ± 2.4	76.7 ± 3.7	38.6 ± 4.4	51.3 ± 3.9	47.3 ± 3.6	5.5 ± 0.5
N8	77.2 ± 2.4	75.5 ± 3.4	37.1 ± 4.9	49.8 ± 4.3	45.7 ± 3.9	5.3 ± 0.5
N9	78.2 ± 2.4	74.9 ± 3.4	38.2 ± 4.5	50.6 ± 3.9	46.2 ± 3.6	5.4 ± 0.5
N10	78.1 ± 2.2	75.1 ± 3.5	39.9 ± 4.1	52.1 ± 3.6	47.4 ± 3.5	5.5 ± 0.5
N11	78.6 ± 2.1	75.2 ± 3.2	40.2 ± 4.2	52.4 ± 3.6	47.6 ± 3.4	5.5 ± 0.5
N12	79.8 ± 2.0	75.9 ± 3.2	40.9 ± 4.2	53.2 ± 3.6	48.5 ± 3.4	5.7 ± 0.5
N13	78.9 ± 1.9	74.8 ± 3.2	43.3 ± 4.3	54.9 ± 3.4	49.5 ± 3.4	5.8 ± 0.5
N14	80.2 ± 1.8	74.9 ± 3.2	39.7 ± 4.3	51.9 ± 3.5	47.1 ± 3.4	5.5 ± 0.5
N15	80.1 ± 1.9	76.3 ± 3.3	40.6 ± 4.2	53.0 ± 3.5	48.5 ± 3.5	5.7 ± 0.5
N16	81.4 ± 1.7	76.2 ± 3.2	40.1 ± 4.6	52.5 ± 3.8	48.0 ± 3.6	5.6 ± 0.5
N17	81.4 ± 1.7	76.1 ± 3.3	40.7 ± 4.5	53.0 ± 3.8	48.4 ± 3.6	5.7 ± 0.5
N18	81.1 ± 1.8	75.1 ± 3.2	41.1 ± 4.3	53.1 ± 3.6	48.2 ± 3.5	5.6 ± 0.5
N90	82.9 ± 1.0	73.8 ± 2.8	35.5 ± 4.5	47.9 ± 3.6	43.8 ± 3.4	5.1 ± 0.4

Protein name	Ranking method

	A (DMNC)	B (MNC)
W	1	4
X	2	2
Y	3	1
Z	4	3

Protein name	i th iteration		Sum of bit string	n – r	Sum

	1st	2nd
W	0	0	0	0	0
X	1	1	2	2	4
Y	0	1	1	3	4
Z	0	0	0	1	1

	AUC	Precision	Recall	F1	MCC	IOR
CMIM09	76.7 ± 1.7	72.0 ± 2.4	70.0 ± 3.7	71.0 ± 2.2	42.1 ± 4.0	2.4 ± 0.3
mRMR13	76.2 ± 2.0	72.8 ± 3.1	65.4 ± 6.4	68.9 ± 3.3	39.6 ± 3.9	2.2 ± 0.2
Gustafson	77.7 ± 2.6	72.2 ± 3.3	71.5 ± 4.0	71.9 ± 2.8	44.0 ± 5.5	2.6 ± 0.3
N4	78.0 ± 1.6	73.3 ± 2.5	70.1 ± 2.7	71.7 ± 1.9	44.6 ± 3.8	2.6 ± 0.3
N5	77.9 ± 1.7	73.0 ± 2.5	70.6 ± 2.8	71.8 ± 1.8	44.5 ± 3.8	2.6 ± 0.3
N6	76.2 ± 1.7	73.5 ± 2.6	66.3 ± 4.3	69.6 ± 2.6	42.5 ± 4.1	2.4 ± 0.3
N7	78.3 ± 1.7	73.7 ± 2.5	69.6 ± 2.7	71.6 ± 1.8	44.8 ± 3.6	2.6 ± 0.3
N8	78.1 ± 1.7	72.3 ± 2.3	71.1 ± 3.3	71.7 ± 2.1	43.9 ± 3.8	2.5 ± 0.3
N9	78.2 ± 1.6	71.5 ± 2.2	70.3 ± 4.1	70.9 ± 2.3	42.3 ± 3.8	2.4 ± 0.3
N10	78.1 ± 1.6	72.5 ± 2.4	70.2 ± 3.3	71.3 ± 2.1	43.6 ± 3.9	2.5 ± 0.3
N11	77.7 ± 1.7	71.9 ± 2.2	70.0 ± 3.3	70.9 ± 2.0	42.6 ± 3.6	2.5 ± 0.3
N12	77.6 ± 1.9	71.5 ± 2.3	69.5 ± 4.5	70.5 ± 2.5	41.8 ± 4.0	2.4 ± 0.3
N13	77.6 ± 1.7	73.1 ± 2.4	69.5 ± 3.0	71.2 ± 2.1	43.9 ± 3.9	2.5 ± 0.3
N80	76.9 ± 1.8	71.1 ± 2.4	71.5 ± 2.4	71.3 ± 1.8	42.4 ± 3.8	2.5 ± 0.3

	AUC		Precision		Recall		F-measure		MCC
N4	0.811	0.815	0.777	0.770	0.716	0.725	0.745	0.747	0.512	0.510
N5	0.824	0.818	0.778	0.771	0.735	0.722	0.756	0.745	0.527	0.508
N6	0.827	0.814	0.778	0.775	0.739	0.709	0.758	0.740	0.530	0.504
N7	0.831	0.824	0.779	0.779	0.733	0.718	0.755	0.747	0.526	0.516
N8	0.826	0.827	0.786	0.781	0.721	0.721	0.752	0.750	0.527	0.521
N9	0.833	0.834	0.791	0.783	0.735	0.734	0.762	0.758	0.541	0.531
N10	0.834	0.835	0.789	0.783	0.736	0.733	0.761	0.757	0.540	0.531
N11	0.831	0.834	0.784	0.780	0.737	0.730	0.760	0.754	0.535	0.525
N12	0.829	0.834	0.779	0.778	0.732	0.734	0.755	0.755	0.526 >	0.525
N13	0.834	0.834	0.788	0.779	0.730	0.732	0.758	0.754	0.535	0.525
N14	0.836	0.832	0.777	0.777	0.743	0.731	0.759	0.753	0.530	0.522
N15	0.843	0.835	0.784	0.778	0.748	0.734	0.766	0.756	0.542	0.526
N16	0.842 >	0.836	0.777	0.777	0.756 >	0.735	0.767 >	0.755	0.540 >	0.525
N17	0.847 >	0.834	0.778	0.777	0.763	0.733	0.770 >	0.754	0.545 >	0.523
N18	0.840	0.835	0.779	0.778	0.740	0.735	0.759	0.756	0.531	0.526

	AUC		Precision		Recall		F1		MCC
N4	0.691	0.651	0.725	0.678	0.280	0.269	0.404	0.385	0.391	0.363
N5	0.690	0.675	0.737	0.687	0.295	0.254	0.421	0.371	0.407	0.356
N6	0.701	0.681	0.742	0.708	0.287	0.220	0.414	0.336	0.403	0.338
N7	0.714	0.686	0.735	0.712	0.275	0.212	0.400	0.326	0.392	0.333
N8	0.705	0.692	0.742	0.713	0.288	0.209	0.415	0.323	0.405	0.330
N9	0.707	0.692	0.726	0.713	0.293 >	0.199	0.417 >	0.312	0.401	0.322
N10	0.711	0.697	0.724	0.703	0.294 >	0.193	0.418 >	0.302	0.401	0.313
N11	0.714 <	0.702	0.732	0.697	0.278 >	0.187	0.403	0.295	0.393	0.306
N12	0.712 <	0.704	0.725	0.683	0.292	0.192	0.416	0.300	0.400	0.305
N13	0.714 <	0.715	0.733	0.713	0.287	0.250	0.413	0.370	0.400	0.360

	AUC		Precision		Recall		F1		MCC
N4	0.691 >	0.663	0.725	0.717	0.280	0.271	0.404	0.393	0.391	0.381
N5	0.690	0.686	0.737	0.710	0.295 >	0.264	0.421 >	0.385	0.407 >	0.373
N6	0.701	0.697	0.742	0.715	0.287 >	0.265	0.414 >	0.387	0.403 >	0.376
N7	0.714	0.693	0.735	0.711	0.275 >	0.261	0.400 >	0.382	0.392 >	0.371
N8	0.705	0.690	0.742	0.709	0.288 >	0.254	0.415 >	0.373	0.405 >	0.364
N9	0.707	0.701	0.726	0.720	0.293 >	0.271	0.417 >	0.394	0.401 >	0.382
N10	0.711	0.702	0.724	0.692	0.294 >	0.248	0.418 >	0.364	0.401 >	0.353
N11	0.714	0.698	0.732	0.690	0.278 >	0.247	0.403 >	0.363	0.393 >	0.351
N12	0.712	0.690	0.725	0.683	0.292 >	0.239	0.416 >	0.353	0.400 >	0.342
N13	0.714	0.688	0.733	0.678	0.287 >	0.236	0.413 >	0.349	0.400 >	0.337

	AUC		Precision		Recall		F1		MCC
N4	0.780 >	0.773	0.733	0.726	0.701 >	0.651	0.717 >	0.686	0.446 >	0.407
N5	0.779	0.772	0.730	0.720	0.706 >	0.654	0.718 >	0.684	0.445 >	0.401
N6	0.762	0.771	0.735	0.717	0.663	0.649	0.696	0.680	0.425	0.394
N7	0.783 >	0.768	0.737	0.716	0.696	0.649	0.716 >	0.680	0.448 >	0.394
N8	0.781 >	0.764	0.723	0.715	0.711 >	0.641	0.717 >	0.675	0.439 >	0.387
N9	0.782 >	0.764	0.715	0.713	0.703	0.643	0.709 >	0.675	0.423 >	0.386
N10	0.781 >	0.765	0.725	0.716	0.702 >	0.636	0.713 >	0.673	0.436 >	0.386
N11	0.777 >	0.766	0.719	0.720	0.700	0.643	0.709 >	0.678	0.426	0.394
N12	0.776	0.765	0.715	0.714	0.695	0.643	0.705	0.676	0.418	0.388
N13	0.776 >	0.762	0.731	0.728	0.695 >	0.654	0.712 >	0.689	0.439 >	0.396

PERMALINK

Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests

Chiou-Yi Hor

Chang-Biau Yang

Zih-Jie Yang

Chiou-Ting Tseng

Abstract

Introduction

Background

The data set

Bootstrap cross-validation

Performance measures

Feature extraction

Table 1.

Table 2.

Table 3.

Sequential backward feature selection method

Experimental procedure and results

Experimental procedure

Figure 1.

Stage 1: Determine benchmark feature set

Stage 2: Tune SVM parameters for best performance

Stage 3: Adopt best performances as reference performances

Stage 4: Perform feature selection

Stage 5: Perform 10-fold and bootstrap cross-validations

Stage 6: Perform significance tests

Backward feature selection and mRMR/CMIM feature ranking

Table 4.

Table 5.

Bootstrap cross validations

Figure 2.

Figure 3.

Performance comparison and significance tests

S. cerevisiae

Table 6.

Table 7.

E. coli

Table 8.

Table 9.

ROC analysis

S. cerevisiae

Figure 4.

Figure 5.

E. coli

Figure 6.

Figure 7.

Top percentage analysis

S. cerevisiae

Table 10.

Figure 8.

E. coli

Table 11.

Figure 9.

Discussion

Conclusion and Future Work

Table 13.

Table 14.

Appendix

Feature extraction

Confidence intervals of performance measures and informational odds ratios

Table 12.

Table 15.

Comparison with other feature selection methods

Table 16.

Table 17.

Table 18.

Table 19.

Table 20.

Table 21.

Table 22.

Table 23.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases