Skip to main content
. 2006 May 2;7:236. doi: 10.1186/1471-2105-7-236

Table 3.

Accuracy estimates (100% – error rate) using different base classifiers and feature selection techniques, based on twenty repetitions, each utilizing ten-fold cross validation for a total of 200 runs

Base Classifler IB1 NaiveBayes KStar J48 IBk
Expression Lower Upper Accuracy SD Accuracy SD Accuracy SD Accuracy SD Accuracy SD

Threshold 0.1 0.9 79.58% 4.79% 79.38% 3.86% 84.17% 2.72% 85.69% 1.63% 84.09% 2.64%
Threshold 0.2 0.8 91.56% 2.35% 90.47% 2.85% 91.67% 2.53% 88.50% 2.70% 93.75% 1.46%
Threshold 0.33 0.66 91.89% 2.95% 91.17% 2.39% 90.44% 2.90% 89.10% 2.88% 92.81% 1.91%
Absolute 50 200 90.28% 1.84% 90.74% 1.55% 90.46% 1.92% 88.54% 2.20% 90.62% 1.92%
Tanh 0.25 0.75 92.71% 2.43% 93.45% 1.92% 91.67% 2.13% 89.93% 3.46% 91.56% 1.19%


Feature Selection InformationGain ChiSquared GainRatio Wrapper Subset

Expression Lower Upper Accuracy SD Accuracy SD Accuracy SD Accuracy SD

Threshold 0.2 0.8 91.56% 2.35% 89.96% 2.74% 92.41% 1.72% 87.30% 3.36%
Tanh 0.25 0.75 92.71% 2.43% 92.55% 2.72% 92.56% 2.34%

Table 3 demonstrates explorations of the effects of varying the parameters involved in classification based on network features. Each pair of cells provides the overall accuracy and standard deviation (SD) for a parameter set based on twenty repetitions, each utilizing ten fold cross-validation, for a total of 200 runs. Table 3 displays the effects of variation of the base classifier combined with variations in the thresholds used to determine the condition specific expression state of genes. All classification was done using the WEKA package. Complete documentation of each method is available at the WEKA website [63]. Briefly, IB1 and IBk are nearest neighbor classifiers using 1 and k neighbors, respectively. Results here are reported for IBk with k = 3. J48 is a standard C4.5 decision tree algorithm implementation, and KStar is an instance-based classifier that differs from the nearest neighbor learners through its use of an entropy-based distance function. Parameters used to determine up- or down-regulation in the co-expression networks were at the 80th and 20th percentile of expression levels respectively or at absolute expression intensities of 200 and 50 (for Affymetrix Arrays only). Continuous co-expression matrix (labeled as Tanh in the table) was constructed by preprocessing the gene expression data g by the hyperbolic tangent transformation: G = tanh[(g-μ)/δ], where (μ and δ are the average and inter-quartile range of the expression level of all genes across all experiments respectively. The lower block in Table 3 shows the effects of different feature selection methods. The first three methods evaluate and rank the attributes. Results using information gain are reported in the main results of the text, and this method is described in the methods section. The chi-squared method calculated a chi-squared statistic with respect to the class; the gain ratio method evaluates attributes by calculating the ratio of the entropy of the class minus the entropy of the class conditional on the attribute to the entropy of the attribute. Additionally, a wrapper method was assessed in which feature subsets were explored utilizing a greedy forward hill-climbing method to search through the space of attribute subsets. In testing various methods of feature selection for the ALL vs. AML dataset, we find that the top ten links were identical for all methods tested, and over 90% of the top 25 links were selected by all methods tested.