Table 3.
Stability of variable (gene) selection evaluated using 200 bootstrap samples. "# Genes": number of genes selected on the original data set. "# Genes boot.": median (1st quartile, 3rd quartile) of number of genes selected from on the bootstrap samples. "Freq. genes": median (1st quartile, 3rd quartile) of the frequency with which each gene in the original data set appears in the genes selected from the bootstrap samples. Parameters for backwards elimination with random forest: mtryFactor = 1, s.e. = 0, ntree = 2000, ntreelterat = 1000, fraction.dropped = 0.2.
Data set | Error | # Genes | # Genes boot. | Freq. genes |
Backwards elimination of genes from random forest | ||||
s.e. = 0 | ||||
Leukemia | 0.087 | 2 | 2 (2, 2) | 0.38 (0.29, 0.48)1 |
Breast 2 cl. | 0.337 | 14 | 9 (5, 23) | 0.15 (0.1, 0.28) |
Breast 3 cl. | 0.346 | 110 | 14 (9, 31) | 0.08 (0.04, 0.13) |
NCI 60 | 0.327 | 230 | 60 (30, 94) | 0.1 (0.06, 0.19) |
Adenocar. | 0.185 | 6 | 3 (2, 8) | 0.14 (0.12, 0.15) |
Brain | 0.216 | 22 | 14 (7, 22) | 0.18 (0.09, 0.25) |
Colon | 0.159 | 14 | 5 (3, 12) | 0.29 (0.19, 0.42) |
Lymphoma | 0.047 | 73 | 14 (4, 58) | 0.26 (0.18, 0.38) |
Prostate | 0.061 | 18 | 5 (3, 14) | 0.22 (0.17, 0.43) |
Srbct | 0.039 | 101 | 18 (11, 27) | 0.1 (0.04, 0.29) |
s.e. = 1 | ||||
Leukemia | 0.075 | 2 | 2 (2, 2) | 0.4 (0.32, 0.5)1 |
Breast 2 cl. | 0.332 | 14 | 4 (2, 7) | 0.12 (0.07, 0.17) |
Breast 3 cl. | 0.364 | 6 | 7 (4, 14) | 0.27 (0.22, 0.31) |
NCI 60 | 0.353 | 24 | 30 (19, 60) | 0.26 (0.17, 0.38) |
Adenocar. | 0.207 | 8 | 3 (2, 5) | 0.06 (0.03, 0.12) |
Brain | 0.216 | 9 | 14 (7, 22) | 0.26 (0.14, 0.46) |
Colon | 0.177 | 3 | 3 (2, 6) | 0.36 (0.32, 0.36) |
Lymphoma | 0.042 | 58 | 12 (5, 73) | 0.32 (0.24, 0.42) |
Prostate | 0.064 | 2 | 3 (2, 5) | 0.9 (0.82, 0.99)1 |
Srbct | 0.038 | 22 | 18 (11, 34) | 0.57 (0.4, 0.88) |
Alternative approaches | ||||
SC.s | ||||
Leukemia | 0.062 | 822 | 46 (14, 504) | 0.48 (0.45, 0.59) |
Breast 2 cl. | 0.326 | 31 | 55 (24, 296) | 0.54 (0.51, 0.66) |
Breast 3 cl. | 0.401 | 2166 | 4341 (2379, 4804) | 0.84 (0.78, 0.88) |
NCI 60 | 0.246 | 51183 | 4919 (3711, 5243) | 0.84 (0.74, 0.92) |
Adenocar. | 0.179 | 0 | 9 (0, 18) | NA (NA, NA) |
Brain | 0.159 | 4177 | 1257 (295, 3483) | 0.38 (0.3, 0.5) |
Colon | 0.122 | 15 | 22 (15, 34) | 0.8 (0.66, 0.87) |
Lymphoma | 0.033 | 2796 | 2718 (2030, 3269) | 0.82 (0.68, 0.86) |
Prostate | 0.089 | 4 | 3 (2, 4) | 0.72 (0.49, 0.92) |
Srbct | 0.025 | 374 | 18 (12, 40) | 0.45 (0.34, 0.61) |
NN.vs | ||||
Leukemia | 0.056 | 512 | 23 (4, 134) | 0.17 (0.14, 0.24) |
Breast 2 cl. | 0.337 | 88 | 23 (4, 110) | 0.24 (0.2, 0.31) |
Breast 3 cl. | 0.424 | 9 | 45 (6, 214) | 0.66 (0.61, 0.72) |
NCI 60 | 0.237 | 1718 | 880 (360, 1718) | 0.44 (0.34, 0.57) |
Adenocar. | 0.181 | 9868 | 73 (8, 1324) | 0.13 (0.1, 0.18) |
Brain | 0.194 | 1834 | 158 (52, 601) | 0.16 (0.12, 0.25) |
Colon | 0.158 | 8 | 9 (4, 45) | 0.57 (0.45, 0.72) |
Lymphoma | 0.04 | 15 | 15 (5, 39) | 0.5 (0.4, 0.6) |
Prostate | 0.081 | 7 | 6 (3, 18) | 0.46 (0.39, 0.78) |
Srbct | 0.031 | 11 | 17 (11, 33) | 0.7 (0.66, 0.85) |
1 Only two genes are selected from the complete data set; the values are the actual frequencies of those two genes.
2 [33] select 21 genes after visually inspecting the plot of cross-validation error rate vs. amount of shrinkage and number of genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error rate of their procedure.
3 [31] report obtaining more than 2000 genes when using shrunken centroids with this data set and show that the minimum error rate is achieved with about 5000 genes.
4 [33] select 43 genes. The difference is likely due to differences in the random partitions for cross-validation. Repeating 100 times the gene selection process with the full data set the median, 1st quartile, and 3rd quartile of the number of selected genes are 13, 8, and 147. For these data, [31] obtain 72 genes with shrunken centroids, which also falls within the above interval.