. 2006 Jan 6;7:3. doi: 10.1186/1471-2105-7-3

Table 3.

Stability of variable (gene) selection evaluated using 200 bootstrap samples. "# Genes": number of genes selected on the original data set. "# Genes boot.": median (1st quartile, 3rd quartile) of number of genes selected from on the bootstrap samples. "Freq. genes": median (1st quartile, 3rd quartile) of the frequency with which each gene in the original data set appears in the genes selected from the bootstrap samples. Parameters for backwards elimination with random forest: mtryFactor = 1, s.e. = 0, ntree = 2000, ntreelterat = 1000, fraction.dropped = 0.2.

Data set	Error	# Genes	# Genes boot.	Freq. genes
Backwards elimination of genes from random forest

s.e. = 0

Leukemia	0.087	2	2 (2, 2)	0.38 (0.29, 0.48)¹
Breast 2 cl.	0.337	14	9 (5, 23)	0.15 (0.1, 0.28)
Breast 3 cl.	0.346	110	14 (9, 31)	0.08 (0.04, 0.13)
NCI 60	0.327	230	60 (30, 94)	0.1 (0.06, 0.19)
Adenocar.	0.185	6	3 (2, 8)	0.14 (0.12, 0.15)
Brain	0.216	22	14 (7, 22)	0.18 (0.09, 0.25)
Colon	0.159	14	5 (3, 12)	0.29 (0.19, 0.42)
Lymphoma	0.047	73	14 (4, 58)	0.26 (0.18, 0.38)
Prostate	0.061	18	5 (3, 14)	0.22 (0.17, 0.43)
Srbct	0.039	101	18 (11, 27)	0.1 (0.04, 0.29)

s.e. = 1

Leukemia	0.075	2	2 (2, 2)	0.4 (0.32, 0.5)¹
Breast 2 cl.	0.332	14	4 (2, 7)	0.12 (0.07, 0.17)
Breast 3 cl.	0.364	6	7 (4, 14)	0.27 (0.22, 0.31)
NCI 60	0.353	24	30 (19, 60)	0.26 (0.17, 0.38)
Adenocar.	0.207	8	3 (2, 5)	0.06 (0.03, 0.12)
Brain	0.216	9	14 (7, 22)	0.26 (0.14, 0.46)
Colon	0.177	3	3 (2, 6)	0.36 (0.32, 0.36)
Lymphoma	0.042	58	12 (5, 73)	0.32 (0.24, 0.42)
Prostate	0.064	2	3 (2, 5)	0.9 (0.82, 0.99)¹
Srbct	0.038	22	18 (11, 34)	0.57 (0.4, 0.88)

Alternative approaches

SC.s

Leukemia	0.062	82²	46 (14, 504)	0.48 (0.45, 0.59)
Breast 2 cl.	0.326	31	55 (24, 296)	0.54 (0.51, 0.66)
Breast 3 cl.	0.401	2166	4341 (2379, 4804)	0.84 (0.78, 0.88)
NCI 60	0.246	5118³	4919 (3711, 5243)	0.84 (0.74, 0.92)
Adenocar.	0.179	0	9 (0, 18)	NA (NA, NA)
Brain	0.159	4177	1257 (295, 3483)	0.38 (0.3, 0.5)
Colon	0.122	15	22 (15, 34)	0.8 (0.66, 0.87)
Lymphoma	0.033	2796	2718 (2030, 3269)	0.82 (0.68, 0.86)
Prostate	0.089	4	3 (2, 4)	0.72 (0.49, 0.92)
Srbct	0.025	37⁴	18 (12, 40)	0.45 (0.34, 0.61)

NN.vs

Leukemia	0.056	512	23 (4, 134)	0.17 (0.14, 0.24)
Breast 2 cl.	0.337	88	23 (4, 110)	0.24 (0.2, 0.31)
Breast 3 cl.	0.424	9	45 (6, 214)	0.66 (0.61, 0.72)
NCI 60	0.237	1718	880 (360, 1718)	0.44 (0.34, 0.57)
Adenocar.	0.181	9868	73 (8, 1324)	0.13 (0.1, 0.18)
Brain	0.194	1834	158 (52, 601)	0.16 (0.12, 0.25)
Colon	0.158	8	9 (4, 45)	0.57 (0.45, 0.72)
Lymphoma	0.04	15	15 (5, 39)	0.5 (0.4, 0.6)
Prostate	0.081	7	6 (3, 18)	0.46 (0.39, 0.78)
Srbct	0.031	11	17 (11, 33)	0.7 (0.66, 0.85)

¹Only two genes are selected from the complete data set; the values are the actual frequencies of those two genes.

²[33] select 21 genes after visually inspecting the plot of cross-validation error rate vs. amount of shrinkage and number of genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error rate of their procedure.

³[31] report obtaining more than 2000 genes when using shrunken centroids with this data set and show that the minimum error rate is achieved with about 5000 genes.

⁴[33] select 43 genes. The difference is likely due to differences in the random partitions for cross-validation. Repeating 100 times the gene selection process with the full data set the median, 1st quartile, and 3rd quartile of the number of selected genes are 13, 8, and 147. For these data, [31] obtain 72 genes with shrunken centroids, which also falls within the above interval.