Quantitative PCR on 5 genes reliably identifies CTCL patients with 5% to 99% circulating tumor cells with 90% accuracy
Blood Nebozhyn et al. 107: 3189

Supplemental materials for: Nebozhyn et al

An additional study of the array data was carried out in order to assess the expected accuracy of our classifier as a function of the number of genes to be used. For this purpose, we followed an extension of 10-fold cross-validation procedure described by Ambroise and McLachlan designed to provide unbiased (with respect to gene selection) estimates of the classification accuracy, especially useful when the number of available samples is limited. The particular implementation of procedure can be considered as a two loop process. In the outer loop, we randomly partitioned the original high-SS dataset (18 patients and 12 controls) 100 times into training sets (90% of the samples) and the associated testing sets (the remaining 10% of the samples). The original proportion of patients to controls was maintained in each training/testing pair. In the inner loop, the discriminant model was fitted on each of the generated 100 training sample sets, and then applied to the withheld samples in the associated test set. Recursive feature elimination was carried out by sequential removal of the genes with the least contribution to the discriminant model, followed by testing for accuracy on the withheld samples at each elimination step. This process was repeated for each of the 100 sample sets. Thus both training of the model and feature reduction are always carried out on samples that are independent of the test set. This training-testing procedure was followed by averaging of the accuracy of the partitions for each reduced gene set. A record of the frequency at which genes are retained in the final discriminators and the average accuracy of various gene combinations are recorded. It can be seen, from Figure S1, that PDA was able to successfully distinguish high tumor burden SS patients from the controls using as few as just five genes. SVM on the other hand, seem to need at least 15 genes to be able to maintain the error rates below 5%.

Files in this Data Supplement: