![]()
Quantitative PCR on 5 genes reliably identifies CTCL patients with 5% to 99% circulating tumor cells with 90% accuracy
Blood Nebozhyn et al. 107: 3189
Supplemental materials for: Nebozhyn et al
An additional study of the array data was carried out in order to assess the expected accuracy of our classifier as a function of the number of genes to be used. For this purpose, we followed an extension of 10-fold cross-validation procedure described by Ambroise and McLachlan designed to provide unbiased (with respect to gene selection) estimates of the classification accuracy, especially useful when the number of available samples is limited. The particular implementation of procedure can be considered as a two loop process. In the outer loop, we randomly partitioned the original high-SS dataset (18 patients and 12 controls) 100 times into training sets (90% of the samples) and the associated testing sets (the remaining 10% of the samples). The original proportion of patients to controls was maintained in each training/testing pair. In the inner loop, the discriminant model was fitted on each of the generated 100 training sample sets, and then applied to the withheld samples in the associated test set. Recursive feature elimination was carried out by sequential removal of the genes with the least contribution to the discriminant model, followed by testing for accuracy on the withheld samples at each elimination step. This process was repeated for each of the 100 sample sets. Thus both training of the model and feature reduction are always carried out on samples that are independent of the test set. This training-testing procedure was followed by averaging of the accuracy of the partitions for each reduced gene set. A record of the frequency at which genes are retained in the final discriminators and the average accuracy of various gene combinations are recorded. It can be seen, from Figure S1, that PDA was able to successfully distinguish high tumor burden SS patients from the controls using as few as just five genes. SVM on the other hand, seem to need at least 15 genes to be able to maintain the error rates below 5%.
Files in this Data Supplement:
- Figure S1. Estimated error rates as a function of the number of genes in the discriminant model (PDF, 194 KB) -
The results shown were obtained on a previously published ([13]) microarray dataset consisting of 18 high tumor burden SS patients and 12 controls. The results are shown for two classification techniques: Penalized Discriminant Analysis (PDA) and Support Vector Machine (SVM). The estimated error rates are based on the sets of discriminant scores assigned to each sample for a given number of genes by the discriminant model in the context of a resampling procedure eliminating selection bias, similar to that of Ambroise and McLachlan. The particular implementation employed 100 random resamplings without replacement (jackknife) of the given dataset. 10% of each sample set is withheld for testing and the model is fitted on the remaining 90% of the samples. Recursive feature elimination (RFE) is performed after each round of training and testing. False positive rate refers to the percentage of control samples misclassified as patients by the model, and conversely, the false negative rate is equal to the percentage of patient samples misclassified as controls.
- Table S1 (XLS, 80.5 KB) -
The experimental data for the samples used in this study along with the values of the LDA and SVM classification scores. The first column contain sample names for 172 samples in the following order: First are 125 samples used for fitting and validating LDA and SVM-based discriminant models. Then come 30 samples used for gene selection. They are followed by 17 PBMC samples from additional independent test set consisting of 12 Mycosis Fungoides, 2 Recovered Sezary Syndrome patients, and 3 Atopic Dermatitis. Columns 2 through 7 contain individual measurements for each of 5 discriminant genes along with MBD4 housekeeping gene used as internal control. Columns 8 to 12 contain gene expression levels normalized by the housekeeping gene. Columns 13 to 17 contain the same values as in previous set of columns but natural log transformed. The last 4 columns contain mean classification scores by LDA and SVM classifiers along with corresponding standard deviations.