Abstract
Background:
The quality of predictive modeling in biomedicine depends on the amount of data available for model building.
Objective:
To study the effect of combining microarray data sets on feature selection and predictive modeling performance.
Methods:
Empirical evaluation of stability of feature selection and discriminatory power of classifiers using three previously published gene expression data sets, analyzed both individually and in combination.
Results:
Feature selection was not robust for the individual as well as for the combined data sets. The classification performance of models built on individual and combined data sets was heavily dependent on the data set from which the features were extracted.
Conclusion:
We identified volatility of feature selection as contributing factor to some of the problems faced by predictive modeling using microarray data.
Introduction
Analysis of data from high-throughput experiments has become pervasive in biomedical research. The development of this technology, along with independent (but concurrent) advances in machine learning algorithms, have significantly changed our perception of how biomedical data should be analyzed and interpreted: In the future it is conceivable that a patient’s health status can be assessed more accurately by a combination of data items.
It is well-known in the field of machine learning that the quality of data analysis depends on two factors: the machine learning algorithm (mostly the choices of parameters for the algorithm), and the data that serves as input to the algorithm. If we restrict our attention to the task of predictive modeling, and consider that several machine learning algorithms have exhibited similar performance1–3 (albeit on lower-dimensional data), then the data becomes the main determining factor for the quality (predictive power) of the model.
While it is hard to assess the quality of data at a level not involving intricate knowledge of the data generation process (for example, whether data acquisition was biased, or measurement devices malfunctioning), from a machine learning perspective it is clear that the quantity of data has a direct influence on model quality: The more data is available, the better the resulting model will be. This is because the goal of predictive modeling as an inference mechanism is to make predictions for cases from the data population as a whole, and not merely for cases from a particular data set. With increasing amounts of data, the sample data set becomes an increasingly accurate representation of the whole population.
Unfortunately, the size of the data space grows exponentially with the number of attributes in the data (the data dimensionality). This is especially true for high-throughput data generated by microarray or mass spectrometry experiments. So while more data will result in better models, the amount of data required to obtain substantially better models may need to grow exponentially as well. Thus, the ideal situation (from a predictive modeling perspective) is to build models using a large number of low-dimensional cases. Microarray experiments, however, exhibit reverse characteristics: Case numbers are limited to a few hundreds (at most), while the data dimensionality is several orders of magnitudes higher. The analysis of microarray experiments is therefore a major challenge for the field of machine learning.
Several publications have investigated how and to which extend predictive modeling of microarray data can benefit from the increased number of cases available by combining data sets.4–7 Although the challenges posed by combining data from different microarray experiments are substantial,8 there are software packages available that help with this task.9, 10
Our research question is to investigate the effect that data combination has on predictive modeling by focusing on the contribution of feature selection, an important first step in predictive model building. We evaluate data from three different publications that analyze gene expression profiles for breast cancer prognosis, both individually and after combining them. The machine learning task is binary classification: distinguishing patients with good outcome from those with poor outcome. Previous work has focused on which data sets to include in a study,7 on predictions based on pair-wise expression comparisons,5 on robust greedy feature selection,6 and on linear additive models based on Cox regression coefficients.4
In contrast to these studies, we evaluate how the performance of machine learning algorithms, in particular logistic regression models, depends on the number and choice of selected features (gene expressions). Using logistic regression over alternatives such as support vector machines and artificial neural networks is motivated by the fact that the performance of the latter models depends crucially on the choice of parameter settings. Since the goal of our work is to isolate and make comparable the influence of the choice of features on the predictive power of models, it is advantageous to use a parameter-free model for this comparison. We therefore apply logistic regression to data sets obtained from three different studies, analyzed both individually and as one merged data set.
Material and methods
Gene expression data
We downloaded three breast cancer data sets from NCBI Gene Expression Omnibus (GEO): The data sets used by Wang et al.11(GSE2034), by Sotiriou et al.12 (GSE2990), and by Miller et al.13 (GSE3494). We used raw CEL files and associated clinical information of the patients. For within-study normalization, we split GSE3494 samples into two groups, GSE3494-A and GSE3494-B, according to the sample’s Affymetrix platform.
To make the samples comparable, we selected patients based on the criteria used in Xu et al,5 meaning that we only used the patients who did not receive any adjuvant treatment and had negative lymph node status. Then we selected the patients with extreme outcomes: those with poor outcome (either recurrence or metastasis within five years) and those with good outcome (neither recurrence nor metastasis within eight years). After filtering, the number of samples in the data sets were as follows: 209 for GSE2034 (114 good/95 poor), 90 for GSE2990 (60 good/30 poor), and 242 for GSE3494 (224 good/18 poor).
We performed quantile-normalization preserving ranks of genes of the sample within a study.14 For probe-to-gene annotation, we first aligned a probe against genome, then matched its positions to those of exons of AceView, the most comprehensive transcript database.15 Then we summarized the mRNA abundance at the gene level using the median polish method from the multiple probe level measurement values.16 In their original study, Xu et al.5 used only the samples of the HGU133A platform from the publication of Miller et al.,13 discarding 201 samples. With a sequence-based annotation approach, we were able to include the samples of both the HGU133A and HGU133B platforms. The combined data set thus contained 541 samples, 398 with good prognosis, and 143 with poor prognosis. The number of genes in the combined data set (the intersection of genes in all data sets) was 4103.
Feature selection
Feature selection is usually the first step in predictive modeling, because machine learning algorithms are better able to generalize from training data to unseen cases after irrelevant features (genes, in our case) have been discarded. Ideally, feature selection is performed by experts in the field who can identify irrelevant features that may safely be omitted from the analysis. Because this knowledge is often not available (as is the case with microarray data), a number of automated algorithms for feature selection have been developed.
In our study, we performed feature ranking via Student’s t-test, which measures the spread between a gene’s mean expression values for patients with good and poor outcomes, respectively, relative to the dispersion of these values. Preliminary experiments (data not shown) using information gain and ReliefF as univariate and multivariate alternatives, respectively, to the t-test exhibited similar or worse performance, compared to the numbers reported below.
We used 10-fold cross-validation for the feature selection step to asses the variability of feature selection on microarray data sets. This means that a t-test is applied 10 times, every time to 9/10 of the data. We thus obtained 10 different rankings of feature relevance. A composite ranking can be computed by taking the mean of the 10 ranks of each feature, and sorting the means. Additionally, the standard deviation of a feature’s rank gives an indication of the volatility of this rank.
Model building and evaluation
Logistic regression models were built using WEKA, the free machine learning software suite*. Data was normalized to zero mean and unit variance before model building. No feature selection scheme, such as forward selection or backwards elimination, was used, because relevant features had already been determined.
The performance of machine learning models is judged by how well they generalize, that is, perform on new cases. Therefore, a number of cases are set aside for later use as test set. A common measure for evaluating classification performance on this test set is the area under the ROC curve (AUC), which measures how well two classes can be separated by the model,17 regardless of where the optimal separating threshold lies. Perfect separation yields an AUC value of 1, no discrimination an AUC value of 0.5.
Model validation was done by five times randomly splitting the data into a training and test set. We chose this approach due to the low amount of cases with poor prognosis in two of the three data sets. 10-fold cross-validation would have resulted in only a few (one to three) cases in the test set. The performance numbers reported below are averages over the five test sets.
To guard against the possibility of the choice of LR models unduly influencing the results, we also investigated the use of decision tree models as classifiers. The results of these models, however, were worse than those of logistic regression models, regardless of whether the t-test or information gain was used as feature selection method.
Results
Using the methods described above, we investigated the following hypotheses:
Feature selection methods are robust across different data sets.
Machine learning performance increases for larger data sets.
Robustness of feature selection
We were able to assess the variability of a feature ranking by using 10-fold cross-validation in the feature selection step. Figure 1 illustrates this volatility for data set GSE2034. One can observe that the standard deviations are quite large, with an average of 9.46 over the 20 top-ranked features shown. The corresponding average standard deviation was 20.73 for GSE2990, 21.33 for GSE3494, and 12.94 for the combined data set.
Figure 1:
Average ranks (vertical axis) over ten cross-validation runs for the 20 top ranked features in GSE2034 (horizontal axis). Vertical bars indicate standard deviations of ranks.
Volatility was high not only within data sets, but also between data sets. To assess this volatility, we used the feature ranking of the combined data set against which we plotted the ranks of the corresponding features in each of the three individual data sets. This plot of inter-study ranking variability is shown in Figure 2. For example, the top-ranked gene for the combined data set (ZNF250, avg. rank = 5.1) is in positions 109.5, 517.1 and 10.5 in the three feature rankings for the individual data sets. In the ideal case, all features would have the same rank in each of the studies, and Figure 2 would show dots along the diagonal (due to the large scale difference in the axes, basically the same as the horizontal axis). The large deviation from this ideal case illustrates that feature rankings are heavily dependent on the data set from which they were generated.
Figure 2:
Ranks of genes obtained from individual data set rankings (vertical axis) plotted against rank from combined data set ranking (horizontal axis).
Benefit of data set combination
To assess the influence of the observed feature volatility on classifier performance, we tested how the features obtained from one data set performed on the other data sets. More precisely, we took the 200 top-ranked features from each data set and divided them into 40 feature sets containing the best 5, 10, 15, . . . features. We then trained 16×40 logistic regression models, one for each combination of feature set and data set.
Not surprisingly, the results showed that the feature sets exhibited the best performance on the data sets from which they were selected. Figure 3 depicts the classification performance of all four data sets trained on the best k features of the combined data set and the GSE3494 data set, respectively. Similar characteristics of classification performance can be observed for the features extracted from the other two data sets as well (not shown). It was thus not the case that the features from the combined data set are better than the others, in the sense that these features increase classifier performance on the other data sets.
Figure 3:
Discriminatory power of logistic regression models (measured by the AUC, vertical axis) trained on the k best features (horizontal axis) of the combined data set (a) and the GSE3494 data set (b) for the combined data set (thick black line), GSE2034 (thin red line), GSE2990 (thin green line), and GSE3494 (thin blue line).
Discussion
High-throughput experiments yield data on a large number of potential predictors, but the number of cases is usually several orders of magnitude smaller than the number of predictors (i.e., hundreds of thousands of predictors to hundreds of cases). Combining data sets to correct this imbalance is a tantalizing prospect. While some studies have shown that for predictive modeling, the data processing pipeline of combining high-throughput measurements is sufficiently advanced to yield improved results,7, 18 others report no benefits from combining data sets.4 Still others analyze combined data sets, but do not compare them to the individual parts from which they were formed.6, 19
Of the above, the only publication to specifically address the role of feature selection in predictive modeling on combined data sets is the work of Yasrebi et al.4 While Wang et al.6 acknowledge that “predictors were inconsistent”, the other publications ignore this issue, instead using feature selection as a necessary black-box step in model building that does not warrant further scrutiny.
In this paper, we showed that on the contrary, feature selection in the context of high-throughput measurements is a highly volatile process that has great influence on predictive modeling performance. We demonstrated that there is high variability in the highest-ranked features within and across data sets, and the effect that this variability has on predictive performance. Although these results were obtained on only one combination of data sets, they highlight the need to specifically address feature selection when combining data sets for predictive modeling.
Acknowledgments
This work was funded in part by the Austrian Genome Program (GEN-AU), project Bioinformatics Integration Network (BIN), the Komen Foundation (FAS0703850), and the National Library of Medicine (R01LM009520).
Footnotes
Available from http://www.cs.waikato.ac.nz/ml/weka/
References
- 1.Adler W, Peters A, Lausen B. Comparison of classifiers applied to confocal scanning laser ophthalmoscopy data. Methods Inf Med. 2002;47:38–46. doi: 10.3414/me0348. [DOI] [PubMed] [Google Scholar]
- 2.Chan K, Lee TW, Sample PA, Goldbaum MH, Weinreb RN, Sejnowski TJ. Comparison of machine learning and traditional classifiers in glaucoma diagnosis. IEEE Trans Biomed Eng. 2002;49:963–974. doi: 10.1109/TBME.2002.802012. [DOI] [PubMed] [Google Scholar]
- 3.Dreiseitl S, Ohno-Machado L, Vinterbo S, Billhardt H, Binder M. A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J Biomed Inform. 2001;34:28–36. doi: 10.1006/jbin.2001.1004. [DOI] [PubMed] [Google Scholar]
- 4.Yasrebi H, Sperisen P, Praz V, Bucher P. Can survival prediction be improved by merging gene expression data sets? PLoS One. 2009;4:e7431. doi: 10.1371/journal.pone.0007431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Xu L, Tan AC, Winslow RL, Geman D. Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics. 2008;9:125. doi: 10.1186/1471-2105-9-125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wang J, Do KA, Wen S, et al. Merging microarray data, robust feature selection, and predicting prognosis in prostate cancer. Cancer Inform. 2006;2:87–97. [PMC free article] [PubMed] [Google Scholar]
- 7.Ng SK, Tan SH, Sundararajan VS. On combining multiple microarray studies for improved functional classification by whole-dataset feature selection. Genome Inform. 2003;14:44–53. [PubMed] [Google Scholar]
- 8.Cahan P, Rovegno F, Mooney D, Newman JC, St Laurent G, 3rd, McCaffrey TA. Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene. 2007;401:12–18. doi: 10.1016/j.gene.2007.06.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Johnson WE, Rabinovic A, Li C. Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
- 10.Benito M, Parker J, Du Q, et al. Adjustment of systematic microarray data biases. Bioinformatics. 2004;20:105–114. doi: 10.1093/bioinformatics/btg385. [DOI] [PubMed] [Google Scholar]
- 11.Wang Y, Klijn JG, Zhang Y, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–679. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
- 12.Sotiriou C, Wirapati P, Loi S, et al. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006;98:262–272. doi: 10.1093/jnci/djj052. [DOI] [PubMed] [Google Scholar]
- 13.Miller LD, Smeds J, George J, et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A. 2005;102:13550–13555. doi: 10.1073/pnas.0506230102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
- 15.Thierry-Mieg D, Thierry-Mieg J. Aceview: a comprehensive cdna-supported gene and transcripts annotation. Genome Biol. 2006;7(Suppl 1):S12.1–14. doi: 10.1186/gb-2006-7-s1-s12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of affymetrix genechip probe level data. Nucleic Acids Res. 2003;31:e15. doi: 10.1093/nar/gng015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hanley JA, McNeil BJ. The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- 18.van Vliet M, Reyal F, Horlings H, van de Vijver M, Reinders M, Wessels L. Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability. BMC Genomics. 2008;9:375. doi: 10.1186/1471-2164-9-375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kim K, Ki D, Jeong H, Jeung H, Chung H, Rha S. Novel and simple transformation algorithm for combining microarray data sets. BMC Bioinformatics. 2007;8:218. doi: 10.1186/1471-2105-8-218. [DOI] [PMC free article] [PubMed] [Google Scholar]



