Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Mar 11;18(3):e1009926. doi: 10.1371/journal.pcbi.1009926

The ability to classify patients based on gene-expression data varies by algorithm and performance metric

Stephen R Piccolo 1,*, Avery Mecham 1, Nathan P Golightly 1, Jérémie L Johnson 1, Dustin B Miller 1
Editor: Xing Chen2
PMCID: PMC8942277  PMID: 35275931

Abstract

By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.

Author summary

When a patient is treated in a medical setting, a clinician may extract a tissue sample and use transcriptome-profiling technologies to quantify the extent to which thousands of genes are expressed in the sample. These measurements reflect biological activity that may influence disease development, progression, and/or treatment responses. Patterns that differ between patients in distinct groups (for example, patients who do or do not have a disease or do or do not respond to a treatment) may be used to classify future patients into these groups. This study is a large-scale benchmark comparison of algorithms that can be used to perform such classifications. Additionally, we evaluated feature-selection algorithms, which can be used to identify which variables (genes and/or patient characteristics) are most relevant for classification. Through a series of analyses that build on each other, we show that classification performance varies considerably, depending on which algorithms are used, whether feature selection is used, which settings are used when executing the algorithms, and which metrics are used to evaluate the algorithms’ performance. Researchers can use these findings as a resource for deciding which algorithms and settings to prioritize when deriving transcriptome-based biomarkers in future efforts.

Introduction

Researchers use observational data to derive categories, or classes, into which patients can be assigned. Such classes might include patients who have a given disease subtype, patients at a particular disease stage, patients who respond to a particular treatment, patients who have poor outcomes, patients who have a particular genomic lesion, etc. Subsequently, a physician may use these classes to tailor patient care, rather than using a one-size-fits-all approach[13]. However, physicians typically do not know in advance which class labels are most relevant for each patient. Thus, a key challenge is defining objective and reliable criteria for assigning individual patients to known class labels. When such criteria have been identified and sufficiently validated, they can be used in medical “expert systems” for classifying individual patients[4].

In this study, we focused on using gene-expression profiles to perform classification. Gene-expression profiling technologies are relatively mature and are used widely in research[5,6]. In addition, gene-expression profiling is now used in clinical applications. For example, physicians use the PAM50 classifier, based on the expression of 58 genes, to assign breast-cancer patients to “intrinsic subtypes”[711]. The success of this classifier has motivated additional research. In breast cancer alone, more than 100 gene-expression profiles have been proposed for predicting breast-cancer prognosis[12].

Classification algorithms learn from data much as a physician does—past observations inform decisions about new patients. Thus, the first step in developing a gene-expression biomarker is to profile a patient cohort that represents the population of interest. Alternatively, a researcher might use publicly available data for this step. Second, the researcher performs a preliminary evaluation of the potential to assign patients to a particular clinically relevant class based on gene-expression profiles and accompanying clinical information. Furthermore, the researcher may undergo an effort to select a classification algorithm that will perform relatively well for this particular task. Such efforts may be informed by prior experience, a literature review, or trial and error. Using some form of subsampling[13] and a given classification algorithm, the researcher derives a classification model from a subset of the patients’ data (training data); to derive this model, the researcher exposes the classification algorithm to the true class labels for each patient. Then, using a disjoint subset of patient observations for which the true class labels have been withheld (test data), the model predicts the label of each patient. Finally, the researcher compares the predictions against the true labels. If the predictive performance approaches or exceeds what can be attained using currently available models, the researcher may continue to refine and test the model. Such steps might include tuning the algorithm, reducing the number of predictor variables, and testing it on multiple, independent cohorts. In this study, we focus on the need to select algorithm(s).

Modern, high-throughput technologies can produce more than 10,000 gene-expression measurements per biological sample. Thus instead of a traditional approach that uses prior knowledge to determine which genes are included in a predictive model, researchers can use a data-driven approach to infer which genes are most relevant and to identify expression patterns that differ among patient groups[14]. These patterns may be complex, representing subtle differences in expression that span many genes[15]. Due to dependencies among biomolecules and limitations in measurement technologies, high-throughput gene-expression measurements are often redundant and noisy[16]. Thus, to be effective at inferring relevant patterns, classification algorithms must be able to overcome these challenges. One approach is to perform feature selection using algorithms that identify predictor variables (features) that are most relevant to the class of interest.

Many machine-learning algorithms and algorithmic variants have been developed and are available in open-source software packages. These include classification algorithms as well as feature-selection algorithms. Gene-expression datasets are abundant in public repositories, affording opportunities for large-scale benchmark comparisons. Furthermore, many of these datasets are accompanied by clinically oriented predictor variables. To our knowledge, no benchmark study has systematically compared the ability to classify patients using clinical data versus gene-expression data—or combined these two types of data—for a large number of datasets. Moreover, previous benchmarks have not systematically evaluated the benefits of optimizing an algorithm’s hyperparameters versus using defaults. We address these gaps with a benchmark study spanning 50 datasets (143 class variables representing diverse phenotypes), 52 classification algorithms (1116 hyperparameter combinations), and 14 feature-selection algorithms. We perform this study in a staged design, comparing the ability to classify patients using gene-expression data alone, clinical data alone, or both data types together. In addition, we evaluate the effects of performing hyperparameter optimization and/or feature selection.

Results

General trends

We evaluated the predictive performance of 52 classification algorithms on 50 gene-expression datasets. Across the 50 datasets, we made predictions for a total of 143 class variables. We divided the analysis into 5 stages to assess benefits that might come from including clinical predictors, optimizing an algorithm’s hyperparameters, or performing feature selection (Fig 1).

Fig 1. Overview of analysis scenarios.

Fig 1

This study consisted of five separate but related analyses. This diagram indicates which data type(s) was/were used and whether we attempted to improve predictive performance via hyperparameter optimization or feature selection in each analysis.

In Analysis 1, we used only gene-expression data as predictors and used default hyperparameters for each classification algorithm. S1 Fig illustrates the performance of these algorithms using area under the receiver operating characteristic curve (AUROC) as a performance metric. As a method of normalization, we ranked the classification algorithms for each combination of dataset and class variable. Two patterns emerged. Firstly, 15 of the 18 top-performing algorithms use kernel functions and/or ensemble approaches. Secondly, although some algorithms performed consistently well overall, they performed quite poorly in some cases. For example, the sklearn/logistic_regression algorithm—which used the LibLinear solver[17], a C value of 1.0, and no class weighting—resulted in the best average rank; yet for 7 (4.9%) of the dataset/class combinations, its performance ranked in the bottom quartile. The keras/snn algorithm resulted in the second-best average rank; yet for 4 (2.8%) of dataset/class combinations, its performance ranked in the bottom quartile.

This study focuses primarily on AUROC because it is widely used and accounts for moderate levels of class imbalance. However, the performance rankings differed considerably depending on which evaluation metric we used. For example, in Analysis 1, many of the same algorithms that performed well according to AUROC also performed well according to classification accuracy (S2 Fig). However, classification accuracy does not account for class imbalance and thus may rank algorithms in a misleading way. For example, the weka/ZeroR algorithm was ranked 18th among the algorithms according to classification accuracy, even though the algorithm simply selects the majority class. (Our analysis included two-class and multi-class problems.) Rankings for the Matthews correlation coefficient were relatively similar to AUROC. For example, sklearn/logistic_regression had the 2nd-best average rank according to this metric. However, in other cases, the rankings were considerably different. For example, the mlr/sda algorithm performed 3rd-best according to MCC but 28th according to AUROC (S3 Fig). The area under the precision-recall curve (AUPRC) is an alternative to the AUROC. In Analysis 1, AUROC and AUPRC scores and ranks were moderately correlated (S4,S5, and S6 Figs). AUPRC is recommended over AUROC when class imbalance is extreme[18,19]. Fig 2 shows the rankings for each algorithm across all metrics that we evaluated, highlighting that conclusions drawn from benchmark comparisons depend heavily on which metric(s) are considered important.

Fig 2. Comparison of ranks for classification algorithms across performance metrics.

Fig 2

We calculated 14 performance metrics for each classification task. This graph shows results for Analysis 1 (using only gene-expression predictors). For each combination of dataset and class variable, we averaged the metric scores across all Monte Carlo cross-validation iterations. For some metrics (such as Accuracy), a relatively high value is desirable, whereas the opposite is true for other metrics (such as FDR). We ranked the classification algorithms such that relatively low ranks indicated more desirable performance for the metrics and averaged these ranks across the dataset/class combinations. This graph illustrates that the best-performing algorithms for some metrics do not necessarily perform optimally according to other metrics. AUROC = area under the receiver operating characteristic curve. AUPRC = area under the precision-recall curve. FDR = false discovery rate. FNR = false negative rate. FPR = false positive rate. MCC = Matthews correlation coefficient. MMCE = mean misclassification error. NPV = negative predictive value. PPV = positive predictive value.

Execution times differed substantially across the algorithms. For Analysis 1, Fig 3 categorizes each algorithm according to its ability to make effective predictions in combination with the computer time required to execute the classification tasks. The sklearn/logistic_regression algorithm not only outperformed other algorithms in terms of predictive ability but also was one of the fastest algorithms. In contrast, the mlr/randomForest algorithm was among the most predictive algorithms but was orders-of-magnitude slower than other top-performing algorithms. Execution time is a less-critical factor than predictive performance; however, when the eventual goal is to provide useful tools for clinical applications, execution times may be an important consideration.

Fig 3. Tradeoff between execution time and predictive performance for classification algorithms.

Fig 3

When using gene-expression predictors only (Analysis 1), we calculated the median area under the receiver operating characteristic curve (AUROC) across 50 iterations of Monte Carlo cross validation for each combination of dataset, class variable, and classification algorithm. Simultaneously, we measured the median execution time (in seconds) for each algorithm across these scenarios. sklearn/logistic_regression attained the top predictive performance and was the 4th fastest algorithm (median = 5.3 seconds). The coordinates for the y-axis have been transformed to a log-10 scale. We used arbitrary AUROC thresholds to categorize the algorithms based on low, moderate, and high predictive ability.

Some classification algorithms are commonly used and thus have been implemented in multiple machine-learning packages. For example, all three open-source libraries that we used in this study have implementations of the SVM and random forests algorithms. However, these implementations differ from each other, often supporting different hyperparameters or using different default values. For example, mlr/svm and weka/LibSVM are both wrappers for the LibSVM package[20]; both use a value of 1.0 for the C parameter and use the Radial Basis Function kernel. However, by default, mlr/svm scales numeric values to zero mean and unit variance, whereas weka/LibSVM performs no normalization by default. In Analysis 1, the predictive performance was similar for these different implementations. Their AUROC values were significantly correlated (r = 0.87; CI = 0.82–0.90; p = 2.2e-16). However, in some instances, their performance differed dramatically. For example, when predicting drug responses for dataset GSE20181, weka/LibSVM performed 2nd best, but mlr/svm performed worst among all algorithms. S7 and S8 Figs illustrate, for two representative datasets, that algorithms with similar methodologies often produced similar predictions; but these predictions were never perfectly correlated. Execution times also differed from one implementation to another; for example, the median execution time for weka/LibSVM was 27.9 seconds, but for mlr/svm it was 114.4 seconds. Overall, the median execution times differed significantly across the software packages (Kruskal-Wallis test; p-value = 1.1e-06). Overall, the sklearn algorithms executed faster than algorithms from other packages (Fig 3).

Some classification labels were easier to predict than others. Across the dataset/class combinations in Analysis 1, the median AUROC across all algorithms ranged between 0.44 and 0.97 (S1 Data). For a given dataset/class combination, algorithm performance varied considerably, though this variation was influenced partially by the weka/ZeroR results, which we used as controls. To gain insight into predictive performance for different types of class labels, we assigned a category to each class variable (S9 Fig); the best predictive performance was attained for class variables representing molecular markers, histological statuses, and diagnostic labels. Class variables in the “patient characteristics” category performed worst; these variables represented miscellaneous factors such as the patient’s family history of cancer, whether the patient had been diagnosed with multiple tumors, and the patient’s physical and cognitive “performance status” at the time of diagnosis.

Effects of using gene-expression predictors, clinical predictors, or both

In Analysis 2, we used only clinical predictors (for the dataset / class-variable combinations with available clinical data). Three linear-discriminant classifiers performed particularly well: mlr/sda, sklearn/lda, and mlr/glmnet (S10 Fig). Two Naïve Bayes algorithms also ranked among the top performers, whereas these algorithms had performed poorly in Analysis 1. Only two kernel-based algorithms were ranked among the top 10: weka/LibLINEAR and sklearn/logistic_regression. Both of these algorithms use the LibLINEAR solver. Most of the remaining kernel-based algorithms were among the worst performers. As with Analysis 1, most ensemble-based algorithms ranked in the top 25; however, none ranked in the top 5.

S2 Data shows the performance of each combination of dataset and class variable in Analysis 2. As with Analysis 1, the ability to predict particular classes and categories varied considerably (S11 Fig). For approximately two-thirds of the dataset/class combinations, AUROC values decreased—sometimes by more than 0.3 (Fig 4A); however, in a few cases, predictive performance increased. The most dramatic improvement was for GSE58697, in which we predicted progression-free survival for desmoid tumors. The clinical predictors were age at diagnosis, biological sex, and tumor location. Salas, et al. previously found in a univariate analysis that age at diagnosis was significantly correlated with progression-free survival [21]. We focused on patients who experienced relatively long or short survival times and used multivariate methods.

Fig 4. Relative predictive performance when training on gene-expression predictors alone vs. using clinical predictors alone or gene-expression predictors in combination with clinical predictors.

Fig 4

In both A and B, we used as a baseline the predictive performance that we attained using gene-expression predictors alone (Analysis 1). We quantified predictive performance using the area under the receiver operating characteristic curve (AUROC). In A, we show the relative increase or decrease in performance when using clinical predictors alone (Analysis 2). In most cases, AUROC values decreased; however, in a few cases, AUROC values increased (by as much as 0.42). In B, we show the relative change in performance when using gene-expression predictors in combination with clinical predictors (Analysis 3). For 82/109 (75%) of dataset/class combinations, including clinical predictors had no effect on performance. However, for the remaining combinations, the AUROC improved by as much as 0.15 and decreased by as much as 0.09.

In Analysis 3, we combined clinical and gene-expression predictors. We limited this analysis to the 108 dataset / class-variable combinations for which clinical predictors were available (S3 Data and S12 Fig). As with Analysis 1, kernel- and ensemble-based algorithms performed best overall (S13 Fig). For 90 (83.3%) of the dataset/ class-variable combinations, the AUROC values were identical to Analysis 1 (Fig 4B). Except in three cases, the absolute change in AUROC was smaller than 0.05, including for GSE58697 (0.026 increase). These results suggest that standard classification algorithms (using default parameters) may not be well suited to datasets in which gene-expression and clinical predictors have simply been merged. The abundance of gene-expression variables may distract the algorithms and/or obfuscate signal from the relatively few clinical variables. Additionally, gene-expression and clinical predictors may carry redundant signals.

Effects of performing hyperparameter optimization

In Analysis 4, we performed hyperparameter optimization via nested cross validation. Across all 52 classification algorithms, we employed 1116 distinct hyperparameter combinations under the assumption that the default settings may be suboptimal for the datasets we evaluated. When clinical predictors were available, we included them (as in Analysis 3). When no clinical predictors were available, we used gene-expression data only (as in Analysis 1). Again, kernel- and ensemble-based algorithms performed well overall (S14 Fig), although the individual rankings differed modestly from the previous analyses. The weka/LibLINEAR algorithm had the best median rank, and algorithms based on random forests were generally ranked lower than in previous analyses. For most dataset / class-variable combinations, the AUROC (median across all classification algorithms) improved with hyperparameter optimization (Fig 5A); however, in some cases, performance decreased.

Fig 5. Relative predictive performance when using default algorithm hyperparameters and all features vs. tuning hyperparameters or selecting features.

Fig 5

In both A and B, we use as a baseline the predictive performance that we attained using default hyperparameters for the classification algorithms (Analysis 3). We quantified predictive performance using the area under the receiver operating characteristic curve (AUROC). In A, we show the relative increase or decrease in performance when tuning hyperparameters within each training set (Analysis 4). In most cases, AUROC values increased. In B, we show the relative change in performance when performing feature selection within each training set (Analysis 5). AUROC increased for most dataset / class-variable combinations. The horizontal dashed lines indicate the median improvement across all dataset / class-variable combinations.

The best- and worst-performing class variables and categories were similar to the previous analyses (S15 Fig and S4 Data). We observed a positive trend in which datasets with larger sample sizes resulted in higher median AUROC values (S16 Fig); however, this relationship was not statistically significant (Spearman’s rho = 0.13; p = 0.13). We observed a slightly negative trend between the number of genes in a dataset and median AUROC (S17 Fig), but again this relationship was not statistically significant (rho = -0.07; p = 0.43).

Evaluating many hyperparameter combinations enabled us to quantify how much the predictive performance varied for different combinations. Some variation is desirable because it enables algorithms to adapt to diverse analysis scenarios; however, large amounts of variation make it difficult to select hyperparameter combinations that are broadly useful. For some classification algorithms, AUROC values varied widely across hyperparameter combinations when applied to a given dataset / class variable (S18 Fig). These variations were often different for algorithms with similar methodological approaches. For example, the median coefficient of variation was 0.22 for the sklearn/svm algorithm but 0.08 for mlr/svm and 0.06 for weka/LibSVM. In other cases, AUROC varied little across hyperparameter combinations. For example, the four algorithms with the highest median AUROC—weka/LibLINEAR, mlr/glmnet, sklearn/logistic_regression, and sklearn/extra_trees—had median coefficients of variation of 0.02, 0.03, 0.01, and 0.03, respectively. For each of these algorithms, we plotted the performance of all hyperparameter combinations across all dataset / class-variable combinations (S19, S20, S21, and S22 Figs). The default hyperparameter combination failed to perform best for any of these algorithms. Indeed, for two of the four algorithms, the default combination performed worst.

Of the 1116 total combinations, 1084 were considered best for at least one dataset / class-variable combination (based on average performance in inner cross-validation folds).

Effects of performing feature selection

In Analysis 5, we performed feature selection via nested cross validation. We used 14 feature-selection algorithms in combination with each classification algorithm. Due to the computational demands of evaluating these 728 combinations, we initially used default hyperparameters for both types of algorithms. The feature-selection algorithms differed in their methodological approaches (Table 1). In addition, some were univariate methods, while others were multivariate. Some feature-selection algorithms mirrored the behavior of classification algorithms (e.g., SVMs or random forests); others were based on statistical inference or entropy-based metrics.

Table 1. Summary of feature-selection algorithms.

We evaluated 14 feature-selection algorithms. The abbreviation for each algorithm contains a prefix that indicates which machine-learning library implemented the algorithm (mlr = Machine learning in R, sklearn = scikit-learn, weka = WEKA: The workbench for machine learning). For each algorithm, we provide a brief description of the algorithmic approach; we extracted these descriptions from the libraries that implemented the algorithms. In addition, we assigned high-level categories that indicate whether the algorithms evaluate a single feature (univariate) or multiple features (multivariate) at a time. In some cases, the individual machine-learning libraries aggregated algorithm implementations from third-party packages. In these cases, we cite the machine-learning library and the third-party package. When available, we also cite papers that describe the algorithmic methodologies used.

Abbreviation Description Category
mlr/cforest.importance Uses the permutation principle (based on Random Forests) to calculate standard and conditional importance of features[2224] Multivariate
mlr/kruskal.test Uses the Kruskal-Wallis rank sum test[22,25] Univariate
mlr/randomForestSRC.rfsrc Uses the error rate for trees grown with and without a given feature[22,26,27] Multivariate
mlr/randomForestSRC.var.select Selects variables using minimal depth (Random Forests)[22,26,27] Multivariate
sklearn/mutual_info Calculates the mutual information between two feature clusterings[28,29] Univariate
sklearn/random_forest_rfe Recursively eliminates features based on Random Forests classification[28,30] Multivariate
sklearn/svm_rfe Recursively eliminates features based on support vector classification[28,31] Multivariate
weka/Correlation Calculates Pearson’s correlation coefficient between each feature and the class[32,33] Univariate
weka/GainRatio Measures the gain ratio of a feature with respect to the class[32,34] Univariate
weka/InfoGain Measures the information gain of a feature with respect to the class[32,34] Univariate
weka/OneR Evaluates the worth of a feature using the OneR classifier[32,35] Univariate
weka/ReliefF Repeatedly samples an instance and considers the value of a given attribute for the nearest instance of the same and different class[32,36] Multivariate
weka/SVMRFE Recursively eliminates features based on support vector classification[31,32] Multivariate
weka/SymmetricalUncertainty Measures the symmetrical uncertainty of a feature with respect to the class[32,37] Univariate

Once again, kernel- and ensemble-based classification algorithms performed best overall when feature selection was used (Fig 6). The median improvement per dataset / class-variable combination was slightly larger for feature selection than for hyperparameter optimization, and the maximal gains in predictive performance were larger for feature selection (Fig 5B and S5 Data). Overall, there was a strong positive correlation between AUROC values for Analyses 4 and 5 (Spearman’s rho = 0.73; S23 Fig). Among the 10 dataset / class-variable combinations that improved most after feature selection, 8 were associated with prognostic, stage, or patient-characteristic variables—categories that were most difficult to predict overall (S24 Fig). The remaining two combinations were molecular markers (HER2-neu and progesterone receptor status). Generally, the best performance was attained using 100 or 1000 features (S25 Fig).

Fig 6. Relative performance of classification algorithms using gene-expression and clinical predictors and performing feature selection.

Fig 6

We predicted patient states using gene-expression and clinical predictors with feature selection (Analysis 5). We used nested cross validation to estimate which features would be optimal for each algorithm in each training set. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

Across all classification algorithms, the weka/Correlation feature-selection algorithm resulted in the best predictive performance (S26 Fig), despite being a univariate method. This algorithm calculates the Pearson’s correlation coefficient between each feature and the class values, a relatively simple approach that also ranked among the fastest (S27 Fig). Other univariate algorithms were among the top performers. To characterize algorithm performance further, we compared the feature ranks between all algorithm pairs for two of the datasets. Some pairs produced highly similar gene rankings, whereas in other cases the similarity was low (S28 and S29 Figs). The weka/Correlation and mlr/kruskal.test algorithms produced similar feature ranks; both use statistical inference; the former is a parametric method, while the latter is nonparametric.

Some classification algorithms (e.g., weka/ZeroR and sklearn/decision_tree) performed poorly irrespective of feature-selection algorithm, whereas other classification algorithms (e.g., mlr/ranger and weka/LibLINEAR) performed consistently well across feature-selection algorithms (S30 Fig). The performance of other algorithms was more variable.

To provide guidance to practitioners, we examined interactions between individual feature-selection algorithms and classification algorithms (Fig 7). If a researcher had identified a particular classification algorithm to use, they might wish to select a feature-selection algorithm that performs well in combination with that classification algorithm. For example, the weka/Correlation feature-selection algorithm performed best overall, but it was only the 6th-best algorithm on average when sklearn/logistic_regression was used for classification. In contrast, a feature-selection algorithm that underperforms in general may perform well in combination with a given classification algorithm. For example, sklearn/svm_rfe performed poorly overall but was effective in combination with mlr/svm.

Fig 7. Relative classification performance per combination of feature-selection and classification algorithm.

Fig 7

For each combination of dataset and class variable, we averaged area under receiver operating characteristic curve (AUROC) values across all Monte Carlo cross-validation iterations. Then for each classification algorithm, we ranked the feature-selection algorithms based on AUROC scores across all datasets and class variables. Lower ranks indicate better performance. Dark-red boxes indicate cases where a particular feature-selection algorithm was especially effective for a particular classification algorithm. The opposite was true for dark-blue boxes.

We evaluated two alternatives for performing feature selection. Firstly, for 5 dataset/class combinations and 7 feature-selection algorithms, we used hyperparameter combinations for the feature-selection algorithms that differed from the defaults (a total of 59 hyperparameter combinations). The results were similar to Analysis 5 (S31 Fig and S6 Data), and the median change in AUROC per dataset/class combination was a decrease of 0.007. Secondly, all of the feature-selection algorithms are filter based; thus, ranking is performed independently of classification. As an alternative, wrapper-based approaches evaluate the extent to which features improve classification performance. We evaluated two classification algorithms (sklearn/svm and sklearn/knn) and selected the top 0.01%, 0.1% or 1% of features. The median change in AUROC per dataset/class combination was a decrease of 0.011. Additional benchmarks involving more algorithms and datasets are warranted in future studies.

Finally, we note that feature selection can be used to provide biological insight. Features that are consistently ranked highly for a given disease may be more likely to play a role in disease development or progression. For GSE10320 and GSE46691, we identified the 50 top-ranked genes, averaged across all algorithms (S7 Data), and used the Molecular Signatures Database to quantify the overlap between these gene lists and a curated “hallmark” set of gene sets known to play a role in tumorigenesis[38]. Three and four gene sets, respectively, significantly overlapped with the top-ranked genes (S8 and S9 Data). More extensive analysis and lab work would be required to validate these insights.

Discussion

The machine-learning community has developed hundreds of classification algorithms, spanning diverse methodological approaches[39]. Historically, most datasets available for testing had fewer than 100 predictor variables, so most algorithms were created and optimized for that use case[40]. Consequently, the execution time and predictive performance of many classification algorithms may be unsatisfactory when datasets consist of thousands of predictor variables–the algorithms may have difficulty identifying the most informative features in the data[41,42].

This benchmark study is considerably larger than any prior study of classification algorithms applied to gene-expression data. When gene-expression microarrays became common in biomedical research in the early 2000s, researchers began exploring the potential to make clinically relevant predictions and overcome these challenges[4347]. As a result of data-sharing policies, gene-expression datasets were increasingly available in the public domain, and researchers performed benchmark studies, comparing the effectiveness of classification algorithms on gene-expression data[14,4850]. Each of these studies evaluated between 5 and 21 algorithmic variants. In addition, the authors typically used at least one method of feature selection to reduce the number of predictor variables. The studies used as many as 7 datasets, primarily from tumor cells (and often adjacent normal cells). The authors focused mostly on classical algorithms, including k-Nearest Neighbors[51], linear discriminant analysis[52], and the multi-layer perceptron[53]. Pochet, et al. also explored the potential for nonlinear Support Vector Machine (SVM) classifiers to increase predictive performance relative to linear methods[49,54]. Later benchmark studies highlighted two types of algorithm—SVM and random forests[30]—that performed relatively well on gene-expression data[42,5557]. Statnikov, et al. examined 22 datasets and specifically compared the predictive capability of these two algorithm types. Overall, they found that SVMs significantly outperformed random forests, although random forests outperformed SVMs in some cases[42]. Perhaps in part due to these highly cited studies, SVMs and random forests have been used heavily in diverse types of biomedical research over the past two decades[58].

Community efforts—especially the Sage Bionetworks DREAM Challenges and Critical Assessment of Massive Data Analysis challenges[5961]—have encouraged the development and refinement of predictive models to address biomedical questions. In these benchmark studies, the priority is to maximize predictive performance and thus increase the potential that the models will have in practical use. Accordingly, participants have flexibility to use alternative normalization or summarization methods, to use alternative subsets of the training data, to combine algorithms, etc. These strategies often prove useful; however, this heterogeneity makes it difficult to deconvolve the relationship between a given solution’s performance and the underlying algorithm(s), hyperparameters, and features used.

Our primary motivation is to provide helpful advice for practitioners who perform biomarker studies. Identifying algorithm(s) and hyperparameter(s) that perform consistently well in this setting may ultimately lead to patient benefits. In situations where a biomarker is applied to thousands of cancer patients, even modest increases in accuracy can benefit hundreds of patients. Accordingly, we questioned whether SVM and random forests algorithms would continue to be the top performers when compared against diverse types of classification algorithms. We also questioned whether there would be scenarios in which these algorithms would perform poorly. Furthermore, relatively little has been known about the extent to which algorithm choice affects predictive success for a given dataset. Thus, we questioned how much variance in predictive performance we would see across the algorithms. In addition, we evaluated practical matters such as tradeoffs between predictive performance and execution time, the extent to which algorithm rankings are affected by the performance metric used, and which algorithms behave most similarly—or differently—to each other.

Our secondary motivation was to help bridge the gap between machine-learning researchers who develop general-purpose algorithms and biomedical researchers who seek to apply them in a specific context. When selecting algorithm(s), hyperparameters, and features to use in a biomarker study, researchers might base their decisions on what others have reported in the literature for a similar study; or they might consider anecdotal experiences that they or their colleagues have had. However, these decisions may lack an empirical basis and not generalize from one analysis to another. Alternatively, researchers might apply many algorithms to their data to estimate which algorithm(s) will perform best. However, this approach is time- and resource-intensive and may lead to bias if the comparisons are not performed in a rigorous manner. In yet another approach, researchers might develop a custom classification algorithm, perhaps one that is specifically designed for the target data. However, it is difficult to know whether such an algorithm would outperform existing, classical algorithms.

Many factors can affect predictive performance in a biomarker study. These factors include data-generation technologies, data normalization / summarization processes, validation strategies, and evaluation metrics used. Although such factors must be considered, we have shown that when holding them constant, the choice of algorithm, hyperparameter combination, and features usually affects predictive performance for a given dataset—sometimes dramatically. Despite these variations, we have demonstrated that particular algorithms and algorithm categories consistently outperform others across diverse gene-expression datasets and class variables. However, even the best algorithms performed poorly in some cases. These findings support the theory that no single algorithm is universally optimal[62]. But they also suggest that researchers can increase the odds of success in developing accurate biomarkers by focusing on a few top-performing algorithms and using hyperparameter optimization and/or feature selection, despite the additional computational demands in performing these steps. However, it is subjective to decide which characteristics to optimize and whether such optimization will reap rewards.

We deliberately focused on general-purpose algorithms because they are readily available in well-maintained, open-source packages. Of necessity, we evaluated an inexhaustive list of algorithms and hyperparameter combinations. Other algorithms or hyperparameter combinations may have performed better than those that we used. Many studies have proposed algorithm variations designed for feature selection and/or classification of gene-expression data[6372]. Some algorithms in our study had more hyperparameter combinations than others, which may have enabled those algorithms to adapt better in Analysis 4. Additionally, in some cases, our hyperparameter combinations were inconsistent between two algorithms of the same type because different software libraries support different options. Despite these limitations, a key advantage of our benchmarking approach is that we performed these comparisons in an impartial manner, not having developed any of the algorithms that we evaluated nor having other conflicts of interest that might bias our results.

Generally, kernel- and ensemble-based algorithms outperformed other types of algorithms in our analyses. Other algorithm types—such as linear-discriminant and neural-network algorithms—performed well in some scenarios. Deep neural networks have received much attention in the biomedical literature over the past decade[73]. This study included three types of deep neural networks. keras/snn and keras/dnn used fully connected networks; the hyperparameters combinations differed in the number of nodes, number of layers, dropout rate, regularization rate, number of epochs, and whether batch normalization was used. The mlr/h2o.deeplearning algorithm provided many of the same options. In Analysis 1, the keras/snn andn keras/dnn algorithms ranked among the top 11; however, their performance dropped in subsequent analyses. The mlr/h2o.deeplearning algorithm performed at mediocre levels in all of our analyses. Custom adaptations to this (or any other) deep-learning algorithm may improve predictive performance in future studies. Efforts to improve predictive ability might also include optimizing hyperparameters of feature-selection algorithms, combining hyperparameter-optimized classification algorithms with feature selection, and using multiple classifier systems[74]. Transfer learning across datasets may also prove fruitful[75].

Our findings are specific to high-throughput gene-expression datasets that have either no clinical predictors or a small set of clinical predictors. However, our conclusions may have relevance to other datasets that include many features and that include a combination of numeric and categorical features.

We applied Monte Carlo cross validation to each dataset separately and thus did not evaluate predictive performance on independent datasets. This approach was suitable for our benchmark comparison because our priority was to compare algorithms against each other rather than to optimize their performance for clinical use. On another note, comparisons across machine-learning packages are difficult to make. For example, some sklearn algorithms provided the ability to automatically address class imbalance, whereas other software packages did not always provide this functionality. Adapting these weights manually was infeasible for this study. Accordingly, future research that specifically focuses on under-sampling, over-sampling, and other methods to correct for class imbalance is warranted. In addition, some classification algorithms are designed to produce probabilistic predictions, whereas other algorithms produce only discrete predictions. The latter algorithms may have been at a disadvantage in our benchmark for the AUROC and other metrics.

Methods

Ethics statement

Brigham Young University’s Institutional Review Board approved this study under exemption status. This study uses data collected from public repositories only. We played no part in patient recruiting or in obtaining consent.

Data preparation

We used 50 datasets spanning diverse diseases and tissue types but focused primarily on cancer-related conditions. We used data from two sources. The first was a resource created by Golightly, et al.[76] that includes 45 datasets from Gene Expression Omnibus[77]. For these datasets, the gene-expression data were generated using Affymetrix microarrays, normalized using Single Channel Array Normalization[78], summarized using BrainArray annotations[79], quality checked using IQRay[80] and DoppelgangR[81], and batch-adjusted (where applicable) using ComBat[82]. Depending on the Affymetrix platform used, expression levels were available for 11,832 to 21,614 genes. For the remaining 5 datasets, we used RNA-Sequencing data from The Cancer Genome Atlas (TCGA)[83], representing 5 tumor types: colorectal adenocarcinoma (COAD), bladder urothelial carcinoma (BLCA), kidney renal clear cell carcinoma (KIRC), prostate adenocarcinoma (PRAD), and lung adenocarcinoma (LUAD). These data had been aligned and quantified using the Rsubread and featureCounts packages[84,85], resulting in transcripts-per-million values for 22,833 genes[86]. All gene-expression data were labeled using Ensembl gene identifiers[87].

For the microarray datasets, we used the class variables and clinical variables identified by Golightly, et al. (2.8 class variables per dataset)[76]. For the RNA-Sequencing datasets, we identified a total of 16 class variables. When a given sample was missing data for a given class variable, we excluded that sample from the analyses. Some class variables were continuous in nature (e.g., overall survival). We discretized these variables to enable classification, taking into account censor status where applicable. To support consistency and human interpretability across datasets, we assigned a standardized name and category to each class variable; the original and standardized names are available in S10 Data.

For most of the Golightly, et al. datasets, at least one clinical variable had been identified as a potential predictor variable. For TCGA datasets, we selected multiple clinical-predictor variables per dataset. Across all datasets, the mean and median number of clinical predictors per dataset were 3.1 and 2.0, respectively (S10 Data). We avoided combinations of clinical-predictor variables and class variables that were potentially confounded. For example, when a dataset included cancer stage as a class variable, we excluded predictor variables such as tumor grade or histological status. In some cases, no suitable predictor variable was available for a given class variable, leaving only gene-expression variables as predictors; this was true for 35 class variables.

Algorithms used

We used 52 classification algorithms that were implemented in the ShinyLearner tool, which enables researchers to benchmark algorithms from open-source machine-learning libraries and is redistributed as software containers[88,89]. We used implementations from the mlr R package (version 2; R version 3.5)[22], sklearn Python module (versions 0.18–0.22)[28], Weka Java application (version 3.6)[32], and keras Python module (2.6.0). Table 2 lists each algorithm, along with a description and methodological category for each algorithm. Furthermore, it indicates the open-source software package that implemented the algorithm, as well as the number of unique hyperparameter combinations that we evaluated for each algorithm. A full list can be found in S11 Data. Among the classification algorithms was Weka’s ZeroR, which predicts all instances to have the majority class. We included this algorithm as a sanity check[90] and a baseline against which all other algorithms could be compared. Beyond the 52 classification algorithms that we used, additional algorithms were available in ShinyLearner. We excluded algorithms that raised exceptions when we used default hyperparameters, required excessive amounts of random access memory (75 gigabytes or more), or were orders of magnitude slower than the other algorithms.

Table 2. Summary of classification algorithms.

We compared the predictive ability of 52 classification algorithms that were available in ShinyLearner and had been implemented across 4 open-source machine-learning libraries. The abbreviation for each algorithm contains a prefix indicating which machine-learning library implemented the algorithm (mlr = Machine learning in R, sklearn = scikit-learn, weka = WEKA: The workbench for machine learning; keras = Keras). For each algorithm, we provide a brief description of the algorithmic approach; we extracted these descriptions from the libraries that implemented the algorithms. In addition, we assigned high-level categories that characterize the algorithmic methodology used by each algorithm. In some cases, the individual machine-learning libraries aggregated algorithm implementations from third-party packages. In these cases, we cite the machine-learning library and the third-party package. When available, we also cite papers that describe the algorithmic methodologies used. Finally, for each algorithm, we indicate the number of unique hyperparameter combinations evaluated in Analysis 4.

Abbreviation Description Category Combos
keras/dnn Multi-layer neural network with Exponential Linear Unit activation[91,92] Artificial neural network 54
keras/snn Multi-layer neural network with Scaled Exponential Linear Unit activation[91,92] Artificial neural network 54
mlr/C50 C5.0 Decision Trees[22,93] Tree- or rule-based 32
mlr/ctree Conditional Inference Trees[22,94] Tree- or rule-based 4
mlr/earth Multivariate Adaptive Regression Splines[22,95] Linear discriminant 36
mlr/gausspr Gaussian Processes[22,96] Kernel-based 3
mlr/glmnet Generalized Linear Models with Lasso or Elasticnet Regularization[22,97] Linear discriminant 3
mlr/h2o.deeplearning Deep Neural Networks[22,92,98] Artificial neural network 32
mlr/h2o.gbm Gradient Boosting Machines[22,98,99] Ensemble 16
mlr/h2o.randomForest Random Forests[22,30,98] Ensemble 12
mlr/kknn k-Nearest Neighbor[22,100] Miscellaneous 6
mlr/ksvm Support Vector Machines[22,54,96] Kernel-based 40
mlr/mlp Multi-Layer Perceptron[22,53,101] Artificial neural network 14
mlr/naiveBayes Naive Bayes[22,102] Miscellaneous 2
mlr/randomForest Breiman and Cutler’s Random Forests[22,103] Ensemble 12
mlr/randomForestSRC Fast Unified Random Forests for Survival, Regression, and Classification[22,26,27] Ensemble 108
mlr/ranger A Fast Implementation of Random Forests[22,104] Ensemble 12
mlr/rpart Recursive Partitioning and Regression Trees[22,105,106] Tree- or rule-based 1
mlr/RRF Regularized Random Forests[22,107] Ensemble 24
mlr/sda Shrinkage Discriminant Analysis[22,108] Linear discriminant 2
mlr/svm Support Vector Machines[20,22,102] Kernel-based 28
mlr/xgboost eXtreme Gradient Boosting[22,109] Ensemble 3
sklearn/adaboost AdaBoost[28,110] Ensemble 8
sklearn/decision_tree A decision tree classifier[28] Tree- or rule-based 96
sklearn/extra_trees An extra-trees classifier[28] Ensemble 24
sklearn/gradient_boosting Gradient Boosting for classification[28,99] Ensemble 6
sklearn/knn k-nearest neighbors vote[28,51] Miscellaneous 12
sklearn/lda Linear Discriminant Analysis[28] Linear discriminant 3
sklearn/logistic_regression Logistic Regression[28,111] Kernel-based 32
sklearn/multilayer_perceptron Multi-layer Perceptron[28,53] Artificial neural network 24
sklearn/random_forest Random Forests[28,30] Ensemble 24
sklearn/sgd Linear classifiers with stochastic gradient descent training[28,112] Linear discriminant 36
sklearn/svm C-Support Vector Classification[28,54] Kernel-based 32
weka/Bagging Bagging a classifier to reduce variance[32,113] Ensemble 32
weka/BayesNet Bayes Network learning using various search algorithms and quality measures[32,114] Miscellaneous 2
weka/DecisionTable Simple decision table majority classifier[32,115] Tree- or rule-based 6
weka/HoeffdingTree Hoeffding tree[32,116] Tree- or rule-based 32
weka/HyperPipes HyperPipe classifier[32] Miscellaneous 1
weka/J48 Pruned or unpruned C4.5 decision tree[32,117] Tree- or rule-based 96
weka/JRip Repeated Incremental Pruning to Produce Error Reduction[32,118] Tree- or rule-based 12
weka/LibLINEAR LIBLINEAR—A Library for Large Linear Classification[17,32] Kernel-based 16
weka/LibSVM Support vector machines[20,32] Kernel-based 32
weka/NaiveBayes A Naive Bayes classifier using estimator classes[32,119] Miscellaneous 3
weka/OneR 1R (1 rule) classifier[32,35] Tree- or rule-based 3
weka/RandomForest Forest of random trees[30,32] Ensemble 18
weka/RandomTree Tree that considers K randomly chosen attributes at each node[32] Tree- or rule-based 2
weka/RBFNetwork Normalized Gaussian radial basis function network[32] Miscellaneous 18
weka/REPTree Fast decision tree learner (reduced-error pruning with backfitting)[32] Tree- or rule-based 16
weka/SimpleLogistic Linear logistic regression models[32,120,121] Linear discriminant 5
weka/SMO Sequential minimal optimization for a support vector classifier[32,122124] Kernel-based 20
weka/VFI Voting feature intervals[32,125] Miscellaneous 6
weka/ZeroR 0-R classifier (predicts the mean for a numeric class or the mode for a nominal class)[32] Baseline 1

For feature selection, we used 14 algorithms that had been implemented in ShinyLearner[89]. Table 1 lists each algorithm, along with a description and high-level category for each algorithm. S12 Data lists hyperparameters evaluated for these algorithms.

For all software implementations that supported it, we set the hyperparameters so that the classification algorithms would produce probabilistic predictions and use a single process/thread. Unless otherwise noted, we used default hyperparameter values for each algorithm, as dictated by the respective software implementations. For feature selection, we used n_features_to_select = 5 and step = 0.1 for the sklearn/random_forest_rfe and sklearn/svm_rfe methods to balance computational efficiency with the size of the datasets. For sklearn/random_forest_rfe, we specified n_estimators = 50 because execution failed when fewer estimators were used.

To analyze the benchmark results, we wrote scripts for Python (version 3.6)[126] and the R statistical software (version 4.02)[127]. We also used the corrplot[128], cowplot[129], ggrepel[130], and tidyverse[131] packages.

Analysis phases

We performed this study in five phases (Fig 1). In each phase, we modulated either the data used or the optimization approach. In Analysis 1, we used gene-expression predictors only and default hyperparameter values for each classification algorithm. In Analysis 2, we used clinical predictors only and default hyperparameter values for each classification algorithm. In Analysis 3, we used gene-expression and clinical predictors and default hyperparameter values. In Analysis 4, we used both types of predictors and selected hyperparameter values via nested cross-validation. In Analysis 5, we used both types of predictors and selected the most relevant n features via nested cross validation before performing classification.

In each phase, we used Monte Carlo cross validation. For each iteration, we randomly assigned the patient samples to either a training set or test set, stratified by class. We assigned approximately 2/3 of the patient samples to the training set. We then made predictions for the test set and evaluated the predictions using diverse metrics (see below). We repeated this process (an iteration) multiple times and used the iteration number as a random seed when assigning samples to the training or test set (unless otherwise noted). ShinyLearner relays this seed to the underlying algorithms, where applicable.

During Analysis 1, we evaluated the number of Monte Carlo iterations that would be necessary to provide a stable performance estimate. For the mlr/randomForest, sklearn/svm, and weka/Bagging classification algorithms, we executed 100 iterations for datasets GSE10320 (predicting relapse vs. non-relapse for Wilms tumor patients) and GSE46691 (predicting early metastasis following radical prostatectomy). As the number of iterations increased, we calculated the cumulative average of the AUROC for each algorithm. After performing at most 40 iterations, the cumulative averages did not change more than 0.01 over sequences of 10 iterations (S32 and S33 Figs). To be conservative, we used 50 iterations in Analysis 1, Analysis 2, and Analysis 3. In Analysis 4 and Analysis 5, we used 5 iterations because hyperparameter optimization and feature selection are CPU and memory intensive. When optimizing hyperparameters (Analysis 4), we used Monte Carlo cross validation for each training set (5 nested iterations) to estimate which hyperparameter combination was most effective for each classification algorithm; we used AUROC as a metric in these evaluations. When performing feature selection (Analysis 5), we used nested Monte Carlo cross validation (5 iterations). In each iteration, we ranked the features using each feature-selection algorithm and performed classification using the top-n features. We repeated this process for each classification algorithm and used n values of 1, 10, 100, 1000, and 10000. For a given combination of feature-selection algorithm and classification algorithm, we identified the n value that resulted in the highest AUROC. We used this n value in the respective outer fold. Finally, when identifying the most informative features across Monte Carlo iterations, we used the Borda Count method to combine the ranks[74].

While executing each analysis phase, we encountered some situations in which we obtained no valid results for all combinations of class variable and algorithms, as noted below.

Analysis 1. On iteration 34, the weka/RBFNetwork algorithm did not converge after 24 hours of execution time for one of the datasets. We manually changed the random seed from 34 to 134, and it converged in minutes.

Analysis 2. The mlr/glmnet algorithm failed three times due to an internal error. We limited the results for this algorithm to the iterations that completed successfully.

Analysis 3. On iteration 34, the weka/RBFNetwork algorithm did not converge after 24 hours of execution time for one of the datasets. We manually changed the random seed from 34 to 134, and it converged in minutes.

Analysis 4. During nested Monte Carlo cross validation, we specified a time limit of 168 hours under the assumption that some hyperparameter combinations would be especially time intensive. A total of 1022 classification tasks failed either due to this limit or due to small sample sizes. We ignored these hyperparameter combinations when determining the top-performing combinations. Most failures were associated with the mlr/h2o.gbm and mlr/ksvm classification algorithms.

Analysis 5. During nested Monte Carlo cross validation, we specified a time limit of 168 hours. A total of 1408 classification tasks failed either due to this limit or due to small sample sizes. We ignored these tasks when performing hyperparameter optimization.

Computing resources

We performed these analyses using Linux servers supported by Brigham Young University’s Office of Research Computing and Life Sciences Information Technology. In addition, we used virtual servers in Google’s Compute Engine environment supported by the Institute for Systems Biology and the United States National Cancer Institute Cancer Research Data Commons[132]. When multiple central-processing cores were available on a given server, we executed tasks in parallel using GNU Parallel [133].

Performance metrics

In outer cross-validation folds, we used diverse metrics to quantify classification performance. These included accuracy (proportion of accurate predictions), AUROC[134], AUPRC, balanced accuracy (proportion of accurate predictions weighted by class-label frequency), Brier score[135], F1 score[136], false discovery rate (false positives divided by total number of positives), false positive rate, Matthews correlation coefficient[137], mean misclassification error (MMCE), negative predictive value, positive predictive value (precision), and recall (sensitivity). Many of these metrics require discretized predictions; we relied on the machine-learning packages that implemented each algorithm to convert probabilistic predictions to discretized predictions.

Supporting information

S1 Fig. Relative performance of classification algorithms using gene-expression predictors and area under the receiver operating characteristic curve as the metric.

We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination. The top 15 performers (relatively low ranks) were algorithms that use linear decision boundaries, kernel functions, and/or ensembles of decision trees.

(PDF)

S2 Fig. Relative performance of classification algorithms using gene-expression predictors and classification accuracy as the metric.

We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of classification accuracy across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

(PDF)

S3 Fig. Relative performance of classification algorithms using gene-expression predictors and Matthews Correlation Coefficient as the metric.

We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of the Matthews Correlation Coefficient across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

(PDF)

S4 Fig. Relative performance of classification algorithms using gene-expression predictors and area under the precision-recall curve as the metric.

We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the precision-recall curve across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

(PDF)

S5 Fig. Comparison of area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) scores for Analysis 1.

(PDF)

S6 Fig. Comparison of area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) scores for Analysis 1, based on ranks (relative performance per algorithm).

(PDF)

S7 Fig. Pairwise correlations of sample-level, probabilistic predictions between classification algorithms for dataset GSE10320.

We used each classification algorithm to make probabilistic predictions of relapse in Wilms tumor patients (GSE10320). Based on these predictions, we calculated the Spearman correlation coefficient for each pair of algorithms. These coefficients, averaged across Monte Carlo cross-validation iterations, are illustrated as a correlation plot, clustered based on similarity.

(PDF)

S8 Fig. Pairwise correlations of sample-level, probabilistic predictions between classification algorithms for dataset GSE46691.

We used each classification algorithm to make probabilistic predictions of early metastasis following radical prostatectomy (GSE46691). Based on these predictions, we calculated the Spearman correlation coefficient for each pair of algorithms. These coefficients, averaged across Monte Carlo cross-validation iterations, are illustrated as a correlation plot, clustered based on similarity.

(PDF)

S9 Fig. Dataset performance by class category when using gene-expression predictors.

For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 1, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as the metric. The dashed, red line indicates the performance expected by random chance. The top-performing category was “Molecular Marker,” which includes class variables associated with mutation status, immunohistochemistry markers of protein expression, presence or absence of chromosomal aberrations, etc. The lowest-performing category was “Patient Characteristic,” which includes variables that indicate whether patients had a family history of cancer, had been diagnosed with multiple tumors, patient performance status, etc.

(PDF)

S10 Fig. Relative performance of classification algorithms using clinical predictors and area under the receiver operating characteristic curve as the metric.

We predicted patient states using clinical predictors only (Analysis 2). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination (some datasets did not have clinical predictors). The top-performing algorithms (relatively low ranks) were similar overall to Analysis 1; however, some differences were large. For example, weka/NaiveBayes performed best overall in Analysis 2 but was ranked 28th in Analysis 1.

(PDF)

S11 Fig. Dataset performance by class category when using clinical predictors.

For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 2, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as the metric. The dashed, red line indicates the performance expected by random chance. The top-performing category was “Diagnosis,” which includes class variables associated with a particular disease or subtype. The lowest-performing category was “Patient Characteristic,” which includes variables that indicate whether patients had a family history of cancer, had been diagnosed with multiple tumors, patient performance status, etc.

(PDF)

S12 Fig. Dataset performance by class category when using gene-expression and clinical predictors.

For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 3, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as a metric. The dashed, red line indicates the performance expected by random chance. As with Analysis 1 (S9 Fig), the top-performing category was “Molecular Marker,” which includes class variables associated with mutation status, immunohistochemistry markers of protein expression, presence or absence of chromosomal aberrations, etc. The lowest-performing category was “Patient Characteristic,” which includes variables that indicate whether patients had a family history of cancer, had been diagnosed with multiple tumors, patient performance status, etc.

(PDF)

S13 Fig. Relative performance of classification algorithms using gene-expression and clinical predictors.

We predicted patient states using gene-expression and clinical predictors (Analysis 3). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

(PDF)

S14 Fig. Relative performance of classification algorithms using gene-expression and clinical predictors and performing hyperparameter optimization.

We predicted patient states using gene-expression and clinical predictors with hyperparameter optimization (Analysis 4). We used nested cross validation to estimate which hyperparameter combination would be optimal for each algorithm in each training set. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination. The algorithm rankings followed similar trends as Analysis 3 (no hyperparameter optimization); however, some differences are notable. For example, the weka/LibLINEAR and mlr/glmnet algorithms were ranked 11th and 16th in Analysis 3 (S13 Fig), but they were ranked 1st and 2nd in this analysis.

(PDF)

S15 Fig. Dataset performance by class category when using gene-expression and clinical predictors and performing hyperparameter optimization.

For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 4, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as a metric. The dashed, red line indicates the performance expected by random chance.

(PDF)

S16 Fig. Correlation between predictive performance and number of samples per dataset.

The number of patient samples differed by dataset. This scatterplot shows the relationship between the median area under the receiver operating characteristic curve (AUROC) and the number of samples in each dataset. We did not observe a significant correlation between these variables.

(PDF)

S17 Fig. Correlation between predictive performance and number of genes per dataset.

Due to differences in gene-expression profiling platforms, we had data for more genes in some datasets than in others. This scatterplot shows the relationship between the median area under the receiver operating characteristic curve (AUROC) and the number of genes in each dataset. We did not observe a significant correlation between these variables.

(PDF)

S18 Fig. Variation in predictive performance across hyperparameter combinations.

In Analysis 4, we used nested cross validation to evaluate multiple hyperparameter combinations for each classification algorithm. We assessed the extent to which the area under the receiver operating characteristic curve (AUROC) varied across the hyperparameter combinations for each algorithm. For each combination of dataset, class variable, classification algorithm, and hyperparameter set, we averaged AUROC values across 5 Monte Carlo cross-validation iterations. Then we calculated the coefficient of variation for these averaged values across each combination of dataset/class and classification algorithm. Relatively low values indicate that the hyperparameter sets resulted in similar predictive performance. No results are available for 3 algorithms that used only a single hyperparameter option.

(PDF)

S19 Fig. Relative performance of different hyperparameter combinations for the weka/LIBLINEAR classification algorithm.

The ShinyLearner software supports 16 hyperparameter combinations for the weka/LIBLINEAR classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Some combinations consistently outperformed other combinations, and the default combination performed suboptimally. Using relatively small cost values appeared to improve the performance more than any other option. This hyperparameter controls the regularization strength.

(PDF)

S20 Fig. Relative performance of different hyperparameter combinations for the mlr/glmnet classification algorithm.

The ShinyLearner software supports 3 hyperparameter combinations for the mlr/glmnet classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Using an alpha value of 0.5 or 0 resulted in better performance than a value of 1.

(PDF)

S21 Fig. Relative performance of different hyperparameter combinations for the sklearn/logistic_regression classification algorithm.

The ShinyLearner software supports 32 hyperparameter combinations for the sklearn/logistic_regression classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Some combinations consistently outperformed other combinations, and the default combination performed suboptimally. Using relatively small cost values appeared to improve the performance more than any other option. This hyperparameter controls the regularization strength.

(PDF)

S22 Fig. Relative performance of different hyperparameter combinations for the sklearn/extra_trees classification algorithm.

The ShinyLearner software supports 24 hyperparameter combinations for the sklearn/extra_trees classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Some combinations consistently outperformed other combinations, and the default combination performed suboptimally. Using a larger number (n = 1000) of estimators (trees) appeared to improve the performance more than any other option.

(PDF)

S23 Fig. Relative predictive performance when using hyperparameter optimization vs. feature selection.

We used as a baseline the predictive performance that we attained using default hyperparameters for the classification algorithms (Analysis 3). We quantified predictive performance using the area under the receiver operating characteristic curve (AUROC). This graph shows the increase or decrease in performance when selecting hyperparameters or selecting features relative to the baseline. Each point represents a particular combination of dataset and class variable. Generally, the dataset/class combinations that benefitted from hyperparameter optimization also benefitted from feature selection. However, some dataset/class combinations that did not benefit from hyperparameter optimization did benefit from feature selection.

(PDF)

S24 Fig. Dataset performance by class category when using gene-expression and clinical predictors and performing feature selection.

For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 5, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as a metric. The dashed, red line indicates the performance expected by random chance. The results are similar to those of Analyses 3 and 4 (S12 and S15 Figs).

(PDF)

S25 Fig. Predictive performance according to the number of features selected via nested cross-validation.

Relative area under the receiver operating character curve (AUROC) values were calculated by comparing against the mean for each combination of classification algorithm and feature-selection algorithm.

(PDF)

S26 Fig. Relative performance of feature-selection algorithms.

For Analysis 5, we used nested cross validation to estimate which features would be most informative for each algorithm in each training set. For each combination of dataset, class variable, and classification algorithm, we ranked the performance of the feature-selection algorithms based on area under the receiver operating characteristic curve (AUROC) and averaged the rankings across 5 iterations of Monte Carlo cross-validation. Each data point that overlays the box plots represents a particular dataset/class combination. Relatively low average ranks are considered optimial. The weka/Correlation feature-selection algorithm performed best overall.

(PDF)

S27 Fig. Execution time per feature-selection algorithm.

In Analysis 5, we used nested cross validation to estimate which features were most informative for each training set. We calculated the time (in seconds) required by each feature-selection algorithm to rank the features. Then we averaged these times across all combinations of dataset, class variable, classification algorithm, and (outer) Monte Carlo cross-validation iteration. Some feature-selection algorithms were much more computationally intensive than others.

(PDF)

S28 Fig. Pairwise correlations of feature ranks between feature-selection algorithms for dataset GSE10320.

We used each feature-selection algorithm to rank the genes based on their informativeness for discriminating between relapse and non-relapse outcomes in Wilms tumor patients (GSE10320). After averaging the ranks across cross-validation iterations, we calculated the Spearman correlation coefficient for the feature ranks produced by each pair of algorithms. These coefficients are illustrated as a correlation plot.

(PDF)

S29 Fig. Pairwise correlations of feature ranks between feature-selection algorithms for dataset GSE46691.

We used each feature-selection algorithm to rank the genes based on their informativeness for predicting early metastasis following radical prostatectomy (GSE46691). After averaging the ranks across cross-validation iterations, we calculated the Spearman correlation coefficient for the feature ranks produced by each pair of algorithms. These coefficients are illustrated as a correlation plot.

(PDF)

S30 Fig. Absolute classification performance per combination of feature-selection and classification algorithm.

For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all Monte Carlo cross-validation iterations. Then for each combination of feature-selection algorithm and classification algorithm, we calculated the median AUROC across all datasets and class variables.

(PDF)

S31 Fig. Relative performance of classification algorithms using gene-expression and clinical predictors and performing feature selection with hyperparameter optimization.

We predicted patient states using gene-expression and clinical predictors with feature selection and optimization of the feature-selection algorithm hyperparameters (Analysis 6). We used nested cross validation to estimate which features and hyperparameter combinations would be optimal for each algorithm in each training set.

(PDF)

S32 Fig. Stability of classification performance for increasing numbers of cross-validation iterations on dataset GSE10320.

When using gene-expression predictors (Analysis 1), we estimated the number of Monte Carlo cross-validation iterations that would be sufficient to characterize algorithm performance. For three classification algorithms, we executed 100 cross-validation iterations on dataset GSE10320 (predicting relapse vs. non-relapse for Wilms tumor patients). As the number of iterations increased, we calculated the cumulative average of the area under the receiver operating characteristic curve (AUROC) for each algorithm. After performing at most 40 iterations, the cumulative averages did not change more than 0.01 over sequences of 10 iterations.

(PDF)

S33 Fig. Stability of classification performance for increasing numbers of cross-validation iterations on dataset GSE46691.

When using gene-expression predictors (Analysis 1), we estimated the number of Monte Carlo cross-validation iterations that would be sufficient to characterize algorithm performance. For three classification algorithms, we executed 100 cross-validation iterations on dataset GSE46691 (predicting early metastasis following radical prostatectomy). As the number of iterations increased, we calculated the cumulative average of the area under the receiver operating characteristic curve (AUROC) for each algorithm. After performing at most 22 iterations, the cumulative averages did not change more than 0.01 over sequences of 10 iterations.

(PDF)

S1 Data. Summary of predictive performance per dataset when using gene-expression predictors.

We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination.

(XLSX)

S2 Data. Summary of predictive performance per dataset when using clinical predictors.

We predicted patient states using clinical predictors only (Analysis 2). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination. For some dataset/class combinations, no clinical predictors were available; these combinations are excluded from this file.

(XLSX)

S3 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors.

We predicted patient states using gene-expression and clinical predictors (Analysis 3). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination. For some dataset/class combinations, no clinical predictors were available; these combinations are excluded from this file.

(XLSX)

S4 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors and performing hyperparameter optimization.

We predicted patient states using gene-expression and clinical predictors (Analysis 4). For classification algorithms that included multiple hyperparameter combinations (n = 47), we performed hyperparameter optimization using the respective training sets. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 (outer) iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination.

(XLSX)

S5 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors and performing feature selection.

We predicted patient states using gene-expression and clinical predictors (Analysis 5). Using each respective training set, we performed feature selection for each of 14 feature-selection algorithms and performed classification using n top-ranked features. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 (outer) iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination.

(XLSX)

S6 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors and performing feature selection with hyperparameter optimization.

(XLSX)

S7 Data. Top 50 genes according to average rank across feature-selection algorithms for GSE10320 and GSE46691.

(XLSX)

S8 Data. Gene-set overlap results for top 50 genes according to average rank across feature-selection algorithms for GSE10320.

(XLSX)

S9 Data. Gene-set overlap results for top 50 genes according to average rank across feature-selection algorithms for GSE46691.

(XLSX)

S10 Data. Summary of datasets used.

This file contains a unique identifier for each dataset, indicates whether gene-expression microarrays or RNA-Sequencing were used to generate the data, and indicates the name of the class variable from the original dataset. In addition, we assigned standardized names and categories as a way to support consistency across datasets. The file lists any clinical predictors that were used in the analyses as well as the number of samples and genes per dataset.

(XLSX)

S11 Data. Classification algorithm hyperparameter combinations.

This file indicates all hyperparameter combinations that we evaluated via nested cross-validation in Analysis 4.

(XLSX)

S12 Data. Feature-selection algorithm hyperparameter combinations.

This file indicates all hyperparameter combinations that we evaluated via nested cross-validation in the follow-up analysis to Analysis 5.

(XLSX)

Acknowledgments

Results from this study are in part based upon data generated by TCGA and managed by the United States National Cancer Institute and National Human Genome Research Institute (see http://cancergenome.nih.gov). We thank the patients who participated in this study and shared their data publicly. We thank the Fulton Supercomputing Laboratory at Brigham Young University for providing computational facilities. This work was supported in part with a cloud credits allocation provided by the ISB-CGC Cloud Resource, part of the NCI Cancer Research Data Commons.

Data Availability

Source code for each algorithm used can be found in repositories for the respective software libraries used in this study: * https://github.com/scikit-learn/scikit-learn * https://github.com/mlr-org/mlr * https://github.com/Waikato/weka-3.8 * https://github.com/keras-team/keras Code used to integrate the software libraries within software containers, to perform cross validation, to calculate performance metrics, etc. are part of the ShinyLearner tool. Source code can be found at https://github.com/srp33/ShinyLearner. Data and code used to execute this analysis are available at https://osf.io/fv8td/. This repository contains raw and summarized versions of the analysis results, as well as code that we used to generate the figures and tables for this manuscript. The repository is freely available under the Creative Commons Universal 1.0 license. All other data are within the manuscript and its Supporting Information files.

Funding Statement

NPG was funded by a student fellowship from the Simmons Center for Cancer Research at Brigham Young University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.National Research Council (US) Committee on A Framework for Developing a New Taxonomy of Disease. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. Washington (DC): National Academies Press (US); 2011. (The National Academies Collection: Reports funded by National Institutes of Health). [PubMed] [Google Scholar]
  • 2.Collins FS, Varmus H. A New Initiative on Precision Medicine. N Engl J Med. 2015. Feb;372(9):793–5. doi: 10.1056/NEJMp1500523 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Krumholz HM. Big Data And New Knowledge In Medicine: The Thinking, Training, And Tools Needed For A Learning Health System. Health Aff (Millwood). 2014. Jul;33(7):1163–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Obermeyer Z, Emanuel EJ. Predicting the Future—Big Data, Machine Learning, and Clinical Medicine. N Engl J Med. 2016. Sep;375(13):1216–9. doi: 10.1056/NEJMp1606181 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Butte A. The use and analysis of microarray data. Nat Rev Drug Discov. 2002. Dec;1(12):951–60. doi: 10.1038/nrd961 [DOI] [PubMed] [Google Scholar]
  • 6.Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008. Jul;5(7):621–8. doi: 10.1038/nmeth.1226 [DOI] [PubMed] [Google Scholar]
  • 7.Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. JCO. 2009. Feb;27(8):1160–7. doi: 10.1200/JCO.2008.18.1370 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wallden B, Storhoff J, Nielsen T, Dowidar N, Schaper C, Ferree S, et al. Development and verification of the PAM50-based Prosigna breast cancer gene signature assay. BMC Med Genomics. 2015. Aug;8. doi: 10.1186/s12920-015-0129-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gnant M, Filipits M, Greil R, Stoeger H, Rudas M, Bago-Horvath Z, et al. Predicting distant recurrence in receptor-positive breast cancer patients with limited clinicopathological risk: Using the PAM50 Risk of Recurrence score in 1478 postmenopausal patients of the ABCSG-8 trial treated with adjuvant endocrine therapy alone. Ann Oncol. 2014. Feb;25(2):339–45. doi: 10.1093/annonc/mdt494 [DOI] [PubMed] [Google Scholar]
  • 10.Dowsett M, Sestak I, Lopez-knowles E, Sidhu K, Dunbier A, Cowens J, et al. Comparison of PAM50 Risk of Recurrence Score With Oncotype DX and IHC4 for Predicting Risk of Distant Recurrence After Endocrine Therapy. Journal of clinical oncology: official journal of the American Society of Clinical Oncology. 2013. Jul;31. doi: 10.1200/JCO.2012.46.1558 [DOI] [PubMed] [Google Scholar]
  • 11.Nielsen T, Wallden B, Schaper C, Ferree S, Liu S, Gao D, et al. Analytical validation of the PAM50-based Prosigna Breast Cancer Prognostic Gene Signature Assay and nCounter Analysis System using formalin-fixed paraffin-embedded breast tumor specimens. BMC Cancer. 2014. Mar;14(1):177. doi: 10.1186/1471-2407-14-177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tofigh A, Suderman M, Paquet ER, Livingstone J, Bertos N, Saleh SM, et al. The Prognostic Ease and Difficulty of Invasive Breast Carcinoma. Cell Reports. 2014. Oct;9(1):129–42. doi: 10.1016/j.celrep.2014.08.073 [DOI] [PubMed] [Google Scholar]
  • 13.Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Methodol. 1974;36(2):111–33. [Google Scholar]
  • 14.Dudoit S, Fridlyand J. Classification in microarray experiments. In: Speed T, editor. Statistical Analysis of Gene Expression Microarray Data. Chapman and Hall/CRC; 2003. [Google Scholar]
  • 15.Fielden MR, Zacharewski TR. Challenges and Limitations of Gene Expression Profiling in Mechanistic and Predictive Toxicology. Toxicol Sci. 2001. Mar;60(1):6–10. doi: 10.1093/toxsci/60.1.6 [DOI] [PubMed] [Google Scholar]
  • 16.Eling N, Morgan MD, Marioni JC. Challenges in measuring and understanding biological noise. Nat Rev Genet. 2019. Sep;20(9):536–48. doi: 10.1038/s41576-019-0130-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR—a library for large linear classification. 2008. [Google Scholar]
  • 18.Whalen S, Schreiber J, Noble WS, Pollard KS. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet. 2021. Nov; doi: 10.1038/s41576-021-00434-9 [DOI] [PubMed] [Google Scholar]
  • 19.Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning. New York, NY, USA: Association for Computing Machinery; 2006. p. 233–40. (ICML ‘06).
  • 20.Chang C-C, Lin C-J. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol TIST. 2011;2(3):1–27. [Google Scholar]
  • 21.Salas S, Brulard C, Terrier P, Ranchere-Vince D, Neuville A, Guillou L, et al. Gene Expression Profiling of Desmoid Tumors by cDNA Microarrays and Correlation with Progression-Free Survival. Clin Cancer Res. 2015. Sep;21(18):4194–200. doi: 10.1158/1078-0432.CCR-14-2910 [DOI] [PubMed] [Google Scholar]
  • 22.Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, et al. Mlr: Machine learning in r. J Mach Learn Res. 2016;17(1):5938–42. [Google Scholar]
  • 23.Strobl C, Boulesteix A-L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 2007. Jan;8(1):25. doi: 10.1186/1471-2105-8-25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinformatics. 2008. Jul;9(1):307. doi: 10.1186/1471-2105-9-307 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kruskal WH W. Allen Wallis. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47(260):583–621. [Google Scholar]
  • 26.Ishwaran H, Kogalur UB, Kogalur MUB. Package ‘randomForestSRC.’ 2020; [Google Scholar]
  • 27.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS, others. Random survival forests. Ann Appl Stat. 2008;2(3):841–60. [Google Scholar]
  • 28.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
  • 29.Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423. [Google Scholar]
  • 30.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
  • 31.Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422. [Google Scholar]
  • 32.Hall M, National H, Frank E, Holmes G, Pfahringer B, Reutemann P, et al. The WEKA data mining software. ACM SIGKDD Explor Newsl. 2009. Nov;11(1):10. [Google Scholar]
  • 33.Pearson K. In: Proceedings of the Royal Society of London. Taylor & Francis; 1895. p. 240–2. [Google Scholar]
  • 34.Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106. [Google Scholar]
  • 35.Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11:63–91. [Google Scholar]
  • 36.Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. In: Bergadano F, Raedt LD, editors. European conference on machine learning. Springer; 1994. p. 171–82. [Google Scholar]
  • 37.Witten IH, Frank E. Data mining: Practical machine learning tools and techniques with Java implementations. Acm Sigmod Rec. 2002;31(1):76–7. [Google Scholar]
  • 38.Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1(6):417–25. doi: 10.1016/j.cels.2015.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? J Mach Learn Res. 2014;15:3133–81. [Google Scholar]
  • 40.Bay SD, Kibler D, Pazzani MJ, Smyth P. The UCI KDD archive of large data sets for data mining research and experimentation. ACM SIGKDD Explor Newsl. 2000;2(2):81–5. [Google Scholar]
  • 41.Domingos P. A Few Useful Things to Know about Machine Learning.: 9. [Google Scholar]
  • 42.Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC bioinformatics. 2008;9(1):319. doi: 10.1186/1471-2105-9-319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999. Oct;286(5439):531–7. doi: 10.1126/science.286.5439.531 [DOI] [PubMed] [Google Scholar]
  • 44.Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000. Feb;403(6769):503–11. doi: 10.1038/35000501 [DOI] [PubMed] [Google Scholar]
  • 45.Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001. Sep;98(19):10869–74. doi: 10.1073/pnas.191367098 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.van ‘t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002. Jan;415(6871):530–6. doi: 10.1038/415530a [DOI] [PubMed] [Google Scholar]
  • 47.Ramaswamy S, Ross KN, Lander ES, Golub TR. A molecular signature of metastasis in primary solid tumors. Nat Genet. 2003. Jan;33(1):49–54. doi: 10.1038/ng1060 [DOI] [PubMed] [Google Scholar]
  • 48.Cho S-B, Won H-H. Machine learning in DNA microarray analysis for cancer classification. In: Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003-Volume 19. 2003. p. 189–98.
  • 49.Pochet N, De Smet F, Suykens JA, De Moor BL. Systematic benchmarking of microarray data classification: Assessing the role of non-linearity and dimensionality reduction. Bioinformatics. 2004;20(17):3185–95. doi: 10.1093/bioinformatics/bth383 [DOI] [PubMed] [Google Scholar]
  • 50.Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal. 2005;48(4):869–85. [Google Scholar]
  • 51.Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85. [Google Scholar]
  • 52.Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7(2):179–88. [Google Scholar]
  • 53.Rosenblatt F. Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Cornell Aeronautical Lab Inc Buffalo; NY; 1961. [Google Scholar]
  • 54.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. [Google Scholar]
  • 55.Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005. Mar;21(5):631–43. doi: 10.1093/bioinformatics/bti033 [DOI] [PubMed] [Google Scholar]
  • 56.Pirooznia M, Yang JY, Yang MQ, Deng Y. A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. 2008;9 Suppl 1:S13. doi: 10.1186/1471-2164-9-S1-S13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006. Jan;7:3. doi: 10.1186/1471-2105-7-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Koohy H. The rise and fall of machine learning methods in biomedical research. F1000Research. 2018. Jan;6:2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Jarchum I, Jones S. DREAMing of benchmarks. Nat Biotechnol. 2015. Jan;33(1):49–50. doi: 10.1038/nbt.3115 [DOI] [PubMed] [Google Scholar]
  • 60.Saez-Rodriguez J, Costello JC, Friend SH, Kellen MR, Mangravite L, Meyer P, et al. Crowdsourcing biomedical research: Leveraging communities as innovation engines. Nat Rev Genet. 2016;17(8):470. doi: 10.1038/nrg.2016.69 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Sumsion GR, Bradshaw MS, Beales JT, Ford E, Caryotakis GRG, Garrett DJ, et al. Diverse approaches to predicting drug-induced liver injury using gene-expression profiles. Biol Direct. 2020. Jan;15(1):1. doi: 10.1186/s13062-019-0257-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Ho YC, Pepyne DL. Simple Explanation of the No-Free-Lunch Theorem and Its Implications. Journal of Optimization Theory and Applications. 2002. Dec;115(3):549–70. [Google Scholar]
  • 63.Li L, Darden TA, Weingberg CR, Levine AJ, Pedersen LG. Gene Assessment and Sample Classification for Gene Expression Data Using a Genetic Algorithm / k-nearest Neighbor Method. Combinatorial Chemistry & High Throughput Screening. 2001. Dec;4(8):727–39. doi: 10.2174/1386207013330733 [DOI] [PubMed] [Google Scholar]
  • 64.Dettling M. BagBoosting for tumor classification with gene expression data. Bioinformatics. 2004. Dec;20(18):3583–93. doi: 10.1093/bioinformatics/bth447 [DOI] [PubMed] [Google Scholar]
  • 65.Au W-H, Chan KCC, Wong AKC, Wang Y. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinform. 2005. Apr;2(2):83–101. doi: 10.1109/TCBB.2005.17 [DOI] [PubMed] [Google Scholar]
  • 66.He H, Shen X. A ranked subspace learning method for gene expression data classification. In: IC-AI. 2007. p. 358–64. [Google Scholar]
  • 67.Chandra B, Gupta M. An efficient statistical feature selection approach for classification of gene expression data. Journal of Biomedical Informatics. 2011. Aug;44(4):529–35. doi: 10.1016/j.jbi.2011.01.001 [DOI] [PubMed] [Google Scholar]
  • 68.Alonso-González CJ, Moro-Sancho QI, Simon-Hurtado A, Varela-Arrabal R. Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods. Expert Systems with Applications. 2012. Jun;39(8):7270–80. [Google Scholar]
  • 69.Buza K. Classification of gene expression data: A hubness-aware semi-supervised approach. Computer Methods and Programs in Biomedicine. 2016. Apr;127:105–13. doi: 10.1016/j.cmpb.2016.01.016 [DOI] [PubMed] [Google Scholar]
  • 70.Liu S, Xu C, Zhang Y, Liu J, Yu B, Liu X, et al. Feature selection of gene expression data for Cancer classification using double RBF-kernels. BMC Bioinformatics. 2018. Oct;19(1):396. doi: 10.1186/s12859-018-2400-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Lu H, Chen J, Yan K, Jin Q, Xue Y, Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing. 2017. Sep;256:56–62. [Google Scholar]
  • 72.Masud Rana Md, Ahmed K. Feature Selection and Biomedical Signal Classification Using Minimum Redundancy Maximum Relevance and Artificial Neural Network. In: Uddin MS, Bansal JC, editors. Proceedings of International Joint Conference on Computational Intelligence. Singapore: Springer; 2020. p. 207–14. (Algorithms for Intelligent Systems).
  • 73.Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface. 2018. Apr;15(141):20170387. doi: 10.1098/rsif.2017.0387 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Ho TK, Hull JJ, Srihari SN. Decision combination in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell. 1994;16(1):66–75. [Google Scholar]
  • 75.López-García G, Jerez JM, Franco L, Veredas FJ. Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data. PLOS ONE. 2020. Mar;15(3):e0230536. doi: 10.1371/journal.pone.0230536 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Golightly NP, Bell A, Bischoff AI, Hollingsworth PD, Piccolo SR. Curated compendium of human transcriptional biomarker data. Sci Data. 2018. Apr;5:180066. doi: 10.1038/sdata.2018.66 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, et al. NCBI GEO: Archive for functional genomics data sets years on. Nucleic Acids Res. 2011. Jan;39(Database issue):D1005–10. doi: 10.1093/nar/gkq1184 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Piccolo SR, Sun Y, Campbell JD, Lenburg ME, Bild AH, Johnson WE. A single-sample microarray normalization method to facilitate personalized-medicine workflows. Genomics. 2012;100(6):337–44. doi: 10.1016/j.ygeno.2012.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005. Jan;33(20):e175. doi: 10.1093/nar/gni179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Rosikiewicz M, Robinson-Rechavi M. IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics. Bioinformatics. 2014;30(10):1392–9. doi: 10.1093/bioinformatics/btu027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Waldron L, Riester M, Ramos M, Parmigiani G, Birrer M. The Doppelgänger effect: Hidden duplicates in databases of transcriptome profiles. JNCI J Natl Cancer Inst. 2016;108(11). doi: 10.1093/jnci/djw146 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27. doi: 10.1093/biostatistics/kxj037 [DOI] [PubMed] [Google Scholar]
  • 83.The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008. Oct;455(7216):1061–8. doi: 10.1038/nature07385 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Liao Y, Smyth GK, Shi W. The Subread aligner: Fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 2013. May;41(10):e108. doi: 10.1093/nar/gkt214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Liao Y, Smyth GK, Shi W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30. doi: 10.1093/bioinformatics/btt656 [DOI] [PubMed] [Google Scholar]
  • 86.Rahman M, Jackson LK, Johnson WE, Li DY, Bild AH, Piccolo SR. Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics. 2015. Nov;31(22):3666–72. doi: 10.1093/bioinformatics/btv377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Yates AD, Achuthan P, Akanni W, Allen J, Allen J, Alvarez-Jarreta J, et al. Ensembl 2020. Nucleic Acids Research. 2020. Jan;48(D1):D682–8. doi: 10.1093/nar/gkz966 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Piccolo SR, Frampton MB. Tools and techniques for computational reproducibility. Gigascience. 2016. Jul;5(1):30. doi: 10.1186/s13742-016-0135-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Piccolo SR, Lee TJ, Suh E, Hill K. ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data. Gigascience. 2020. Apr;9(4). doi: 10.1093/gigascience/giaa026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Sculley D, Snoek J, Wiltschko A, Rahimi A. Winner’s Curse? On Pace, Progress, and Empirical Rigor. 2018. Feb; [Google Scholar]
  • 91.Gulli A, Pal S. Deep learning with keras. Packt Publishing Ltd; 2017. [Google Scholar]
  • 92.Bengio Y. Learning deep architectures for AI. Now Publishers Inc; 2009. [Google Scholar]
  • 93.Kuhn M, Quinlan R. C50: C5.0 decision trees and rule-based models. 2020. [Google Scholar]
  • 94.Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat. 2006;15(3):651–74. [Google Scholar]
  • 95.Hastie SMilborrowD from mda:mars by T, wrapper. RTibshiraniUAMF utilities with TL leaps. Earth: Multivariate adaptive regression splines. 2020.
  • 96.Karatzoglou A, Smola A, Hornik K, Zeileis A. Kernlab an S4 package for kernel methods in R. J Stat Softw. 2004;11(9):1–20. [Google Scholar]
  • 97.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1. [PMC free article] [PubMed] [Google Scholar]
  • 98.LeDell E, Gill N, Aiello S, Fu A, Candel A, Click C, et al. H2o: R interface for the ‘H2O’ scalable machine learning platform. 2020. [Google Scholar]
  • 99.Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neurorobotics. 2013;7:21. doi: 10.3389/fnbot.2013.00021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Schliep K, Hechenbichler K. Kknn: Weighted k-Nearest neighbors. 2016. [Google Scholar]
  • 101.Bergmeir C, Benítez JM. Neural networks in R using the stuttgart neural network simulator: RSNNS. J Stat Softw. 2012;46(7):1–26.22837731 [Google Scholar]
  • 102.Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. E1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU wien. 2019. [Google Scholar]
  • 103.Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2(3):18–22. [Google Scholar]
  • 104.Wright MN, Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw. 2017;77(1):1–7. [Google Scholar]
  • 105.Therneau T, Atkinson B. Rpart: Recursive partitioning and regression trees. 2019. [Google Scholar]
  • 106.Therneau TM, Atkinson EJ, others. An introduction to recursive partitioning using the RPART routines. Technical report Mayo Foundation; 1997. [Google Scholar]
  • 107.Deng H, Runger G. Gene selection with guided regularized random forest. Pattern Recognit. 2013;46(12):3483–9. [Google Scholar]
  • 108.Ahdesmaki M, Zuber V, Gibb S, Strimmer K. Sda: Shrinkage discriminant analysis and CAT score variable selection. 2015. [Google Scholar]
  • 109.Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. Xgboost: Extreme gradient boosting. 2020. [Google Scholar]
  • 110.Freund Y, Schapire R, Abe N. A short introduction to boosting. J-Jpn Soc Artif Intell. 1999;14(771–780):1612. [Google Scholar]
  • 111.Berkson J. Application of the logistic function to bio-assay. J Am Stat Assoc. 1944;39(227):357–65. [Google Scholar]
  • 112.Saad D. Online algorithms and stochastic approximations. Online Learn. 1998;5:6–3. [Google Scholar]
  • 113.Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40. [Google Scholar]
  • 114.Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Mach Learn. 1997;29(2–3):131–63. [Google Scholar]
  • 115.Kohavi R. The power of decision tables. In: 8th european conference on machine learning. Springer; 1995. p. 174–89.
  • 116.Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In: ACM SIGKDD intl Conf On knowledge discovery and data mining. ACM Press; 2001. p. 97–106.
  • 117.Quinlan R. C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers; 1993. [Google Scholar]
  • 118.Cohen WW. Fast effective rule induction. In: Twelfth international conference on machine learning. Morgan Kaufmann; 1995. p. 115–23.
  • 119.John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Eleventh conference on uncertainty in artificial intelligence. San Mateo: Morgan Kaufmann; 1995. p. 338–45.
  • 120.Landwehr N, Hall M, Frank E. Logistic model trees. Machine learning. 2005;95(1–2):161–205. [Google Scholar]
  • 121.Sumner M, Frank E, Hall M. Speeding up logistic model tree induction. In: 9th european conference on principles and practice of knowledge discovery in databases. Springer; 2005. p. 675–83.
  • 122.Platt J. Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A, editors. Advances in kernel methods—support vector learning. MIT Press; 1998. [Google Scholar]
  • 123.Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 2001;13(3):637–49. [Google Scholar]
  • 124.Hastie T, Tibshirani R. Classification by pairwise coupling. In: Jordan MI, Kearns MJ, Solla SA, editors. Advances in neural information processing systems. MIT Press; 1998. [Google Scholar]
  • 125.Demiroz G, Guvenir A. Classification by voting feature intervals. In: 9th european conference on machine learning. Springer; 1997. p. 85–92.
  • 126.Van Rossum G, others. Python Programming Language. In: USENIX Annual Technical Conference. 2007. p. 36.
  • 127.R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020. [Google Scholar]
  • 128.Wei T, Simko V. R package "corrplot": Visualization of a correlation matrix. 2017. [Google Scholar]
  • 129.Wilke CO. Cowplot: Streamlined Plot Theme and Plot Annotations for ‘Ggplot2’. 2017. [Google Scholar]
  • 130.Slowikowski K. Ggrepel: Automatically Position Non-Overlapping Text Labels with ‘Ggplot2’. 2018. [Google Scholar]
  • 131.Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, et al. Welcome to the tidyverse. J Open Source Softw. 2019;4(43):1686. [Google Scholar]
  • 132.Reynolds SM, Miller M, Lee P, Leinonen K, Paquette SM, Rodebaugh Z, et al. The ISB Cancer Genomics Cloud: A Flexible Cloud-Based Platform for Cancer Genomics Research. Cancer Res. 2017. Nov;77(21):e7–10. doi: 10.1158/0008-5472.CAN-17-0617 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Tange O. GNU Parallel—The Command-Line Power Tool. Login USENIX Mag. 2011. Feb;36(1):42–7. [Google Scholar]
  • 134.Green DM, Swets JA, others. Signal detection theory and psychophysics. Vol. 1. Wiley; New York; 1966. [Google Scholar]
  • 135.Brier GW. Verification of forecasts expressed in terms of probability. Mon Wea Rev. 1950. Jan;78(1):1–3. [Google Scholar]
  • 136.Vickery BC. Techniques of Information Retrieval. London: Butterworths; 1970. [Google Scholar]
  • 137.Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta BBA-Protein Struct. 1975;405(2):442–51. doi: 10.1016/0005-2795(75)90109-9 [DOI] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009926.r001

Decision Letter 0

Edwin Wang, Xing Chen

23 Jul 2021

Dear Dr. Piccolo,

Thank you very much for submitting your manuscript "Benchmarking 50 classification algorithms on 50 gene-expression datasets" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Xing Chen, Ph.D.

Guest Editor

PLOS Computational Biology

Edwin Wang

Benchmarking Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Line 344: Given the trend you have observed on the datasets you have used using clinical and/or gene expression data, please provide examples, if possible, of other algorithms that have not been studied but could potentially be promising, and why.

Line 356: Trying out convolutional neural networks in deep learning with optimizing number of layers and a hyperparameter search would be useful.

Line 370: The remark about class imbalance being handled well by sklearn is interesting and valid.

Line 377: It is interesting that data on co-occuring tumors did not have significance in feature selection.

Reviewer #2: Piccolo et al. benchmark 50 common classification algorithms from multiple publicly available packages to evaluate algorithm performance in a robust comparative framework. This is a very nice study. While much of their results are not particularly surprising (e.g., parameter optimization improves performance), I believe this study will be an important resource for a broad research community and one that is appropriate for publication in PLOS Computational Biology. I have a few suggestions that I feel would improve the utility and breadth of audience for this work:

1) The AUROC analysis of this manuscript is great and will be a beneficial set of benchmarks for many studies. As the authors acknowledge, however, many studies have unbalanced classes that may see poor results compared to those expected from considering only auROC scores. To address this (and make their results more broadly applicable), it would be nice to see precision-recall curves in addition to their current analyses. The authors should have the data already to generate these plots, so I believe this should be a relatively easy addition.

2) While I appreciate the author’s focus on classification algorithms for classifying biomedical datasets, I believe there could be more attention given to other uses for classification algorithms. A discussion of classification algorithms as a discovery tool—such as using feature selection to identify potentially novel disease or phenotype-associated genes—would increase the breadth of their audience. Since the quality of feature selection is always dependent on the quality of the classification algorithm, but feature extraction is not equally accessible for all algorithms, this could lead to a very interesting additional contrast for the algorithms studied. The authors do touch on feature selection a bit, but mostly in reference to classification. It could be useful to have a brief discussion of how these algorithms perform for feature selection in a discovery context.

3) I like that the authors contrast algorithm performance with running time. That said, I’m less certain that execution time should be valued as strongly as performance. Unless the difference is a matter of days or weeks, I suspect that pretty much all studies would choose the highest quality predictions over a modestly shorter runtime. Outside of (possibly) a few real-world clinical scenarios, I suspect the vast majority of studies would choose high quality predictions over even substantially longer runtimes.

Minor comment: Figure 3 claims the y-axis is log10 transformed, but this does not seem to match the values along the axis.

Reviewer #3: however, the great efforts of the author, the research is poorly organized. it's hard to get benefits for naive researcher?

2. is svm-rfe multivariate? kindly check the type of feature selection methodss

Reviewer #4: In this paper, the authors performed a benchmark comparison, applying 50 classification algorithms to 50 gene-expression datasets (143 class variables). The findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies. The review paper may useful for the researchers and students who interested especially in the fields of characteristic genes of tumor. However, a minor revision is required as indicated below:

1. The selection of tumor characteristic genes is a NP problem. Generally, feature selection algorithms can be divided into three categories: filter, wrapper and embedded. The wrapper method has the advantages of large search space coverage, more flexible classification accuracy and computational efficiency. Wrapper method uses meta heuristic algorithm to obtain the optimal feature subset, and combines the classification algorithm of machine learning as the evaluation standard, which achieves good results in feature selection of high-dimensional medical and health data. The authors should add the analysis of tumor gene feature selection using the meta heuristics.

2. It is suggested that the authors should simplify the introduction and make a more detailed analysis of the discussion.

3. Traditional machine learning methods need to adjust super parameters in feature selection, so it is difficult to determine the best combination of parameters by the analytic method. So that the setting of optimal parameters itself is an optimization problem. Therefore, the parameter setting of the algorithm is worth exploring, and the author should give a detailed discussion.

4. It is suggested that the authors should provide the source programs of all 50 algorithms for better understanding and application of these methods.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: None

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: muhammed abd-elnaby sadek

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: reviewer_ploscompio_june2021.docx

Attachment

Submitted filename: New Rich Text Document.rtf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009926.r003

Decision Letter 1

Edwin Wang, Xing Chen

15 Feb 2022

Dear Dr. Piccolo,

We are pleased to inform you that your manuscript 'The ability to classify patients based on gene-expression data varies considerably across algorithms and performance metrics' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Xing Chen, Ph.D.

Guest Editor

PLOS Computational Biology

Edwin Wang

Benchmarking Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The revision made by the authors of the study, with respect to the analysis and additional results, has been satisfactory and serves the scope of the study well. It is noteworthy that added the two neural-network based classification algorithms in their analysis. Yes, the approach to use CNNs to classify gene expession data is emerging. Overall, I think the research in the manuscript is well-organized, carefully done and useful for a broader audience than before. I appreciate the discussion on gene discovery in addition to benchmarking as well. I recognize the useful work done towards the revision in various sections of the manuscript.

Reviewer #2: The authors have done an excellent job of addressing my concerns and those of the other reviewers in my opinion.

Reviewer #4: The paper has been revised according to the revision suggestions, and we have no comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #4: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009926.r004

Acceptance letter

Edwin Wang, Xing Chen

7 Mar 2022

PCOMPBIOL-D-21-00860R1

The ability to classify patients based on gene-expression data varies by algorithm and performance metric

Dear Dr Piccolo,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Relative performance of classification algorithms using gene-expression predictors and area under the receiver operating characteristic curve as the metric.

    We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination. The top 15 performers (relatively low ranks) were algorithms that use linear decision boundaries, kernel functions, and/or ensembles of decision trees.

    (PDF)

    S2 Fig. Relative performance of classification algorithms using gene-expression predictors and classification accuracy as the metric.

    We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of classification accuracy across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

    (PDF)

    S3 Fig. Relative performance of classification algorithms using gene-expression predictors and Matthews Correlation Coefficient as the metric.

    We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of the Matthews Correlation Coefficient across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

    (PDF)

    S4 Fig. Relative performance of classification algorithms using gene-expression predictors and area under the precision-recall curve as the metric.

    We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the precision-recall curve across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

    (PDF)

    S5 Fig. Comparison of area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) scores for Analysis 1.

    (PDF)

    S6 Fig. Comparison of area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) scores for Analysis 1, based on ranks (relative performance per algorithm).

    (PDF)

    S7 Fig. Pairwise correlations of sample-level, probabilistic predictions between classification algorithms for dataset GSE10320.

    We used each classification algorithm to make probabilistic predictions of relapse in Wilms tumor patients (GSE10320). Based on these predictions, we calculated the Spearman correlation coefficient for each pair of algorithms. These coefficients, averaged across Monte Carlo cross-validation iterations, are illustrated as a correlation plot, clustered based on similarity.

    (PDF)

    S8 Fig. Pairwise correlations of sample-level, probabilistic predictions between classification algorithms for dataset GSE46691.

    We used each classification algorithm to make probabilistic predictions of early metastasis following radical prostatectomy (GSE46691). Based on these predictions, we calculated the Spearman correlation coefficient for each pair of algorithms. These coefficients, averaged across Monte Carlo cross-validation iterations, are illustrated as a correlation plot, clustered based on similarity.

    (PDF)

    S9 Fig. Dataset performance by class category when using gene-expression predictors.

    For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 1, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as the metric. The dashed, red line indicates the performance expected by random chance. The top-performing category was “Molecular Marker,” which includes class variables associated with mutation status, immunohistochemistry markers of protein expression, presence or absence of chromosomal aberrations, etc. The lowest-performing category was “Patient Characteristic,” which includes variables that indicate whether patients had a family history of cancer, had been diagnosed with multiple tumors, patient performance status, etc.

    (PDF)

    S10 Fig. Relative performance of classification algorithms using clinical predictors and area under the receiver operating characteristic curve as the metric.

    We predicted patient states using clinical predictors only (Analysis 2). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination (some datasets did not have clinical predictors). The top-performing algorithms (relatively low ranks) were similar overall to Analysis 1; however, some differences were large. For example, weka/NaiveBayes performed best overall in Analysis 2 but was ranked 28th in Analysis 1.

    (PDF)

    S11 Fig. Dataset performance by class category when using clinical predictors.

    For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 2, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as the metric. The dashed, red line indicates the performance expected by random chance. The top-performing category was “Diagnosis,” which includes class variables associated with a particular disease or subtype. The lowest-performing category was “Patient Characteristic,” which includes variables that indicate whether patients had a family history of cancer, had been diagnosed with multiple tumors, patient performance status, etc.

    (PDF)

    S12 Fig. Dataset performance by class category when using gene-expression and clinical predictors.

    For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 3, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as a metric. The dashed, red line indicates the performance expected by random chance. As with Analysis 1 (S9 Fig), the top-performing category was “Molecular Marker,” which includes class variables associated with mutation status, immunohistochemistry markers of protein expression, presence or absence of chromosomal aberrations, etc. The lowest-performing category was “Patient Characteristic,” which includes variables that indicate whether patients had a family history of cancer, had been diagnosed with multiple tumors, patient performance status, etc.

    (PDF)

    S13 Fig. Relative performance of classification algorithms using gene-expression and clinical predictors.

    We predicted patient states using gene-expression and clinical predictors (Analysis 3). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.

    (PDF)

    S14 Fig. Relative performance of classification algorithms using gene-expression and clinical predictors and performing hyperparameter optimization.

    We predicted patient states using gene-expression and clinical predictors with hyperparameter optimization (Analysis 4). We used nested cross validation to estimate which hyperparameter combination would be optimal for each algorithm in each training set. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination. The algorithm rankings followed similar trends as Analysis 3 (no hyperparameter optimization); however, some differences are notable. For example, the weka/LibLINEAR and mlr/glmnet algorithms were ranked 11th and 16th in Analysis 3 (S13 Fig), but they were ranked 1st and 2nd in this analysis.

    (PDF)

    S15 Fig. Dataset performance by class category when using gene-expression and clinical predictors and performing hyperparameter optimization.

    For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 4, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as a metric. The dashed, red line indicates the performance expected by random chance.

    (PDF)

    S16 Fig. Correlation between predictive performance and number of samples per dataset.

    The number of patient samples differed by dataset. This scatterplot shows the relationship between the median area under the receiver operating characteristic curve (AUROC) and the number of samples in each dataset. We did not observe a significant correlation between these variables.

    (PDF)

    S17 Fig. Correlation between predictive performance and number of genes per dataset.

    Due to differences in gene-expression profiling platforms, we had data for more genes in some datasets than in others. This scatterplot shows the relationship between the median area under the receiver operating characteristic curve (AUROC) and the number of genes in each dataset. We did not observe a significant correlation between these variables.

    (PDF)

    S18 Fig. Variation in predictive performance across hyperparameter combinations.

    In Analysis 4, we used nested cross validation to evaluate multiple hyperparameter combinations for each classification algorithm. We assessed the extent to which the area under the receiver operating characteristic curve (AUROC) varied across the hyperparameter combinations for each algorithm. For each combination of dataset, class variable, classification algorithm, and hyperparameter set, we averaged AUROC values across 5 Monte Carlo cross-validation iterations. Then we calculated the coefficient of variation for these averaged values across each combination of dataset/class and classification algorithm. Relatively low values indicate that the hyperparameter sets resulted in similar predictive performance. No results are available for 3 algorithms that used only a single hyperparameter option.

    (PDF)

    S19 Fig. Relative performance of different hyperparameter combinations for the weka/LIBLINEAR classification algorithm.

    The ShinyLearner software supports 16 hyperparameter combinations for the weka/LIBLINEAR classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Some combinations consistently outperformed other combinations, and the default combination performed suboptimally. Using relatively small cost values appeared to improve the performance more than any other option. This hyperparameter controls the regularization strength.

    (PDF)

    S20 Fig. Relative performance of different hyperparameter combinations for the mlr/glmnet classification algorithm.

    The ShinyLearner software supports 3 hyperparameter combinations for the mlr/glmnet classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Using an alpha value of 0.5 or 0 resulted in better performance than a value of 1.

    (PDF)

    S21 Fig. Relative performance of different hyperparameter combinations for the sklearn/logistic_regression classification algorithm.

    The ShinyLearner software supports 32 hyperparameter combinations for the sklearn/logistic_regression classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Some combinations consistently outperformed other combinations, and the default combination performed suboptimally. Using relatively small cost values appeared to improve the performance more than any other option. This hyperparameter controls the regularization strength.

    (PDF)

    S22 Fig. Relative performance of different hyperparameter combinations for the sklearn/extra_trees classification algorithm.

    The ShinyLearner software supports 24 hyperparameter combinations for the sklearn/extra_trees classification algorithm. In Analysis 4, we used nested cross validation for hyperparameter optimization. For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all (outer) Monte Carlo cross-validation iterations and then ranked the averages for each hyperparameter combination. Some combinations consistently outperformed other combinations, and the default combination performed suboptimally. Using a larger number (n = 1000) of estimators (trees) appeared to improve the performance more than any other option.

    (PDF)

    S23 Fig. Relative predictive performance when using hyperparameter optimization vs. feature selection.

    We used as a baseline the predictive performance that we attained using default hyperparameters for the classification algorithms (Analysis 3). We quantified predictive performance using the area under the receiver operating characteristic curve (AUROC). This graph shows the increase or decrease in performance when selecting hyperparameters or selecting features relative to the baseline. Each point represents a particular combination of dataset and class variable. Generally, the dataset/class combinations that benefitted from hyperparameter optimization also benefitted from feature selection. However, some dataset/class combinations that did not benefit from hyperparameter optimization did benefit from feature selection.

    (PDF)

    S24 Fig. Dataset performance by class category when using gene-expression and clinical predictors and performing feature selection.

    For each class variable across all datasets, we assigned a category representing the type of patient state being predicted. For Analysis 5, we show the predictive performance for each combination of dataset, class variable, and classification algorithm in each class category. We use area under the receiver operating characteristic curve (AUROC) as a metric. The dashed, red line indicates the performance expected by random chance. The results are similar to those of Analyses 3 and 4 (S12 and S15 Figs).

    (PDF)

    S25 Fig. Predictive performance according to the number of features selected via nested cross-validation.

    Relative area under the receiver operating character curve (AUROC) values were calculated by comparing against the mean for each combination of classification algorithm and feature-selection algorithm.

    (PDF)

    S26 Fig. Relative performance of feature-selection algorithms.

    For Analysis 5, we used nested cross validation to estimate which features would be most informative for each algorithm in each training set. For each combination of dataset, class variable, and classification algorithm, we ranked the performance of the feature-selection algorithms based on area under the receiver operating characteristic curve (AUROC) and averaged the rankings across 5 iterations of Monte Carlo cross-validation. Each data point that overlays the box plots represents a particular dataset/class combination. Relatively low average ranks are considered optimial. The weka/Correlation feature-selection algorithm performed best overall.

    (PDF)

    S27 Fig. Execution time per feature-selection algorithm.

    In Analysis 5, we used nested cross validation to estimate which features were most informative for each training set. We calculated the time (in seconds) required by each feature-selection algorithm to rank the features. Then we averaged these times across all combinations of dataset, class variable, classification algorithm, and (outer) Monte Carlo cross-validation iteration. Some feature-selection algorithms were much more computationally intensive than others.

    (PDF)

    S28 Fig. Pairwise correlations of feature ranks between feature-selection algorithms for dataset GSE10320.

    We used each feature-selection algorithm to rank the genes based on their informativeness for discriminating between relapse and non-relapse outcomes in Wilms tumor patients (GSE10320). After averaging the ranks across cross-validation iterations, we calculated the Spearman correlation coefficient for the feature ranks produced by each pair of algorithms. These coefficients are illustrated as a correlation plot.

    (PDF)

    S29 Fig. Pairwise correlations of feature ranks between feature-selection algorithms for dataset GSE46691.

    We used each feature-selection algorithm to rank the genes based on their informativeness for predicting early metastasis following radical prostatectomy (GSE46691). After averaging the ranks across cross-validation iterations, we calculated the Spearman correlation coefficient for the feature ranks produced by each pair of algorithms. These coefficients are illustrated as a correlation plot.

    (PDF)

    S30 Fig. Absolute classification performance per combination of feature-selection and classification algorithm.

    For each combination of dataset and class variable, we averaged the area under the receiver operating characteristic curve (AUROC) across all Monte Carlo cross-validation iterations. Then for each combination of feature-selection algorithm and classification algorithm, we calculated the median AUROC across all datasets and class variables.

    (PDF)

    S31 Fig. Relative performance of classification algorithms using gene-expression and clinical predictors and performing feature selection with hyperparameter optimization.

    We predicted patient states using gene-expression and clinical predictors with feature selection and optimization of the feature-selection algorithm hyperparameters (Analysis 6). We used nested cross validation to estimate which features and hyperparameter combinations would be optimal for each algorithm in each training set.

    (PDF)

    S32 Fig. Stability of classification performance for increasing numbers of cross-validation iterations on dataset GSE10320.

    When using gene-expression predictors (Analysis 1), we estimated the number of Monte Carlo cross-validation iterations that would be sufficient to characterize algorithm performance. For three classification algorithms, we executed 100 cross-validation iterations on dataset GSE10320 (predicting relapse vs. non-relapse for Wilms tumor patients). As the number of iterations increased, we calculated the cumulative average of the area under the receiver operating characteristic curve (AUROC) for each algorithm. After performing at most 40 iterations, the cumulative averages did not change more than 0.01 over sequences of 10 iterations.

    (PDF)

    S33 Fig. Stability of classification performance for increasing numbers of cross-validation iterations on dataset GSE46691.

    When using gene-expression predictors (Analysis 1), we estimated the number of Monte Carlo cross-validation iterations that would be sufficient to characterize algorithm performance. For three classification algorithms, we executed 100 cross-validation iterations on dataset GSE46691 (predicting early metastasis following radical prostatectomy). As the number of iterations increased, we calculated the cumulative average of the area under the receiver operating characteristic curve (AUROC) for each algorithm. After performing at most 22 iterations, the cumulative averages did not change more than 0.01 over sequences of 10 iterations.

    (PDF)

    S1 Data. Summary of predictive performance per dataset when using gene-expression predictors.

    We predicted patient states using gene-expression predictors only (Analysis 1). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination.

    (XLSX)

    S2 Data. Summary of predictive performance per dataset when using clinical predictors.

    We predicted patient states using clinical predictors only (Analysis 2). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination. For some dataset/class combinations, no clinical predictors were available; these combinations are excluded from this file.

    (XLSX)

    S3 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors.

    We predicted patient states using gene-expression and clinical predictors (Analysis 3). For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 50 iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination. For some dataset/class combinations, no clinical predictors were available; these combinations are excluded from this file.

    (XLSX)

    S4 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors and performing hyperparameter optimization.

    We predicted patient states using gene-expression and clinical predictors (Analysis 4). For classification algorithms that included multiple hyperparameter combinations (n = 47), we performed hyperparameter optimization using the respective training sets. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 (outer) iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination.

    (XLSX)

    S5 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors and performing feature selection.

    We predicted patient states using gene-expression and clinical predictors (Analysis 5). Using each respective training set, we performed feature selection for each of 14 feature-selection algorithms and performed classification using n top-ranked features. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 (outer) iterations of Monte Carlo cross-validation. Next, we calculated the minimum, first quartile (Q1), median, third quartile (Q3), and maximum for these values across the algorithms. Finally, we sorted the algorithms in descending order based on median values. Each row represents a particular dataset/class combination.

    (XLSX)

    S6 Data. Summary of predictive performance per dataset when using gene-expression and clinical predictors and performing feature selection with hyperparameter optimization.

    (XLSX)

    S7 Data. Top 50 genes according to average rank across feature-selection algorithms for GSE10320 and GSE46691.

    (XLSX)

    S8 Data. Gene-set overlap results for top 50 genes according to average rank across feature-selection algorithms for GSE10320.

    (XLSX)

    S9 Data. Gene-set overlap results for top 50 genes according to average rank across feature-selection algorithms for GSE46691.

    (XLSX)

    S10 Data. Summary of datasets used.

    This file contains a unique identifier for each dataset, indicates whether gene-expression microarrays or RNA-Sequencing were used to generate the data, and indicates the name of the class variable from the original dataset. In addition, we assigned standardized names and categories as a way to support consistency across datasets. The file lists any clinical predictors that were used in the analyses as well as the number of samples and genes per dataset.

    (XLSX)

    S11 Data. Classification algorithm hyperparameter combinations.

    This file indicates all hyperparameter combinations that we evaluated via nested cross-validation in Analysis 4.

    (XLSX)

    S12 Data. Feature-selection algorithm hyperparameter combinations.

    This file indicates all hyperparameter combinations that we evaluated via nested cross-validation in the follow-up analysis to Analysis 5.

    (XLSX)

    Attachment

    Submitted filename: reviewer_ploscompio_june2021.docx

    Attachment

    Submitted filename: New Rich Text Document.rtf

    Attachment

    Submitted filename: Response_to_Reviewers_PLOS_CompBio.pdf

    Data Availability Statement

    Source code for each algorithm used can be found in repositories for the respective software libraries used in this study: * https://github.com/scikit-learn/scikit-learn * https://github.com/mlr-org/mlr * https://github.com/Waikato/weka-3.8 * https://github.com/keras-team/keras Code used to integrate the software libraries within software containers, to perform cross validation, to calculate performance metrics, etc. are part of the ShinyLearner tool. Source code can be found at https://github.com/srp33/ShinyLearner. Data and code used to execute this analysis are available at https://osf.io/fv8td/. This repository contains raw and summarized versions of the analysis results, as well as code that we used to generate the figures and tables for this manuscript. The repository is freely available under the Creative Commons Universal 1.0 license. All other data are within the manuscript and its Supporting Information files.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES