Table 9.
Gene Selection Method Name | Gene Selection Method Acronym | Main Advantages | Main Limitations |
---|---|---|---|
Least Absolute Shrinkage and Selection Operator | LASSO | (1) Smaller mean squared error (MSE) than conventional methods; (2) It is good for simultaneous estimation and eliminating trivial genes; (3) Coefficients being easy to implement is another of the merits. |
(1) It is not good for grouped selection; (2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than LASSO; (3) This method has shown to not always provide consistent variable selection; (4) Its estimators are biased always; (5) Its efficiency depends greatly on the number of dimension of genes. |
Adaptive Least Absolute Shrinkage and Selection Operator | Adapt. LASSO | (1) This method has all of advantages of the LASSO. (2) This method uses adaptive weights to penalize coefficients differently; (3) Adaptive LASSO provides a more consistent solution than LASSO. |
(1) It is not good for grouped selection; (2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than adapt. LASSO; (3) Its estimators are biased always; (4) Its efficiency depends greatly on the number of dimension of genes. |
Elastic net regularization | Elastic net | (1) This method selects groups of correlated variables together, shares nice properties of both the LASSO and ridge; (2) It can be considered for situations with p > n, it allows the number of selected features to exceed the sample size; (3) This method has predictive performance better than LASSO and ridge. |
(1) It can only apply to two-class feature selection problems, it cannot resolve multi-class feature selection problems directly; (2) Its estimators aren’t robust against outliers |
Ridge Logistic Regression | Ridge | (1) It handles the multi-collinearity problem (2) Ridge regression can reduce the variance (with an increasing bias); (3) Can improve predictive performance than ordinary least square approach. |
(1) It is not able to shrink coefficients to exactly zero; (2) It cannot perform variable selection; it includes all of predictors (e.g. genes) in the final model; (3) It cannot handles the overfitting problem. |
Support Vector Machines | SVM | (1) It has a regularization parameter for avoiding overfitting; (2) It uses the kernel trick; (3) It is defined by a convex optimization problem (no local optimization); (4) It is a powerful classifier that works well on a wide range of classification problems, in other words, it is very good when we have no idea on the data; (5) It can apply for high dimensional and not linearly separable situations. |
(1) Choosing a good kernel function is not easy; (2) It has several key parameters that need to be set correctly to achieve the best classification results for any given problem; (3) Long training time for large datasets and large amount of training data; it was computationally intensive, especially the grid search for tuning its parameters; (4) Difficult to understand and interpret the final model, variable weights and individual impact. |
Gradient Boosting Machines (stochastic) | GBM | (1) It can apply for high dimensional situations; (2) It works well in the situation with a lot of main and interaction parameters; (3) It can automatically select variables; (4) It is robust to outliers and missing data; (5) It can handle the numerous correlated and irrelevant variables problems; (6) It is an ensemble learning. |
(1) Long training time for large datasets; (2) Difficult to understand and interpret the model; (3) Prone to overfitting. |
Naive Bayes | NB | (1) It is easy to implement as a single learning; (2) If the its conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models (e.g. logistic regression); (3) It needs less training data than other algorithms; |
(1) Class conditional independence assumption for all of variables (e.g. genes); (2) It is defined by a local optimization problem. |
Random Forest | RF | (1) It can apply for high dimensional situations; (2) It is robust to outliers and missing data; (3) It has less variance than a single decision tree; (4) Training each tree perform independently. |
(1) It is complex; (2) It requires more computational resources and are also less intuitive; (3) Its prediction process using random forests is time-consuming than decision trees; (4) It assumes that model errors are uncorrelated and uniform. |
Artificial Neural Network | ANN | (1) It is easy to implement; (2) It can approximate any function between the independent and dependent variables; (3) It handles all possible interactions between the dependent variables; (4) It does not require any assumptions, in other words, it is very good when we have no idea on the data. |
(1) It solved for local optimization; (2) Parameters are hard to interpret; (3) Long training time for large neural networks. |
Decision Trees | RT | (1) Easy to interpret and explain as a single learning; (2) It is very fast; (3) Its estimators are robust against outliers; (4) Can be combined with other decision techniques; (5) It handles missing values and filling them in with the most probable value. |
(1) Prone to overfitting; (2) Instability; (3) This method has predictive performance worse than random forest; (4) It solved for local optimization. |
AdaBoost Classification Trees (Adaptive Boosting) | ABCT | (1) It can be less susceptible to the overfitting problem than most learning algorithms; (2) It combines a set of weak learners in order to form a strong classifier and selection of weak classifier is easy; (3) It is a machine learning meta-algorithm. |
(1) It can be sensitive to noisy data and outliers; (2) Requirement of a large amount of training data and long training time. |