. 2018 Oct 25;8:15775. doi: 10.1038/s41598-018-33986-8

Table 9.

Gene selection methods: Definitions, acronyms and main advantages and limitations.

Gene Selection Method Name	Gene Selection Method Acronym	Main Advantages	Main Limitations
Least Absolute Shrinkage and Selection Operator	LASSO	(1) Smaller mean squared error (MSE) than conventional methods; (2) It is good for simultaneous estimation and eliminating trivial genes; (3) Coefficients being easy to implement is another of the merits.	(1) It is not good for grouped selection; (2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than LASSO; (3) This method has shown to not always provide consistent variable selection; (4) Its estimators are biased always; (5) Its efficiency depends greatly on the number of dimension of genes.
Adaptive Least Absolute Shrinkage and Selection Operator	Adapt. LASSO	(1) This method has all of advantages of the LASSO. (2) This method uses adaptive weights to penalize coefficients differently; (3) Adaptive LASSO provides a more consistent solution than LASSO.	(1) It is not good for grouped selection; (2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than adapt. LASSO; (3) Its estimators are biased always; (4) Its efficiency depends greatly on the number of dimension of genes.
Elastic net regularization	Elastic net	(1) This method selects groups of correlated variables together, shares nice properties of both the LASSO and ridge; (2) It can be considered for situations with p > n, it allows the number of selected features to exceed the sample size; (3) This method has predictive performance better than LASSO and ridge.	(1) It can only apply to two-class feature selection problems, it cannot resolve multi-class feature selection problems directly; (2) Its estimators aren’t robust against outliers
Ridge Logistic Regression	Ridge	(1) It handles the multi-collinearity problem (2) Ridge regression can reduce the variance (with an increasing bias); (3) Can improve predictive performance than ordinary least square approach.	(1) It is not able to shrink coefficients to exactly zero; (2) It cannot perform variable selection; it includes all of predictors (e.g. genes) in the final model; (3) It cannot handles the overfitting problem.
Support Vector Machines	SVM	(1) It has a regularization parameter for avoiding overfitting; (2) It uses the kernel trick; (3) It is defined by a convex optimization problem (no local optimization); (4) It is a powerful classifier that works well on a wide range of classification problems, in other words, it is very good when we have no idea on the data; (5) It can apply for high dimensional and not linearly separable situations.	(1) Choosing a good kernel function is not easy; (2) It has several key parameters that need to be set correctly to achieve the best classification results for any given problem; (3) Long training time for large datasets and large amount of training data; it was computationally intensive, especially the grid search for tuning its parameters; (4) Difficult to understand and interpret the final model, variable weights and individual impact.
Gradient Boosting Machines (stochastic)	GBM	(1) It can apply for high dimensional situations; (2) It works well in the situation with a lot of main and interaction parameters; (3) It can automatically select variables; (4) It is robust to outliers and missing data; (5) It can handle the numerous correlated and irrelevant variables problems; (6) It is an ensemble learning.	(1) Long training time for large datasets; (2) Difficult to understand and interpret the model; (3) Prone to overfitting.
Naive Bayes	NB	(1) It is easy to implement as a single learning; (2) If the its conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models (e.g. logistic regression); (3) It needs less training data than other algorithms;	(1) Class conditional independence assumption for all of variables (e.g. genes); (2) It is defined by a local optimization problem.
Random Forest	RF	(1) It can apply for high dimensional situations; (2) It is robust to outliers and missing data; (3) It has less variance than a single decision tree; (4) Training each tree perform independently.	(1) It is complex; (2) It requires more computational resources and are also less intuitive; (3) Its prediction process using random forests is time-consuming than decision trees; (4) It assumes that model errors are uncorrelated and uniform.
Artificial Neural Network	ANN	(1) It is easy to implement; (2) It can approximate any function between the independent and dependent variables; (3) It handles all possible interactions between the dependent variables; (4) It does not require any assumptions, in other words, it is very good when we have no idea on the data.	(1) It solved for local optimization; (2) Parameters are hard to interpret; (3) Long training time for large neural networks.
Decision Trees	RT	(1) Easy to interpret and explain as a single learning; (2) It is very fast; (3) Its estimators are robust against outliers; (4) Can be combined with other decision techniques; (5) It handles missing values and filling them in with the most probable value.	(1) Prone to overfitting; (2) Instability; (3) This method has predictive performance worse than random forest; (4) It solved for local optimization.
AdaBoost Classification Trees (Adaptive Boosting)	ABCT	(1) It can be less susceptible to the overfitting problem than most learning algorithms; (2) It combines a set of weak learners in order to form a strong classifier and selection of weak classifier is easy; (3) It is a machine learning meta-algorithm.	(1) It can be sensitive to noisy data and outliers; (2) Requirement of a large amount of training data and long training time.