Table 2.
Benefits and limitations of feature selection methods.
Method | Benefits | Limitations | |
---|---|---|---|
Univariate filter method | Information gain | Results into the relevance of an attribute or feature | Biased towards multi-valued attributes and overfitting |
Chi-square | Reduces training time and avoids overfitting | Highly sensitive to sample size | |
Fishers' score | Evaluates features individually to reduce the feature set | Does not handle features redundancy | |
Pearson's correlation coefficient | Is simplest and fast and measures the linear correlation between features | It is only sensitive to a linear relationship | |
Variance threshold | Removes features with variance below a certain cutoff | Does not consider the relationship with the target variable | |
| |||
Multi-variate filter method | mRMR (minimal redundancy maximum relevance) | Measures the nonlinear relationship between feature and target variable and provides low error accuracies | Features may be mutually as dissimilar to each other as possible |
Multi-variate relative discriminative criterion | Best determines the contribution of individual features to the underlying dimensions | Does not fit for a small sample size | |
| |||
Linear multi-variate wrapper method | Recursive feature elimination | Considers high-quality top-N features and removes weakest features | Computationally expensive and correlation of features not considered |
Forward/backward stepwise selection | Is computationally efficient and greedy optimization | Sometimes impossible to find features with no correlation between them | |
Genetic algorithm | Accommodates data set with a large number of features and knowledge about a problem not required | Stochastic nature and computationally expensive | |
| |||
Nonlinear multi-variate wrapper methods | Nonlinear kernel multiplicative | De-emphasizes the least useful features by multiplying features with a scaling factor | The complexity of kernel computation and multiplication |
Relief | Is feasible for binary classification, based on nearest neighbor instance pairs and is noise-tolerant | Does not evaluate boundaries between redundant features, not suitable for the low number of training data sets | |
| |||
Embedded methods | LASSO | L1 regularization reduces overfitting, and it can be applied when features are even more than the data | Random selection when features are highly correlated |
Ridge regression | L2 regularization is preferred over L1 when features are highly correlated | Reduction of features is a challenge | |
Elastic net | Is better than L1 and L2 for dealing with highly correlated features, is flexible, and solves optimization problems | High computational cost |