Univariate filter method |
Information gain |
Results into the relevance of an attribute or feature |
Biased towards multi-valued attributes and overfitting |
Chi-square |
Reduces training time and avoids overfitting |
Highly sensitive to sample size |
Fishers' score |
Evaluates features individually to reduce the feature set |
Does not handle features redundancy |
Pearson's correlation coefficient |
Is simplest and fast and measures the linear correlation between features |
It is only sensitive to a linear relationship |
Variance threshold |
Removes features with variance below a certain cutoff |
Does not consider the relationship with the target variable |
|
Multi-variate filter method |
mRMR (minimal redundancy maximum relevance) |
Measures the nonlinear relationship between feature and target variable and provides low error accuracies |
Features may be mutually as dissimilar to each other as possible |
Multi-variate relative discriminative criterion |
Best determines the contribution of individual features to the underlying dimensions |
Does not fit for a small sample size |
|
Linear multi-variate wrapper method |
Recursive feature elimination |
Considers high-quality top-N features and removes weakest features |
Computationally expensive and correlation of features not considered |
Forward/backward stepwise selection |
Is computationally efficient and greedy optimization |
Sometimes impossible to find features with no correlation between them |
Genetic algorithm |
Accommodates data set with a large number of features and knowledge about a problem not required |
Stochastic nature and computationally expensive |
|
Nonlinear multi-variate wrapper methods |
Nonlinear kernel multiplicative |
De-emphasizes the least useful features by multiplying features with a scaling factor |
The complexity of kernel computation and multiplication |
Relief |
Is feasible for binary classification, based on nearest neighbor instance pairs and is noise-tolerant |
Does not evaluate boundaries between redundant features, not suitable for the low number of training data sets |
|
Embedded methods |
LASSO |
L1 regularization reduces overfitting, and it can be applied when features are even more than the data |
Random selection when features are highly correlated |
Ridge regression |
L2 regularization is preferred over L1 when features are highly correlated |
Reduction of features is a challenge |
Elastic net |
Is better than L1 and L2 for dealing with highly correlated features, is flexible, and solves optimization problems |
High computational cost |