Table 2.
Summary of major data splitting, data transformation and feature selection techniques. Notation: and are the raw and normalized values of variable in observation , respectively. and are the mean and standard deviation of variable in the available dataset, respectively. and are the minimum and maximum values of variable observed in the available dataset.
| Class | Principle | Advantages | Disadvantages | Methods |
|---|---|---|---|---|
| Data splitting | ||||
| Random Sampling (RS) | The training, validation and test sets are randomly chosen from the population | Minimal levels of bias introduced during sampling When iteratively repeated can be used for model generalization evaluation |
Do not account for data distribution Model might perform poorly if requested to extrapolate Not appropriate for small sample sizes |
Train/Test split k-fold cross-validation (CV) Hold-out cross-validation Nested cross-validation Leave-one-out-cross-validation (LOOCV) Bootstrapping |
| Data Transformation | ||||
| z-score normalization | Accounts for both the mean and the variability of the dataset | Assumes normal distribution Could lead to over-amplification of small differences Increases the impact of measurement error |
||
| Pareto scaling | Reduction of large values effect on model training | Reduction of large variance in the data | ||
| Range scaling | Transformed features are equally important | Outliers can undermine the correct interpretation of data variation | ||
| min-max normalization | Most applicable when the data does not follow a normal distribution | Sensitive to outliers Does not account for the data dispersion |
||
| Mean centering | Mean of all features is zero Can partially alleviate multicollinearity |
Does not scale the data Usually applied in combination with other scaling methods |
||
| Log transformation | Can alleviate heteroskedasticity and impose normal distribution | Can be problematic when values reach transformation function boundaries | ||
| Feature Selection | ||||
| Filter | Features are selected based on their performance in statistical algorithms | High efficiency Independent of the predictor |
Do not interact with the predictor Results can be relatively poor |
Correlation based Feature Selection (CFS) Information Gain (IG) minimum Redundancy-Maximum Relevance (mRMR) Hilbert Schmidt Independence Criterion Lasso (HSIC-Lasso) |
| Wrapped | Evaluation of feature subsets based on ML model performance. They are composed of a search and an evaluation algorithm | Results in feature subsets with good performance | Computationally expensive (greedy) as they require multiple model simulations Biased towards the examined model Exhaustive search and fast evaluation necessary Danger of overfitting Optimization of evaluation method may be necessary |
Forward selection Backwards elimination Exhaustive search |
| Embedded | Feature selection is incorporated in model training | Non-contributing features are usually penalized Less computationally expensive than wrapped methods Robust to overfitting |
Selection is biased towards the model in use | Least absolute shrinkage and selection operator (LASSO) Support Vector Machine (SVM) SVM-Recursive feature elimination (SVM-RFE) Random Forest (RF) |