Skip to main content
. 2020 Oct 16;18:3287–3300. doi: 10.1016/j.csbj.2020.10.011

Table 2.

Summary of major data splitting, data transformation and feature selection techniques. Notation: xij and xij are the raw and normalized values of variable i in observation j, respectively. x-i and σi are the mean and standard deviation of variable i in the available dataset, respectively. xi,min and xi,max are the minimum and maximum values of variable i observed in the available dataset.

Class Principle Advantages Disadvantages Methods
Data splitting
Random Sampling (RS) The training, validation and test sets are randomly chosen from the population Minimal levels of bias introduced during sampling

When iteratively repeated can be used for model generalization evaluation
Do not account for data distribution
Model might perform poorly if requested to extrapolate
Not appropriate for small sample sizes
Train/Test split
k-fold cross-validation (CV)
Hold-out cross-validation
Nested cross-validation
Leave-one-out-cross-validation (LOOCV)
Bootstrapping
Data Transformation
z-score normalization xij=xij-x-iσi Accounts for both the mean and the variability of the dataset Assumes normal distribution
Could lead to over-amplification of small differences
Increases the impact of measurement error
Pareto scaling xij=xij-x-iσi Reduction of large values effect on model training Reduction of large variance in the data
Range scaling xij=xij-x-ixi,max-xi,min Transformed features are equally important Outliers can undermine the correct interpretation of data variation
min-max normalization xij=xij-xi,minxi,max-xi,min Most applicable when the data does not follow a normal distribution Sensitive to outliers
Does not account for the data dispersion
Mean centering xij=xij-x-i Mean of all features is zero
Can partially alleviate multicollinearity
Does not scale the data
Usually applied in combination with other scaling methods
Log transformation xij=log10(xij) Can alleviate heteroskedasticity and impose normal distribution Can be problematic when values reach transformation function boundaries
Feature Selection
Filter Features are selected based on their performance in statistical algorithms High efficiency
Independent of the predictor
Do not interact with the predictor
Results can be relatively poor
Correlation based Feature Selection (CFS)
Information Gain (IG)
minimum Redundancy-Maximum Relevance (mRMR)
Hilbert Schmidt Independence Criterion Lasso (HSIC-Lasso)
Wrapped Evaluation of feature subsets based on ML model performance. They are composed of a search and an evaluation algorithm Results in feature subsets with good performance Computationally expensive (greedy) as they require multiple model simulations

Biased towards the examined model

Exhaustive search and fast evaluation necessary

Danger of overfitting

Optimization of evaluation method may be necessary
Forward selection

Backwards elimination

Exhaustive search
Embedded Feature selection is incorporated in model training Non-contributing features are usually penalized

Less computationally expensive than wrapped methods

Robust to overfitting
Selection is biased towards the model in use Least absolute shrinkage and selection operator (LASSO)

Support Vector Machine (SVM)

SVM-Recursive feature elimination (SVM-RFE)

Random Forest (RF)