. 2020 Oct 16;18:3287–3300. doi: 10.1016/j.csbj.2020.10.011

Table 2.

Summary of major data splitting, data transformation and feature selection techniques. Notation: $x_{ij}$ and ${\overset{}{x}}_{ij}$ are the raw and normalized values of variable $i$ in observation $j$ , respectively. ${\bar{x}}_{i}$ and $σ_{i}$ are the mean and standard deviation of variable $i$ in the available dataset, respectively. $x_{i, m i n}$ and $x_{i, m a x}$ are the minimum and maximum values of variable $i$ observed in the available dataset.

Class	Principle	Advantages	Disadvantages	Methods
Data splitting
Random Sampling (RS)	The training, validation and test sets are randomly chosen from the population	Minimal levels of bias introduced during sampling When iteratively repeated can be used for model generalization evaluation	Do not account for data distribution Model might perform poorly if requested to extrapolate Not appropriate for small sample sizes	Train/Test split k-fold cross-validation (CV) Hold-out cross-validation Nested cross-validation Leave-one-out-cross-validation (LOOCV) Bootstrapping
Data Transformation
z-score normalization	${\overset{}{x}}_{ij} = \frac{x_{ij} - {\bar{x}}_{i}}{σ_{i}}$	Accounts for both the mean and the variability of the dataset	Assumes normal distribution Could lead to over-amplification of small differences Increases the impact of measurement error
Pareto scaling	${\overset{}{x}}_{ij} = \frac{x_{ij} - {\bar{x}}_{i}}{\sqrt{σ_{i}}}$	Reduction of large values effect on model training	Reduction of large variance in the data
Range scaling	${\overset{}{x}}_{ij} = \frac{x_{ij} - {\bar{x}}_{i}}{x_{i, m a x} - x_{i, m i n}}$	Transformed features are equally important	Outliers can undermine the correct interpretation of data variation
min-max normalization	${\overset{}{x}}_{ij} = \frac{x_{ij} - x_{i, m i n}}{x_{i, m a x} - x_{i, m i n}}$	Most applicable when the data does not follow a normal distribution	Sensitive to outliers Does not account for the data dispersion
Mean centering	${\overset{}{x}}_{ij} = x_{ij} - {\bar{x}}_{i}$	Mean of all features is zero Can partially alleviate multicollinearity	Does not scale the data Usually applied in combination with other scaling methods
Log transformation	${\overset{}{x}}_{ij} = \log_{10} ({\overset{}{x}}_{ij})$	Can alleviate heteroskedasticity and impose normal distribution	Can be problematic when values reach transformation function boundaries
Feature Selection
Filter	Features are selected based on their performance in statistical algorithms	High efficiency Independent of the predictor	Do not interact with the predictor Results can be relatively poor	Correlation based Feature Selection (CFS) Information Gain (IG) minimum Redundancy-Maximum Relevance (mRMR) Hilbert Schmidt Independence Criterion Lasso (HSIC-Lasso)
Wrapped	Evaluation of feature subsets based on ML model performance. They are composed of a search and an evaluation algorithm	Results in feature subsets with good performance	Computationally expensive (greedy) as they require multiple model simulations Biased towards the examined model Exhaustive search and fast evaluation necessary Danger of overfitting Optimization of evaluation method may be necessary	Forward selection Backwards elimination Exhaustive search
Embedded	Feature selection is incorporated in model training	Non-contributing features are usually penalized Less computationally expensive than wrapped methods Robust to overfitting	Selection is biased towards the model in use	Least absolute shrinkage and selection operator (LASSO) Support Vector Machine (SVM) SVM-Recursive feature elimination (SVM-RFE) Random Forest (RF)