Skip to main content
. 2021 Apr 14;8:27. doi: 10.1186/s40634-021-00346-x

Table 4.

Data analysis characteristics

Authors Train, Validate and Test Strategy Data Pre-processing Feature Selection/ Dimensionality Reduction Machine Learning Classification Methods Deficits of ML Analysis
AYALA ET AL threefold stratified cross-validation for comparison of 68 algorithms

- Data imputation: missing data were replaced by the mean values of the players in the same division

- Data discretization

No

- Decision tree ensembles

- Adjusted for imbalance via synthetic minority oversampling

- Aggregated using bagging and boosting methods

Discretization before data splitting
CAREY ET AL

- Split in training dataset (data of 2014 and 2015) and test dataset (data of 2016)

- Hyperparameter tuning via tenfold cross-validation

- Each analysis repeated 50 times

NR Principal Component Analysis

- Decision tree ensembles (Random Forests), Support Vector Machines

- Adjusted for imbalance via undersampling and synthetic minority oversampling

Dependency between training and test dataset
LÓPEZ-VALENCIANO ET AL fivefold stratified cross-validation for comparison of 68 algorithms

- Data imputation: missing data were replaced by the mean values of the players in the same division

- Data discretization using literature and Weka software

No

- Decision trees ensembles

- Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling

- Aggregated using bagging and boosting methods

Discretization before data splitting
MCCULLAGH ET AL tenfold cross-validation for testing NR No Artificial Neural Networks with backpropagation Dependency between training and test dataset
OLIVER ET AL fivefold cross-validation for comparison of 57 models - Data discretization using literature and Weka software No

- Decision trees ensembles

- Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling

- Aggregated using bagging and boosting methods

Discretization before data splitting
RODAS ET AL

- Outer fivefold cross-validation for model testing

- inner tenfold cross-validation for hyperparameters tuning

- Synthetic variant imputation Least Absolute Shrinkage and Selection Operator (LASSO) Decision tree ensembles (Random Forests), Support Vector Machines
ROMMERS ET AL

- Split in training (80%) and test (20%) dataset

- Cross-validation for tuning hyperparameters

NR No

Decision tree ensembles

- Aggregated using boosting methods

ROSSI ET AL

- Split in dataset 1 (30%) for feature elimination and dataset 2 (70%) for training and testing

- stratified two-fold cross-validation on dataset 2

- repeated 10,000 times

NR Recursive Feature Elimination with Cross-Validation

- Decision tree ensembles

- Adjusted for imbalance via adaptive synthetic sampling

- Aggregated using Random Forests

Dependency between training and test dataset
RUDDY ET AL

Between Year approach:

- Split in training dataset (2013) and test dataset (2015)

Within Year approach:

- Split in training (70%) and test (30%) dataset

Both approaches:

- tenfold cross-validation for hyperparameter tuning

- Each analysis repeated 10,000 times

- Data standardization No

- Single decision tree, decision tree ensembles (Random Forests), Artificial Neural Networks, Support Vector Machines

- Adjusted for imbalance via synthetic minority oversampling

Standardization independent in training and test dataset
THORNTON ET AL Split in training (70%), validation (15%), and test (15%) dataset NR No

Decision tree ensembles

- Aggregated using Random Forests

WHITESIDE ET AL fivefold cross-validation for comparison of models NR Brute Force feature selection: Every possible subset of features is tested Support Vector Machines