Table 4.
Authors | Train, Validate and Test Strategy | Data Pre-processing | Feature Selection/ Dimensionality Reduction | Machine Learning Classification Methods | Deficits of ML Analysis |
---|---|---|---|---|---|
AYALA ET AL | threefold stratified cross-validation for comparison of 68 algorithms |
- Data imputation: missing data were replaced by the mean values of the players in the same division - Data discretization |
No |
- Decision tree ensembles - Adjusted for imbalance via synthetic minority oversampling - Aggregated using bagging and boosting methods |
Discretization before data splitting |
CAREY ET AL |
- Split in training dataset (data of 2014 and 2015) and test dataset (data of 2016) - Hyperparameter tuning via tenfold cross-validation - Each analysis repeated 50 times |
NR | Principal Component Analysis |
- Decision tree ensembles (Random Forests), Support Vector Machines - Adjusted for imbalance via undersampling and synthetic minority oversampling |
Dependency between training and test dataset |
LÓPEZ-VALENCIANO ET AL | fivefold stratified cross-validation for comparison of 68 algorithms |
- Data imputation: missing data were replaced by the mean values of the players in the same division - Data discretization using literature and Weka software |
No |
- Decision trees ensembles - Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling - Aggregated using bagging and boosting methods |
Discretization before data splitting |
MCCULLAGH ET AL | tenfold cross-validation for testing | NR | No | Artificial Neural Networks with backpropagation | Dependency between training and test dataset |
OLIVER ET AL | fivefold cross-validation for comparison of 57 models | - Data discretization using literature and Weka software | No |
- Decision trees ensembles - Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling - Aggregated using bagging and boosting methods |
Discretization before data splitting |
RODAS ET AL |
- Outer fivefold cross-validation for model testing - inner tenfold cross-validation for hyperparameters tuning |
- Synthetic variant imputation | Least Absolute Shrinkage and Selection Operator (LASSO) | Decision tree ensembles (Random Forests), Support Vector Machines | |
ROMMERS ET AL |
- Split in training (80%) and test (20%) dataset - Cross-validation for tuning hyperparameters |
NR | No |
Decision tree ensembles - Aggregated using boosting methods |
|
ROSSI ET AL |
- Split in dataset 1 (30%) for feature elimination and dataset 2 (70%) for training and testing - stratified two-fold cross-validation on dataset 2 - repeated 10,000 times |
NR | Recursive Feature Elimination with Cross-Validation |
- Decision tree ensembles - Adjusted for imbalance via adaptive synthetic sampling - Aggregated using Random Forests |
Dependency between training and test dataset |
RUDDY ET AL |
Between Year approach: - Split in training dataset (2013) and test dataset (2015) Within Year approach: - Split in training (70%) and test (30%) dataset Both approaches: - tenfold cross-validation for hyperparameter tuning - Each analysis repeated 10,000 times |
- Data standardization | No |
- Single decision tree, decision tree ensembles (Random Forests), Artificial Neural Networks, Support Vector Machines - Adjusted for imbalance via synthetic minority oversampling |
Standardization independent in training and test dataset |
THORNTON ET AL | Split in training (70%), validation (15%), and test (15%) dataset | NR | No |
Decision tree ensembles - Aggregated using Random Forests |
|
WHITESIDE ET AL | fivefold cross-validation for comparison of models | NR | Brute Force feature selection: Every possible subset of features is tested | Support Vector Machines |