Table 3.
Summary of predictive performance of the best ML models for classification tasks.
Phenotype | Dataset | Model | F1 score per class Test | F1 score Test |
F1 score Train |
Precision per class Test | Recall per class Test | Ave F1-score CV |
Std F1-score CV |
---|---|---|---|---|---|---|---|---|---|
Menopausal status | 62 first samples | LightGBM | [0.86, 0.95] | 0.92 | 0.98 | [1., 0.9] | [0.75, 1. ] | 0.93 | 0.06 |
1200 samples blocked by individual | XGBoost | [0.89, 0.75] | 0.85 | 1 | [0.89, 0.75] | [0.89, 0.75] | 0.82 | 0.07 | |
Smoking status |
62 first samples | XGBoost | [0.89, 0.75] | 0.85 | 0.98 | [0.89, 0.75] | [0.89, 0.75] | 0.72 | 0.12 |
1200 samples blocked by individual | LightGBM | [0.88, 0.10] | 0.74 | 1.0 | [0.82, 0.21] | [0.94, 0.07] | 0.93 | 0.08 |
Three ML models (RF, LightGBM, XGboost) have been evaluated on different subsets of the Canada cohort (62 samples taken from each subject at the first time point, and 1200 time series samples blocked by individual). When applied to time series samples, the ML models have been tuned and trained blocking by individual, e.g., samples of the same subjects are not present both in the training and test datasets. The table reports F1-score, precision and recall per class as computed on the test dataset, weighted average F1-score on the test, training datasets and on cross validation. The table reports the performances scores of the best fine-tuned model per dataset and phenotype, while Supplementary Table 2 shows the full list.