. 2021 Feb 25;11:4565. doi: 10.1038/s41598-021-83922-6

Table 3.

Summary of predictive performance of the best ML models for classification tasks.

Phenotype	Dataset	Model	F1 score per class Test	F1 score Test	F1 score Train	Precision per class Test	Recall per class Test	Ave F1-score CV	Std F1-score CV
Menopausal status	62 first samples	LightGBM	[0.86, 0.95]	0.92	0.98	[1., 0.9]	[0.75, 1. ]	0.93	0.06
Menopausal status	1200 samples blocked by individual	XGBoost	[0.89, 0.75]	0.85	1	[0.89, 0.75]	[0.89, 0.75]	0.82	0.07
Smoking status	62 first samples	XGBoost	[0.89, 0.75]	0.85	0.98	[0.89, 0.75]	[0.89, 0.75]	0.72	0.12
Smoking status	1200 samples blocked by individual	LightGBM	[0.88, 0.10]	0.74	1.0	[0.82, 0.21]	[0.94, 0.07]	0.93	0.08

Three ML models (RF, LightGBM, XGboost) have been evaluated on different subsets of the Canada cohort (62 samples taken from each subject at the first time point, and 1200 time series samples blocked by individual). When applied to time series samples, the ML models have been tuned and trained blocking by individual, e.g., samples of the same subjects are not present both in the training and test datasets. The table reports F1-score, precision and recall per class as computed on the test dataset, weighted average F1-score on the test, training datasets and on cross validation. The table reports the performances scores of the best fine-tuned model per dataset and phenotype, while Supplementary Table 2 shows the full list.