Skip to main content
. 2023 Aug 23;15(1):2248671. doi: 10.1080/19420862.2023.2248671

Figure 1.

Four part image with datasets used in this study on the left are datasets, structure generation and computed feature. In the middle left, the feature selection module shows cycles of XGBoost leading to feature selection which feeds into the middle right or machine learning module using PyCaret to generate ~250K models. On the right side is a flowchart of the machine learning models tested on the validation and final prediction on the test data.

Schematic of machine learning workflow in this study.

For each bioassay endpoint in the Datasets category, the data are split into training validation test sets, ensuring low sequence identity, and representative assay variation in the validation set. Fv sequences are modeled using ABodyBuilder2 and features are generated using three popular software packages. Individual or multiple feature sets are trained with XGBoost regression models using 32 grouped splits and reduced individually to X features (X is a hyparameter 5-150). Multiple reduced feature sets are then combined, resubmitted, and reduced again to X features. The top features selected by XGBoost are submitted with the training set to a PyCaret workflow and trained on 19 regression models. For each X value, the top five models are then tested for prediction and scored on the validation set for ten random seeds. 5000 cycles (250K) models are evaluated and ranked on the validation data. The top performing model on the validation data is confirmed by prediction on the test set.