Figure 2.
Performance gain across cell lines for each introduced modeling choice during the exploratory analysis of FG data. Each boxplot represents the distribution of the cell line models' test set performances (Rp) at any given step. Analysis steps are carried out sequentially: I—RF, 1,000 trees with all n features tried to split a node, 80% training set, 20% test set, MACCS (Molecular ACCess System) keys as features; II—MFPC (Morgan fingerprint counts) are used as features instead; III—physico-chemical features are added for each drug; IV—training set rows are duplicated with the reverse order of drugs (data augmentation); V-−90% training set, 10% test set are used instead of the initial 80/20 partition; VI—RF with 250 trees with n/3 features tried to split a node; VII—XGB models with recommended settings; VIII—tuned XGB models. Note that I-V employ RF with same values for its hyperparameters (RF tuned in VI) and V–VIII use the same training and test sets. Modeling choices introducing the largest improvements are the choice of molecular features and the data augmentation strategies.
