Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

. 2019 Jul 16;7:509. doi: 10.3389/fchem.2019.00509

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

Copyright © 2019 Sidorov, Naulaerts, Ariey-Bonnet, Pasquier and Ballester.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

PMC Copyright notice

Performance gain across cell lines for each introduced modeling choice during the exploratory analysis of FG data. Each boxplot represents the distribution of the cell line models' test set performances (R_p) at any given step. Analysis steps are carried out sequentially: I—RF, 1,000 trees with all n features tried to split a node, 80% training set, 20% test set, MACCS (Molecular ACCess System) keys as features; II—MFPC (Morgan fingerprint counts) are used as features instead; III—physico-chemical features are added for each drug; IV—training set rows are duplicated with the reverse order of drugs (data augmentation); V-−90% training set, 10% test set are used instead of the initial 80/20 partition; VI—RF with 250 trees with n/3 features tried to split a node; VII—XGB models with recommended settings; VIII—tuned XGB models. Note that I-V employ RF with same values for its hyperparameters (RF tuned in VI) and V–VIII use the same training and test sets. Modeling choices introducing the largest improvements are the choice of molecular features and the data augmentation strategies.