Skip to main content
. 2015 Aug 19;14:958–970. doi: 10.17179/excli2015-374

Figure 1. Schematic workflow of this study consisted of 4 major steps: (1) data sets preparation, (2) determining informative molecular descriptors, (3) coping with imbalanced data and (4) multivariate analysis. In step 1, redundant compounds, overlapping compounds, and compounds with MW > 1000 Da were identified and removed. Next, in step 2 the resulting compounds from the aforementioned pre-processed data sets were geometrically optimized at the PM6 level, calculate a set of 13 descriptors, apply feature selection to select informative descriptors for multivariate analysis. Subsequently, in step 3 the imbalanced number of positive and negative classes solved by making sure that positive class clusters were equivalent in number to that of the negative class where clusters providing the best predictive performance are selected as the representative clusters for model construction. Finally, in step 4 the balanced data set was subjected to data splitting via random selection into a training (85 %) and external test (15 %) set. Predictive models were constructed using DT, ANN and SVM algorithms. Predictive performance of the models were assessed by a set of statistical parameters. I = inhibitors, NI = non-inhibitors, S = substrates, NS = non-substrates.

Figure 1