Skip to main content
. 2020 Dec 18;9:e62208. doi: 10.7554/eLife.62208

Figure 5. Supervised machine learning can predict Lineage-Specific (LS) regions based on epigenome and physical genome characteristics.

(A) Area under the Response operator curve (auROC) plotting sensitivity and false positive rate (FPR) for four machine learning algorithms, BCT- Boosted classification tree; GBM- stochastic gradient boosting; LR- logistic regression; RF- random forest. The auROC scores are shown next the algorithm key in the gray box. The black dotted line represents the performance of a random classifier. Perfect model performance would be a curve through point (0,1) in the upper left corner. (B) Area under the Precision-Recall curve for the same four models shown in A. Area under the curves are shown in the figure key in the gray box. The black dashed line shows the performance of a random classifier, calculated as the TP / (TP + FN). Perfect model performance would be a curve through point (1,1) in the upper right corner.

Figure 5.

Figure 5—figure supplement 1. Results from model parameter tuning and assessment.

Figure 5—figure supplement 1.

(A) The random forrest model was trained using three-time 10-fold cross-validation (CV) under varying conditions for the parameter ‘randomly selected predictor’. The plot shows the average accuracy across the 30 trials for each variable level as a black square. (B) Average accuracy results from three-time 10-fold CV using the boosted classification tree algorithm. The variables ‘number of trees’ (x-axis) and ‘max tree depth’ (blue, green, black lines) were varied across the trials. Each data point represents the average accuracy across the CV. (C) Average accuracy results from three-time 10-fold CV using the stochastic gradient boosting algorithm. The variables ‘number of boosting iterations’ (x-axis), ‘shrinkage’ (y-axis), ‘minimum terminal node size’ (columns), and max tree depth (blue, green, black lines) were varied across the trials. Each data point represents the average accuracy across the CV. (D) The individual accuracy measurements and box plot for the final models picked for each algorithm. Results are from the 30 CV runs.