Skip to main content
. 2021 Apr 19;38(8):3397–3414. doi: 10.1093/molbev/msab111

Fig. 2.

Fig. 2.

(A) Machine learning pipeline workflow. Input data consisted of instances (gene pairs) with labels (redundant or nonredundant) and values of features (characteristics of gene pairs). Example features, as shown in the table, include DNA sequence similarity, the number of genes in a pair annotated as having TF activity, maximum gene expression level, and the average level of CpG methylation among genes in the pair. The full input data are provided in supplementary data, Supplementary Material online. Instances were first split into training and testing sets. The training set was further split into a training subset (90%) and validation subset (10%) in a 10-fold cross-validation scheme. The optimal model after tuning the model parameters was used to provide performance metrics based on cross-validation, predict labels in the testing set for model evaluation purposes, and to obtain feature importance scores. (B, C) Cross-validation performance of models built using six of the nine RDs based on (B) AUC-ROC and (C) AU-PRC for each RD. RD1, 2, and 6 were not included due to small training data sizes. A model classifying gene pairs perfectly would have AUC-ROC and AU-PRC scores of 1.0; black dotted lines represent the performance of a model classifying at random, in which AUC-ROC and AU-PRC scores would be 0.5 given that we used balanced data (i.e., equal number of redundant and nonredundant instances). These curves represent the average scores from 100 iterations of model building; curves including standard deviation from this process are shown in supplementary figure S3, Supplementary Material online. (D) AUC-ROC and (E) AU-PRC for a model trained using extreme redundancy (RD4) gene pairs and balanced nonredundant pairs was applied to inclusive redundancy (RD9) gene pairs (excluding RD4) and nonredundant pairs that did not overlap with those used in training the RD4 model.