Table 1.
Summary of Top Performing v3 Models After Feature Selection Using RFE-CV
| AA seq | Optimal Features (ap) | Residue Positions (Zm) | ROC AUC | Precision | Recall | F1 | Acc |
|---|---|---|---|---|---|---|---|
| RbcL | 101, 143, 281, 309, 418, 468 | 101, 143, 281, 309, 418, 468 | 1.0 | 1.0 | 0.9997 | 0.9998 | 0.9998 |
| RpoC2 | 357, 368, 450, 774, 1050, 1132, 1290 | 357, 368, 450, 706, 942, 1011, 1151 | 0.9639 | 0.9728 | 0.9471 | 0.9591 | 0.955 |
| NdhI | 5, 25, 38, 49, 84, 88, 89, 147, 153 | 5, 25, 38, 49, 84, 88, 89, 147, 153 | 0.9583 | 0.9784 | 0.9197 | 0.9464 | 0.9423 |
| NdhA | 5, 29, 36, 98, 110, 187, 298, 301, 319, 320 | 5, 27, 34, 96, 108, 184, 293, 296, 314, 315 | 0.9805 | 0.9910 | 0.9031 | 0.9436 | 0.9401 |
| RpoA | 6, 14, 146, 163, 180, 243, 279, 326, 329, 336 | 6, 14, 146, 161, 176, 237, 270, 317, 327 | 0.969 | 0.9444 | 0.9305 | 0.9359 | 0.9295 |
| MatK | 49, 147, 159, 314, 378, 417, 436 | 16, 111, 123, 274, 338, 377, 396 | 0.9673 | 0.8491 | 0.9716 | 0.93 | 0.9191 |
| NdhD | 64, 76, 114, 334, 364, 376, 442, 451, 497, 501 | 62, 74, 112, 332, 362, 374, 440, 449, 495, 499 | 0.963 | 0.9214 | 0.9197 | 0.9183 | 0.9089 |
| NdhF | 89, 145, 287, 340, 400, 566, 568, 597, 659 | 89, 145, 287, 340, 400, 555, 557, 586, 646 | 0.9653 | 0.9860 | 0.8497 | 0.9114 | 0.9082 |
ap—positions numbered according to alignment position.
Z. m—Residue positions numbered according to positions in Zea mays protein sequence.
Average scores of classification metrics (ROC AUC, Precision, Recall, F1, Accuracy) after cross validation (using repeated random subsampling, n = 500, 70/30 train/test split) are shown. Gray cells indicate models and model performances that are potential artifacts of corresponding sequence length/variation. ROC AUC, area under receiver operating characteristic curve.