Table 2. Performance benchmarks for the identification of CPDs based on the two machine-learning models prototyped in this study.
Data set | TPR (%) | FPR (%) | MCC | AUC of ROC |
---|---|---|---|---|
CPD-trained model | 78.05 | 21.75 | 0.56 | 86.21 |
DM-trained model (no MSA features) | 54.84 | 21.98 | 0.34 | 75.10 |
Abbreviations: FPR, false-positive rate; MCC, Matthew’s Correlation Coefficient; MSA, multiple sequence alignments; TPR, true-positive rate.
The first model, termed the 'CPD-trained model', was trained using the sets of CPDs and common SNPs employed in this study (CPD and SNP sets). The second model, the 'DM-trained model', was trained using disease-causing mutations and common SNPs (DM and SNP sets) but excludes any features derived from MSA. The Random Forest machine-learning algorithm was employed, and evaluation was performed using a variation of 10-fold cross-validation, whereby the positive evaluation set in each fold comprised unseen examples from the CPD set for both models (DM-trained model and CPD-trained model). An MCC of −1 represents the worst possible prediction, 0 a random prediction and +1 a perfect prediction.