Table 2. Performance benchmarks for the identification of CPDs based on the two machine-learning models prototyped in this study.

Data set	TPR (%)	FPR (%)	MCC	AUC of ROC
CPD-trained model	78.05	21.75	0.56	86.21
DM-trained model (no MSA features)	54.84	21.98	0.34	75.10

Abbreviations: FPR, false-positive rate; MCC, Matthew’s Correlation Coefficient; MSA, multiple sequence alignments; TPR, true-positive rate.

The first model, termed the 'CPD-trained model', was trained using the sets of CPDs and common SNPs employed in this study (CPD and SNP sets). The second model, the 'DM-trained model', was trained using disease-causing mutations and common SNPs (DM and SNP sets) but excludes any features derived from MSA. The Random Forest machine-learning algorithm was employed, and evaluation was performed using a variation of 10-fold cross-validation, whereby the positive evaluation set in each fold comprised unseen examples from the CPD set for both models (DM-trained model and CPD-trained model). An MCC of −1 represents the worst possible prediction, 0 a random prediction and +1 a perfect prediction.