Skip to main content
editorial
. 2016 Oct 5;25(1):2–7. doi: 10.1038/ejhg.2016.129

Table 2. Performance benchmarks for the identification of CPDs based on the two machine-learning models prototyped in this study.

Data set TPR (%) FPR (%) MCC AUC of ROC
CPD-trained model 78.05 21.75 0.56 86.21
DM-trained model (no MSA features) 54.84 21.98 0.34 75.10

Abbreviations: FPR, false-positive rate; MCC, Matthew’s Correlation Coefficient; MSA, multiple sequence alignments; TPR, true-positive rate.

The first model, termed the 'CPD-trained model', was trained using the sets of CPDs and common SNPs employed in this study (CPD and SNP sets). The second model, the 'DM-trained model', was trained using disease-causing mutations and common SNPs (DM and SNP sets) but excludes any features derived from MSA. The Random Forest machine-learning algorithm was employed, and evaluation was performed using a variation of 10-fold cross-validation, whereby the positive evaluation set in each fold comprised unseen examples from the CPD set for both models (DM-trained model and CPD-trained model). An MCC of −1 represents the worst possible prediction, 0 a random prediction and +1 a perfect prediction.