Skip to main content
. 2019 Oct 15;20:496. doi: 10.1186/s12859-019-3026-8

Table 1.

Feature selection

Feature label RF(sklearn) BRF(imblearn)
HPO-cosine 0.2895 0.2471
PyxisMap 0.2207 0.2079
CADD Scaled 0.1031 0.1007
phylop100 conservation 0.0712 0.0817
phylop conservation 0.0641 0.0810
phastcon100 conservation 0.0572 0.0628
GERP rsScore 0.0357 0.0416
HGMD assessment type_DM 0.0373 0.0344
HGMD association confidence_High 0.0309 0.0311
Gnomad Genome total allele count 0.0192 0.0322
ClinVar Classification_Pathogenic 0.0228 0.0200
ADA Boost Splice Prediction 0.0081 0.0109
Random Forest Splice Prediction 0.0077 0.0105
Meta Svm Prediction_D 0.0088 0.0092
PolyPhen HV Prediction_D 0.0075 0.0071
Effects_Premature stop 0.0049 0.0057
SIFT Prediction_D 0.0026 0.0056
PolyPhen HD Prediction_D 0.0025 0.0049
Effects_Possible splicing modifier 0.0029 0.0035
ClinVar Classification_Likely Pathogenic 0.0034 0.0020

This table shows the top 20 features that were used to train the classifiers ordered from most important to least important. After training, the two random forest classifiers report the importance of each feature in the classifier (total is 1.00 per classifier). We average the two importance values, and order them from most to least important. Feature labels with an ‘_’ represent a single category of a multi-category feature (i.e. “HGMD assessment type_DM” means the “DM” bin-count feature from the “HGMD assessment type” annotation in Codicem)