Table 1. Summary of variations in the ENTPRISE-X training and testing data sets.
Data Set | Usage | |||
Training set | 1. For training a model in future applications. 2. For feature reduction. 3. For large scale, ten-fold cross-validation test on nonsense mutations in comparison to DDIG-in, and on frameshift mutations in comparison to DDIG-in & SIFT-indel (see Table 3). |
|||
Frameshift | Nonsense | |||
Pathogenic | Neutral | Pathogenic | Neutral | |
ClinVar: 6,513 | ESP6500: 1,604 | ClinVar: 5,023 | ESP6500: 181 | |
1000 GP: 366 | 1000 GP: 3,171 | |||
Total numbers (sum of each column) | ||||
6,513 | 1,970 | 5,023 | 32,51 | |
Independent testing sets (not used in training) | Usage | |||
VEST-indel set | For test on frameshift variations in comparison to VEST-indel & DDIG-in methods (see Table 2). | |||
Frameshift | Nonsense | |||
Pathogenic | Neutral | Pathogenic | Neutral | |
ClinVar: 82 | Inter-species: 1,025 | ─ | ─ | |
ExAC set | For large scale false positive rate test on frameshift & nonsense variations in comparison to the VEST-indel & DDIG-in methods | |||
─ | ExAC: 56,917 | ─ | ExAC: 45,131 |