Rationale and General Design of DOMINO
(A) A typical exome analysis identifies 20,000 variants, when compared to the human reference genome. After filtering by rarity in the general population (minor allele frequency, or MAF, < 1%) and by functional impact of each variant, approximately 400 DNA changes remain. These impact 300–400 genes, heterozygously (red dots), and 5–10 genes when they are present as homozygous or compound heterozygous variants (blue dots).
(B) Workflow of DOMINO methodology, showing the different steps of gene selection, annotation, and scoring.
(C) Details of the LDA algorithm. Relevant features are first preselected and then removed, replaced or added iteratively to the model, with specific acceptance criteria. 10 × 10-fold cross-validation is performed at each iteration.
(D) Performance of the model as a function of the iterations performed. AUCs of the training, testing and validation sets, as well as the number of features at each iteration are shown. The cut-off value retained corresponded to the 14th iteration and a set of 8 features. The model converges starting from the 36th iteration.
(E) ROC curves for the complete training, testing and validation sets, displaying AUC values of 0.912, 0.908, and 0.920, respectively.
(F) Features composing the selected model. Average values for AD and AR genes of the training set are shown, along with their relative weight. Units are as follows: for STRING entries, number of interactions;17 for ExAC-pRec, probability of being intolerant to homozygous but not heterozygous loss-of-function variants;18 for ExAC-missense Z score, value with respect to a distribution of expected number of missenses;18 PhyloP, average PhyloP score with respect to a 1,000-bp window centered on the TSS;19 ExAC-don./syn., number of variants at the donor splicing site, normalized to the number of synonymous variants in the coding sequence;20 mRNA half-life, 0 if ≤ 10 hr or 1 if > 10 hr.21