Fig. 2: Low-signal signed iterative random forest (lo-siRF) prioritizes risk loci and epistatic interactions for left ventricular hypertrophy.
a-e, Workflow of low-signal signed iterative random forest (lo-siRF). a, Lo-siRF took in as input single-nucleotide variant (SNV) data and cardiac MRI-derived left ventricular mass indexed by body surface area (LVMi) from 29,661 UK Biobank participants. b, Dimension reduction was performed via a genome-wide association study (GWAS) to concentrate the analysis on a smaller set of SNVs. c, LVMi was binarized into high and low LVMi categories according to three different binarization thresholds (represented by the stacked boxes). d, For each of the three binarization thresholds, a signed iterative random forest was fitted using the GWAS-filtered SNVs to predict the binarized LVMi phenotype. The validation prediction accuracy was assessed prior to interpreting the model fit. e, SNVs used in the fitted signed iterative random forest were aggregated into genetic loci based on annotations using ANNOVAR78. Genetic loci and pairwise interactions between loci were finally ranked according to their importance across the three signed iterative random forest fits, as measured by our proposed stability-driven importance score. f, Lo-siRF-prioritized risk loci and epistatic interactions. (1) Loci stably prioritized by lo-siRF as epistasis participants are highlighted in green. (2) nIndSigSNVs, the number of independent significant SNVs that are stably prioritized by lo-siRF across the three different LVMi binarization thresholds (panel c). (3) nSNVs, the number of candidate SNVs extracted by FUMA37 (v1.5.4) in strong LD (r2 > 0.6) with any of the lo-siRF-prioritized independent significant SNVs. (4) Lo-siRF p-value, the mean p value from lo-siRF, averaged across the three LVMi binarization thresholds. (5) Max CADD, the maximum CADD40 score of SNVs within or in LD with the specific locus. A high CADD score indicates a strong deleterious effect of the variant. A threshold of 12.37 has been suggested by Kircher et al.40. (6) Min RDB, the minimum RegulomeDB39 score of SNVs within or in LD with the specific locus. RDB is a categorical score to guide interpretation of regulatory variants (from 1a to 7, with 1a being the most biological evidence for an SNV to be a regulatory element)37,39. (7) The top-ranked SNV or SNV-SNV pair showing the highest occurrence frequency (Extended Data Fig. 4) averaged across lo-siRF fits from the three LVMi binarization thresholds. A full list of lo-siRF-prioritized SNVs and SNV-SNV pairs can be found in Extended Data 3. (8) Genomic location (hg38) and GWAS statistics information (using PLINK34) of the top SNV for each lo-siRF-prioritized locus. Abbreviations: MAF, minor allele frequency; NEA/EA, non-effect-allele/effect-allele; SE, standard error. (9) nPartnerSNVs, number of partner SNVs that interact with the given SNV in lo-siRF. These SNV-SNV pairs interacted in at least one lo-siRF decision path across every LVMi binarization threshold (details in Methods).
