Summary
Mode of inheritance (MOI) is necessary for clinical interpretation of pathogenic variants; however, the majority of variants lack this information. Furthermore, variant effect predictors are fundamentally insensitive to recessive-acting diseases. Here, we present MOI-Pred, a variant pathogenicity prediction tool that accounts for MOI, and ConMOI, a consensus method that integrates variant MOI predictions from three independent tools. MOI-Pred integrates evolutionary and functional annotations to produce variant-level predictions that are sensitive to both dominant-acting and recessive-acting pathogenic variants. Both MOI-Pred and ConMOI show state-of-the-art performance on standard benchmarks. Importantly, dominant and recessive predictions from both tools are enriched in individuals with pathogenic variants for dominant- and recessive-acting diseases, respectively, in a real-world electronic health record (EHR)-based validation approach of 29,981 individuals. ConMOI outperforms its component methods in benchmarking and validation, demonstrating the value of consensus among multiple prediction methods. Predictions for all possible missense variants are provided in the “Data and code availability” section.
Keywords: mode of inheritance, variant effect prediction, autosomal recessive, autosomal dominant, pathogenic variants, electronic health records
Graphical abstract

Highlights
-
•
Mode-of-inheritance (MOI) information is missing from most variants
-
•
We present MOI-Pred and ConMOI to predict MOI at the variant level
-
•
We demonstrate extensive validation on clinical biobank data
-
•
We provide MOI predictions for 71 million missense variants
Motivation
Mode-of-inheritance (MOI) information is crucial to clinically evaluate disease-causing genetic variation. However, most pathogenic variants lack MOI information. Furthermore, despite advances in the computational prediction of pathogenic variants, these tools remain fundamentally insensitive to recessive-acting diseases. Hence, there is a pressing need for robust MOI prediction tools.
Petrazzini et al. use machine learning to build a variant-level mode-of-inheritance (MOI) prediction tool called MOI-Pred and an ensemble predictor of three independent tools called ConMOI. The authors validated their approach in real-world clinical settings using electronic health record data and provide publicly available pre-computed MOI predictions.
Introduction
Computational methods can predict the effect of coding variants to help diagnose genetic diseases,1,2,3,4 inform genetic association studies,5,6,7,8 and accelerate drug design.9,10 Currently available methods perform very well at discriminating pathogenic and benign missense variants, typically reporting accuracy in the range of 58%–86%.11,12,13 Each prediction uses a unique set of variant characteristics, making ensemble approaches often more accurate than individual methods.14,15,16 Accordingly, current guidelines recommend considering multiple prediction tools to inform decision making.13 While these methods perform very well, they predict a variant’s effect on disease on a very granular level. The vast majority of these methods make a simple binary prediction: is the variant pathogenic, or is it benign? Some methods predict whether variants cause particular phenotypes,17 but even these are still binary predictions. The true genotype-to-phenotype map is much more complex and highly dimensional. A full assessment of a variant’s effect would include potentially pleiotropic effects on a variety of different phenotypes, from the molecular level to the systems level, as well as features that modify its genetic impact, such as penetrance and mode of inheritance (MOI). Large-scale computational approaches that incorporate different axes of genomic information can potentially be used to inform various aspects of variant function.18
Here, we focus on MOI as the next level of granularity to include in computational prediction of variant effect. The concept behind MOI is foundational to the field of genetics and considered one of the most important features to report about a pathogenic variant.13,19,20 Studies have shown that disease diagnosis can be greatly improved by incorporating pedigree information.21,22,23,24 In spite of this, MOI has practically no role in current variant annotation pipelines. Efforts to resolve MOI mechanisms have fallen behind the gene discovery rate,25 limiting the availability of such information in databases of validated clinically relevant variants. Even among databases that do provide MOI information, most notably Online Mendelian Inheritance in Man (OMIM),26 these annotations are present only for a small fraction of genes (4,417 out of 17,125 in OMIM). They are also not necessarily reliable, since they derive almost entirely from anecdotal case reports with small pedigrees, and very few have been replicated across studies. Currently, 35.5% of variants reported as “pathogenic” or “likely pathogenic” in ClinVar27 have no annotated MOI and cannot confidently be assigned one based on existing annotations. While some molecular and evolutionary features are known to be enriched in genes implicated in autosomal recessive (AR) disease,28,29,30,31,32,33 these features are not widely used at the variant level to distinguish variants causing AR disease from variants causing autosomal dominant (AD) disease or benign variants. Additionally, evidence suggests that current variant effect prediction methods are fundamentally insensitive to AR disease,34,35 reinforcing the need for new methods specifically aimed at predicting genes and variants causing AR disease.36 Previous efforts at developing such methods underperform binary prediction tools, lack robust validation, and have not achieved widespread use in the field.16,28,37,38 Binary prediction tools have benefited greatly from the development of multiple approaches with diverse architectures, and standard practice for clinical applications is to use consensus methods that combine these multiple independent predictions.13,39 We therefore propose a similar approach to MOI-aware prediction.
Here, we present two methods: MOI-Pred, a three-way prediction method that labels missense variants as pathogenic for AR disease, pathogenic for AD disease, or benign; and ConMOI, a consensus approach that combines predictions from MOI-Pred with two previously published methods that produce predictions incorporating MOI, MAPPIN37 and MAVERICK.40 MOI-Pred uses a random forest classifier to combine variant effect estimations with gene-level features that are predictive of AR or AD disease. It accurately predicts disease case-control status for homozygous and heterozygous carriers in an external validation using real-world electronic health record (EHR)-based clinical data. MOI-Pred also has performance comparable to state-of-the-art methods on standard benchmarks for binary variant effect prediction as well as on modified benchmarks testing variant-level MOI prediction. It also substantially improves on the performance of existing methods when incorporated into a consensus prediction: ConMOI outperforms all three of its component methods on all these metrics, as well as outperforming a consensus of any two. MOI-Pred and ConMOI address a shortcoming in current annotation pipelines by advancing the role of MOI in variant effect prediction, especially differentiating AR pathogenic variants from benign variants. These tools provide an improvement in reliability that is critical for adoption of MOI prediction methods in downstream research.
Results
Clinical variants missing MOI information
ClinVar does not explicitly annotate MOI. Instead, this information is extracted from external resources such as OMIM or the Human Gene Mutation Database (HGMD).41 These databases provide mainly gene-level information and only for a subset of diseases. Thus, most variants in ClinVar either lack a MOI annotation entirely or are simply labeled with the annotation of their corresponding gene. Out of 817,984 ClinVar variants we tested, 307,800 (37.63%) were in genes with no clearly annotated MOI (Table S2). This includes variants from every clinical significance annotation, including 35.51% of all variants annotated as pathogenic (49,745 variants), 41.38% of all variants annotated as benign (119,532 variants), 35.30% of all variants annotated as uncertain significance (122,600 variants), and 38.19% of all variants annotated as conflicting interpretation (15,923 variants) (Table S2).
Model training
We collected a training set of 2,481 recessive and 1,248 dominant pathogenic missense variants from ExoVar16 and 3,729 presumed non-pathogenic missense variants from the Genome Aggregation Database (gnomAD)42 annotated with a wide range of features capturing functional and biological aspects of MOI. Variants from gnomAD were frequency matched to pathogenic variants to avoid overfitting on allele frequency (AF), which is used explicitly as an input to several of our features (Table S1). While this approach likely results in some pollution of pathogenic variants in variants labeled non-pathogenic, the non-pathogenic variants remain highly depleted for pathogenic variants relative to both classes of pathogenic variants.43 This approach has successfully been used previously to train several widely used variant effect prediction methods, including CADD,44 DANN,45 and M-CAP.46 We fitted a random forest model on this training set, using 10-fold cross-validation and 100 different random train-test splits to assess performance (Figure 1). Cross-validation was stratified by gene so that variants from the same gene could never be in both training and test sets. Feature selection was performed independently on each iteration, reducing the number of features to a minimum of 10 and a maximum of 18 (median across 100 models is 13 features (Figure S1). In total, 19 unique features were selected across all 100 iterations for training, incorporating a range of functional, evolutionary, and combined information (STAR Methods, Table S1).
Figure 1.
Study design and machine-learning workflow
Train and test sets correspond to the 90% and 10% balanced datasets, built from ExoVar and gnomAD variants, used for training and testing respectively. Benchmark set corresponds to the balanced dataset, built from ClinVar and GEMJ-WGA variants, used for external benchmarking. EHR, electronic health record.
See also Figure S1.
The prediction models performed well in the test set, with a mean area under the receiver operating characteristic curve (AUROC) = 0.94/0.96/0.95 (standard deviation [SD] = 1.2 × 10−2/6.8 × 10−3/1.3 × 10−2), sensitivity = 0.75/0.76/0.92 (SD = 3.8 × 10−2/3.0 × 10−2/2.8 × 10−2), and specificity = 0.94/0.95/0.82 (SD = 1.3 × 10−2/1.4 × 10−2/2.2 × 10−2) (Figure 2; reported values represent mean and standard deviation across cross-validation runs for recessive/dominant/non-pathogenic variants, with each class tested against the other two; see STAR Methods). This represents good overall performance with similar discrimination power across classes. Non-pathogenic has higher sensitivity and lower specificity than the other two classes, representing a higher rate of false positives for non-pathogenic variants and a higher rate of false negatives for both classes of pathogenic variants. This is likely partly due to the above-mentioned pollution of pathogenic variants in the non-pathogenic dataset. Additionally, there is only minor loss of performance between the training set and the cross-validation test set, indicating minimal overfitting (Table S3). Feature importance on MOI prediction models are provided in Figure 3 (see “Model interpretation” section).
Figure 2.
Receiver operator characteristic curves and bar plots showing sensitivity and specificity for three-class MOI prediction models
Receiver operator characteristic curves for three-class MOI prediction models (A). Bar plots showing sensitivity and specificity for three-class MOI prediction models (B). Test set corresponds to the 10% balanced dataset, built from ExoVar and gnomAD variants, used for testing. Benchmarking set corresponds to the balanced dataset, built from ClinVar and GEMJ-WGA variants, used for external benchmarking. Reported AUROC corresponds to the mean across 100 models. Reported sensitivity and specificity corresponds to mean (SD) across 100 models.
Figure 3.
Feature importance on MOI prediction models
Bar plot showing feature importance on three-class MOI prediction models (A). Word clouds representing feature importance on 2-class MOI prediction models (B). Feature importance is reported as the median across 100 models. Word clouds represent feature importance on benign-dominant (left), benign-recessive (middle), and dominant-recessive (right) models, respectively. Exact feature importance values in two-class prediction models can be found in Figures S10–S12.
EHR-based clinical validation
To test the performance of the model on real-world clinical data, we collected a total of 1,845,623 variants present in 29,981 individuals from the BioMe biobank.47 Of these, 56,706 were missense variants present in ClinVar (2,301 pathogenic, 9,865 benign, 35,629 uncertain significance, and 8,911 conflicting Interpretation), and 19,134 remain after restricting to 2-star or higher in ClinVar review status (1,047 pathogenic, 6,303 benign, and 11,784 uncertain significance) (Table S4). The model used to predict all variants shows good performance in the train/test sets, as well as in external benchmarks (see below and Table S5). For each variant, we marked each patient as positive if the EHR included a diagnosis reported for the variant in ClinVar, and negative otherwise. We then used a Cochran-Mantel-Haenszel (CMH) stratified contingency test to assess the association between homozygous or heterozygous carriers of ClinVar-annotated variants and actual diagnoses, stratified by disease. An association between carrier status and disease status indicates that the variants being tested are, in aggregate, associated with disease with the specified MOI. By separating variants that receive different predictions from our model, we can test whether our model’s prediction is actually predictive of carrier disease status in a real clinical population. Critically, while this analysis uses ClinVar annotations to group variants, the evaluation of model performance relies only on patient diagnoses observed in the EHR. This means this analysis will not overestimate the model’s performance if the same ClinVar variants appear in the method’s training data, which is otherwise a major concern for training and validation of variant effect prediction methods.48,49,50 This is particularly important for an ensemble predictor that uses trained machine-learning classifiers as input features, where data leakage can come not only from the training set of the ensemble predictor itself but also from those of the individual features.
The contingency table analysis showed that MOI-Pred is highly predictive, with all ClinVar categories showing associations in the expected directions. We found that a recessive prediction from MOI-Pred significantly increases the odds ratio (OR) of association between homozygous genotype and disease status for ClinVar variants annotated as pathogenic (OR = 4.30 for recessive predicted variants vs. OR = 1.07 for dominant or benign predicted variants; p = 1.4 × 10−43, Q test), uncertain significance (OR = 5.45 for recessive predicted variants vs. OR = 0.31 for dominant or benign predicted variants; p = 1.4 × 10−152, Q test), and conflicting interpretation (OR = 4.11 for recessive predicted variants vs. OR = 1.11 for dominant or benign predicted variants; p = 4.6 × 10−51, Q test). Likewise, a dominant prediction from MOI-Pred significantly increases the association between carrier status and disease status for ClinVar variants annotated as pathogenic (OR = 1.98 for dominant predicted variants vs. OR = 1.56 for recessive or benign predicted variants; p = 4.2 × 10−7, Q test) or uncertain significance (OR = 1.40 for dominant predicted variants vs. OR = 0.87 for recessive or benign predicted variants; p = 6.5 × 10−33, Q test). Finally, as expected, a benign prediction from MOI-Pred significantly decreases the association between carrier status and disease status for ClinVar variants annotated as pathogenic (OR = 1.23 for benign predicted variants vs. OR = 2.97 for dominant or recessive predicted variants; p = 5.6 × 10−94, Q test) and uncertain significance (OR = 0.87 for benign predicted variants vs. OR = 1.02 for dominant or recessive predicted variants; p = 1.8 × 10−4, Q test) (Figure 4; Tables S6–S8). Restricting to ClinVar variants with 2-star or higher review status showed similar results (Figure S2; Tables S9–S11). Notably, we observed a particularly strong protective association of variants on disease that are not predicted recessive for uncertain-significance variants (Figure 4).
Figure 4.
Inheritance-specific disease association for variants with MOI prediction by MOI-Pred or ConMOI
Forest plots showing disease association with variants predicted by MOI-Pred (A, B, and C) or ConMOI (D, E, and F) to be recessive (A and D), dominant (B and E), and benign (C and F). Effect sizes (odds ratios) and 95% confidence intervals were obtained for individual ancestries using a Cochran-Mantel-Haenszel (CMH) test. The reported effect sizes correspond to an inverse-variance meta-analysis across ancestries. p values for heterogeneity between odds ratios are derived from a Q test set.
See also Tables S21–S24.
Comparison with other methods and consensus approach
At present, there are two additional tools that produce predictions for three distinct MOI classes: MAPPIN37 and MAVERICK.40 We performed the same EHR-based clinical validation analyses using predictions from these two tools (Figures S3 and S4; Tables S12–S17). Additionally, we performed the same EHR-based clinical validation using predictions from REVEL,14 which is often ranked as the best-performing ensemble method for binary pathogenicity prediction (Tables S18–S20). As expected, REVEL showed similar or stronger enrichment for dominant association with disease to the MOI-aware methods but substantially worse enrichment for recessive association with disease. In contrast, both MAPPIN and MAVERICK showed a higher enrichment for association with recessive disease, comparable to each other and to MOI-Pred.
It has previously been observed that the consensus of multiple variant effect predictor methods performs better than any single method, and it is now standard in variant annotation to use a consensus of multiple methods rather than relying on a single method.13 Based on this observation, we performed the same analysis using the consensus of all three available MOI-aware prediction methods, MOI-Pred, MAPPIN, and MAVERICK (Figure 4; Tables S21–S24). As expected, we found that the consensus approach outperforms any of the individual methods on most metrics, particularly for recessive predictions. This improvement is most pronounced in the uncertain significance and conflicting interpretations categories. For example, for variants annotated as uncertain significance, the three individual methods show OR in the range of 0.33–6.59 for individuals homozygous for a predicted recessive variant having a corresponding disease diagnosis, while the consensus of all three methods shows an OR of 17.31, substantially higher than each individual prediction. Likewise, for variants annotated as conflicting interpretations, the three individual methods show OR in the range of 1.15–8.09 for the same recessive predicted variant (homozygous disease association), while the consensus shows an OR of 10.48. This demonstrates that, as with traditional variant effect prediction methods, the consensus of multiple MOI-aware prediction methods can contribute to resolving variants of uncertain significance (VUSs), a major outstanding problem in the field of medical genetics.51,52
Variant effect prediction benchmarks
To assess the model’s performance on standard benchmarks for variant effect prediction tasks, we collected an external benchmark set containing 735 recessive, 1,402 dominant, and 6,327 benign missense variants from the previously published ClinVar 2020 benchmark dataset.53 This benchmark was designed to exclude variants found in the training sets of many published variant effect prediction methods, including all methods with published training sets that are used as input features to MOI-Pred, as well as the ExoVar used as the training data for MOI-Pred. We also collected an additional non-pathogenic benchmark set of 1,010 missense variants found at high AF in the Genome Medical Alliance Japan Whole Genome Aggregation panel v.1 (GEMJ-WGA)54 and absent from gnomAD. These GEMJ-WGA non-pathogenic variants mimic the bulk of benign variants reported as VUSs in a clinical setting: variants common in an undersampled population but absent from existing databases of variation.13,55,56,57,58 Since these non-pathogenic variants are absent from existing databases of variation, they also do not overlap with the training data for any input feature of MOI-Pred or the training data of MOI-Pred itself. Performance on this external benchmark set using ClinVar recessive, ClinVar dominant, and GEM non-pathogenic variants was similar to performance measured on the test set, with AUROC = 0.99/0.99/0.96 (SD = 2.7 × 10−3/1.8 × 10−3/6.9 × 10−3), sensitivity 0.86/0.99/0.95 (SD 1.8 × 10−2/2.8 × 10−2/1.7 × 10−2), and specificity 0.98/0.95/0.95 (SD 7.4 × 10−3/4.8 × 10−3/1.5 × 10−2) for recessive/dominant/benign classes (Figure 2). A secondary benchmark set using ClinVar benign variants instead of GEM variants showed similar performance (Figure S5). This suggests that the model is not overfitting the data sources used for training and testing (ExoVar and gnomAD), which would be indicated by a significant drop in performance between the internal blind test set and external benchmarks. Indeed, the model appears to perform better on the external benchmark set. This may reflect the fact that the ClinVar datasets used for external benchmarking contain more confident annotations and less noise than the ExoVar datasets used for training and testing.
To expand on this observation, we grouped variants by ClinVar review status (level of evidence for pathogenicity) and assessed how the confidence of variant annotations affects the sensitivity of our predictions. We found that sensitivity improved with higher review status: sensitivity for the recessive/dominant classes was 0.58/0.49 (SD 0.03/0.00) for 0-star review status, 0.68/0.59 (SD 0.04/0.00) for 1-star review status, and 0.86/0.89 (SD 0.04/0.00) for 2-star or higher review status (Figure S6). This is as expected if lower-confidence variants are less likely to be truly pathogenic. In this case, an accurate predictor would predict a smaller fraction of 0-star variants to be pathogenic, because the fraction of those variants that are truly pathogenic is smaller.
We also tested the inheritance prediction model on variants unique to a single ancestry group (European American, African American, or Hispanic American) to evaluate whether performance is consistent across ancestries. We found that sensitivity was uniformly high across all three ancestries, with no specific ancestry having substantially higher power (Figures S7 and S8). We also found that ancestry-specific variants across all three ancestries showed the same trend as the full dataset, with sensitivity improving in higher-confidence annotations. This demonstrates that MOI-Pred is not primarily powered to detect variants observed in Europeans but has similar performance regardless of ancestry.
Model interpretation
Examining the importance of different features in the model shows the union of functional, evolutionary, and combined information that are driving the inheritance prediction. One functional feature (ISPP, AD.rank score59 with 23.8%), two combined features (MutPred60 and M-CAP46 with 14.2% and 13.6%, respectively), and two evolutionary features (OE42 and FATHMM61 with 11.2% and 11%, respectively) carry 73.8% of the models’ weight (Figure 3).
These feature weights represent the overall importance of features to the three-way classifier. To examine which features are important to identify each individual class, we trained three two-way classifiers to distinguish dominant from benign, recessive from benign, and dominant from recessive. Both benign-pathogenic binary prediction models (benign-dominant and benign-recessive) are dominated by features carrying combined functional and evolutionary information, namely M-CAP, MutPred, and VEST3,62 in addition to FATHMM, which primarily carries evolutionary information (Figures 3 and S8). In contrast, the dominant-recessive prediction is mainly driven by gene-level features carrying either functional or evolutionary information, like ISPP and OE (Figures 3 and S8).
Single-nucleotide variant association discovery
To test the utility of MOI-Pred and ConMOI for clinical assessment of individual variants, we first tested 6,382 ClinVar variants predicted pathogenic by MOI-Pred for association with 433 groups of International Classification of Disease 10 (ICD-10) codes in 25,326 individuals from the BioMe biobank. We identified 18 variants showing significant associations with a single phenotype with an MOI corresponding to MOI-Pred’s prediction (Table 1). Of these, three had a corresponding consensus prediction from ConMOI (Table 1), which we also evaluated in independent ancestries (Tables S25–S28). Three variants were found to have significant recessive associations with disease, one of which (rs1800562 in the HFE gene) was also predicted recessive by the consensus. Interestingly, none of these are currently labeled as pathogenic in ClinVar. rs1800562is labeled conflicting interpretation of pathogenicity and the other two are labeled benign. Moreover, 15 variants showed dominant association with disease, two of which were predicted dominant by the consensus. Six of these are labeled uncertain significance or conflicting interpretation of pathogenicity (including one of those with a consensus prediction, rs142885240 in the DSP gene) and six of which are labeled benign or likely benign; the remaining three are labeled pathogenic or likely pathogenic. In standard clinical practice, a consensus prediction of multiple computational methods is considered strong suggestive evidence and may prompt further investigation and reconsideration of the variant’s label, especially where the existing evidence for the variant is weak or conflicting.63,64 We suggest that ConMOI predictions should be treated the same way for rs1800562 and rs142885240, both of which have conflicting annotations of pathogenicity in ClinVar and conflicting annotations of MOI in OMIM but are confidently predicted pathogenic with a specific MOI by ConMOI, with observed phenotypic associations matching the consensus predictions. Similarly, of the six variants annotated benign or likely benign that MOI-Pred predicts as pathogenic, none have supporting evidence for the annotation recorded in ClinVar. While predictions from a single method are not as compelling as consensus predictions, MOI-Pred’s predictions still suggest a need for further study of these variants possibly leading to reclassification. These examples show the potential utility of ConMOI and MOI-Pred for discovery of disease associations.
Table 1.
Description of variant associations with clinical disease
| A | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Variant ID | Gene | AF | ICD10 codes | p value | OR (95% CI) | DOM p value | DOM OR (95% CI) | OMIM | ClinVar |
| rs79985808 | SUMF1 | 0.016 | other degenerative diseases of basal ganglia (G23) and disorders of sphingolipid metabolism and other lipid storage disorders (E75) | 4.03 × 10−5 | 126.7 (118.5–134.9) | 1.61 × 10−2 | 4.52 (4.40–4.67) | recessive | benign |
| rs17144835 | DNAH11 | 0.061 | other congenital malformations of respiratory system (Q34) | 4.04 × 10−7 | 35.9 (31.9–39.9) | 5.64 × 10−1 | 1.34 (0.95–1.73) | recessive | benign |
| rs1800562 | HFE | 0.022 | disorders of mineral metabolism (E83) and genetic susceptibility to disease (Z15) | 1.10 × 10−9 | 13.7 (10.7–16.7) | 3.18 × 10−1 | 1.15 (0.63–1.67) | conflicting | conflicting interpretations of pathogenicity |
| B | |||||||
|---|---|---|---|---|---|---|---|
| Variant ID | Gene | AF | ICD10 codes | p value | OR (95% CI) | OMIM | ClinVar |
| rs145214720 | COL10A1 | 3.41 × 10−4 | osteochondrodysplasias (Q78) and osteochondrodysplasia with defects of growth of tubular bones and spine (Q77) | 3.64 × 10−8 | 352.3 (348.9–355.7) | dominant | likely benign |
| rs34539681 | COL10A1 | 3.90 × 10−3 | 2.03 × 10−6 | 23.4 (15.5–31.3) | dominant | benign | |
| rs140075817 | EXT2 | 2.27 × 10−4 | osteochondrodysplasias (Q78) | 2.39 × 10−6 | 248.1 (239.8–256.4) | conflicting | conflicting interpretations of pathogenicity |
| rs770821909 | EXT2 | 1.30 × 10−4 | osteochondrodysplasias (Q78) | 2.36 × 10−6 | 471.8 (461.1–482.4) | conflicting | uncertain significance |
| rs146098187 | EXT2 | 3.57 × 10−4 | osteochondrodysplasias (Q78) | 3.17 × 10−6 | 262.8 (254.2–271.4) | conflicting | benign |
| rs138495222 | EXT2 | 3.57 × 10−4 | osteochondrodysplasias (Q78) | 5.93 × 10−7 | 161.0 (153.9–168.1) | conflicting | conflicting interpretations of pathogenicity |
| rs35221558 | LEMD3 | 1.99 × 10−3 | osteochondrodysplasias (Q78) | 4.24 × 10−7 | 40.3 (36.1–44.5) | dominant | likely benign |
| rs142885240 | DSP | 1.44 × 10−3 | other congenital malformations of skin (Q82) | 5.72 × 10−7 | 16.04 (14.2–17.9) | conflicting | conflicting interpretations of pathogenicity |
| rs36105360 | LMNB1 | 9.89 × 10−3 | degenerative diseases of basal ganglia (G23) and disorders of sphingolipid metabolism and other lipid storage disorders (E75) | 7.74 × 10−7 | 7.7 (5.4–9.9) | dominant | benign |
| rs139644798 | RARS1 | 3.40 × 10−4 | degenerative diseases of basal ganglia (G23) and disorders of sphingolipid metabolism and other lipid storage disorders (E75) | 2.39 × 10−6 | 98.1 (91.5–104.6) | recessive | likely pathogenic |
| rs34637584 | LRRK2 | 1.78 × 10−3 | neoplasms of unspecified behavior (D49), Parkinson’s disease (G20) and leprosy (Hansen’s disease) (A30) | 1.61 × 10−6 | 4.7 (2.4–6.9) | dominant | pathogenic |
| rs141230910 | SDHB | 5.84 × 10−4 | genetic susceptibility to disease (Z15), phakomatoses (Q85), malignant neoplasm of other endocrine glands and related structures (C74), neoplasm of uncertain behavior of endocrine glands (D44), and malignant neoplasm of other and ill-defined digestive organs (C26) | 1.98 × 10−6 | 18.4 (14.9–21.8) | dominant | conflicting interpretations of pathogenicity |
| rs372115732 | TBX4 | 1.13 × 10−4 | primary disorders of muscles (G71) and congenital malformations of limb(s) (Q74) | 8.81 × 10−7 | 228.6 (220.1–237.1) | conflicting | likely benign |
| rs141707850 | FBN2 | 1.29 × 10−4 | other congenital musculoskeletal deformities (Q68) and other specified congenital malformation syndromes affecting multiple systems (Q87) | 1.34 × 10−6 | 153.3 (145.6–160.9) | dominant | uncertain significance |
| rs147272790 | MBD5 | 2.75 × 10−4 | nonrheumatic aortic valve disorders (I35); congenital malformations of great arteries (Q25); monosomies and deletions from the autosomes, not elsewhere classified (Q93); and other congenital malformations of skin (Q82) | 1.42 × 10−6 | 19.2 (15.6–22.8) | dominant | conflicting interpretations of pathogenicity |
| rs77375493 | JAK2 | 6.00 × 10−4 | other venous embolism and thrombosis (I82), polycythemia vera (D45), mast cell neoplasms of uncertain behavior (D47), myeloid leukemia (C92), and other and unspecified diseases of blood and blood-forming organs (D75) | 3.84 × 10−13 | 18.9 (16.6–21.2) | dominant | pathogenic |
Variants predicted recessive (A) and dominant (B) by MOI-Pred. Shaded rows are also predicted recessive or dominant by the consensus of MOI-Pred, MAPPIN, and MAVERICK (ConMOI). The significance threshold is set to p = 1.09 × 10−4 for the recessive association test and p = 7.83 × 10−6 for the dominant association test after a Bonferroni correction based on 455 and 6,382 tests, respectively. No homozygous carriers were found in the BioMe exome data for significant dominant associations. AF, allele frequency; OR, odds ratio; CI, confidence interval; MOI-Pred, mode-of-inheritance predictor; OMIM, Online Mendelian Inheritance in Man. “Conflicting” corresponds to genes having autosomal dominant and autosomal recessive inheritance label in OMIM; “dominant” corresponds to genes having autosomal dominant inheritance label in OMIM; “recessive” corresponds to genes having autosomal recessive inheritance label in OMIM; “Conf. Int.” corresponds to variants having conflicting interpretation of pathogenicity label in ClinVar; “benign” corresponds to variants having benign, likely benign, and/or benign/likely benign label in ClinVar; “pathogenic” corresponds to variants having pathogenic, likely pathogenic, and/or pathogenic/likely pathogenic label in ClinVar.
To further demonstrate the utility of MOI-Pred and ConMOI, we also tested 151,164 predicted recessive variants for association with 171 metabolites in 44,118 individuals from the UK Biobank. Five variants showed significant associations with metabolites, one of which (rs1800562 in HFE) is also predicted recessive by the consensus approach (Table 2). Of these five, two genes lack MOI annotation in OMIM (ACADL and SCFD1) and two have conflicting annotations (MTHFR and HFE). Importantly, while all of these are known disease genes, none of the metabolite associations reported are directly diagnostic of the disease reported in OMIM. These results demonstrate that MOI-Pred and ConMOI can identify both known and previously unannotated recessive associations.
Table 2.
Description of recessive variant associations with metabolites
| Variant ID | Gene | Metabolite | p value | Beta (SE) | OMIM |
|---|---|---|---|---|---|
| rs2286963 | ACADL | glycine | 1.12 × 10−4 | −0.05 (−0.90) | N/A |
| rs61754480 | SCFD1 | cholesterol in very small VLDL | 3.14 × 10−4 | 2.30 (−1.50) | N/A |
| rs61754480 | SCFD1 | free cholesterol in very small VLDL | 2.44 × 10−4 | 2.37 (−1.53) | N/A |
| rs35088381 | LOXHD1 | docosahexaenoic acid | 1.46 × 10−4 | −1.21 (0.02) | recessive |
| rs1801133 | MTHFR | glycine | 1.18 × 10−4 | 0.05 (−1.05) | conflicting |
| rs1800562 | HFE | clinical LDL cholesterol | 8.69 × 10−5 | −0.23 (−0.24) | conflicting |
| rs1800562 | HFE | LDL cholesterol | 1.50 × 10−4 | −0.22 (−0.26) | conflicting |
| rs1800562 | HFE | phospholipids in LDL | 1.31 × 10−4 | −0.23 (−0.25) | conflicting |
| rs1800562 | HFE | free cholesterol in LDL | 5.80 × 10−6 | −0.27 (−0.26) | conflicting |
| rs1800562 | HFE | total lipids in LDL | 2.73 × 10−4 | −0.22 (−0.29) | conflicting |
| rs1800562 | HFE | 3-hydroxybutyrate | 3.44 × 10−4 | −0.22 (−0.83) | conflicting |
| rs1800562 | HFE | total lipids in large LDL | 3.13 × 10−4 | −0.21 (−0.30) | conflicting |
| rs1800562 | HFE | phospholipids in large LDL | 1.80 × 10−4 | −0.22 (−0.28) | conflicting |
| rs1800562 | HFE | cholesterol in large LDL | 1.56 × 10−4 | −0.22 (−0.30) | conflicting |
| rs1800562 | HFE | free cholesterol in large LDL | 1.28 × 10−5 | −0.25 (−0.25) | conflicting |
| rs1800562 | HFE | phospholipids in medium LDL | 1.88 × 10−4 | −0.22 (−0.28) | conflicting |
| rs1800562 | HFE | free cholesterol in medium LDL | 3.57 × 10−6 | −0.27 (−0.23) | conflicting |
| rs1800562 | HFE | total lipids in small LDL | 2.86 × 10−4 | −0.22 (−0.26) | conflicting |
| rs1800562 | HFE | phospholipids in small LDL | 6.77 × 10−5 | −0.23 (−0.26) | conflicting |
| rs1800562 | HFE | cholesterol in small LDL | 1.51 × 10−4 | −0.23 (−0.25) | conflicting |
| rs1800562 | HFE | free cholesterol in small LDL | 5.94 × 10−7 | −0.29 (−0.20) | conflicting |
| rs1800562 | HFE | triglycerides in medium HDL | 1.38 × 10−4 | 0.22 (−0.46) | conflicting |
All rows are predicted recessive by MOI-Pred; shaded rows are also predicted recessive by the consensus of MOI-Pred, MAPPIN, and MAVERICK (ConMOI). SE, standard error; LDL, low-density lipoprotein; VLDL, very-low-density lipoprotein. “Conflicting” corresponds to genes having autosomal dominant and autosomal recessive inheritance label in OMIM; “recessive” corresponds to genes having autosomal recessive inheritance label in OMIM; “N/A” corresponds to genes having no inheritance annotation in OMIM.
Discussion
Here we introduce MOI-Pred, a computational tool that predicts variant pathogenicity including MOI for missense variants using a unique combination of evolutionary and functional information. We also present a consensus method, ConMOI, consisting of a consensus prediction between MOI-Pred and two other published methods, MAPPIN37 and MAVERICK.40 Each of these methods computes a three-way prediction, classifying each variant as pathogenic for AR disease, pathogenic for AD disease, or benign. While many existing binary methods can identify pathogenic variants in AD disease (e.g., O/E,42 CADD,44 phyloP,65), MOI-Pred specifically targets the problem of discriminating AR pathogenic variants from benign. This is a long-standing issue in genetics and current annotation pipelines are known to underperform on AR variants.34,36,66,67 There are several pre-existing methods designed to predict recessive disease genes (pRec,42 ISPP,59 srML66), but all these methods report poor performance on known AR disease genes and in all cases the authors recommend against using them in real-world applications.
MOI-Pred and ConMOI benefit from several key innovations. First, we apply a consensus prediction approach to MOI predictions. Such consensus approaches are widely used to augment binary predictions of variant effect, but they have not previously been applied to three-way predictions including MOI. In the context of binary variant effect prediction, it is well established that the consensus of multiple methods is substantially more reliable than any individual method.13,68 As expected, our analyses show that the same holds for predictions including MOI. As methods that make more detailed predictions about phenotypes or mechanism of pathogenicity become more common, we anticipate that consensus approaches will also be able to incorporate these predictions.
Second, we validated the utility of ConMOI and its component methods in a real-world clinical case scenario with a benchmark based on predicting disease case-control status in EHR data from the BioMe biobank.47 We demonstrated both that MOI-Pred’s predictions of variant effect are significantly associated with the likelihood of carriers developing Mendelian disease in clinical settings, and, as discussed above, that consensus predictions from ConMOI are substantially more predictive of Mendelian disease than individual predictions from any one method. This is true for variants annotated as pathogenic, benign, and VUS, as well as novel and ancestry-specific variants. We also found individual variants where the MOI-Pred and ConMOI predictions differed from their clinical annotation in ClinVar/OMIM. In these cases, we investigated both the evidence supporting the original annotation and our own EHR-based validation and verified that MOI-Pred and ConMOI are likely correct in reclassifying these variants’ pathogenicity predictions. These analyses demonstrate that computational MOI predictions can inform clinical decision making, particularly in large-scale electronic health systems. With increasingly available EHR-linked biobanks, we anticipate the clinical validation introduced in this study will be applied more broadly to evaluate variant prediction tools in the future.
Third, the MOI-Pred method in particular combines evolutionary and functional annotations on both the gene and variant level to predict variant effect, including MOI for individual variants. Previous methods predict MOI on a gene level, and indeed some of these methods are among the most informative features of MOI-Pred, although they perform poorly as standalone scores at distinguishing AR disease genes, as mentioned above. However, even if these scores could reliably identify which genes are implicated in AR disease, existing predictors of variant pathogenicity would still have reduced power to predict pathogenicity for variants in these genes. By jointly predicting MOI and pathogenicity, MOI-Pred is able to make meaningful predictions of pathogenicity for AR variants. The combination of annotations from multiple sources also provides an important advantage when incorporating MOI information into predictions, since different annotations are known to have different error profiles. In particular, it has recently been shown that evolutionary scores are primarily sensitive to heterozygote effects, making these methods very likely to misclassify AR pathogenic variants as benign.34,66 Most predictors of pathogenicity rely primarily on such scores and are therefore insensitive to AR pathogenic variants. By integrating multiple sources of annotation, MOI-Pred can inform the scores that are sensitive to a specific inheritance mechanism. For example, O/E, a score that relies exclusively on evolutionary constraint, is very likely to confuse AR with benign; while MutPred, a score that incorporates biophysical properties of proteins,60,69 is more likely to categorize AR variants as pathogenic. Accordingly, MOI-Pred relies on O/E to discriminate AD from benign and on MutPred to discriminate AR from benign.
Fourth, MOI-Pred is trained with a population-derived list of benign variants and benchmarked on population-derived benign variants unknown to its constituent scores. A rigorous benign training set is crucial to distinguishing AR from benign, the most difficult classification task addressed by MOI-Pred. Since we use clinically validated pathogenic variants for training, it is tempting to use clinically validated benign variants as well, but this can bias the training set. Clinically validated benign variants were suspected pathogenic at some point and therefore may have features distinct from true benign variants.70 The ideal source of benign variants should be found at sufficiently high frequency in a healthy human population to ensure they do not affect gene function.13,71 Thus, we used frequency-matched variants from a large population database (gnomAD) as presumed non-pathogenic controls in our training set, an approach that has been used by previous methods such as CADD44 and VEST3.62 However, using these variants introduces an additional problem: population genetics scores used as components in our prediction model are often themselves derived from the same populations, introducing bias and the risk of overfitting.48 We addressed this problem in part by validating our method with common variants from a recently released database of 7,609 aggregated Japanese genomes as a benchmark set. At the time of analysis, nearly all of these samples had not yet been incorporated into widely used population databases, and all genetics scores were therefore naive to it. In addition to protecting against overfitting on variant frequency in gnomAD, this dataset is also a much better approximation of real VUS, since it represents likely benign variants that are common in an undersampled population but are absent from existing population databases due to this undersampling.13,55,56,57,58 Such training and benchmarking sets allow for more precise discrimination between AR and benign variants while providing reliable performance metrics comparable to clinical case scenarios dealing with unknown variants.
Conclusions
A three-way variant effect classifier including MOI, built using functional and evolutionary information, can accurately discriminate missense variants that are pathogenic for recessive-acting diseases and adds substantially to existing methods when incorporated in a consensus classifier. Additionally, we introduce an EHR-based validation approach using real-world clinical data and show recessive predictions are enriched for known and novel recessive mechanisms of diseases and metabolites.
Limitations of the study
Our methods have several limitations and areas for future work. First, it remains uncertain whether the performance we observe in the test and benchmark sets will hold in real applications. Many existing tools have reported similarly high performance in their authors’ internal testing and lower performance in unbiased replication analyses.12 Many have also failed to find clinical utility despite numerically high performance.64,72 One likely reason for this is the problem of data leakage, where the same sources of information used to label the training data are also informative about the validation data.48,49,50 This problem arises because there is a limited number of variants that are confidently known to be pathogenic or benign, and supposedly independent datasets that identify these known variants typically have large overlap. Ensemble methods like MOI-Pred have the additional problem that many of the individual features used as inputs to the method have their own training data, which may overlap with validation data without the authors knowing. We have attempted to exclude training variants from our benchmark data, both by using a published benchmark that explicitly excludes the training sets of most available methods73 and by developing a benchmark using novel benign variants exclusive to the Japanese population. There is also a similar concern about leakage of gene-level information, since MOI-Pred incorporates gene-level annotations that are shared between all variants in the same gene. We addressed this issue by modifying our cross-validation procedure so that no variant in the test set can be in the same gene as any variant in the training set, which prevents overfitting on these gene-level annotations. These steps are likely to mitigate the problem of data leakage and improve the credibility of our performance estimates. Even more promisingly, our EHR-based clinical validation completely avoids the problem of data leakage. Instead of comparing predictions of variant effect to expert annotations of variant effect, the EHR-based validation procedure compares predictions of variant effect to actual phenotypes diagnosed in carriers, without any reference to external annotations. Since no external annotations are used in evaluating model performance, there is no opportunity for the model’s training data to leak into external annotations. In addition to this benefit, the EHR-based validation also more closely resembles the real-world applications of variant effect prediction methods, suggesting that results will hold74 in real-world clinical data. Nevertheless, we believe validation in other clinical datasets and by other groups is needed.
Second, our three-way predictions, although more complete than typical binary predictions, do not completely account for all forms of MOI. Phenomena such as incomplete dominance, overdominance, and heterozygote advantage, all of which are well documented in human disease,74,75,76 are unaccounted for in our simple recessive-dominant-benign classification, as well as the equivalent classification made by MAPPIN and MAVERICK. These phenomena are also severely under-reported in current clinical practice and databases: variants are generally labeled AD if they cause a phenotype in heterozygotes and AR if they cause a variant in homozygotes.77 Variants with intermediate inheritance, which cause severe disease in homozygotes and less severe diseases in heterozygotes, are labeled inconsistently based on whether the variant was detected in a homozygous or heterozygous patient. The situation is even more confusing for variants that cause different phenotypes in homozygotes and heterozygotes. Likewise, MOI itself is far from the only refinement that can be added to variant and gene annotations. The field would benefit enormously from better annotation of gain-of-function variants, disease-suppressor variants, or uniparental imprinted variants, to name just a few. Our decision to set these complexities aside and classify Mendelian diseases as either AR or AD is based on existing databases of clinical gene and variant annotations, which use these labels almost exclusively. Hence, a more complete system of nomenclature is warranted to annotate the full range of Mendelian inheritance.
Third, the MOI label used to build the classifier model for MOI-Pred was obtained only at the gene level. This is also true of the other methods incorporated in ConMOI and highlights a limitation in the field, where MOI information is incomplete and lacks granularity. Accordingly, variant-level MOI is only known for a handful of phenotypes, and, where it does exist, it is often annotated inconsistently. On the other hand, the extensive set of variant-level features used to build MOI-Pred enables variant-level prediction of MOI, at least in principle. As we would expect given the gene-level MOI labels in the training data, MOI-Pred usually predicts all pathogenic variants in the same gene to have the same MOI. In the small number of cases where the same gene contains both dominant and recessive predicted variants, it is unclear whether these predictions represent a reliable variant-level prediction or uncertainty in the MOI predicted at a gene level. MAPPIN and MAVERICK each used slightly different strategies from MOI-Pred to assign MOI in the training set, and it is also unclear how the biases introduced by each of these strategies interact in the ConMOI consensus predictions. More complete and accessible annotations of variant-level MOI would allow us test this rigorously and also allow us to collect a training set that does not have this limitation. We believe that these methods are a step toward this future of more complete and granular annotations of functional effects of variants. This will enable variant function prediction to go beyond a binary prediction of pathogenicity so that the picture of variant effects formed by computational annotation begins to resemble the true complexity of actual phenotypes.
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Dr. Ron Do (ron.do@mssm.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
Pre-computed predictions of variant-level MOI using ConMOI and MOI-Pred are publicly available at https://doi.org/10.5281/zenodo.5565246. Variant sets used to benchmark the tools will be made available upon request.
-
•
Computational programs and scripts to reproduce the results are publicly available at https://github.com/rondolab/mode-of-inheritance. An archival DOI is listed in the key resources table.
-
•
Any additional information needed to re-analyze the data reported in this paper is available from the lead contact upon request.
Acknowledgments
I.S.F. is supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH) (T32-GM007280). R.D. is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836) and the National Heart, Lung, and Blood Institute of the NIH (R01-HL139865 and R01-HL155915). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author contributions
Conceptualization, B.O.P., D.M.J., and R.D.; methodology, B.O.P., D.J.B., D.M.J., and R.D.; formal analysis, B.O.P., D.M.J., and R.D.; resources, B.O.P., D.J.B., I.S.F., J.C., G.R., D.M.J., and R.D.; data curation, B.O.P., D.J.B., I.S.F., J.C., G.R., D.M.J., and R.D.; writing – original draft, B.O.P., D.M.J., and R.D.; writing – review & editing, B.O.P., D.J.B., I.S.F., J.C., G.R., D.M.J., and R.D.; project administration, J.C. and R.D.; supervision, D.M.J. and R.D.; funding acquisition, R.D.
Declaration of interests
R.D. is a scientific co-founder, consultant, and equity holder for Pensieve Health (pending) and is a consultant for Variant Bio and Character Bio, outside of the submitted work.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| Variant-level MOI predictions from MOI-Pred and ConMOI for all possible missense variants | This paper | https://doi.org/10.5281/zenodo.5565246 |
| Variant-level pathogenicity information used for external evaluation | ClinVar | https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/archive/2020/variant_summary_2020-06.txt.gz |
| Variant-level pathogenicity information used for training | ExoVar | https://pmglab.top/kggseq/download.htm |
| Gene-level MOI information | OMIM | May 2020 release |
| General population variants used for external evaluation. | GEM | v.1 |
| Curated dataset of pathogenic and benign variants used for benchmarking | ClinVar Benchmark | Pejaver et al.53 |
| Software and algorithms | ||
| Software for statistical analyses | R | v3.5.3 |
| Algorithm to compute area under the receiver operator characteristic | pROC | v1.14.0 |
| Package grouping algorithm for model training and evaluation | caret | v6.0.84 |
| Package to perform meta-analyses in R | metafor | v3.0.2 |
| Package to compute Cochran-Mantel-Haenszel in R | stats | v3.6.2 |
| Codes and scripts to reproduce MOI-Pred predictions | This paper | https://doi.org/10.5281/zenodo.5565246 |
Method details
Variant collection
Missense variants from publicly available resources were used to generate all datasets. For the training set, pathogenic variants were obtained from ExoVar.16 Presumed non-pathogenic variants were selected from the Genome Aggregation Database (gnomAD)42 v2.1.1, excluding variants already present in the pathogenic training set. GnomAD variants were chosen to match the allele frequency (AF) of pathogenic variants to within 0.1%, based on minor allele frequency in the entire gnomAD population; singletons were chosen to match variants not present in gnomAD. This approach is designed to prevent overfitting on features incorporating population frequency. While this ExoVar dataset likely contains some overlap with the training data of the input features, it has been used previously to train ensemble methods without apparent overfitting.37,48,78,79,80 For the external benchmark, pathogenic variants from the “ClinVar 2020” benchmark set, recently used to calibrate computational tools73 (ClinVar,27 release June 2020), and presumed non-pathogenic variants were selected from the Genome Medical Alliance Japan Whole Genome Aggregation panel v.1 (GEMJ-WGA),54 defined as variants with AF ≥ 1% in GEMJ-WGA and absent or singleton in gnomAD. Unlike ExoVar, these benchmarks are unlikely to overlap the training sets for any of the input features, as the ClinVar 2020 benchmark explicitly excludes the training sets of nearly all available variant effect prediction methods, while the GEMJ-WGA benchmark is defined to exclude variants that were known to be common before the release of the GEMJ-WGA resource. Gene-level MOI information for pathogenic variants was obtained from the Online Mendelian Inheritance in Man (OMIM)26 (release May 2020).
Variant annotation
Variants were characterized using functional and evolutionary information. ANNOVAR81 was used to annotate variant-level features. We included all available features from ANNOVAR that could be applied to missense variants. This includes 15 features built on evolutionary information (e.g., phyloP,65 FATHMM,61 GERP,82 PROVEAN,83 etc.), 42 features built on both evolutionary and functional information (e.g., M-CAP,46 CADD,44 VEST3,62 MutationTaster,84 etc.) and 14 population frequency features (e.g., cg69,85 Kaviar,86 GME,87 etc.). We added to this, 7 gene-level features which were retrieved manually from their original sources. This includes 2 gene level features built on evolutionary information (OE score42 and s_het88), 4 gene level features built on functional information (Episcore,89 AD rank,59 StringAD and StringAR90) and 1 gene level feature combining the two annotations (HI91). The full list of features and their source of information can be found in Table S1. All features were retrieved in 2020, prior to the publication of the ClinVar 2020 or GEMJ-WGA resources used as external benchmark.
Data trimming and imputation
Features with more than 60% missing values and/or high correlation (Pearson’s r ≥ 0.8) in the training set were removed. In two correlated features, the one with higher mean absolute correlation across all features was removed. Variants with more than 60% missing values in the remaining set of features were removed from both the training and external benchmark sets. Missing values were imputed first on variant-level features. The resulting dataset was then used to impute gene-level features, ensuring low intra-gene variation in these annotations. A random forest-based algorithm (missForest v1.492) was used for both imputations, based on previous analyses.93 The final dataset was comprised of 30 features on 5,872 and 1,526 variants from the training and external benchmark sets respectively.
Workflow to train inheritance prediction models
A machine learning (ML) workflow was used to develop MOI and pathogenicity prediction models. To minimize sampling biases, 100 models were trained, tested and validated using random sets of variants. The workflow is described below for a single iteration (Figure 1). A random sample of 90% of available Dominant variants from the training set plus equal numbers of Recessive and non-Pathogenic variants constituted a balanced Train set. The remaining variants from the training set were used to sample a balanced 10% Test set. The validation benchmark set consisted of all Recessive variants available in the external benchmark dataset plus equal numbers of randomly sampled Dominant and Non-Pathogenic variants. Scaling and feature selection (using recursive feature elimination, a wrapper random forest-based approach) were performed on the Train set using the caret package v6.0.8494 available in R, then applied accordingly to the Test and Benchmark sets. A three-class (Recessive, Dominant, Benign/Non-Pathogenic) random forest algorithm95 was then fitted to the Train set using 10-fold cross validation to optimize parameter tuning and limit overfitting. A tree-based model was used, as such models are easily interpretable and outperform state-of-the-art deep learning methods in tabular data.96 Within tree-based models, a random forest algorithm was preferred as it prunes individual trees by randomly removing features to reduce overfitting. Three two-class random forest algorithms (Dominant vs. Recessive, Dominant vs. Benign, Recessive vs. Benign) were fitted in parallel for subsequent feature importance analyses. Variant effect label was then predicted on the Test and Benchmark sets to compute performance metrics. This entire procedure was repeated 100 times; reported performance statistics (see results) correspond to the mean and standard deviation (SD) across all 100 runs. To allow easy reproducibility of our analysis, an adaptable version of the R code used to train MOI-Pred is available at https://github.com/rondolab/mode-of-inheritance.
The area under the receiver operating characteristic (AUROC) was calculated using the pROC package v1.14.097 available in R v3.5.3.98 To obtain a per-class discrimination metric the remaining two labels were treated as negative classes. Accuracy, sensitivity, specificity and positive/negative predictive values (PPV/NPV) as well as the ML framework was implemented using the caret package.
Three-way variant effect predictions (pathogenic for AR disease, pathogenic for AD disease, or benign) for all possible missense variants in the human genome build hg38 are available at https://github.com/rondolab/mode-of-inheritance.
Clinical validation of inheritance prediction models in electronic health records
A single three-class random forest algorithm was fitted, tested and validated as described above to predict MOI and pathogenicity in genotype data from 29,981 individuals in the BioMe biobank.47 The latter is a multiethnic, EHR-linked, clinical care biobank of more than 60,000 samples from individuals recruited at the Mount Sinai Health System between 2007 and 2015. Participants were genotyped using the Illumina Global Screening Array, imputation was performed using the 1000 Genomes Phase 3 reference panel, and genetic ancestry was determined through k-means clustering of principal components. Longitudinal biomedical traits including diagnostic codes and laboratory test results were obtained mainly through ambulatory care practices resulting in a high median number of encounters per patient.99 Only variants present in ClinVar (release June 2020) were considered for posterior analyses. ClinVar’s phenotype information was mapped to 456 categories of International Classification of Disease 10 (ICD-10) diagnostic codes using information from the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT)100 and Orphanet.101
Stratified contingency table analyses were performed to test recessive, dominant and benign models on variants predicted with the corresponding effect. Variants were grouped by 1. ClinVar clinical significance label (Pathogenic/Likely Pathogenic, Benign/Likely Benign, Uncertain Significance, or Conflicting Interpretation); 2. ICD-10 code category corresponding to the disease phenotype associated with that clinical significance label; and 3. Variant effect prediction produced by MOI-Pred (Dominant, Recessive, Benign). 2x2 contingency tables of phenotype vs. genotype were constructed for each grouping. In each grouping, an individual’s phenotype was considered “affected” if that individual was diagnosed with an ICD-10 code matching the given disease phenotype, and “unaffected” otherwise. For alleles predicted Recessive, an individual’s genotype was considered “carrier” if that individual was homozygous for any allele annotated in ClinVar with the given clinical significance label and the given disease, and “non-carrier” otherwise; for all other alleles (predicted Dominant or predicted Benign), an individual’s genotype was considered “carrier” if that individual carried any such allele at all, regardless of genotype, and “non-carrier” if that individual carried no such allele. 2x2 contingency tables of phenotype vs. genotype for each of the 456 disease categories were combined using a Cochran–Mantel–Haenszel (CMH) test, weighting each disease table by the inverse of disease prevalence, to obtain a single odds ratio (OR), 95% confidence interval (CI) and corresponding p-value for each combination of ClinVar significance label and MOI-Pred prediction. Individuals were divided into four self-reported ancestry groups (European-American, African American, Hispanic-American, and Other), and the analysis was conducted independently for each ancestry group. The results were then aggregated across ancestries using an inverse variance meta-analysis.
For each of the three MOI-Pred prediction classes (Recessive, Dominant, and Benign), the analysis was repeated twice, once for all variants with the corresponding prediction and once for all other variants. A Q-test for heterogeneity was performed to test whether there was a significant difference between each set of variants and its complement. For example, the Recessive version of this Q-test tests whether being homozygous for a variant with a Recessive prediction produces more risk of disease than being homozygous for a variant with a Dominant or Benign prediction. A secondary analysis restricting to variants with ClinVar review status of two stars or higher was also performed. The stats package v3.6.298 was used to perform the CMH test and the metafor package v3.0.2102 was used for the Q-test.
Single nucleotide variant association discovery for ClinVar variants
ClinVar variants, having an ICD-10 code mapping and MOI-Pred prediction were tested for dominant and recessive association with groups of ICD-10 codes. The analysis was performed in individual ancestries (7,473 European-Americans, 6,222 African-Americans, 8,380 Hispanic-Americans and 3,251 other ancestry individuals) using standard logistic regression models and then a fixed-effects meta-analysis using plink v1.9.103 Whole exome sequencing data and EHR from the BioMe biobank were used in the analysis; 10 principal components were used as covariates to account for population stratification. For statistical significance, we used a p-value corrected for 455 recessive association tests = 1.09 x 10−4 using the Bonferroni correction and p-value corrected for 6,382 dominant association tests = 7.83 x 10−6 using the Bonferroni correction.
Single nucleotide variant association discovery for metabolites
We included all missense variants predicted recessive by MOI-Pred and were polymorphic in the whole exome sequencing of studied UK Biobank participants with metabolite data. The analysis was performed in individual ancestries (42,278 European, 972 Central/South Asian, 387 African, 260 East Asian, 132 Middle Eastern and 89 Admixed American ancestry individuals) using standard logistic regression models and then a fixed-effects meta-analysis using plink v1.9.103 The first 10 principal components were used as covariates to account for population stratification. For statistical significance, we used a p-value of 3.31 x 10−7 corrected for 151,164 recessive association and two orders of magnitude lower than the dominant p-value.
Quantification and statistical analysis
Details regarding computational performance metrics, validation procedures, and statistical tests are described in the sections above.
Published: December 9, 2024
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2024.100914.
Supplemental information
References
- 1.Yang Y., Muzny D.M., Reid J.G., Bainbridge M.N., Willis A., Ward P.A., Braxton A., Beuten J., Xia F., Niu Z., et al. Clinical Whole-Exome Sequencing for the Diagnosis of Mendelian Disorders. N. Engl. J. Med. 2013;369:1502–1511. doi: 10.1056/NEJMoa1306555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Posey J.E., Harel T., Liu P., Rosenfeld J.A., James R.A., Coban Akdemir Z.H., Walkiewicz M., Bi W., Xiao R., Ding Y., et al. Resolution of Disease Phenotypes Resulting from Multilocus Genomic Variation. N. Engl. J. Med. 2017;376:21–31. doi: 10.1056/NEJMoa1516767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Adams D.R., Eng C.M. Next-Generation Sequencing to Diagnose Suspected Genetic Disorders. N. Engl. J. Med. 2018;379:1353–1362. doi: 10.1056/NEJMra1711801. [DOI] [PubMed] [Google Scholar]
- 4.Monies D., Abouelhoda M., Assoum M., Moghrabi N., Rafiullah R., Almontashiri N., Alowain M., Alzaidan H., Alsayed M., Subhani S., et al. Lessons Learned from Large-Scale, First-Tier Clinical Exome Sequencing in a Highly Consanguineous Population. Am. J. Hum. Genet. 2019;104:1182–1201. doi: 10.1016/j.ajhg.2019.04.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Akawi N., McRae J., Ansari M., Balasubramanian M., Blyth M., Brady A.F., Clayton S., Cole T., Deshpande C., Fitzgerald T.W., et al. Discovery of four recessive developmental disorders using probabilistic genotype and phenotype matching among 4,125 families. Nat. Genet. 2015;47:1363–1369. doi: 10.1038/ng.3410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Turro E., Astle W.J., Megy K., Gräf S., Greene D., Shamardina O., Allen H.L., Sanchis-Juan A., Frontini M., Thys C., et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature. 2020;583:96–102. doi: 10.1038/s41586-020-2434-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Van Hout C.V., Tachmazidou I., Backman J.D., Hoffman J.D., Liu D., Pandey A.K., Gonzaga-Jauregui C., Khalid S., Ye B., Banerjee N., et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature. 2020;586:749–756. doi: 10.1038/s41586-020-2853-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Do R., Stitziel N.O., Won H.-H., Jørgensen A.B., Duga S., Angelica Merlini P., Kiezun A., Farrall M., Goel A., Zuk O., et al. Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature. 2015;518:102–106. doi: 10.1038/nature13917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Spreafico R., Soriaga L.B., Grosse J., Virgin H.W., Telenti A. Advances in Genomics for Drug Development. Genes. 2020;11 doi: 10.3390/genes11080942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Plenge R.M., Scolnick E.M., Altshuler D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 2013;12:581–594. doi: 10.1038/nrd4051. [DOI] [PubMed] [Google Scholar]
- 11.Dong C., Wei P., Jian X., Gibbs R., Boerwinkle E., Wang K., Liu X. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 2015;24:2125–2137. doi: 10.1093/hmg/ddu733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Li J., Zhao T., Zhang Y., Zhang K., Shi L., Chen Y., Wang X., Sun Z. Performance evaluation of pathogenicity-computation methods for missense variants. Nucleic Acids Res. 2018;46:7793–7804. doi: 10.1093/nar/gky678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Richards S., Aziz N., Bale S., Bick D., Das S., Gastier-Foster J., Grody W.W., Hegde M., Lyon E., Spector E., et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Ioannidis N.M., Rothstein J.H., Pejaver V., Middha S., McDonnell S.K., Baheti S., Musolf A., Li Q., Holzinger E., Karyadi D., et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 2016;99:877–885. doi: 10.1016/j.ajhg.2016.08.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Alirezaie N., Kernohan K.D., Hartley T., Majewski J., Hocking T.D. ClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants. Am. J. Hum. Genet. 2018;103:474–483. doi: 10.1016/j.ajhg.2018.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Li M.-X., Kwan J.S.H., Bao S.-Y., Yang W., Ho S.-L., Song Y.-Q., Sham P.C. Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in Exome Sequencing Studies. PLoS Genet. 2013;9 doi: 10.1371/journal.pgen.1003143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhang X., Walsh R., Whiffin N., Buchan R., Midwinter W., Wilk A., Govind R., Li N., Ahmad M., Mazzarotto F., et al. Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions. Genet. Med. 2021;23:69–79. doi: 10.1038/s41436-020-00972-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lappalainen T., MacArthur D.G. From variant to function in human disease genetics. Science. 2021;373:1464–1468. doi: 10.1126/science.abi8207. [DOI] [PubMed] [Google Scholar]
- 19.Claustres M., Kožich V., Dequeker E., Fowler B., Hehir-Kwa J.Y., Miller K., Oosterwijk C., Peterlin B., van Ravenswaaij-Arts C., Zimmermann U., et al. Recommendations for reporting results of diagnostic genetic testing (biochemical, cytogenetic and molecular genetic) Eur. J. Hum. Genet. 2014;22:160–170. doi: 10.1038/ejhg.2013.125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.MacArthur D.G., Manolio T.A., Dimmock D.P., Rehm H.L., Shendure J., Abecasis G.R., Adams D.R., Altman R.B., Antonarakis S.E., Ashley E.A., et al. Guidelines for investigating causality of sequence variants in human disease. Nature. 2014;508:469–476. doi: 10.1038/nature13127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Eldomery M.K., Coban-Akdemir Z., Harel T., Rosenfeld J.A., Gambin T., Stray-Pedersen A., Küry S., Mercier S., Lessel D., Denecke J., et al. Lessons learned from additional research analyses of unsolved clinical exome cases. Genome Med. 2017;9:26. doi: 10.1186/s13073-017-0412-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ewans L.J., Schofield D., Shrestha R., Zhu Y., Gayevskiy V., Ying K., Walsh C., Lee E., Kirk E.P., Colley A., et al. Whole-exome sequencing reanalysis at 12 months boosts diagnosis and is cost-effective when applied early in Mendelian disorders. Genet. Med. 2018;20:1564–1574. doi: 10.1038/gim.2018.39. [DOI] [PubMed] [Google Scholar]
- 23.Lee H., Deignan J.L., Dorrani N., Strom S.P., Kantarci S., Quintero-Rivera F., Das K., Toy T., Harry B., Yourshaw M., et al. Clinical Exome Sequencing for Genetic Identification of Rare Mendelian Disorders. JAMA. 2014;312:1880–1887. doi: 10.1001/jama.2014.14604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Retterer K., Juusola J., Cho M.T., Vitazka P., Millan F., Gibellini F., Vertino-Bell A., Smaoui N., Neidich J., Monaghan K.G., et al. Clinical application of whole-exome sequencing across clinical indications. Genet. Med. 2016;18:696–704. doi: 10.1038/gim.2015.148. [DOI] [PubMed] [Google Scholar]
- 25.Chong J.X., Buckingham K.J., Jhangiani S.N., Boehm C., Sobreira N., Smith J.D., Harrell T.M., McMillin M.J., Wiszniewski W., Gambin T., et al. The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities. Am. J. Hum. Genet. 2015;97:199–215. doi: 10.1016/j.ajhg.2015.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Online Mendelian Inheritance in Man, O.M.-N.I.o.G.M. Johns Hopkins University; Baltimore, MD: 2022. World Wide Web.https://omim.org/ [Google Scholar]
- 27.Landrum M.J., Lee J.M., Benson M., Brown G.R., Chao C., Chitipiralla S., Gu B., Hart J., Hoffman D., Jang W., et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46 doi: 10.1093/nar/gkx1153. D1062–d1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Furney S.J., Albà M.M., López-Bigas N. Differences in the evolutionary history of disease genes affected by dominant or recessive mutations. BMC Genom. 2006;7:165. doi: 10.1186/1471-2164-7-165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jimenez-Sanchez G., Childs B., Valle D. Human disease genes. Nature. 2001;409:853–855. doi: 10.1038/35057050. [DOI] [PubMed] [Google Scholar]
- 30.Kondrashov F.A., Koonin E.V. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet. 2004;20:287–290. doi: 10.1016/j.tig.2004.05.001. [DOI] [PubMed] [Google Scholar]
- 31.López-Bigas N., Blencowe B.J., Ouzounis C.A. Highly consistent patterns for inherited human diseases at the molecular level. Bioinformatics. 2006;22:269–277. doi: 10.1093/bioinformatics/bti781. [DOI] [PubMed] [Google Scholar]
- 32.Blekhman R., Man O., Herrmann L., Boyko A.R., Indap A., Kosiol C., Bustamante C.D., Teshima K.M., Przeworski M. Natural selection on genes that underlie human disease susceptibility. Curr. Biol. 2008;18:883–889. doi: 10.1016/j.cub.2008.04.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rapaport F., Boisson B., Gregor A., Béziat V., Boisson-Dupuis S., Bustamante J., Jouanguy E., Puel A., Rosain J., Zhang Q., et al. Negative selection on human genes underlying inborn errors depends on disease outcome and both the mode and mechanism of inheritance. Proc. Natl. Acad. Sci. USA. 2021;118 doi: 10.1073/pnas.2001248118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Fuller Z.L., Berg J.J., Mostafavi H., Sella G., Przeworski M. Measuring intolerance to mutation in human genetics. Nat. Genet. 2019;51:772–776. doi: 10.1038/s41588-019-0383-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Balick D.J., Jordan D.M., Sunyaev S., Do R. Overcoming constraints on the detection of recessive selection in human genes from population frequency data. Am. J. Hum. Genet. 2022;109:33–49. doi: 10.1016/j.ajhg.2021.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Antonarakis S.E. Carrier screening for recessive disorders. Nat. Rev. Genet. 2019;20:549–561. doi: 10.1038/s41576-019-0134-2. [DOI] [PubMed] [Google Scholar]
- 37.Gosalia N., Economides A.N., Dewey F.E., Balasubramanian S. MAPPIN: a method for annotating, predicting pathogenicity and mode of inheritance for nonsynonymous variants. Nucleic Acids Res. 2017;45:10393–10402. doi: 10.1093/nar/gkx730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Quinodoz M., Royer-Bertrand B., Cisarova K., Di Gioia S.A., Superti-Furga A., Rivolta C. DOMINO: Using Machine Learning to Predict Genes Associated with Dominant Disorders. Am. J. Hum. Genet. 2017;101:623–629. doi: 10.1016/j.ajhg.2017.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Bendl J., Stourac J., Salanda O., Pavelka A., Wieben E.D., Zendulka J., Brezovsky J., Damborsky J. PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations. PLoS Comput. Biol. 2014;10 doi: 10.1371/journal.pcbi.1003440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Danzi M.C., Dohrn M.F., Fazal S., Beijer D., Rebelo A.P., Cintra V., Züchner S. Deep structured learning for variant prioritization in Mendelian diseases. Nat. Commun. 2023;14:4167. doi: 10.1038/s41467-023-39306-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Stenson P.D., Ball E.V., Mort M., Phillips A.D., Shiel J.A., Thomas N.S.T., Abeysinghe S., Krawczak M., Cooper D.N. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 2003;21:577–581. doi: 10.1002/humu.10212. [DOI] [PubMed] [Google Scholar]
- 42.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lek M., Karczewski K.J., Minikel E.V., Samocha K.E., Banks E., Fennell T., O’Donnell-Luria A.H., Ware J.S., Hill A.J., Cummings B.B., et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Rentzsch P., Witten D., Cooper G.M., Shendure J., Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019;47 doi: 10.1093/nar/gky1016. D886–d894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Quang D., Chen Y., Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015;31:761–763. doi: 10.1093/bioinformatics/btu703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Jagadeesh K.A., Wenger A.M., Berger M.J., Guturu H., Stenson P.D., Cooper D.N., Bernstein J.A., Bejerano G. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 2016;48:1581–1586. doi: 10.1038/ng.3703. [DOI] [PubMed] [Google Scholar]
- 47.BioMeTM BioBank Program 2020. https://icahn.mssm.edu/research/ipm/programs/biome-biobank Accessed June (2020)
- 48.Grimm D.G., Azencott C.A., Aicheler F., Gieraths U., MacArthur D.G., Samocha K.E., Cooper D.N., Stenson P.D., Daly M.J., Smoller J.W., et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 2015;36:513–523. doi: 10.1002/humu.22768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Brandes N., Goldman G., Wang C.H., Ye C.J., Ntranos V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 2023;55:1512–1522. doi: 10.1038/s41588-023-01465-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Gao H., Hamp T., Ede J., Schraiber J.G., McRae J., Singer-Berk M., Yang Y., Dietrich A.S.D., Fiziev P.P., Kuderna L.F.K., et al. The landscape of tolerated genetic variation in humans and primates. Science. 2023;380 doi: 10.1126/science.abn8197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Chen E., Facio F.M., Aradhya K.W., Rojahn S., Hatchell K.E., Aguilar S., Ouyang K., Saitta S., Hanson-Kwan A.K., Capurro N.N., et al. Rates and Classification of Variants of Uncertain Significance in Hereditary Disease Genetic Testing. JAMA Netw. Open. 2023;6 doi: 10.1001/jamanetworkopen.2023.39571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Fowler D.M., Rehm H.L. Will variants of uncertain significance still exist in 2030? Am. J. Hum. Genet. 2024;111:5–10. doi: 10.1016/j.ajhg.2023.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Pejaver V., Byrne A.B., Feng B.J., Pagel K.A., Mooney S.D., Karchin R., O'Donnell-Luria A., Harrison S.M., Tavtigian S.V., Greenblatt M.S., et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am. J. Hum. Genet. 2022;109:2163–2177. doi: 10.1016/j.ajhg.2022.10.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.TogoVar GEM Japan Whole Genome Aggregation (GEM-J WGA) Panel. 2020. https://grch37.togovar.org/doc/datasets/gem_j_wga
- 55.Ndugga-Kabuye M.K., Issaka R.B. Inequities in multi-gene hereditary cancer testing: lower diagnostic yield and higher VUS rate in individuals who identify as Hispanic, African or Asian and Pacific Islander as compared to European. Fam. Cancer. 2019;18:465–469. doi: 10.1007/s10689-019-00144-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Caswell-Jin J.L., Gupta T., Hall E., Petrovchich I.M., Mills M.A., Kingham K.E., Koff R., Chun N.M., Levonian P., Lebensohn A.P., et al. Racial/ethnic differences in multiple-gene sequencing results for hereditary cancer risk. Genet. Med. 2018;20:234–239. doi: 10.1038/gim.2017.96. [DOI] [PubMed] [Google Scholar]
- 57.Chan S.H., Bylstra Y., Teo J.X., Kuan J.L., Bertin N., Gonzalez-Porta M., Hebrard M., Tirado-Magallanes R., Tan J.H.J., Jeyakani J., et al. Analysis of clinically relevant variants from ancestrally diverse Asian genomes. Nat. Commun. 2022;13:6694. doi: 10.1038/s41467-022-34116-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Fatkin D., Johnson R. Variants of Uncertain Significance and “Missing Pathogenicity”. J. Am. Heart Assoc. 2020;9 doi: 10.1161/JAHA.119.015588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Hsu J.S., Kwan J.S.H., Pan Z., Garcia-Barcelo M.-M., Sham P.C., Li M. Inheritance-mode specific pathogenicity prioritization (ISPP) for human protein coding genes. Bioinformatics. 2016;32:3065–3071. doi: 10.1093/bioinformatics/btw381. [DOI] [PubMed] [Google Scholar]
- 60.Pejaver V., Urresti J., Lugo-Martinez J., Pagel K.A., Lin G.N., Nam H.J., Mort M., Cooper D.N., Sebat J., Iakoucheva L.M., et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat. Commun. 2020;11:5918. doi: 10.1038/s41467-020-19669-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Shihab H.A., Gough J., Cooper D.N., Stenson P.D., Barker G.L.A., Edwards K.J., Day I.N.M., Gaunt T.R. Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. 2013;34:57–65. doi: 10.1002/humu.22225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Carter H., Douville C., Stenson P.D., Cooper D.N., Karchin R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genom. 2013;14:S3. doi: 10.1186/1471-2164-14-s3-s3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Harrison S.M., Biesecker L.G., Rehm H.L. Overview of Specifications to the ACMG/AMP Variant Interpretation Guidelines. Curr. Protoc. Hum. Genet. 2019;103:e93. doi: 10.1002/cphg.93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Ghosh R., Oak N., Plon S.E. Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines. Genome Biol. 2017;18:225. doi: 10.1186/s13059-017-1353-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Pollard K.S., Hubisz M.J., Rosenbloom K.R., Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Balick D.J., Jordan D.M., Sunyaev S., Do R. Overcoming constraints on the detection of recessive selection in human genes from population frequency data. bioRxiv. 2021 doi: 10.1101/2021.05.06.443024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Ziegler A., Colin E., Goudenège D., Bonneau D. A snapshot of some pLI score pitfalls. Hum. Mutat. 2019;40:839–841. doi: 10.1002/humu.23763. [DOI] [PubMed] [Google Scholar]
- 68.Gunning A.C., Fryer V., Fasham J., Crosby A.H., Ellard S., Baple E.L., Wright C.F. Assessing performance of pathogenicity predictors using clinically relevant variant datasets. J. Med. Genet. 2021;58:547–555. doi: 10.1136/jmedgenet-2020-107003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Li B., Krishnan V.G., Mort M.E., Xin F., Kamati K.K., Cooper D.N., Mooney S.D., Radivojac P. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009;25:2744–2750. doi: 10.1093/bioinformatics/btp528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Shah N., Hou Y.-C.C., Yu H.-C., Sainger R., Caskey C.T., Venter J.C., Telenti A. Identification of Misclassified ClinVar Variants via Disease Population Prevalence. Am. J. Hum. Genet. 2018;102:609–619. doi: 10.1016/j.ajhg.2018.02.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Ghosh R., Harrison S.M., Rehm H.L., Plon S.E., Biesecker L.G., ClinGen Sequence Variant Interpretation Working Group Updated recommendation for the benign stand-alone ACMG/AMP criterion. Hum. Mutat. 2018;39:1525–1530. doi: 10.1002/humu.23642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Mahmood K., Jung C.H., Philip G., Georgeson P., Chung J., Pope B.J., Park D.J. Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics. Hum. Genomics. 2017;11:10. doi: 10.1186/s40246-017-0104-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Pejaver V., Byrne A.B., Feng B.-J., Pagel K.A., Mooney S.D., Karchin R., O’Donnell-Luria A., Harrison S.M., Tavtigian S.V., Greenblatt M.S., et al. Evidence-based calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for clinical use of PP3/BP4 criteria. bioRxiv. 2022;2022.2003.2017 doi: 10.1101/2022.03.17.484479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Wexler N.S., Young A.B., Tanzi R.E., Travers H., Starosta-Rubinstein S., Penney J.B., Snodgrass S.R., Shoulson I., Gomez F., Ramos Arroyo M.A., et al. Homozygotes for Huntington's disease. Nature. 1987;326:194–197. doi: 10.1038/326194a0. [DOI] [PubMed] [Google Scholar]
- 75.Hughes A.L., Nei M. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature. 1988;335:167–170. doi: 10.1038/335167a0. [DOI] [PubMed] [Google Scholar]
- 76.Schroeder S.A., Gaughan D.M., Swift M. Protection against bronchial asthma by CFTR ΔF508 mutation: A heterozygote advantage in cystic fibrosis. Nat. Med. 1995;1:703–705. doi: 10.1038/nm0795-703. [DOI] [PubMed] [Google Scholar]
- 77.Zschocke J., Byers P.H., Wilkie A.O.M. Mendelian inheritance revisited: dominance and recessiveness in medical genetics. Nat. Rev. Genet. 2023;24:442–463. doi: 10.1038/s41576-023-00574-0. [DOI] [PubMed] [Google Scholar]
- 78.Fabienne J.-H., Hugo V., Frederic T., Alexandre A., Jean-Philippe J. Rfpred: A Random Forest Approach for Prediction of Missense Variants in Human Exome. bioRxiv. 2016 doi: 10.1101/037127. [DOI] [Google Scholar]
- 79.Zhen X., Lin G.N. Proceedings of the 2021 10th International Conference on Bioinformatics and Biomedical Science. Association for Computing Machinery; 2022. PPSNV: A Novel Predictor for Pathogenicity of Nonsynonymous SNV Based on Ensemble Learning. [Google Scholar]
- 80.Mayumi K., Atsuko T., Ryosuke K., Yoshihisa T., Masahiko N., Noriko T., Makoto H., Teruhiko Y., Yasushi O. Network-based pathogenicity prediction for variants of uncertain significance. bioRxiv. 2021;2021 doi: 10.1101/2021.07.15.452566. [DOI] [Google Scholar]
- 81.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Davydov E.V., Goode D.L., Sirota M., Cooper G.M., Sidow A., Batzoglou S. Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ PLoS Comput. Biol. 2010;6 doi: 10.1371/journal.pcbi.1001025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Choi Y., Sims G.E., Murphy S., Miller J.R., Chan A.P. Predicting the functional effect of amino acid substitutions and indels. PLoS One. 2012;7 doi: 10.1371/journal.pone.0046688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Schwarz J.M., Cooper D.N., Schuelke M., Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat. Methods. 2014;11:361–362. doi: 10.1038/nmeth.2890. [DOI] [PubMed] [Google Scholar]
- 85.Drmanac R., Sparks A.B., Callow M.J., Halpern A.L., Burns N.L., Kermani B.G., Carnevali P., Nazarenko I., Nilsen G.B., Yeung G., et al. Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science. 2010;327:78–81. doi: 10.1126/science.1181498. [DOI] [PubMed] [Google Scholar]
- 86.Glusman G., Caballero J., Mauldin D.E., Hood L., Roach J.C. Kaviar: an accessible system for testing SNV novelty. Bioinformatics. 2011;27:3216–3217. doi: 10.1093/bioinformatics/btr540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Scott E.M., Halees A., Itan Y., Spencer E.G., He Y., Azab M.A., Gabriel S.B., Belkadi A., Boisson B., Abel L., et al. Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery. Nat. Genet. 2016;48:1071–1076. doi: 10.1038/ng.3592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Cassa C.A., Weghorn D., Balick D.J., Jordan D.M., Nusinow D., Samocha K.E., O'Donnell-Luria A., MacArthur D.G., Daly M.J., Beier D.R., Sunyaev S.R. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 2017;49:806–810. doi: 10.1038/ng.3831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Han X., Chen S., Flynn E., Wu S., Wintner D., Shen Y. Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders. Nat. Commun. 2018;9:2138. doi: 10.1038/s41467-018-04552-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Szklarczyk D., Gable A.L., Lyon D., Junge A., Wyder S., Huerta-Cepas J., Simonovic M., Doncheva N.T., Morris J.H., Bork P., et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47 doi: 10.1093/nar/gky1131. D607–d613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Huang N., Lee I., Marcotte E.M., Hurles M.E. Characterising and Predicting Haploinsufficiency in the Human Genome. PLoS Genet. 2010;6 doi: 10.1371/journal.pgen.1001154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Stekhoven D.J., Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28:112–118. doi: 10.1093/bioinformatics/btr597. [DOI] [PubMed] [Google Scholar]
- 93.Petrazzini B.O., Naya H., Lopez-Bello F., Vazquez G., Spangenberg L. Evaluation of different approaches for missing data imputation on features associated to genomic data. BioData Min. 2021;14:44. doi: 10.1186/s13040-021-00274-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Kuhn M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008;28:1. doi: 10.18637/jss.v028.i05. [DOI] [Google Scholar]
- 95.Liaw A., Wiener M. Classification and Regression by randomForest. R. News. 2002;2:18–22. [Google Scholar]
- 96.Grinsztajn L., Oyallon E., Varoquaux G. Why do tree-based models still outperform deep learning on tabular data? arXiv. 2022 doi: 10.48550/ARXIV.2207.08815. [DOI] [Google Scholar]
- 97.Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J.-C., Müller M. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.R Core Team . R Foundation for Statistical Computing; Vienna, Austria: 2019. R: A Language and Environment for Statistical Computing. [Google Scholar]
- 99.Tayo B.O., Teil M., Tong L., Qin H., Khitrov G., Zhang W., Song Q., Gottesman O., Zhu X., Pereira A.C., et al. Genetic Background of Patients from a University Medical Center in Manhattan: Implications for Personalized Medicine. PLoS One. 2011;6 doi: 10.1371/journal.pone.0019166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.International Health Terminology Standards Development Organisation 2020 SNOMED CT Starter Guide. 2020. https://confluence.ihtsdotools.org/display/DOCSTART/SNOMED+CT+Starter+Guide
- 101.Pavan S., Rommel K., Mateo Marquina M.E., Höhn S., Lanneau V., Rath A. Clinical Practice Guidelines for Rare Diseases: The Orphanet Database. PLoS One. 2017;12 doi: 10.1371/journal.pone.0170365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Viechtbauer W. Conducting Meta-Analyses in R with the metafor Package. J. Stat. Softw. 2010;36:48. doi: 10.18637/jss.v036.i03. [DOI] [Google Scholar]
- 103.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4 doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
Pre-computed predictions of variant-level MOI using ConMOI and MOI-Pred are publicly available at https://doi.org/10.5281/zenodo.5565246. Variant sets used to benchmark the tools will be made available upon request.
-
•
Computational programs and scripts to reproduce the results are publicly available at https://github.com/rondolab/mode-of-inheritance. An archival DOI is listed in the key resources table.
-
•
Any additional information needed to re-analyze the data reported in this paper is available from the lead contact upon request.




