Abstract
Variants of uncertain significance (VUS) represent variants that lack sufficient evidence to be confidently associated with a disease, thus posing a challenge in the interpretation of genetic testing results. Here we report an improved method for predicting the VUS of Arylsulfatase A (ARSA) gene as part of the Critical Assessment of Genome Interpretation challenge (CAGI6). Our method uses a transfer learning approach that leverages a pre-trained protein language model to predict the impact of mutations on the activity of the ARSA enzyme, whose deficiency is known to cause a rare genetic disorder, metachromatic leukodystrophy. Our innovative framework combines zero-shot log odds scores and embeddings from the ESM, an evolutionary scale model as features for training a supervised model on gene variants functionally related to the ARSA gene. The zero-shot log odds score feature captures the generic properties of the proteins learned due to its pre-training on millions of sequences in the UniProt data, while the ESM embeddings for the proteins in the ARSA family capture features specific to the family. We also tested our approach on another enzyme, N-acetyl-glucosaminidase (NAGLU), that belongs to the same superfamily as ARSA. Our results demonstrate that the performance of our family models (augmented ESM models) is either comparable or better than the ESM models. The ARSA model compares favorably with the majority of state-of-the-art predictors on area under precision and recall curve (AUPRC) performance metric. However, the NAGLU model outperforms all pathogenicity predictors evaluated in this study on AUPRC metric. The improved AUPRC has relevance in a diagnostic setting where variant prioritization generally entails identifying a small number of pathogenic variants from a larger number of benign variants. Our results also indicate that genes that have sparse or no experimental variant impact data, the family variant data can serve as a proxy training data for making accurate predictions. Attention analysis of active sites and binding sites in ARSA and NAGLU proteins shed light on probable mechanisms of pathogenicity for positions that are highly attended.
Supplementary Information
The online version contains supplementary material available at 10.1007/s00439-025-02727-z.
Introduction
Next generation sequencing has enabled rapid and inexpensive genetic testing for various diseases. Genetic testing reveals variant pool in a patient, and variant interpretation methods aid in associating gene variant to a disease. However, not all the variants are classified as disease causing or benign, as there is a plethora of variants that do not have sufficient evidence or have conflicting evidence to associate them with a particular disease and are thus classified as variants of uncertain significance (VUS). In the context of personalized medicine, it is necessary to have a robust and reliable method to classify VUS for clinical diagnosis, genetic studies, and protein engineering. Experimental characterization for associating a variant gene to a disease is valuable but not scalable due to cost and time. Computational predictions can potentially yield efficient and scalable preliminary assessments of how VUS may affect human health. Several computational methods have been developed to associate variants of a gene with a disease. Most methods use sequence based features (Ng and Henikoff 2003; Pollard et al. 2010; Hecht et al. 2015; Hopf et al. 2017; Rentzsch et al. 2019; Pejaver et al. 2020); a few use three dimensional structure (Dehouck et al. 2009; Laimer et al. 2015), while others use a combination of sequence and structural features to classify variants as pathogenic or benign (Adzhubei et al. 2013; Niroula et al. 2015; Savojardo et al. 2016). Current state-of-the-art variant predictors are genome wide predictors trained on all gene variants available in the databases. While these predictors capture generic features from the training set very well, they fail to capture features specific to genes as these signals get diluted. Thus, to improve the classification of variants further, gene specific approaches are needed in addition to genome wide approaches that can capture features specific to a gene or family and reduce the burden of VUS. However, such gene-specific approaches are not as established as general pathogenicity prediction tools due to lack of adequate training datasets for a specific gene of interest. Previous studies have generated customized predictions for the gene of interest (Adhikari 2019; Draelos et al. 2022) and found that the biological context of the individual gene can improve the predictive capability of computational models. Disease specific predictors (Zhang et al. 2021) trained on variants curated by experts have also shown improved performance compared to genome wide approaches. Current gene-specific tools for variant predictions employ very specific features of the protein for pathogenicity predictions, such as distance to substrate binding sites, side-chain side-chain distances for defined critical residues, and protein domain information (Adhikari 2019). The specific features of a gene depend on its function, so identifying appropriate gene specific features for each gene can be a very time consuming process. Although this approach may still be desirable for a well characterized gene, for less characterized gene or regions of gene this can pose a problem. Thus, a better alternative could be an approach that can capture the biological context of a gene without feature engineering.
To address the low data situation often encountered while developing a gene specific predictor, to avoid cumbersome gene specific feature engineering, and to incorporate both gene specific and generic features in a model, we sought to develop a novel approach that addresses these limitations. In this work, we provide a solution to the low data problem by showing that robust gene specific predictions can be achieved by training a predictor on variants from all the genes in the gene family instead of relying only on variant data for a specific gene under investigation. Instead of gene specific feature engineering, we use learned protein embeddings from protein language models as variant features for training our models. Recently, protein language models trained on large databases of evolutionary diverse proteins in an unsupervised fashion have been used for prediction of protein structure (Lin et al. 2023) and function (Shashkova et al. 2022). We use the ESM (evolutionary scale model) (Rives et al. 2021; Lin et al. 2023) embeddings as features that capture gene or family specific features. We also use zero-shot log odds score from ESM, that captures robust generic features of the gene due to its pre-training on a much larger dataset, i.e. UniRef. This study showcases a novel approach that captures both generic and gene specific features of the genes, leveraging protein language models to predict VUS in Arylsulfatase A (ARSA, E.C. 3.1.6.8, ENST00000216124.5 also known as cerebroside sulfatase) gene as part of the CAGI6 challenge (Jain et al. 2024a, b). Deficiency of ARSA enzyme causes Metachromatic Leukodystrophy (MLD, OMIM #250100), a rare recessive disorder in which sulfatide buildup in cells, particularly in the brain, spinal cord, and peripheral nerves, leads to progressive demyelination, resulting in a variety of neurological symptoms and ultimately death (Greene et al. 1967).
Post challenge, we also evaluated our algorithm for predicting variants in the NAGLU gene (MIM# 609701) that encodes human N-acetyl-glucosaminidase, an enzyme involved in the heparan sulfate degradation process. Mutations in the NAGLU gene cause a rare, neurological disorder, Mucopolysaccharidosis IIIB or Sanfilippo B disease (O’brien 1972; von Figura and Kresse 1972; Valstar et al. 2008). This disorder leads to mental deterioration in childhood and death in the second decade. We report here two gene specific models, one for the ARSA gene developed as part of the CAGI6 challenge and another for the NAGLU gene developed post challenge for evaluating the robustness of our approach.
Materials and methods
ARSA challenge data
BioMarin had functionally assessed the enzymatic activity of 277 missense VUS in the ARSA gene and provided these variants for the CAGI6 challenge. An evaluation set of 221 ARSA gene variants was obtained as a subset of this curated set by removing variants with unambiguous classification in the December 2022 release of ClinVar (Landrum et al. 2018). Challenge participants were asked to submit numerical predictions ranging from 0 (no activity) to 1 (wild-type level of activity) indicating how the missense variations affected activity of ARSA enzyme.
Training data for ARSA models
We first compiled gene lists for our family specific predictor. The ARSA gene belongs to the sulfatase family and searching the UniProt with the sulfatase family genes, yielded 17 genes that were reviewed entries and belonged to Homo Sapiens. Thus, our sulfatase family comprised the following genes: IDS, SULF1, SGSH, STS, ARSG, ARSI, ARSF, ARSH, ARSD, ARSJ, GALNS, ARSA, ARSL, ARSK, GNS, SULF2, ARSB.
From the literature we also found that the ARSA enzyme belongs to a group of enzymes called carbohydrate hydrolases (Stütz and Wrodnigg 2016) that process carbohydrates in the lysosomes. The genes belonging to carbohydrate hydrolases from this report were: GALC, GUSB, GLA, IDUA, MAN2B1, GAA, FUCA1, NEU1, GLB1, HEXB, MANBA, HYAL1, HEXA, ARSA, ARSB and NAGLU.
We formed two datasets for training our models from the aforementioned gene lists as shown in Table 1A. Dataset-1 consisted of gene variants from the sulfatase family and Dataset-2 consisted of gene variants from sulfatase family and carbohydrate hydrolases. When merging the two gene lists for the Dataset-2, overlapping genes were counted only once. We next collected variants for Dataset-1 and Dataset-2 from various databases—humsavar (The UniProt Consortium 2023), ClinVar (Landrum et al. 2018), the Pompe disease mutation database (Kroos et al. 2012) and the Leiden Open Variation Database 3.0 (LOVD 3.0) (Fokkema et al. 2021). The following filters were used for selecting the variants from various databases: (1) mutation had to be a single nucleotide polymorphism, (2) clinical significance benign/likely benign and pathogenic/likely pathogenic and (3) VUS and mutations with conflicting annotations were removed (4) mutations with different annotations in different databases were removed. For the control experiment, a third training dataset (Dataset-3) was created by removing the variants of the ARSA gene from the Dataset-1.
Table 1.
A: different training sets for predicting pathogenicity of ARSA variants. B: different training sets for predicting pathogenicity of NAGLU VUS
| A | |||||||
|---|---|---|---|---|---|---|---|
| Training Dataset Identifier | Training Dataset | Collected Variant Data | Variant Data for Training | ||||
| Pathogenic | Benign | Pathogenic | Benign | ||||
| Database | Alignment | Database | Alignment | ||||
| Dataset-1 | Sulfatase family | 558 | 45 | 1529 | 558 | 45 | 513 |
| Dataset-2 | Carbohydrate hydrolases + Sulfatase family | 2013 | 296 | 4444 | 2013 | 296 | 1717 |
| Dataset-3 | Control | 490 | 43 | 1496 | 490 | 43 | 447 |
| B | |||||||
|---|---|---|---|---|---|---|---|
| Training Dataset | Collected Variant Data | Variant Data for Training | |||||
| Pathogenic | Benign | Pathogenic | Benign | ||||
| Database | Alignment | Database | Alignment | ||||
| Carbohydrate hydrolases | 1638 | 164 | 3128 | 1638 | 164 | 1474 | |
| Hydrolases | 1763 | 225 | 3128 | 1763 | 225 | 1538 | |
| Control | 1670 | 223 | 3128 | 1670 | 223 | 1447 | |
A. All the pathogenic variants are from different disease databases. Benign variants are from database as well as generated by alignment of enzymes (Human enzymes with enzymes from other organisms) belonging to same EC number. (1) Dataset-1 gene variants are from genes: IDS, SULF1, SGSH, STS, ARSG, ARSI, ARSF, ARSH, ARSD, ARSJ, GALNS, ARSA, ARSL, ARSK, GNS, SULF2, ARSB. (2) Dataset-2 gene variants are from Dataset-1 and additional 14 genes from carbohydrate hydrolases: FUCA1, GAA, GALC, GLB1, GLA, GUSB, HEXA, HEXB, HYAL1, IDUA, MAN2B1, MANBA, NEU1, and NAGLU (3) Control training set contains all variants from Dataset-1 genes except ARSA
B. All the pathogenic variants are from different disease databases. Benign variants are from database as well as generated by alignment of enzymes (Human enzymes with enzymes from other organisms) belonging to same EC number. (1) Carbohydrate processing enzyme variants from genes: GALC, GUSB, GLA, IDUA, MAN2B1, GAA, FUCA1, NEU1, GLB1, HEXB, MANBA, HYAL1, HEXA, ARSA, ARSB and NAGLU. (2) Hydrolases: Genes from 1 along with additional genes: ADA, AMPD2, ADA2, AGA, AMPD3, DPYS, AMPD1, PSMB9, PSMB8, FAAH, MACROD2, UCHL1, ADPRS, TATDN2, DPEP2, PSMB11. (3) Control training set contains all variants from hydrolase genes except NAGLU
Since pathogenic variants outnumbered benign variants in the disease databases for these genes, we generated benign variants by aligning enzymes (human and other organisms) belonging to the same EC (Enzyme Commission) number (Hecht et al. 2015) with the assumption that if these proteins are sequence similar, most variants between them will likely be benign. For each enzyme we looked up for the corresponding enzyme in the same EC number but different organisms. We then aligned the protein sequences of the enzymes from other organisms with that of the human enzyme. Each alignment yielded highly, weakly conserved and unconserved regions. The unconserved regions had different residues at a particular position and these were considered benign. Regions of gaps, insertions and deletions if any in the alignments were excluded from such analysis.
From each dataset, we created 5 sets of balanced training variant set, where in each set pathogenic variants were common but benign variants were randomly sampled. The independent evaluation set provided by the CAGI6 competition consisted of 221 variants, with 74 pathogenic and 147 benign variants.
Training data for NAGLU models
Like ARSA, NAGLU is also member of carbohydrate hydrolases (Stütz and Wrodnigg 2016). We thus used carbohydrate hydrolases variant genes (as used for ARSA) for training NAGLU model. NAGLU belongs to Hydrolases (EC 3.0), so we also collected all genes under this EC number from BRENDA and looked for these genes in UniProt. Out of all the reviewed, human hydrolase genes from UniProt, only 32 genes had variants reported in variant datasets (databases used are mentioned in ARSA data section) and these are: GALC, GUSB, GLA, IDUA, MAN2B1, GAA, FUCA1, NEU1, GLB1, HEXB, MANBA, HYAL1, HEXA, ARSA, ARSB, NAGLU, ADA, AMPD2, ADA2, AGA, AMPD3, DPYS, AMPD1, PSMB9, PSMB8, FAAH, MACROD2, UCHL1, ADPRS, TATDN2, DPEP2, and PSMB11. The variants in these genes formed our second data set for training NAGLU model. Same filters were used for getting variants from the databases as described in the ARSA dataset section. Since pathogenic variants outnumbered benign variants in the disease databases for these genes, we generated benign variants as explained in the previous section. We trained our predictor on three different datasets as shown in Table 1B. We also created a control dataset by removing NAGLU gene variants from the hydrolase family dataset. From each dataset, we created 5 sets of balanced training variant set, where in each set pathogenic variants were common but benign variants were randomly sampled. The test set consisted of 38 pathogenic and 126 benign variants from the NAGLU gene (Clark et al. 2018) and this set did not overlap with the variants in the training set.
Supervised model features and training (ARSA)
In this work, we used ESM-1b, a publicly available 650 million parameter protein language model trained on UniRef data (Rives et al. 2021). The model produces a 1280 dimensional vector representation for each residue in a given sequence context. We used ESM-1b as a static feature encoder to extract protein features for training neural network (NN) models on family data without fine-tuning the ESM-1b layer weights. We extracted amino acid level features using ESM-1b only at the site of mutation for wild type and mutant protein and then calculated the difference. This difference formed the final representation comprising 1280 feature vectors, which was concatenated with zero-shot log odds score (1 feature) and fed as input to the NN model. We also trained another residue level NN model (Fig. 1) in which the dimension of the final representation (1280 features) was reduced by principal component analysis (PCA) to 269 features and concatenated with the zero-shot log odds score. The feature space in lower dimensions obtained using PCA represents the number of principal components that preserve 95% of the variance in the data.
Fig. 1.
A residue level embedding model with PCA for variant prediction
Residue level embeddings are extracted from ESM-1b as features for both wild-type (ew) and mutant positions (em) of a given protein and difference (er) is calculated. PCA is deployed to reduce dimensions of features. The reduced features (P) concatenated with zero-shot log-odds score (Z) from ESM-1b are given as input to a fully connected feed forward neural network (MLP) which outputs the probability of the input being pathogenic or benign.
These two feature types and the three types of training data, as explained in the dataset section, resulted in six experiments. For all these experiments the input features were passed through an NN that outputs a probability score where 0 (no enzyme activity) is pathogenic and 1 (100% enzyme activity) is benign. For binary class prediction of the variant, a probability score < 0.5 was considered to indicate a pathogenic variant based on the threshold of 0.5. To reduce overfitting, regularization techniques such as dropout and early stopping criteria were introduced while training the model. The final prediction was based on an ensemble model of five different models trained on five different training sets.
Supervised model features and training (NAGLU)
ESM-2 was released while we were working on NAGLU, so we used ESM-2 (Lin et al. 2023) features for the NAGLU models as they seemed to work slightly better on the NAGLU test set as compared to the ESM-1b and ESM-1v models (Table S1). The ESM-2 produces a 1280-dimensional vector representation for each residue in a given sequence context. We used ESM-2 as a static feature encoder to extract protein features for training neural network (NN) models on family data without fine-tuning the ESM-2-layer weights. We considered two types of protein representations as inputs to our model: residue level and sequence level. For residue-level representation based on the number of features, two models were trained on each dataset (as explained in the previous section). Models with features reduced from PCA used 482 features (481 from embeddings and 1 zero-shot log odds score), while models without any feature reduction used 1281 features (1280 from embeddings and 1 zero-shot log odds score). For NAGLU, we also developed models using sequence-level representations for the proteins. We first extracted amino acid-level features using ESM-2, followed by vector representation for protein by averaging each feature over the L amino acids (where L is the length of the protein). The final representation was the difference between the wild-type and mutant protein feature vectors. The 1280 features from this representation were concatenated with the zero-shot log odds score (1 feature) and fed as input to the NN model. We also trained another sequence-level NN model in which the dimension of the final representation (1280 features) was reduced by PCA to 191 features and concatenated with the zero-shot log odds score.
These four different feature types, along with the three types of training data as explained in the dataset section, resulted in a total of 12 experiments. For all these experiments, the input features were passed through an NN that outputs a probability score where 0 (no enzyme activity) is pathogenic and 1 (100% enzyme activity) is benign. The final prediction was based on an ensemble model of five different models trained on five different training sets.
Evaluating evotuning
We finetuned ESM-2 with evolutionary sequences related to the NAGLU gene generated using the JACKHMMER (Potter et al. 2018) tool. Evolutionary fine-tuning or evotuning was first introduced in the UniRep paper (Alley et al. 2019), where the pretrained mLSTM weights get fine-tuned through weight updates using sequences homologous to the protein of interest. The total number of unique UniRef50 (The UniProt Consortium 2023) evolutionary sequences related to the NAGLU gene were 41,695 with the maximum sequence length restricted to 1,000. We chose this cutoff because most sequences from the sequence distribution were within this range. Additionally, this approach also reduced the computational burden while fine-tuning the model. The evotuning was done using the Masked Language Modeling (MLM) objective by masking 15% of the amino acids in each input sequence and the model was trained to predict the missing tokens.
Hyperparameter tuning
Our model includes multiple hyperparameters, such as the number of hidden layers, hidden layer size, and dropout rate. The compiled variants from each of the three datasets (Table 1A) were divided into five training sets. In all the five sets, pathogenic variants were common, but benign variants were randomly sampled to create a balanced training set. Each of the five training sets was further split into training and validation sets in the ratio of 80:20 to perform hyperparameter tuning by observing the validation accuracy. Thus, each model had five different optimal hyperparameters that were fine-tuned during training. Hyperparameter tuning for NAGLU models was also done in a similar fashion. The hyperparameter information for the ARSA and NAGLU models is in the Supplementary Tables S2 and S3 respectively.
Attention analysis
ESM 650 M parameter (ESM-1b, ESM-2) model architecture comprises 33 transformer encoder layers each having 20 attention heads. The transformer encoder layers consist of multiple, sequential layers of attention-feedforward network. Within the attention layer, the input embedding for each token (residue) is converted into keys, queries and values. Keys and queries are combined by matrix multiplication to form the attention score matrix which is then scaled by the square root of the dimensionality of the key vector and passed through a softmax function to generate a final probability distribution. The multi head attention structure splits the input across multiple heads which enables each head to learn different characteristics and relationships between the residues.
Attention scores are the numerical values representing the importance of each residue in a given sequence and capture the contextual relationships and inter-dependencies between the residues in a protein sequence.
We utilized the attention matrix of all the attention heads from the last layer of the ESM model for the wild-type protein sequence. For each of the functional (active-site, binding site) positions of the protein, we took the attention scores and got the highest attended top five residues from each of the heads. We also checked these top attended residue positions in the test set to find overlapping pathogenic and benign mutation positions. Using the similar approach, we got the least attended five residue positions corresponding to the functional positions and performed in silico saturation mutagenesis analysis for few of the highly attended and least attended residue positions.
For hard to predict variant positions, we got attention scores with functional sites of the protein from all the attention heads and averaged. These average attention scores were sorted and the top scoring positions that added upto 0.5 of attention score (Maximun being 1) were taken as positions that are highly attended by the functional sites. A similar calculation was done for the mutant protein sequence and percentage change of attention score of the mutant with respect to wildtype was calculated to get idea about the mutational effects on the residue environment.
Results
ARSA challenge
Fifteen best models from different teams were ranked against each other by the CAGI assessors.
The augmented ESM-1b model (trained on the sulfatase family) hereafter referred to as our “best model” ranked 7th in the overall metric that combined the rankings of 4 different metrics: Pearson’s correlation coefficient, Kendall’s tau correlation, area under the Receiver Operating Characteristic Curve (AUROC), and truncated AUROC (Jain et al. 2024a, b). We ranked first under the truncated AUROC metric category among the challenge participants, a metric relevant in clinical settings. In terms of AUROC, we were among the top 5 predictors; however, we considerably lagged when evaluated against Pearson’s correlation and Kendall’s tau metrics. Assessors also compared the performance of challenge predictors with other public tools such as AlphaMissense (Cheng et al. 2023), REVEL (Ioannidis et al. 2016), Polyphen-2 (Adzhubei et al. 2013), MutPred2 (Pejaver et al. 2020) and VEST4 (Carter et al. 2013). The best AUROC reported on CAGI challenge data was 0.86 by AlphaMissense, while our model’s AUROC was 0.81 (Fig. 2). Similarly, the best truncated AUROC reported was 0.53 by AlphaMissense, while our model’s truncated AUROC was 0.50. Post challenge, we also compared the performance of our best model with other state of the art pathogenicity predictors. On AUROC, ClinPred (Alirezaie et al. 2018) and MetaRNN (Li et al. 2022) performed better than our model, while REVEL, EVE (Frazer et al. 2021), and MutPred2 were comparable. On truncated AUROC, only AlphaMissense was better than our model, while ClinPred was comparable. On Pearson’s correlation, all the models performed better than our model except SIFT and SNAP2 (Hecht et al. 2015). On Kendall’s tau, Polyphen-2, ClinPred, MetaRNN, and AlphaMissense were better than our model. In addition to the metrics used by CAGI assessors to rank the performance of the models, we also used area under precision-recall curve (AUPRC) as an additional metric. Metrics with confidence intervals for various predictors, our best model, other family models that we tried are provided in Table S4 and S5.
Fig. 2.
Comparison of our best family based ARSA model with ESMs and other state-of-the-art pathogenicity predictors on various metrics (A) AUROC (area under the receiver operating characteristic curve) (B) Truncated ROC (C) Pearson’s correlation coefficient (D) Kendall’s Tau (E) AUPRC (area under the precision-recall curve). For each performance metric, the gaussian
To check whether the difference in performance was statistically significant, a one-sided binomial test was used with the number of wins on 1000 bootstrap samples as the test statistic (Jain et al. 2024a, b). Our best model won 908, 821, and 655 times on truncated AUROC, AUPRC, and AUROC metrics giving p-values 1.11e-169, 4.541e-99, 3.388e-23 respectively against MutPred2. On comparison with Polyphen-2, our best model won 531, 560, and 965 times on AUROC, AUPRC, and truncated AUROC achieving p value of 0.0268, 8.252e-05, and 5.133e-237 respectively. Our best model won 591 times on truncated AUROC with a p value of 4.76e-09 in comparison with ClinPred. On comparison with MetaRNN, our model won 696 and 705 times on AUPRC and truncated AUROC respectively. Results of statistical analysis on various metrics are provided in Table S6.
Post challenge, we compared our best model (ESM-1b augmented with sulfatase family variants) with ESM zero shot predictions on the CAGI ARSA test set. From the plots in Figs. 2 and 3, we see that ESM-1b zero shot predictions are comparable to our best model, on AUROC (0.81 vs. 0.807), truncated AUROC (0.499 vs. 0.507) and Kendall’s tau (0.369 vs. 0.363) and AUPRC (0.734 vs. 0.725). However, our best model does slightly better on PCC (0.503 vs. 0.476). On testing the performance on 1000 bootstrapped samples using a binomial test for statistical significance, we found that our model performed better than ESM-1b on all metrics except truncated AUROC (Table 2). Similarly, on comparing our best model with ESM-2, we find that we do better than ESM-2 on all metrics. Though much of the success of our model is attributed to zero-shot predictions from the ESM-1b model, family based augmentation helps to improve predictions.
Fig. 3.
Full AUROC, truncated AUROC and AUPRC curves for our best model along with other state-of-the-art predictors evaluated on CAGI ARSA test set. For each performance metric, the gaussian approximation based 95% confidence interval calculated as 1.96 * standard deviation obtained from the boostrap estimate
Table 2.
Statistical significance using one-sided binomial test for best model vs. zero-shot model on CAGI ARSA test set
| Binomial Test | AUROC | AUPRC | Truncated AUROC | PCC | Kendall’s Tau | |
|---|---|---|---|---|---|---|
| Best model vs. ESM-1b | n_wins | 531 | 577 | 468 | 681 | 601 |
| P-value | 0.027 | 6.255e-07 | 0.98 | 3.911e-31 | 9.008e-11 | |
| Best model vs. ESM-2 | n_wins | 930 | 710 | 621 | 910 | 910 |
| P-value | 7.104e-193 | 1.418e-41 | 9.434e-15 | 1.121e-171 | 1.21e-171 |
Bold entries indicates that the difference in performance is statistically significant for the particular metric
We also visualized t-distributed stochastic neighbor embedding (tSNE) plots to get an idea about the classifier’s decision boundary and separation of pathogenic and benign variants. As seen in Fig. 4, the model can classify variants with a clear decision boundary, as evident by the widely separated clusters of pathogenic and benign variants in the training set. We also see a good separation in the test set when comparing ESM-1b with our family model. However, there are overlaps indicating that there is still room to improve the model’s performance.
Fig. 4.
3D t-distributed stochastic neighbor embedding (tSNE) of variants from sulfatase family training set (upper row) and CAGI6 ARSA evaluation test set (lower row). The plots show classifier performance in separating pathogenic (blue color) and benign (brown color) variants in both training and test set. (a, d) before classification (b, e) after zero-shot classification (c, f) family specific classifier
Assessors also evaluated different predictors in the challenge as well as baseline predictors for their performance on difficult-to-predict variants (Jain et al. 2024a, b). The difficult-to-predict variants are annotated by the assessors based on poor performance on these variants by most of the top methods. Our best model attained minimum FPR for two difficult-to-predict pathogenic variants and minimum FNR for one difficult-to-predict benign variant. Other models, such as AlphaMissense and ESNPs&GO (Manfredi et al. 2022), attained minimum FPR for 2 and 3 difficult-to-predict variants respectively. REVEL, ESNPs&GO, Polyphen-2, and Evolutionary Index (Frazer et al. 2021) attained minimum FNR for 3 difficult-to-predict benign variants.
approximation based 95% confidence interval calculated as 1.96 * standard deviation obtained from the boostrap estimate. Best Model: Features (concatentation of ESM-1b embeddings obtained after doing PCA and zero-shot log-odds score from ESM-1b), data (trained on sulfatase family variants). Control Model: same as best model except ARSA variants are removed from sulfatase family variant set. +For EVE model, the pathogenicity scores for 7 variants were not available, so they were removed from the evaluation set for metric calculation.
Post challenge analysis on NAGLU
To assess the performance of our approach on another clinically important gene, we used published NAGLU enzyme data (Clark et al. 2018) as our test set. As shown in Fig. 5, our best model, trained on carbohydrate processing hydrolases of lysosomes, using reduced features after PCA, has the highest AUPRC of 0.77, an AUROC of 0.89, and a truncated AUROC of 0.63. Our second best model, trained on the hydrolase superfamily using reduced features after PCA, has an AUPRC of 0.76, an AUROC of 0.88, and a truncated AUROC of 0.62 (Table S7 and S8). The ESM-2 zero shot predictions are comparable to the best family model on all metrics except AUPRC (Fig. 5) Our best model augmented with family data has better AUPRC as compared to zero shot models (ESM-1b and ESM-2) and other state-of-the-art pathogenicity predictors. The plots of AUROC, truncated AUROC and AUPRC are shown in Fig. 6. On testing the performance on 1000 bootstrapped samples using a binomial test for statistical significance, we found that our family model (augmented ESM-2) had 732 wins on AUPRC metric as compared to ESM-2 with a p value of 1.146e-50. ESM-2 model’s performance was better than AlphaMissense and statistically significant on all metrics (p value < 0.00001) except AUPRC, where the p value was 0.04 (Table S9). The control model (without NAGLU gene variants in the training set) performs at par with the best family model and shows that family variant data can serve as a proxy if the gene of interest has no or low data. Our attempts at evotuning the ESM-2 transformer model on evolutionary related sequences did not result in improved performance, as shown in Table S10.
Fig. 5.
Comparison of NAGLU family based models with ESMs and other state of the art pathogenicity predictors on various metrics. (A) AUROC (area under the receiver operating characteristic curve) (B) Truncated ROC (C) Pearson’s correlation coefficient (D) Kendall’s Tau (E) AUPRC (area under the precision-recall curve) For each performance metric, the gaussian approximation based 95% confidence interval calculated as 1.96 * standard deviation obtained from the boostrap estimate. Best Model: Features (concatentation of ESM-2 embeddings obtained after doing PCA and zero shot log-odds score from ESM-2), data (trained on carbohydrate hydrolase variants). Control Model: Same as best model except NAGLU variants are removed from family variant set. +For EVE model pathogenicity scores (EVE Score column) for 15 variants were not available, so they were removed from the test set for metric calculation. #For REVEL pathogenicity scores for 52 variants were not available, so they were removed from the test set for metric calculation
Fig. 6.
Full AUROC, truncated AUROC and AUPRC curves for our best model along with other state-of-the-art predictors evaluated on NAGLU test set. For each performance metric, the gaussian approximation based 95% confidence interval calculated as 1.96 * standard deviation obtained from the boostrap estimate
The 3D-tsne plots of NAGLU training and test variants are shown in Supplementary Fig. S1. The best NAGLU model shows better separation of pathogenic and benign variants on the NAGLU test set as compared to the ARSA model on the CAGI6 test set. The VUS test set (Clark et al. 2018) was provided by dataset providers in the CAGI4 NAGLU challenge. This allowed us to also assess the performance of our model on difficult-to-predict variants (Clark et al. 2019) in the NAGLU VUS test set. Out of 10 hard-to-predict NAGLU variants, our model predicted 6 correctly (Table S11), while AlphaMissense and Mutpred2 predicted 3 and 4 difficult variants correctly.
Mechanistic hypothesis for pathogenic variants based on attention analysis
The attention weights can be used to measure the importance of each amino acid site on protein fitness when mutated. A residue site with higher attention scores than other residues is more important.
Attention analysis for ARSA variants
We carried out attention analysis for the following functional sites in ARSA: calcium binding site (D281, N282, D29, and D30), substrate binding site (H229), active site (K123), and C69. A few of the top attended residue positions in ARSA protein that were part of the CAGI ARSA test set are 255 (wildtype: D mutants: E, N, H), 146 (wildtype: G mutants: R, S), 44 (wildtype: S, mutant: P), and 136 (wildtype: P, mutant: S, L, T). These positions are attended by multiple sites of functional importance in ARSA as shown in Table 3 and all the variants at these positions mentioned above are annotated pathogenic based on experimental enzyme activity measurements. From conservation analysis, we see that residues D255 and P136 are 100% conserved across all human sulfatases, while G146 is partially conserved and S44 is not conserved. In silico saturation mutagenesis was done on D255, G146, and S44 to see how constrained these positions are. All D255 and G146 substitutions were predicted to be pathogenic by both our model and AlphaMissense, while S44 had substitutions that were either pathogenic or benign. A likely evolutionary relationship between the aforementioned residues and important functional sites in the protein and their role in affecting the enzyme activity is captured by the attention analysis.
Table 3.
Attention analysis of ARSA variant positions for few functional sites
| Functional Site | Top attended residues | Corresponding CA-CA Distance from functional site (Å) |
|---|---|---|
| H229 | D255, G146 | 18.5, 16.35 |
| D281 | D255, G146 | 20.98, 20.76 |
| N282 | D255 | 20.54 |
| D30 | P136, S44, | 31.05, 13.57 |
| D29 | S44 | 12.8 |
| K123 | P136, G146 | 14, 5.48 |
D281, N282, D29, D30 calcium binding site, H229 substrate binding site, active site K123
We also carried out attention analysis for one of the difficult-to-predict ARSA variants, A212D, that was predicted correctly and with high confidence by our model. This variant was among the top 30 difficult-to-predict variants in the ARSA test set. On mapping the top residues attended by A212 on the ARSA structure, we saw that F219, L117, and the hydrophobic portion of R217 form a hydrophobic cluster with A212 (Fig. 7). This cluster is disrupted on mutation to D, and steric clashes are introduced with the carbonyl oxygens of the residues (R217, P218) that are part of the near by turn. We also checked DDmut (Zhou et al. 2023), a state-of-the-art algorithm for protein stability prediction, and found that it predicted this mutation as destabilizing with a high score of -2.03 kcal/mol. Usually mutations to charged residues in the interior of the proteins are among the most difficult to predict variants for protein stability changes. Charge burial reduces protein stability and mutation from apolar to charged residue to be stabilizing would require change in the neighboring environment to accommodate the charge favorably. We reason that in addition to the biochemical features of the residues captured by the ESM embeddings (Rives et al. 2021), attention in the ESM model is able to capture the structural proximity of the residues and helps the model to predict the mutation correctly.
Fig. 7.
Top attended residues by A212 are shown to form hydrophobic cluster. Mutation to D212 introduces multiple steric clashes
Attention analysis for NAGLU variants
We did attention analysis for the following important functional sites in NAGLU protein: N134, C136, Y140, W201, M204, W268, N315, E316 (catalytic residue), W352, L383, L407, F410, E446 (catalytic residue), H512, W649, I655, Y658 (Meiyappan et al. 2015). A few of the NAGLU variants that were highly attended by multiple functional sites are listed in Table 4. We observed that A444 has two different variants in the test set, A444D (pathogenic) and A444V (benign), and on in silico saturation mutagenesis using our best model and AlphaMissense we see that variants at these positions are either pathogenic or benign. We call these positions modulators of enzyme activity, as depending on the mutation type, it can have normal healthy levels or subnormal activity levels leading to disease. In silico saturation mutagenesis of W361 and E386K shows that all the substitutions are pathogenic (except E386D); we call these sites constrained. We also found two positions in NAGLU, G491 and P488, that were least attended by functional residues and were predicted benign for all the possible substitutions by our model and AlphaMissense. These residues presumably have no role in the regulation of enzyme activity and can be modified to optimize other properties of enzyme. We think that attention analysis can help pinpoint residues that should or should not be altered when engineering enzymes for property optimizations. Thus, modulatory sites and least attended sites can be mutated for engineering, while the constrained positions should be kept unaltered to retain optimal enzyme activity.
Table 4.
Attention analysis of NAGLU variant positions for few functional sites
| Functional Site | Top attended residues | Corresponding CA-CA Distance from functional site (Å) |
|---|---|---|
| E446 | E386, A444, R377 | 14.68, 6.49, 25.79 |
| L383 | E386, A444, W361, R377 | 8.61, 6.93, 15.65, 17.92 |
| W352 | E386, W361 | 12.66, 10.93 |
| L407 | E386, A444 | 7.45, 4.63 |
We carried out attention analysis for difficult to predict NAGLU variant R377H. The structural analysis of the variant revealed that the residue is exposed to the solvent. In general mutations on the surface of proteins are highly depleted in pathogenic variants, thus many pathogenicity predictors do not show good performance on surface mutations. Thus, a pathogenic mutant on the surface of the protein is usually predicted benign by most of the pathogenicity predictors. However our model predicts this mutation disease causing, so we investigated the probable reason for such a prediction using attention analysis. From the attention analysis we saw that for residue R377 the top attended residues T343, A345, Y335, D382, L381 are highly significant (Fig. 8). While T343, A345 and Y335 residues form polar interactions with R377, residues L381 and D382 are near the active site residue L383. These residues are highly attended and their attention scores with respect to R377 change significantly on mutation to H. The R377 position is also among the top attended residues for active site residues E446 and L383. These observations explain the importance of this position and a probable reason for the pathogenicity of the mutant.
Fig. 8.
Top attended residues by R377 are shown to form polar interactions with T343, A345 and Y335. R377 is highly attended by active site residues E446 and L383 (possible long range evolutionary relationship)
Discussion
Leveraging machine learning for accurate classification of pathogenic and benign variants requires appropriate selection of training variants. Previous work in this area has shown that gene specific approaches are better at predicting variant pathogenicity as compared to genome wide approaches. However, gene specific approaches are limited due to the unavailability of sufficient training data for a given gene. Therefore, in this work, we have incorporated biological context by incorporating training variants not only from the gene of interest but from genes that are functionally associated with the gene i.e., the gene family, to obtain a good set of variants for training a family specific classifier.
Our method combines log odds score from zero-shot predictions as a feature along with embeddings from ESM as features to train a supervised model on variant data associated with the gene family of interest. The zero-shot model predicts variant effect based on its pre-training on varied protein sequence data available in the UniRef sequence database and thus is a generic variant predictor; however, the supervised model, leveraging ESM embeddings as features, captures traits specific to the protein family and helps in improving the predictions on certain metrics. Thus, our model’s novelty lies in capturing both generic and contextual properties of proteins from ESM itself. This approach results in improved performance without any gene specific feature engineering, which can be a limiting factor in getting comprehensive gene specific features for unannotated proteins or regions of proteins. Dimensionality reduction also helped models to improve by increasing their generalizability.
We found that our best ARSA model (augmented ESM-1b) performed better than ESM-1b on all metrics except truncated AUROC as gauged by binomial test for statistical significance. However, a substantial contribution of our family specific model’s performance is attributed to ESM models, as these are our generic predictors, and we augment them with the family data. Thus, the combination definitely works well to give statistically significant results for metrics where our model performs better than the state-of-the-art pathogenicity predictors. The ESM-2 zero shot model performs better than our best NAGLU model (augmented ESM-2) on all metrics except AUPRC based on binomial test for statistical significance (Table S9). In both the ARSA and NAGLU test sets, we observe that the family based models have increased AUPRC as compared to ESM models. This is crucial because in diagnostic labs, variant prioritization generally entails identifying a small number of pathogenic variants from a larger number of benign variants (Anderson and Lassmann 2018). For this task, the AUPRC is a more informative measure of performance than AUROC, because it better quantifies the number of false positives (FP). The AUROC plots the true positive rate (TPR) versus the false positive rate (FPR), and the FPR remains low even when there are many FPs, due to the majority of benign variants being correctly classified. The AUPRC plots precision versus the TPR, and precision gives a more accurate picture of the number of FPs when compared to the FPR. Thus AUPRC helps in clinical decision making by reducing the number of variants to follow up for further investigations.
Our work highlights that models trained on superfamily/family data can be used to predict variants for proteins that have low or no data with good confidence. If we have to predict for a gene with no/low variant data, family based models can help, provided there is enough variant data for protein members of the family. This is what we show with the control models, where we have variants from all the members of the family except the variants from the gene of interest (ARSA /NAGLU). If gene variants of the family members are still low, we can extend it to superfamily and still get good predictions for the gene of interest, as seen for NAGLU models, where Hydrolase superfamily model performs at par with the best model (carbohydrate hydrolase model).
This method is applicable to other protein families because it does not require any feature engineering and requires only the collection of family specific variants for training. Our sequence based approach with no feature engineering compares favorably on few metrics with other state-of-the-art methods such as MutPred2 that uses extensive feature engineering or AlphaMissense that uses MSA, structure information, and training on huge variant data. We think for a model trained only on family specific training data and relying only on pre-trained embeddings for feature engineering, these results are significant. However, a major limitation of this approach is that it will only work for well studied genes and gene families. For novel proteins (with no variant, no family or superfamily gene variant data) only genome wide predictors can predict the impact of variants on the protein.
Leveraging attention for the interpretation of the structure and function of the protein is a promising feature of transformer-based protein language models. The ability to identify direct residue-residue co-evolutionary couplings is one of the main factors enabling transformer models to predict protein structure. Co-evolutionary coupled residues, however, do not always need to be spatially close. According to earlier research, co-evolution at distal locations may result from other biological phenomenon such as allosteric interactions, codon effects and negative design (Karamanos 2023). The attention analysis for both NAGLU and ARSA variants has revealed high attention scores between the pathogenic variant positions and the functional sites. In our analysis, highly attended pathogenic residues located at a distance greater than 10Å from the protein’s functional sites suggest a long-range evolutionary link between them and their possible role in attenuating the enzyme activity. Though mutations at a distance of 20Å or more from functional sites affecting enzyme function is less intuitive, there are reports confirming long range evolutionary coupling and attenuation (Romero, 2015; Abriata, 2015; Leferink, 2014). The evolutionary coupling gleaned from attention analysis of the proteins has broad implications in protein engineering and molecular evolution.
Conclusions
ESM zero shot predictions outperform other pathogenicity tools in a few metrics and our family models (ESM augmented with family data) boost the performance further either on the same metric or on a different metric. Thus, the combination definitely works well to give statistically significant results across different metrics. A thorough evaluation of this framework on other clinically important genes is warranted to establish it as a reliable approach for improving the prediction accuracy of genome wide predictors. The current framework might also benefit by incorporating structural information as opposed to only sequence embeddings in this work. Overall, our work addresses the complex challenge of leveraging protein family variant data to improve gene-specific predictions of variant pathogenicity. The insights gained from the attention analysis also offer an intriguing avenue for further exploration.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
The CAGI experiment is supported by National Institute of Health grant U24 HG007346. We thank CAGI organisers for conducting such experiments regularly and providing a platform for researchers to develop and test their methods.
Author contributions
S.R. conceptualization and design of experiments; D.J. and S.P. implementation; D.J. carried all experiments and analysis; S.P. AI expertise and support; R. Sajeed data acquisition, compilation and analysis; S.R. wrote the original draft; S.R., S.P., D.J. and R. Srinivasan reviewed the manuscript; S.R. and R.Srinivasan supervision. All authors have read and approved the final manuscript.
Funding
This work is supported by Tata Consultancy Services.
Data availability
No datasets were generated or analysed during the current study.
Code availability
The codes supporting the current study are available from the corresponding author on request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
The original online version of this article was revised due to a retrospective Open Access order.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
8/12/2025
A Correction to this paper has been published: 10.1007/s00439-025-02767-5
References
- Abriata LA, Palzkill T, Dal Peraro M (2015) How structural and physicochemical determinants shape sequence constraints in a functional enzyme. PLoS ONE 10:e0118684. 10.1371/journal.pone.0118684 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Adhikari AN (2019) Gene-specific features enhance interpretation of mutational impact on acid α-glucosidase enzyme activity. Hum Mutat 40:1507–1518. 10.1002/humu.23846 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Adzhubei I, Jordan DM, Sunyaev SR (2013) Predicting Functional Effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet 76. 7.20.1–7.20.41 [DOI] [PMC free article] [PubMed]
- Alirezaie N, Kernohan KD, Hartley T et al (2018) ClinPred: Prediction Tool to identify Disease-relevant Nonsynonymous single-nucleotide variants. Am J Hum Genet 103:474–483. 10.1016/j.ajhg.2018.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alley EC, Khimulya G, Biswas S et al (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16:1315–1322. 10.1038/s41592-019-0598-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson D, Lassmann T (2018) A phenotype centric benchmark of variant prioritisation tools. Npj Genom Med 3:5. 10.1038/s41525-018-0044-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carter H, Douville C, Stenson PD et al (2013) Identifying mendelian disease genes with the variant Effect Scoring Tool. BMC Genomics 14:S3. 10.1186/1471-2164-14-S3-S3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng J, Novati G, Pan J et al (2023) Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381:eadg7492. 10.1126/science.adg7492 [DOI] [PubMed] [Google Scholar]
- Clark WT, Yu GK, Aoyagi-Scharber M, LeBowitz JH (2018) Utilizing ExAC to assess the hidden contribution of variants of unknown significance to Sanfilippo Type B incidence. PLoS ONE 13:e0200008. 10.1371/journal.pone.0200008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark WT, Kasak L, Bakolitsa C et al (2019) Assessment of predicted enzymatic activity of α-N-acetylglucosaminidase variants of unknown significance for CAGI 2016. Hum Mutat 40:1519–1529. 10.1002/humu.23875 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dehouck Y, Grosfils A, Folch B et al (2009) Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 25:2537–2543. 10.1093/bioinformatics/btp445 [DOI] [PubMed] [Google Scholar]
- Draelos RL, Ezekian JE, Zhuang F et al (2022) GENESIS: gene-specific machine learning models for variants of Uncertain significance found in Catecholaminergic Polymorphic Ventricular Tachycardia and Long QT Syndrome-Associated genes. Circ Arrhythm Electrophysiol 15:e010326. 10.1161/CIRCEP.121.010326 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fokkema IFAC, Kroon M, López Hernández JA et al (2021) The LOVD3 platform: efficient genome-wide sharing of genetic variants. Eur J Hum Genet 29:1796–1803. 10.1038/s41431-021-00959-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frazer J, Notin P, Dias M et al (2021) Disease variant prediction with deep generative models of evolutionary data. Nature 599:91–95. 10.1038/s41586-021-04043-8 [DOI] [PubMed] [Google Scholar]
- Greene H, Hug G, Schubert WK (1967) Arylsulfatase A in the urine and metachromatic leukodystrophy. J Pediatr 71:709–711. 10.1016/S0022-3476(67)80207-5 [DOI] [PubMed] [Google Scholar]
- Hecht M, Bromberg Y, Rost B (2015) Better prediction of functional effects for sequence variants. BMC Genomics 16:S1. 10.1186/1471-2164-16-S8-S1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hopf TA, Ingraham JB, Poelwijk FJ et al (2017) Mutation effects predicted from sequence co-variation. Nat Biotechnol 35:128–135. 10.1038/nbt.3769 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ioannidis NM, Rothstein JH, Pejaver V et al (2016) REVEL: an Ensemble Method for Predicting the pathogenicity of rare missense variants. Am J Hum Genet 99:877–885. 10.1016/j.ajhg.2016.08.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain S, Bakolitsa C, Brenner SE et al (2024a) CAGI, the critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 25:53. 10.1186/s13059-023-03113-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain S, Trinidad M, Nguyen TB et al (2024b) Evaluation of enzyme activity predictions for variants of unknown significance in Arylsulfatase A. 16.594558. 10.1101/2024.05.16.594558 [DOI] [PMC free article] [PubMed]
- Karamanos TK (2023) Chasing long-range evolutionary couplings in the AlphaFold era. Biopolymers 114:e23530. 10.1002/bip.23530 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kroos M, Hoogeveen-Westerveld M, Michelakakis H et al (2012) Update of the pompe disease mutation database with 60 novel GAA sequence variants and additional studies on the functional effect of 34 previously reported variants. Hum Mutat 33:1161–1165. 10.1002/humu.22108 [DOI] [PubMed] [Google Scholar]
- Laimer J, Hofer H, Fritz M et al (2015) MAESTRO - multi agent stability prediction upon point mutations. BMC Bioinformatics 16:116. 10.1186/s12859-015-0548-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landrum MJ, Lee JM, Benson M et al (2018) ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46:D1062–D1067. 10.1093/nar/gkx1153 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leferink NGH, Antonyuk SV, Houwman JA, Scrutton NS, Eady RR, Hasnain SS (2014) Impact of residues remote from the catalytic centre on enzyme catalysis of copper nitrite reductase. Nat Commun 5:4395 pmid:25022223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C, Zhi D, Wang K, Liu X (2022) MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning. Genome Med 14:115. 10.1186/s13073-022-01120-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Z, Akin H, Rao R et al (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379:1123–1130. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
- Manfredi M, Savojardo C, Martelli PL, Casadio R (2022) E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants. Bioinformatics 38:5168–5174. 10.1093/bioinformatics/btac678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meiyappan M, Concino MF, Norton AW (2015) Crystal structure of human alpha-n-acetylglucosaminidase. US Patent 20150031112A1, 29 Jan 2015
- Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814. 10.1093/nar/gkg509 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niroula A, Urolagin S, Vihinen M (2015) PON-P2: prediction method for fast and Reliable Identification of Harmful variants. PLoS ONE 10:e0117380. 10.1371/journal.pone.0117380 [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’brien JS (1972) Sanfilippo Syndrome: Profound Deficiency of Alpha-Acetylglucosaminidase activity in organs and skin fibroblasts from Type-B patients. Proc Natl Acad Sci 69:1720–1722. 10.1073/pnas.69.7.1720 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pejaver V, Urresti J, Lugo-Martinez J et al (2020) Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat Commun 11:5918. 10.1038/s41467-020-19669-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A (2010) Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20:110–121. 10.1101/gr.097857.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Potter SC, Luciani A, Eddy SR et al (2018) HMMER web server: 2018 update. Nucleic Acids Res 46:W200–W204. 10.1093/nar/gky448 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rentzsch P, Witten D, Cooper GM et al (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 47:D886–D894. 10.1093/nar/gky1016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rives A, Meier J, Sercu T et al (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci 118:e2016239118. 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Romero PA, Tran TM, Abate AR (2015) Dissecting enzyme function with microfluidic-based deep mutational scanning. Proc Natl Acad Sci USA 112:7159–7164. https://www.pnas.org/doi/full/10.1073/pnas.1422285112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Savojardo C, Fariselli P, Martelli PL, Casadio R (2016) INPS-MD: a web server to predict stability of protein variants from sequence and structure. Bioinformatics 32:2542–2544. 10.1093/bioinformatics/btw192 [DOI] [PubMed] [Google Scholar]
- Shashkova TI, Umerenkov D, Salnikov M et al (2022) SEMA: Antigen B-cell conformational epitope prediction using deep transfer learning. Front Immunol 13 [DOI] [PMC free article] [PubMed]
- Stütz AE, Wrodnigg TM (2016) Chapter Four - Carbohydrate-Processing enzymes of the lysosome: diseases caused by misfolded mutants and Sugar Mimetics as correcting pharmacological chaperones. In: Baker DC (ed) Advances in Carbohydrate Chemistry and Biochemistry. Academic, pp 225–302 [DOI] [PubMed]
- The UniProt Consortium (2023) UniProt: the Universal protein knowledgebase in 2023. Nucleic Acids Res 51:D523–D531. 10.1093/nar/gkac1052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valstar MJ, Ruijter GJG, van Diggelen OP et al (2008) Sanfilippo syndrome: a mini-review. J Inherit Metab Dis 31:240–252. 10.1007/s10545-008-0838-5 [DOI] [PubMed] [Google Scholar]
- von Figura K, Kresse H (1972) The Sanfilippo B corrective factor: A N-acetyl-α-D-glucosaminidase. Biochem Biophys Res Commun 48:262–269. 10.1016/S0006-291X(72)80044-5 [DOI] [PubMed] [Google Scholar]
- Zhang X, Walsh R, Whiffin N et al (2021) Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions. Genet Med 23:69–79. 10.1038/s41436-020-00972-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Y, Pan Q, Pires DEV et al (2023) DDMut: predicting effects of mutations on protein stability using deep learning. Nucleic Acids Res 51:W122–W128. 10.1093/nar/gkad472 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No datasets were generated or analysed during the current study.
The codes supporting the current study are available from the corresponding author on request.








