Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 1.
Published in final edited form as: Hum Mutat. 2019 Aug 7;40(9):1507–1518. doi: 10.1002/humu.23846

Gene-specific features enhance interpretation of mutational impact on acid alpha-glucosidase enzyme activity

Aashish N Adhikari 1,*
PMCID: PMC7329270  NIHMSID: NIHMS1037759  PMID: 31228295

Abstract

We present a computational model for predicting mutational impact on enzymatic activity of human acid alpha-glucosidase (GAA), an enzyme associated with Pompe disease. Using a model that combines features specific to GAA with other general evolutionary and physiochemical features, we made blind predictions of enzymatic activity relative to wildtype human GAA for >300 GAA mutants, as part of the CAGI 5 GAA challenge. We found that gene-specific features can improve the performance of existing impact prediction tools that mostly rely on general features for pathogenicity prediction. Majority of the poorly predicted mutants that lower wildtype GAA enzyme activity occurred on the surface of the GAA protein. We also found that gene-specific features were uncorrelated with existing methods and provided orthogonal information for interpreting the origin of pathogenicity, particular in variants that are poorly predicted by existing general methods. Specific variants in GAA when investigated in the context of its protein structure suggested gene-specific information like the disruption of local backbone torsional geometry and disruption of particular sidechain-sidechain hydrogen bonds as some potential sources for pathogenicity.

Keywords: Variant interpretation, Pompe disease, enzyme activity prediction, gene-specific variant effect prediction, acid alpha-glucosidase

Introduction

Pompe disease is a rare autosomal-recessive lysosomal storage disorder caused by the deficiency of acid alpha-glucosidase (GAA) which leads to pathological accumulation of glycogen in lysosomes. The severity of Pompe disease and age of onset can be highly variable. The clinical manifestations are dominated by progressive weakness of skeletal muscle throughout the clinical spectrum. The classic infantile form is further characterized by hypertrophic cardiomyopathy. The diagnostic workup for Pompe disease involves a GAA enzymatic activity assay which is necessary because the activity of the GAA enzyme is correlated with the severity and onset of Pompe disease. The infantile, juvenile, and adult onset forms of the disease have an average of 0%, 5%, and 8% residual GAA activity, respectively (Reuser, Hirschhorn, & Kroos, 2014).

The 952 amino acid human GAA enzyme is encoded by the GAA gene. To date, more than 500 different human genetic variants in GAA including missense, nonsense, small or large insertions and deletions, frame-shifting, and splicing changes have been observed and implicated in Pompe disease (Pompe_Center_mutation_database, 2019). While most of these variants are associated with the disease, the functional impact and clinical severity of the variants vary, including several variants that are of uncertain clinical significance. Novel genetic variants will likely be discovered in the GAA gene as more individuals are sequenced. To fully understand the genotype-to-phenotype relationships in Pompe disease, it is therefore critical to characterize the impact of genetic variants in human GAA, particularly on how they affect its enzymatic activity.

With the progress in sequencing of human genomes, several computational methods have emerged within the last decade to predict the impact of genetic variants in human disease. Typically, these variant interpretation methods aim to classify disease-causing pathogenic variants from those that are benign. Some rely predominantly on evolutionary conservation information (Reva, Antipin, & Sander, 2011; Sim et al., 2012), whereas others additionally incorporate physiochemical properties and protein structural data (Adzhubei, Jordan, & Sunyaev, 2013; Carter, Douville, Stenson, Cooper, & Karchin, 2013). Most tools focus on predicting the impact of nonsynonymous changes, but some methods attempt to quantify the mutational effect across the whole genome (Kircher et al., 2014). Recently, meta-predictors have also emerged, that combine scores from several prediction tools using machine learning methods and report an integrated pathogenicity score (Dong et al., 2015; Ioannidis et al., 2016). Instead of directly predicting clinical pathogenicity, yet other tools focus on predicting various intermediate biochemical phenotypes and functional consequences resulting from the genetic variants, including protein stability (Quan, Lv, & Zhang, 2016), RNA splicing (Yeo & Burge, 2004), subcellular protein localization (Laurila & Vihinen, 2011), protein interactions (Betts et al., 2015), and mechanism of action (Li et al., 2009).

Most of the aforementioned methods train broadly in a supervised fashion across large databases of previously characterized human disease-associated and neutral variants. Even though the tools are numerous, the underlying training datasets, features and design principles used are largely homogeneous, and these often suffer from confounding issues of circularity and overfitting arising from overlapping training and evaluation datasets (Grimm et al., 2015).

An implicit assumption made when training classifiers of mutational impact using data from all human genes is that the properties that determine a variant’s deleteriousness are broadly generalizable across all genes. This is certainly reasonable to a degree. For example, amino acids that are highly conserved in evolution are generally intolerant to variants. Similarly, variants that introduce polar sidechains in a protein’s hydrophobic core can destabilize the protein and are generally deleterious. Yet, not all the damaging variants in a particular gene can be comprehensively identified using such general properties alone. Making higher resolution predictions of mutational impact for a particular gene of interest might require incorporating properties that capture the biological context specific to that gene. Some previous efforts have made customized predictions for a gene or gene family of interest (Masica, Sosnay, Cutting, & Karchin, 2012; Torkamani & Schork, 2007) and found that the context of the individual gene, biological pathway and disease can inform the computational models. Still, such gene-specific approaches are less prevalent compared to general pathogenicity prediction tools, in part because of lack of sizable datasets from experimental phenotypic assays targeting specific genes and diseases.

The Critical Assessment of Genome Interpretation (CAGI) experiment aims to evaluate and benchmark computational tools for interpreting the phenotypic impact of genetic variation (Hoskins et al., 2017). One of the challenges in CAGI5 was to make blind predictions of enzymatic activity for >350 different human GAA mutants with experimental data available to the CAGI organizers. In this work, we present a computational framework that integrates common general features with features specific to human GAA gene for predicting the mutational impact on the human GAA enzyme. We applied the framework on the mutations from the CAGI5 GAA challenge to make blind predictions, which were then subsequently assessed. We find that gene-specific features can supplement the predictive capabilities of existing pathogenicity prediction tools. Even though the predicted enzyme activities in GAA were poorer compared to similar enzyme activity challenges in previous CAGIs, we found that incorporating information the full context of the gene can be informative in the interpretation of mutational impact. This was particularly true in mutations that were difficult to predict using existing computational pathogenicity prediction methods alone.

Methods

GAA CAGI challenge test dataset

CAGI5 organizers provided the challenge test dataset, which included cDNA changes of nonsynonymous variants in GAA from an enzyme activity assay performed by BioMarin. Briefly, BioMarin functionally assessed enzymatic activity for 357 nonsynonymous variants in GAA observed in ExAC, most of whose contribution to the disease incidence is unknown. In an immortalized Pompe patient fibroblast cell line with no GAA activity, plasmids containing cDNAs encoding each of the mutant proteins were transfected, and the cells were subsequently lysed, and GAA activity in the lysate was assessed using a fluorogenic substrate. For each variant, the background subtracted mutant enzyme activity was reported as the percent of the background subtracted wildtype GAA enzyme activity. For each mutant, the average normalized % wildtype enzyme activity and standard deviation was calculated from three independent transfections. The GAA CAGI challenge required computational methods to make blind predictions of these % wildtype enzyme activity values, which could then be subsequently independently assessed. Details are provided in the CAGI website (GAA_CAGI5_Challenge, accessed Jan 2019).

Training set

Our approach involved integrating general evolutionary and physiochemical features with GAA-specific features to build a predictive computational model for mutational impact on GAA enzyme activity. For model training, all nonsynonymous variants in GAA gene associated with Pompe disease observed in a locus-specific dababase (LSDB) were collected(Pompe_Center_mutation_database, 2019). Based on the annotated effect in the LSDB, the variants were sorted by severity. The “very severe” annotations were assigned a score of 0 and the “non-pathogenic” were assigned a score of 100. Intermediate values were assigned to the variants with other annotation classes based on the class-wise average enzyme activity observed in for those variants in cited publications from the Pompe center mutation database: very severe: 0, potentially less severe: 3.0, less severe: 5.0, potentially mild: 20.0, presumably non-pathogenic:50.0, non-pathogenic: 100.0 (Kroos et al., 2012; Kroos et al., 2008). For some variants, specific enzyme activity values were available (gathered from Table 2 in (Deming et al., 2017) and the references therein) and these were directly used as activity scores when available. The full collected training dataset and the activity scores are provided in Supp. Table S1. 17 variants in the training set (where enzyme activity values were gathered from previous studies) were also present in the CAGI5 GAA challenge test data set and thus were removed from test set performance assessment.

Table 2:

Performance of the models in classification of putatively pathogenic (<10% wildtype GAA activity) vs. benign GAA mutations.

  Binary class threshold = 10% WT enzyme activity (63 positives, 272 negatives)
Model ROC_AUC ACC F1 TPR FPR TNR PPV NPV MCC
AA_1 0.61 0.81 0.36 0.29 0.53 0.93 0.47 0.85 0.26
AA_2 0.59 0.8 0.32 0.25 0.56 0.93 0.44 0.84 0.23
AA_3 0.57 0.79 0.27 0.21 0.61 0.93 0.39 0.83 0.17
AA_4 0.63 0.82 0.41 0.33 0.48 0.93 0.52 0.86 0.32
  Binary class threshold = 25% WT enzyme activity (125 positives, 210 negatives)
Model ROC_AUC ACC F1 TPR FPR TNR PPV NPV MCC
AA_1 0.62 0.69 0.47 0.38 0.36 0.87 0.64 0.70 0.29
AA_2 0.62 0.69 0.47 0.38 0.36 0.87 0.64 0.70 0.29
AA_3 0.6 0.66 0.45 0.38 0.44 0.82 0.56 0.69 0.22
AA_4 0.64 0.67 0.53 0.5 0.43 0.77 0.57 0.72 0.28

Features

Features for predicting the mutational impact of variants could be broadly classified into two categories: general features, which included some of the commonly used evolutionary conservation and functional impact scores; and gene-specific features, which captured features specific to the GAA gene. Most general features were obtained from existing sources whereas most gene-specific features had to be computed anew here.

First, all functional annotations for the GAA gene from dbNSFPv3.3a (Liu, Wu, Li, & Boerwinkle, 2016) were collected and the several evolutionary conservation, functional impact and pathogenicity prediction rankscores were selected as features. For each feature, missing values were replaced by the mean value of that particular feature across all variants for GAA in dbNSFP. Some gene-specific and other general features required protein structural information and were calculated using the mature human GAA determined recently by X-ray crystallography (PDB ID: 5kzw (Berman et al., 2000; Deming et al., 2017)). Beyond the dbNSFP rankscores, the features incorporated into the model included: average distance from two glycogen binding active sites, carbohydrate binding sites, solvent accessibility, residue-wise conservation score, site-wise mainchain-sidechain and sidechain-sidechain hydrogen bonds to critical residues, changes in hydrophobicity/charge/volume upon variant, sequence distance from exon boundaries, protein domain, backbone torsional propensity change, and allele count in the ExAC database (Lek et al., 2016). A full list of the features is provided in Supp. Table S2. Calculations of features specific to this work are described below.

For structural calculations, the Euclidian distances were computed from the Cβ atom of the mutated residue to Cβ atoms of residues participating in biologically relevant sites (glycogen-binding and carbohydrate-binding) in the protein structure. In particular, average distance from the mutant site to the residues forming the first active site 1 (residues 282, 376, 404, 405, 441, 481, 516, 519, 600, 616, 618, 649, 650, 674), second active site 2 (91, 123, 126, 127) and carbohydrate binding site (140, 233, 390, 470, 652, 882, 925) were computed. Sidechain-sidechan and sidechain-mainchain hydrogen bonds were computed using Pymol (DeLano, 2002).

Backbone torsional ϕ,ψ angles of an amino acid are influenced by the identity and geometry of its nearest-neighbor amino acids in the protein chain (Colubri et al., 2006; Jha et al., 2005). Another gene-specific feature that captured perturbation in local backbone geometry, labeled tsp_rmsd, was computed by quantifying the root mean square deviation (RMSD) in the nearest-neighbor dependent backbone torsional angle (Ramachandran) distributions resulting from the variant. From a database of high resolution X-ray crystal PDB structures (resolution < 2.5 Å, homology below 90%), nearest-neighbor dependent Ramachandran (ϕ,ψ) distributions for every possible amino acid trimer had been previously computed and discretized into 722 5° × 5° bins in a previous work (Adhikari, Freed, & Sosnick, 2012). Individual ϕ,ψ distributions, termed Ramachandran maps, for a site of interest in the protein chain then generated for each amino acid position conditional on both the amino acid and its nearest neighbors.

The effect of the variant of the amino acid backbone torsional angles at the mutant site was quantified, as:

tsp_rmsdi=a=i1a=i+1RMSD(P(r)a,P(m)a)

where P(r)i and P(m)i, refer to the nearest-neighbor dependent Ramachandran basin probability vectors (4 basins) at site i for the reference and mutant amino acids, respectively. The 4-basin probability vectors were computed from the Ramachandran maps for the relevant trimer at the mutant site, by discretizing the maps and quantifying the occupancy frequency in the 4 relevant basins: confident α helix, remaining helical region, strand basin and the rest of the Ramachandran map.

From the wildtype human GAA protein structure, we also calculated residue-wise evolutionary conservation scores using the Consurf webserver that leverages phylogenetic relationships between homologous sequences (Ashkenazy et al., 2016; Landau et al., 2005). We used DSSP to compute the solvent accessibility of the reference amino acid in the protein structure (Kabsch & Sander, 1983). Finally, the location of the mutant site in one of the five GAA structural domains (1–136, 137–349, 350–726, 727–820, 821–952), as well as the sequence distance of the mutant to the nearest splice site in the GAA gene were calculated.

Models

Four different computational models were trained. Model AA_3 used predominantly gene-specific features only, models AA_1 and AA_2 used a combination/subset of the gene-specific and general features, and model AA_4 used all the available features. The features selected for each of the four models are provided in Supp. Table S2.

All models treated the prediction task as a regression problem and were thus designed to predict continuous % wildtype GAA enzyme activity values. AA_1, AA_2 and AA_3 used gradient boosted regression tress (GBRT) with 3-fold cross validation for training the model. For each of these models respectively, random 90%, 70% and 90% subsets of the training data were used for training. For each of the three models, GBRT was run 10 times on random subsets of the full training set and the resulting predictions on the CAGI5 test data were averaged to obtain the mean activity scores for that model. Model AA_4 uses all the features and was trained using a single run of XGBoost (Chen & Guestrin, 2016) on the complete training set (Full parameters for the XGBoost function are provided in Supp. Table S3). Additionally, Bayesian hyperparameter tuning was performed in AA_4 using 3 fold cross-validation, with the BayesSearchCV Skopt package.

AA_1, AA_2 and AA_3 were submitted to the CAGI5 GAA challenge. The hyperparameter tuning for AA_4 did not complete before the submission deadline and was not part of the official submission. All models were implemented in Python using the sci-kit learn package (http://scikit-learn.org).

Performance assessment

In total, predictions were made for 335 variants on the CAGI5 GAA test set. Since the challenge involved predicting the impact of the GAA variants on enzyme activity relative to the wild-type human GAA, and our prediction model was designed to report continuous values, we modeled it as a regression problem instead of binary classification. We calculated three different correlation measures for the continuous valued predictions to the experimental % wildtype enzyme activity values: Pearson correlation coefficient (R), Spearman rank correlation coefficient (ρ) and Kendall’s tau-b correlation coefficient (τ). Additionally, we also computed mean absolute error (MAE) of the predictions with the experimental % wildtype enzyme activity values. For the various dbNSFP rankscores, higher values indicate more pathogenic, and thus were assumed to have worse impact on enzyme activity. Therefore, for the correlation coefficients, the absolute value rather than the sign of the correlation will be the relevant metric.

In clinical settings, computational methods are traditionally assessed in their ability to classify benign and pathogenic variants. Since the GAA enzyme activity strongly correlates with affected clinical status in Pompe disease, we also evaluated our prediction in their ability to classify the variants into putatively pathogenic and benign classes. From the continuous valued enzyme activity, binary classification requires choosing thresholds to define the two classes. In severest forms, the residual GAA enzyme activity in patient fibroblasts can be virtually zero, but less severely affected infants can have residual activity in the range of 25–30%. Therefore, we chose two different % wildtype GAA activity thresholds for classification: 10% wild type enzyme activity and 25% wild type enzyme activity. The same corresponding thresholds were used on the predicted values from our models for binary classification assessment.

For the chosen threshold for binary classes, true positives (TP), false negatives (FN), true negatives (TN) and false positives (FP) were calculated. From these, the following metrics of binary assessment were computed as follows:

Accuracy (ACC) = (TP+TN)/(TP+TN+FP+FN)

True positive rate (TPR), or sensitivity = TP/(TP+FN)

True negative rate (TNR), or specificity = TN/(TN+FP)

Positive predictive value (PPV), or precision = TP/(TP+FP)

Negative predictive value(NPV) = TN/(TN+FN)

False positive rate (FPR) = FP/(FP+TN)

Matthew’s correlation coefficient = ((TPxTN)-(FPxFN))/ ((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))0.5

Two additional binary classification metrics are reported. The first is the balanced F score, also called the F1 score which computes weighted average of the precision and recall.

F1 = 2 * (precision * recall) / (precision + recall)

The second is area under the receiver operator characteristic curve (ROC_AUC) derived from the predictions for the binary classes.

Results

Overall performance (regression)

We first made direct assessment of the predicted continuous % GAA wildtype enzyme activity values for the GAA variants from the four models against the experimental % wildtype enzyme activity values made available by the challenge data providers. The overall performance metrics for the different models are shown in Table 1. Model AA_4 trained using all the features and XGBoost had the best overall correlation coefficients (R=0.34,ρ=0.34,τ=0.24) among the four models. The model that was composed of primarily gene-specific features (AA_3) had correlation coefficients R, ρ, and τ of 0.25,0.26 and 0.18.

Table 1:

Correlation of the continuous valued enzyme activity predictions with the experimental %wildtype GAA activity quantified using different metrics

Model Pearson Correlation coefficient (R) Spearman Rank correlation (ρ) Kendall Correlation Coefficient (τ) Mean Absolute Error
AA_1 0.32 0.34 0.24 45.67
AA_2 0.31 0.34 0.24 45.68
AA_3 0.25 0.26 0.18 45.89
AA_4 0.34 0.34 0.24 45.99

Direct comparisons to other existing methods may not be appropriate as the tools typically trained to separate clinically pathogenic variants from benign ones are not optimized for enzyme activity predictions. Yet, mutants in lower enzyme activities should be clinically pathogenic. So, we compared our models to two other recent meta-predictors which integrate several methods. Our model, AA_4 had comparable or better correlation coefficients (R=0.34,ρ=0.34,τ=0.24) compared to REVEL (R=−0.31,ρ=−0.35,τ=−0.24) and MetaSVM (R=−0.23,ρ=−0.28,τ=−0.19). We also compared our predictions to MutPred which had been successfully used for prediction of functional impact in previous CAGI challenges (Figure 1). The correlation scores for MutPred (R=−0.22,ρ=−0.26,τ=−0.18) with the experimental data were lower compared to our models and the meta-predictors. Removing GAA variants where MutPred scores were not available from dbNSFP did not significantly improve the correlations with the experimental data. Out of 245 variants where Mutpred scores were available, the R, ρ, τ coefficients improved slightly to −0.26, −0.30, and −0.21 respectively. An interesting observation was that AA_3, the model using primarily gene-specific features, was least correlated with all the existing tools, highlighting the potential novelty of information contained in the gene-specific features (Supp. Figure S1).

Figure 1:

Figure 1:

Scatterplot matrix showing correlation between of one of our prediction models (AA_4) to some other existing pathogenicity prediction tools (REVEL, MetaSVM, MutPred) as well as the experimental % wildtype GAA mean activity (MeanActivity) from the CAGI 5 GAA challenge.

Overall, the performance of our predictions as well as other existing tools on GAA was worse compared to another enzymatic assay challenge from the previous CAGI4. For N-acetyl-glucosaminidase (NAGLU) in CAGI4, the best performing had a Pearson correlation coefficient of around 0.6, with several other methods with correlation coefficient above 0.4 (Katsonis & Lichtarge, 2017).

Overall performance (binary classification)

Next, we evaluated the ability of the models to classify putatively pathogenic GAA variants from benign ones. Based on depletion of enzyme activity typically seen in severe and milder GAA patients, we chose on two different thresholds (<10% wildtype and <25% of wildtype) for enzyme activity to define the putative pathogenic mutants. The AUC_ROC ranged from 0.57 to 0.63, and the overall accuracy ranged from 0.79 to 0.82 for the four models in this task, when using the binary classification threshold of < 10% wild type enzyme activity for the positive (putatively pathogenic) class, with model AA_4 performing better than the rest. When choosing a higher threshold of <25% wildtype enzyme activity for putatively pathogenic variants, the accuracy decreased to a range of 0.66–0.69 in all the models. At the same time, at this threshold, the sensitivity (true positive rate) increased from 0.25–0.33 to 0.38–0.5. Table 2 presents the other metrics of performance of the different models in their ability to classify putatively pathogenic GAA variants.

Characterizing well-predicted and poorly-predicted pathogenic variants

To better understand the correct and incorrect predictions of mutational effect, we investigated the variants in the context of the wildtype protein structure. In particular, we focused on putatively pathogenic variants (where the experimental enzymatic activity was < 10% of the wild type), because majority of the clinically severe cases of Pompe disease have enzyme activity below this threshold. Among these putatively pathogenic variants, those that were correctly predicted (i.e., <10% activity) by the primary model (AA_1) are shown in green and those that were incorrectly predicted (≥10% activity) are shown in red in the human GAA protein structure (Figure 2). Structural data was not available for the first 78 residues and variants in those positions.

Figure 2:

Figure 2:

Left: Location in the protein structure (PDB: 5kzw) of the human GAA variants from the CAGI5 GAA challenge with <10% experimental wildtype enzyme activity. Variants correctly predicted to have <10% wildtype activity by our model (AA_1) are shown in green and those failed to be identified are shown in red. Right: For different binary prediction outcome categories for classifying variants that result in <10% experimental wildtype enzyme activity, box plots showing Cβ atom distances of the mutated amino acid site to the protein core

We observed that the correctly predicted putatively pathogenic variants appeared more frequently in the core of the protein than those that were incorrectly predicted. For the putatively pathogenic variants, we computed the distance of the corresponding amino acid Cβ atoms from the protein center of mass. The true positive predictions (correctly predicted to have < 10% wildtype activity) were closer to the core of the protein compared to the false negatives (incorrectly predicted to have ≥10% wildtype activity) (p<1.76e-03, Mann-Whitney-Wilconson (two-sided) test). That the variants disruptive to the core of the protein are deleterious is a general property, which most existing methods that use protein structural features are likely trained to capture. This suggests that pathogenic variants in the surface of the protein, however, may require additional context for accurate predictions.

Gene-specific features aid interpretation of some putatively pathogenic variants

We next explored if the gene-specific features in GAA could elucidate the origin of deleteriousness of some of the poorly predicted mutants. We first investigated what features best correlate with experimental data in mutants with low experimental wildtype enzyme activity (<10%) but where our computational model (AA_1) failed to classify them. For all the numeric features in our work, Figure 3 shows the absolute Pearson correlation coefficients of the different features with the experimental enzyme activity values for this subset of the mutants. Interestingly, in this subset, the features that best correlate with the experimental values are often gene-specific (shown in blue in Figure 3).

Figure 3:

Figure 3:

Within variants resulting in <10% wildtype enzyme activity values in the experimental data, absolute Pearson correlation coefficient of the various features with the experimental values. Gene-specific features are shown in blue and general features are shown in green.

Discussion

Results from previous CAGI challenges had indicated that variant impact predictors trained to classify disease-causing variants could potentially be also directly applied for predicting impact of nonsynonymous variants in activity and function of proteins (e.g. NAGLU, CBS, p16), especially when the training of the predictors involved molecular and biochemical features (Pejaver, Mooney, & Radivojac, 2017). In the case of GAA enzymatic activity, it was surprising that most existing tools as well as our predictions performed poorly relative to previous CAGI enzyme functional impact prediction challenges.

Experimental variability

One potential issue could be the variability in experimental measurements of enzymatic activity. For 17 variants in our training set where we gathered enzyme activities from previous studies, Figure 4 shows the corresponding % wildtype enzyme activity values in the experimental data from CAGI5 GAA challenge. While the activity values generally agree, the % wildtype enzyme activities in the CAGI5 GAA challenge were frequently lower for the same variants than values previously reported in literature. Some experimental noise is always expected, but it is unclear what factors accounts for some variants where the disagreements are substantial. For example, previous studies reported % wildtype enzyme activity of the NP_000143.2:p.R11Q variant in GAA to be 115 and therefore non-pathogenic (Kroos et al., 2008). The same variant has an average % wildtype activity of 23.42±4.41 in the CAGI 5 challenge answer key. Alternatively, another variant, D91N, was found to lowers GAA’s affinity for glycogen in previous studies. In the CAG5 experimental data, the variant had near wildtype % enzyme activity (99.55±1.72). These differences could be important and taken into consideration when evaluating and benchmarking computational models of mutational impact.

Figure 4:

Figure 4:

For variants where prior enzyme activity data was available (in our training set), the corresponding experimental % wildtype enzyme activity values in the CAGI5 GAAGAA challenge dataset are plotted.

Poorly-predicted vs well-predicted variants

While experimental variability may be a contributing factor to poor performance of the predictions relative to prior enzyme activity prediction challenges, it is more likely that the computational models failed to systematically capture some important biochemical aspects of the GAA enzyme. An interesting observation was that most of the poorly predicted enzyme activity lowering variants often occurred near the surface of the protein. Therefore, one possible explanation could be that some of these surface variants participate in important interactions necessary for enzymatic activity. While general features used in existing tools could well capture information regarding variants that disrupt the core of proteins, perhaps pathogenicity arising from damaging variants on the surface of proteins are not as well-characterized. Important gene-or biological pathway-specific signals may be lost when pathogenicity prediction methods train globally across all genes, which often ignore the specific biological context of individual genes.

Gene-specific models for hypothesis generation regarding origin of pathogenicity

When discussing computational tools, the primary focus typically is on their predictive capabilities, which are obviously important. Nonetheless, computational models also can reveal novel aspects of the disease etiologies and help generate new hypotheses. We present two examples where a deeper exploration of the poorly predicted variants revealed gene-specific properties that could potentially explain the origin of pathogenicity. One of the features that we incorporated in our models was the change in local protein backbone torsional angle propensities resulting from the variants. The geometry adopted by amino acids in a protein chain is highly constrained by the immediately neighboring amino acids. Such neighbor-dependent torsional angle (Ramachandran) distributions have been successfully used previously for prediction of protein folding pathways and tertiary structures (Adhikari et al., 2012; Colubri et al., 2006; DeBartolo et al., 2009). One of the poorly predicted enzyme activity lowering GAA variants was NP_000143.2:p.Pro690Leu (% experimental wildtype activity: 0.53 ± 0.37). Proline residues often disrupt secondary structures like alpha helices and are frequently found at the start of α helices in protein structures (Aurora & Rose, 1998; Richardson, 1981). Pro690 is such an amino acid as observed in the start of an α helix (Supp. Figure S2, left). Supp. Figure S2 (right) shows the change in the nearest-neighbor dependent Ramachandran distributions of the residue Pro690 and the neighboring Glu689 and Ala691 residues, resulting from the NP_000143.2:p.Pro690Leu variant. As expected, the removal of Proline from the helix capping position in lieu of Leucine leads to a dramatic increase in the helical propensity of all the three amino acids. Thus, a hypothesis for the origin of pathogenicity of this variant could be that this rigidity conferring Proline when mutated to Leucine potentially severely disrupts the local geometry of the protein chain. In this example, the specific context in which the Proline appears in the protein structure could be important information. Further confirmatory computational models using molecular dynamics as well as experimental validation would be required to confirm the origin of pathogenicity. Nonetheless, such contextual information is generally lacking in existing pathogenicity prediction tools.

In another example of gene-specific context providing clues for origin of pathogenicity is the NP_000143.2:p.Arg464Ser variant (% experimental wildtype activity: 2.54±1.52). In the wildtype GAA protein, the NH1 atom of the Arg464 sidechain forms a hydrogen bond with the OD2 atom of Asp501 sidechain (Figure 5). The NP_000143.2:p.Arg464Ser variant results in the loss of this hydrogen bond as the Serine sidechain lacks the hydrogen bonding acceptor. Even though this variant is in the surface of the protein, this specific disruption of a sidechain-sidechain hydrogen bond could play a role in the pathogenicity of this particular variant.

Figure 5:

Figure 5:

A sidechain-sidechain hydrogen bond is disrupted by a GAAGAA variant. In the wildtype GAAGAA protein, the NH1 atom of the Arg464 sidechain forms a hydrogen bond with the OD2 atom of Asp501. The NP_000143.2:p.Arg464Ser variant results in the loss of the hydrogen bonding partner for Asp502 sidechain.

Both these are examples of severe variants near the surface of the protein that are pathogenic and cause severe form of the disease, but their interpretation likely required specific information not captured by general features used in training most variant interpretation tools.

Limitations of gene-specific models

Incorporating specific biological context in computational variant effect prediction tools is useful. However, one of the main challenges facing gene-specific computational approaches is the lack of large mutational datasets on individual genes. There are well-maintained locus-specific mutational databases for some genes, But most monogenic diseases are rare, and it is unclear when and if most genes will have sizable disease-associated variant databases required for individually customized, but also statistically robust prediction tools. One potential way forward will be to perform high throughput mutational scans in disease-relevant human genes and probe mutational consequences on a wide array of functional assays (Fowler & Fields, 2014; Starita et al., 2015). Recent CAGI 5 challenges (TPMT and PTEN) are starting to address these (Matreyek et al., 2018), but some limitations still exist. As observed in this study, the same functional assays can differ across experiments. Furthermore, designing a functional assay that is most relevant for understanding disease etiology can be challenging. Finally, even when the functional assay is relevant, intermediate biochemical phenotypes may not always correlate fully with clinical phenotype.

Other confounding factors

One important consideration for Pompe disease is that because the disease is recessive, an affected individual’s overall in vivo GAA enzymatic activity will be determined by a combination to two alleles (diplotype) rather than a single allele (haplotype). In a diplotypic context, the effect of a severe allele can be lessened by the presence of a milder allele which may occur in compound heterozygote form in the patient(Reuser et al., 2014). In fact, variants occurring even in the same allele as the pathogenic GAA variant can modulate the severity and onset of Pompe disease through various combinations (Yang et al., 2011). Other genetic and epigenetic background could modulate the activity levels in vivo. Modifying genes, like ACE and ACN3, have been suggested as potentially linked with earlier onset (de Filippi et al., 2010). Multiple forms of human GAA enzyme have also been observed in different populations. The GAA2 form characterized by an NP_000143.2:p.Asp91Asn variant, has lower enzyme affinity for glycogen (Beratis, LaBadie, & Hirschhorn, 1980; Swallow et al., 1989). Another form of the GAA enzyme termed GAA4, characterized by the NP_000143.2:p.Glu689Lys variant is especially prevalent in Asian populations (allele frequency of 0.29). The variable prevalence of such factors across different ancestral groups strongly motivates sequencing studies should sample across a range of different populations when studying the genetics of human diseases.

Lessons from the CAGI5 GAA challenge

It is known that due to interdependent relationships among underlying features and training data, variant effect prediction tools frequently correlate with each other. An important lesson from the CAGI 5 GAA challenge was that the concordance among computational tools does not necessarily translate to more accurate predictions. We observed that the computational tools performed poorly in predicting enzyme activities GAA compared to similar challenges from previous CAGI experiments. Therefore, another important lesson from this challenge is that one can not assume that a well-performing tool for a given gene will apply readily across various other genes. Large benchmarks will be continually needed to validate predictions from tools across a variety of mutational assays and genes. The interdependence among existing tools also prompt for the need of novel approaches to tackle the issue of variant interpretation.

Gene-specific features may provide orthogonal information to those that are already contained in existing prediction tools. Gene-specific explorations can also help identify new features that may be general beyond the particular gene (e.g. helix capping prolines are general features of protein structures, and could be incorporated into existing pathogenicity predictors that already incorporate some protein structural information). The strength of existing global predictors of mutational effects is that they train on a large mutational datasets, and often use powerful machine learning techniques to capture general aspects of mutational pathogenicity mechanisms. Their overall improvement, however, likely comes at the cost of diluting important signals specific to certain genes, gene families and diseases. Alternatively, single genes and gene family specific models can leverage the specific biological context into pathogenicity predictions. However, these methods alone are often data-limited and don’t often leverage the large clinical and genetic datasets across a broad range of diseases. A reasonable way forward may therefore be to explore ways to integrate such gene-specific and general models so that both the accuracy and resolution of the prediction for a disease in question can be improved. Here, we demonstrated one such approach focusing on the GAA enzyme in Pompe disease. Future efforts of pathogenicity prediction in specific diseases may benefit from such hybrid approaches that leverage both general and gene-specific features to predict mutational effects.

Supplementary Material

supp info

Acknowledgments

We thank the CAGI5 organizers, data providers and assessors. In particular, we thank BioMarin for providing the enzymatic assay data for the GAA challenge. We also thank the Brenner lab members for helpful discussions.The CAGI experiment coordination is supported by NIH U41 HG007446 and the CAGI conference by NIH R13 HG006650. ANA was a postdoctoral scholar at University of California, Berkeley supported by NIH U19 HD077627 when this work was performed.

Funding Information: NIH R13 HG006650, NIH U19 HD077627 Submission for CAGI5 special issue

References

  1. Adhikari AN, Freed KF, & Sosnick TR (2012). De novo prediction of protein folding pathways and structure using the principle of sequential stabilization. Proc Natl Acad Sci U S A, 109(43), 17442–17447. doi: 10.1073/pnas.1209000109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Adzhubei I, Jordan DM, & Sunyaev SR (2013). Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet, Chapter 7, Unit7 20. doi: 10.1002/0471142905.hg0720s76 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ashkenazy H, Abadi S, Martz E, Chay O, Mayrose I, Pupko T, & Ben-Tal N. (2016). ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucleic Acids Res, 44(W1), W344–350. doi: 10.1093/nar/gkw408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Aurora R, & Rose GD (1998). Helix capping. Protein Sci, 7(1), 21–38. doi: 10.1002/pro.5560070103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Beratis NG, LaBadie GU, & Hirschhorn K. (1980). An isozyme of acid alpha-glucosidase with reduced catalytic activity for glycogen. Am J Hum Genet, 32(2), 137–149. [PMC free article] [PubMed] [Google Scholar]
  6. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, … Bourne PE (2000). The Protein Data Bank. Nucleic Acids Res, 28(1), 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Betts MJ, Lu Q, Jiang Y, Drusko A, Wichmann O, Utz M, … Russell RB (2015). Mechismo: predicting the mechanistic impact of mutations and modifications on molecular interactions. Nucleic Acids Res, 43(2), e10. doi: 10.1093/nar/gku1094 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Carter H, Douville C, Stenson PD, Cooper DN, & Karchin R. (2013). Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics, 14 Suppl 3, S3. doi: 10.1186/1471-2164-14-S3-S3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen Tianqi, & Guestrin Carlos. (2016). XGBoost. 785–794. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
  10. Colubri A, Jha AK, Shen MY, Sali A, Berry RS, Sosnick TR, & Freed KF (2006). Minimalist representations and the importance of nearest neighbor effects in protein folding simulations. J Mol Biol, 363(4), 835–857. doi: 10.1016/j.jmb.2006.08.035 [DOI] [PubMed] [Google Scholar]
  11. de Filippi P, Ravaglia S, Bembi B, Costa A, Moglia A, Piccolo G, … Danesino C. (2010). The angiotensin-converting enzyme insertion/deletion polymorphism modifies the clinical outcome in patients with Pompe disease. Genet Med, 12(4), 206–211. doi: 10.1097/GIM.0b013e3181d2900e [DOI] [PubMed] [Google Scholar]
  12. DeBartolo J, Colubri A, Jha AK, Fitzgerald JE, Freed KF, & Sosnick TR (2009). Mimicking the folding pathway to improve homology-free protein structure prediction. Proc Natl Acad Sci U S A, 106(10), 3734–3739. doi: 10.1073/pnas.0811363106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. DeLano Warren L. (2002). The PyMOL molecular graphics system. http://www.pymol.org.
  14. Deming Derrick, Lee Karen, McSherry Tracey, Wei Ronnie R., Edmunds Tim, & Garman Scott C. (2017). The molecular basis for Pompe disease revealed by the structure of human acid α-glucosidase. doi: 10.1101/212837 [DOI] [Google Scholar]
  15. Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K, & Liu X. (2015). Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet, 24(8), 2125–2137. doi: 10.1093/hmg/ddu733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fowler DM, & Fields S. (2014). Deep mutational scanning: a new style of protein science. Nat Methods, 11(8), 801–807. doi: 10.1038/nmeth.3027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. GAA_CAGI5_Challenge. (accessed Jan 2019). https://genomeinterpretation.org/content/GAA.
  18. Grimm Dominik G., Azencott Chloé-Agathe, Aicheler Fabian, Gieraths Udo, MacArthur Daniel G., Samocha Kaitlin E., … Borgwardt Karsten M. (2015). The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity. Human Mutation, 36(5), 513–523. doi: 10.1002/humu.22768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hoskins RA, Repo S, Barsky D, Andreoletti G, Moult J, & Brenner SE (2017). Reports from CAGI: The Critical Assessment of Genome Interpretation. Hum Mutat, 38(9), 1039–1041. doi: 10.1002/humu.23290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, … Sieh W. (2016). REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99(4), 877–885. doi: 10.1016/j.ajhg.2016.08.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Jha AK, Colubri A, Zaman MH, Koide S, Sosnick TR, & Freed KF (2005). Helix, sheet, and polyproline II frequencies and strong nearest neighbor effects in a restricted coil library. Biochemistry, 44(28), 9691–9702. doi: 10.1021/bi0474822 [DOI] [PubMed] [Google Scholar]
  22. Kabsch W, & Sander C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12), 2577–2637. doi: 10.1002/bip.360221211 [DOI] [PubMed] [Google Scholar]
  23. Katsonis P, & Lichtarge O. (2017). Objective assessment of the evolutionary action equation for the fitness effect of missense mutations across CAGI-blinded contests. Hum Mutat, 38(9), 1072–1084. doi: 10.1002/humu.23266 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, & Shendure J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46(3), 310–315. doi: 10.1038/ng.2892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kroos M, Hoogeveen-Westerveld M, Michelakakis H, Pomponio R, Van der Ploeg A, Halley D, … Consortium, G. A. A. Database. (2012). Update of the pompe disease mutation database with 60 novel GAA sequence variants and additional studies on the functional effect of 34 previously reported variants. Hum Mutat, 33(8), 1161–1165. doi: 10.1002/humu.22108 [DOI] [PubMed] [Google Scholar]
  26. Kroos M, Pomponio RJ, van Vliet L, Palmer RE, Phipps M, Van der Helm R, … Consortium, G. A. A. Database. (2008). Update of the Pompe disease mutation database with 107 sequence variants and a format for severity rating. Hum Mutat, 29(6), E13–26. doi: 10.1002/humu.20745 [DOI] [PubMed] [Google Scholar]
  27. Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, & Ben-Tal N. (2005). ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res, 33(Web Server issue), W299–302. doi: 10.1093/nar/gki370 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Laurila K, & Vihinen M. (2011). PROlocalizer: integrated web service for protein subcellular localization prediction. Amino Acids, 40(3), 975–980. doi: 10.1007/s00726-010-0724-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, … Exome Aggregation, Consortium. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285–291. doi: 10.1038/nature19057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, … Radivojac P. (2009). Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics, 25(21), 2744–2750. doi: 10.1093/bioinformatics/btp528 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Liu X, Wu C, Li C, & Boerwinkle E. (2016). dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Hum Mutat, 37(3), 235–241. doi: 10.1002/humu.22932 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Masica DL, Sosnay PR, Cutting GR, & Karchin R. (2012). Phenotype-optimized sequence ensembles substantially improve prediction of disease-causing mutation in cystic fibrosis. Hum Mutat, 33(8), 1267–1274. doi: 10.1002/humu.22110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, … Fowler DM (2018). Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat Genet, 50(6), 874–882. doi: 10.1038/s41588-018-0122-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Pejaver V, Mooney SD, & Radivojac P. (2017). Missense variant pathogenicity predictors generalize well across a range of function-specific prediction challenges. Hum Mutat, 38(9), 1092–1108. doi: 10.1002/humu.23258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Pompe_Center_mutation_database. (2019). Erasmus MC Pompe Center mutation database (http://cluster15.erasmusmc.nl/klgn/pompe/mutations.html?lang=en). January 2019, from http://cluster15.erasmusmc.nl/klgn/pompe/mutations.html?lang=en
  36. Quan L, Lv Q, & Zhang Y. (2016). STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics, 32(19), 2936–2946. doi: 10.1093/bioinformatics/btw361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Reuser Arnold J. J., Hirschhorn Rochelle, & Kroos Marian A. (2014). Pompe Disease: Glycogen Storage Disease Type II, Acid α-Glucosidase (Acid Maltase) Deficiency In Beaudet AL, Vogelstein B, Kinzler KW, Antonarakis SE, Ballabio A, Gibson KM & Mitchell G. (Eds.), The Online Metabolic and Molecular Bases of Inherited Disease. New York, NY: The McGraw-Hill Companies, Inc. [Google Scholar]
  38. Reva B, Antipin Y, & Sander C. (2011). Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res, 39(17), e118. doi: 10.1093/nar/gkr407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Richardson JS (1981). The anatomy and taxonomy of protein structure. Adv Protein Chem, 34, 167–339. [DOI] [PubMed] [Google Scholar]
  40. Sim NL, Kumar P, Hu J, Henikoff S, Schneider G, & Ng PC (2012). SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res, 40(Web Server issue), W452–457. doi: 10.1093/nar/gks539 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Starita LM, Young DL, Islam M, Kitzman JO, Gullingsrud J, Hause RJ, … Fields S. (2015). Massively Parallel Functional Analysis of BRCA1 RING Domain Variants. Genetics, 200(2), 413–422. doi: 10.1534/genetics.115.175802 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Swallow DM, Kroos M, Van der Ploeg AT, Griffiths B, Islam I, Marenah CB, & Reuser AJ (1989). An investigation of the properties and possible clinical significance of the lysosomal alpha-glucosidase GAA*2 allele. Ann Hum Genet, 53(2), 177–184. [DOI] [PubMed] [Google Scholar]
  43. Torkamani A, & Schork NJ (2007). Accurate prediction of deleterious protein kinase polymorphisms. Bioinformatics, 23(21), 2918–2925. doi: 10.1093/bioinformatics/btm437 [DOI] [PubMed] [Google Scholar]
  44. Yang CC, Chien YH, Lee NC, Chiang SC, Lin SP, Kuo YT, … Hwu WL (2011). Rapid progressive course of later-onset Pompe disease in Chinese patients. Mol Genet Metab, 104(3), 284–288. doi: 10.1016/j.ymgme.2011.06.010 [DOI] [PubMed] [Google Scholar]
  45. Yeo Gene, & Burge Christopher B. (2004). Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals. Journal of Computational Biology, 11(2–3), 377–394. doi: 10.1089/1066527041410418 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp info

RESOURCES