Skip to main content
PLOS One logoLink to PLOS One
. 2024 May 17;19(5):e0303787. doi: 10.1371/journal.pone.0303787

WilsonGenAI a deep learning approach to classify pathogenic variants in Wilson Disease

Aastha Vatsyayan 1,2,, Mukesh Kumar 1,2,, Bhaskar Jyoti Saikia 1,2, Vinod Scaria 1,2,¤,*, Binukumar B K 1,2,*
Editor: Muhammad Salman Bashir3
PMCID: PMC11101024  PMID: 38758754

Abstract

Background

Advances in Next Generation Sequencing have made rapid variant discovery and detection widely accessible. To facilitate a better understanding of the nature of these variants, American College of Medical Genetics and Genomics and the Association of Molecular Pathologists (ACMG-AMP) have issued a set of guidelines for variant classification. However, given the vast number of variants associated with any disorder, it is impossible to manually apply these guidelines to all known variants. Machine learning methodologies offer a rapid way to classify large numbers of variants, as well as variants of uncertain significance as either pathogenic or benign. Here we classify ATP7B genetic variants by employing ML and AI algorithms trained on our well-annotated WilsonGen dataset.

Methods

We have trained and validated two algorithms: TabNet and XGBoost on a high-confidence dataset of manually annotated, ACMG & AMP classified variants of the ATP7B gene associated with Wilson’s Disease.

Results

Using an independent validation dataset of ACMG & AMP classified variants, as well as a patient set of functionally validated variants, we showed how both algorithms perform and can be used to classify large numbers of variants in clinical as well as research settings.

Conclusion

We have created a ready to deploy tool, that can classify variants linked with Wilson’s disease as pathogenic or benign, which can be utilized by both clinicians and researchers to better understand the disease through the nature of genetic variants associated with it.

Introduction

Wilson’s Disease (WD) is a rare autosomal recessive disorder characterized by the presence of pathogenic mutations in the copper-transporting ATP7B gene. Located on chromosome 13q14.2, ATP7B spans 21 exons, encoding a 1465-amino-acid copper-transporting ATPase [1]. Altered gene function in WD results in copper accumulation in the liver and brain, leading to impaired functions and movement disorders. WD patients exhibit pathogenic mutations causing reduced serum holo-ceruloplasmin production. Excessive copper deposition induces oxidative stress, contributing to clinical problems like cirrhosis and fulminant hepatitis. Neurological complications arise from copper deposits in specific brain regions, leading to movement disorders and associated symptoms [2]. This complex interplay of genetic factors and copper metabolism underscores the multisystemic impact of WD.

WD is an underdiagnosed and treatable genetic condition with an estimated worldwide prevalence of around 13.9 per 100,000, derived from known pathogenic variants [3]. Several recent publications have highlighted an estimated carrier frequency of 1 in 90 individuals [46]. The known prevalence and carrier frequency of WD however, are confined to a few specific populations [7, 8] while in large populations like India, they remain unexplored. This opens up a unique opportunity to understand the genetic architecture of the disease in populations rich in genetic diversity such as India.

The recent availability of a framework for the interpretation of pathogenicity of genetic variants put forward by the American College of Medical Genetics and Genomics and the Association of Molecular Pathologists (ACMG & AMP) has opened up a unique opportunity to create a standardised system for interpretation of genetic variants for clinical diagnosis and genetic counselling. To assist in a better understanding of variant pathogenicity, our group has recently put together one of the most comprehensive collections of genetic variants classified according to the ACMG & AMP Guidelines [9], in form of the WilsonGen database, a robust compilation of all publicly reported ATP7B variants exhaustively collected from literature and across 9 large databases [10], making it the largest, most comprehensive database of it’s kind, to the best of our knowledge.

While classification according to the ACMG & AMP guidelines is time-consuming and at times limited by literature and experimental evidence to confirm the pathogenicity, a number of variants remain unclassified as variants of uncertain significance (VUS). This significantly impacts the ability to classify variants, especially from unique population groups and rare variants identified from patient cohorts.

The advent of machine learning approaches in clinical medicine have accelerated the ability to analyse and interpret medical data and have been extensively used in a number of scenarios, including the rapid classification of large numbers of variants. The widespread application of such approaches in genomics however, has been limited by the lack of gold-standard datasets for training. The availability of WilsonGen database thus provides a unique opportunity in this aspect.

Here, we describe a machine learning approach trained on a gold-standard ACMG-classified variant dataset for pathogenicity in the ATP7B gene for accurate classification of variants. We also use the approach for reclassification of VUS variants in public datasets so as to enable quick variant interpretation in clinical and research settings. To the best of our knowledge, ours is the only approach based on a manually ACMG-classified dataset, dedicated specifically to WD variants. A public implementation of the algorithm is available at: https://github.com/aastha-v/WilsonGenAI.

Materials and methods

Datasets

The variants and their pathogenicity ascertained according to the ACMG and AMP guidelines and available in the WilsonGen database were taken up for analysis. This dataset contained a total of 1458 genetic variants manually classified according to the ACMG & AMP guidelines. Non-exonic variants were removed due to lack of sufficient training data, as were VUS variants. This resulted in a variant dataset of 723 unique variants, out of which 410 were annotated as pathogenic, 167 as likely pathogenic, 9 as benign and 137 as likely benign. Fig 1 offers an overview of our entire workflow.

Fig 1. The overview of the workflow followed for model development and variant classification with TabNet and XGBoost models.

Fig 1

Variant parameters

The variant VCF was run through the ANNOVAR [11] tool, which annotated the variants with allele frequencies (AF) from three global population and subpopulation datasets: GnomAD [12], 1000Genomes [13] and GME [14]. We further added the position of the first amino acid change for each variant as “Start_pro”, which would offer positional data along with “Start” which marked the position of nucleotide change. Further, we added the categorical attributes “Pfam_imp_domain” that would indicate whether the variant overlapped with an important protein domain based on the Pfam database, as well as “LoF_HC_Canonical” categorical attribute which would indicate if the variant was a high-confidence loss-of-function variant present in a canonical transcript. The exonic function (e.g. frameshift insertion/deletion. Stopgain/Startloss etc.) was further encoded into numbers. Feature selection was performed manually with all features with missing values that exceeded 80% for each class being removed. Finally, all /likely pathogenic variants were encoded as “1” and all benign/likely benign variants as “0”. A total of 73 attributes were thus obtained and are detailed in S1 Table.

AI models

For our analysis, we considered two state-of-the-art deep and machine learning models, namely TabNet and XGBoost, to train on the ACMG-classified gold standard dataset.

We had previously utilized the Weka suite [15] to test the performance of several algorithms including NaiveBayes, SMO, J48, and RandomForest on the dataset available to us then, comprising of 725 variants split into 70% train—30% test datasets. Traditionally, tree ensemble models are recommended for classification and regression problems with tabular data [16]. Our results proved to be in concordance, since RandomForest and J48 outperformed others. We thus chose to work with the XGBoost classifier, which is one of the most widely used gradient-boosted decision trees, especially for tabular datasets. XGBoost is reported to perform faster and better than other models such as RandomForest for missing data, and with class imbalanced datasets. It also has in-built regularization which prevents overfitting, which models like RandomForest can be prone to. The XGBoost algorithm creates decision trees in sequential form, wherein increased weights of incorrectly predicted variables are fed into the next tree. The algorithm has been created to handle sparse data effectively, which mirrors real-world situations where data is often found to be missing or containing frequently repeating values.

Additionally, we chose to also utilize the novel deep learning neural network TabNet [17], which was specially created for tabulated datasets. TabNet has been reported to outperform tree methods including XGBoost for certain tabular datasets [18]. Unlike other deep learning models, TabNet mimics the learning of decision trees, through the use of its transformer architecture, enabling the model to quickly decipher complex data patterns. TabNet uses sequential attention to choose features at each decision step. Feature selection is done instance-wise, i.e. it could be different for each variant in the training dataset. To the best of our knowledge, this is the first implementation of TabNet for the classification of variants based on their pathogenicity.

Since the performance of the two models with respect to each other seems to vary based on datasets used [16], we decided to include results from both models for assessment.

Hyperparameter selection and cross validation

Our models were run with different input parameters until convergence. The best performing model by accuracy was taken up. The PyTorch [19] implementation of Google’s TabNet was used for model creation, while Anaconda [20] was used to enable the use of Scikit-learn, Pandas, Matplotlib and Seaborn to enable analysis and visualisation for both models.

For TabNet, SimpleImputer was used to replace missing data with a constant value. Further, the model’s mask_type parameter was set to ‘entmax’, which showed a better overall performance than the default ‘sparsemax’. The ‘weights’ parameter was set to ‘1’ to address class imbalance, while the batch size was set at the maximum recommended 10% of the total data size at 72. A maximum of 1000 epochs were allowed with a patience (i.e. the number of epochs to wait for improvement before terminating the training run) of 100.

For XGBoost, the hyper-parameters were selected and evaluated using a 5-fold cross validation (CV) approach. A randomized search on the hyperparameters was performed using RandomizedSearchCV (CV = 5). Class imbalance was corrected using the scale_pos_weight parameter set at 3.95. The following hyper-parameters were finally used (Table 1):

Table 1. Model hyperparameters used for the XGBoost model.

Hyperparameter Value Hyperparameter Value Hyperparameter Value
base_score 0.5 gpu_id -1 min_child_weight 1
booster gbtree grow_policy depthwise’ missing nan
callbacks None importance_type None monotone_constraints ()
colsample_bylevel 1 interaction_constraints n_estimators 50
colsample_bynode 1 learning_rate 0.25 n_jobs 0
colsample_bytree 0.9 max_bin 256 num_parallel_tree 1
early_stopping_rounds None max_cat_to_onehot 4 predictor auto
enable_categorical FALSE max_delta_step 0 random_state 0
eval_metric None max_depth 6 reg_alpha 0
gamma 0.2 max_leaves 0 reg_lambda 1

The mean cross_val_score function (CV = 10) was used to test model performance for both models across multiple test/train splits. Several models with and without the hyperparameters determined during tuning, were tested for performance using accuracy metrics described below. The best performing model was then selected.

Independent validation dataset

An additional number of 420 variants were curated from published literature not included in the WilsonGen database till 2022. The variants were classified according to the ACMG & AMP guidelines as described previously. The dataset comprised of 31 variants which were annotated as pathogenic, 29 which were likely pathogenic, 96 were likely benign, and the remaining variants were classified as VUS. Thus we had a total of 156 variants in our independent test dataset.

Accuracy estimates

The following accuracy estimates were used to evaluate the performance of the models: a) Sensitivity b) Specificity c) Accuracy d) Positive Predictive Value (PPV) e) Negative Predictive Value (NPV), and f) Matthews Correlation Coefficient (MCC). All data used in this study is freely accessible at: https://clingen.igib.res.in/WilsonGen/ The source code of both our models is available at https://github.com/aastha-v/WilsonGenAI. The models have been standardized on Ubuntu 18 LTS. The instructions and code for the preprocessing pipeline, variant classification through our models, as well as for generating one’s own models are also freely included.

Patient data validation

Generating variants and functional validation

ATP7B plasmid (pLB1080; Addgene) was subjected to site-directed mutagenesis (SDM) according to the manufacturer’s instruction (Agilent, 200522) using the set of primers shown in Table 2.

Table 2. Primers used in site-directed mutagenesis.
Variants Forward Primer (5’—3’) Reverse Primer (5’—3’)
c.2564C>T (S855F) CTCCTGTGATGAGGAACTCATCAGCCATGGTATT AATACCATGGCTGATGAGTTCCTCATCACAGGAG
c.813C>A (C271X) GCCTCCGCAGTCTCCACCACAGCCA TGGCTGTGGTGGAGACTGCGGAGGC

To understand the impact of WT (wild type) ATP7B and its protein mutants, knock-out HepG2 cells were cultured under the standard conditions. Different plasmids were transfected using lipofectamine-3000 (Thermo Scientific, L3000008). Post 24 hours, cells were treated with 500 μM CuCl2 for 6 hours and replaced with fresh media. After 18 hours, spent media was collected to estimate the exported copper using the manufacturer’s protocol (Sigma, MAK127). The colorimetric data of the assay was analyzed using an unpaired-t-test, with a p-value<0.05 considered statistically significant for all three sets of experiments.

Results

Both models performed best with a 70–30% train-test split. The TabNet model additionally split the 30% test set into 50% train and 50% validation subsets during training.

Accuracy estimates

TabNet

Although the model was set to run at a maximum of 1000 epochs, it stopped the training at 187th epoch with the best accuracy of 99% on the validation and 97.24% on the test sets respectively. The overall MCC was 0.92. The Precision, Recall and F1 scores are shown in S2 Table. S1 Fig shows the accuracy and Fig 2 the Area Under the Curve (AUC) plot; the receiver operating characteristic curve (ROC) was 0.996. Further, S2 Fig shows the confusion matrix for our test data; out of 109 variants taken as part of the 50% test subset data, the model accurately predicted 84 as pathogenic and 22 as benign. The Precision-Recall curve is shown in S3 Fig, with the overall area under the precision-recall curve (AUPRC) determined to be 1. The model learning rate and loss are plotted in S4 Fig. Additionally, the model Specificity, and its Negative Predictive Value (NPV) were both 1.

Fig 2.

Fig 2

The receiver operating characteristic curve for (A) the TabNet and (B) the XGBoost model.

XGBoost

The XGBoost model had an overall accuracy on the test set of 0.986175, AUC 0.9926 and MCC of 0.952773. Fig 2 shows the AUC plot, while S2 Fig depicts the confusion matrix. The Precision, Recall and F1 scores are shown in S2 Table. The Precision-Recall curve is shown in S3 Fig, with the overall AUPRC determined to be 1. Additionally, the model Specificity was 0.989, and its NPV was 1.

Validation in an independent set of variants classified according to the ACMG & AMP guidelines

After removing all non-exonic variants, we had a total of 96 benign/likely benign variants clubbed together as benign, and 60 pathogenic/likely pathogenic variants clubbed together as “pathogenic”. Upon running our models on the data, the TabNet model accurately classified all correctly, while XGBoost correctly classified 60 variants as pathogenic and 95 as benign, as shown in the confusion matrices in S5 Fig. Scatterplots of class probability vs the actual ACMG class for each model across all 156 variants are shown in S6 and S7 Figs for TabNet and XGBoost respectively.

Comparison with CADD

Both our models performed better than CADD, which only had scores for 53 out of the 156 variants included in the independent ACMG-qualified test set. TabNet had an overall accuracy on the test set of 1, and XGBoost of 0.9935, while CADD only had an overall accuracy of 0.9811 on the limited number of variants it predicted. A complete comparison between the Accuracy, PPV, NPV and MCC is shown in S3 Table.

Comparison with other models

To the best of our knowledge, ours are the only models trained on an ACMG/AMP gold standard dataset specifically created for ATP7B variants linked with Wilson’s Disease. While other deep learning models based on ACMG/AMP guidelines such as RENOVO [21] and MLVar [22] exist, they are either not trained on manually classified variants/attributes, or do not follow a disease-specific approach. As each disease follows different genetic mechanisms, generalization for all is difficult to achieve by a single model. We have, however included scores generated by running RENOVO, as well as pre-determined scores obtained for 11 other models including AlphaMissense, REVEL, SIFT, Polyphen2, Eigen-PC, LRT, MutationTaster, FATHMM, PROVEAN, MetaLR, and MutationAssessor [2333] in S3 Table, and as S8 Fig. Our models were able to outperform the others over the combined metrics of Accuracy, PPV, NPV, and MCC.

Reclassification of VUS variants

We collected all ATP7B variants of unknown significance and conflicting or missing classification from the ClinVar [34] database as well as our in-house data and used the model to reclassify them. Out of 977 exonic variants, TabNet reclassified 736 variants as pathogenic and 241 as benign. XGBoost on the other hand reclassified 800 as pathogenic and 177 as benign. Overall, a 91.4% concordance in predictions (726 pathogenic and 167 benign variants) was observed between the two models. The complete list of these variants and their reclassification can be accessed in S4 Table.

Scatterplots of class probability vs the predicted class for each model across 251 exonic VUS variants that were a part of our validation dataset are shown in S9 and S10 Figs for TabNet and XGBoost respectively.

Patient data validation

Impacts of WT ATP7B protein variants in cellular copper excretion

The copper assay data for the ATP7B variants, S855F and C271X (positive control for impaired ATP7B) showed reduced copper levels in the media in comparison to the WT ATP7B (Fig 3). This implies that WT ATP7B promotes the excretion of excess copper in the media while mutants, S855F and C271X show impaired protein function.

Fig 3. Copper exposure in ATP7B Knock-out HepG2 cells overexpressed with the plasmid containing wildtype and mutant ATP7B gene.

Fig 3

The copper transport activity of ATP7B mutants S855F and C271X is significantly impaired in comparison to the wild-type ATP7B, where N = 3, ** denotes p value less than 0.01 and *** denotes p value less than 0.001.

We ran both our models on each of the variants: both models accurately identified the control C271X variant as pathogenic, and also classified S855F as pathogenic. Thus, both our models tested on functionally proven data provide accurate classifications of the variant.

Feature importance and computational efficiency

The feature importance of the top 20 features are depicted in S11 Fig. The larger the score, the higher the impact of the feature on the model. Both models had 10 features in common, relying on Loss of function (LoF) information, wherein a genetic lesion prevents the formation of a normal gene product thereby leading to disease. They also take into account the genomic position of the mutation (Start: nucleotide), which could dictate a pathogenic effect. Additionally they rely on global prevalence of variants (1000Genomes AF -ALL)—the number of high frequency disease causing variants is usually small, i.e. most pathogenic variants are rarely prevalent. The remaining features common to both models consist of pathogenicity scores obtained from 7 prediction tools (MetaSVM, MCAP, MutPred, SIF4G, REVEL, PolyPhen2 HDIV, and MutationTaster). Additional details of these features can be seen in S1 Table.

The XGBoost model additionally relies on the exonic function of the variant (Function), i.e. the nature of the effect the variant has (a stopgain/loss variant for example, would have a larger effect on the protein than a synonymous variant). It also takes into account the allele frequencies reported in the GnomAD database, which is a larger population dataset. Finally, it also considers conservation scores (Siphy 29way logOdds and MutationAssessor) that dictate how conserved a given site is among mammals, indicating a potentially important location, and thus a potentially more disruptive effect, as well as additional pathogenicity scores (DANN, MetaRNN, and BayesDel).

The TabNet model additionally considers variant prevalence in Gnomad (GnomadAF—Raw) and the Northeast African subset of the Greater Middle East populations (GME AF—NEA), as well as pathogenicity and conservation scores (LRT, integreated_fitCons, PrimateAI, Eigen-PC-raw coding, and LIST-S2).

Thus both models take a well-rounded approach, and consider different aspects that determine variant pathogenicity, and are thus able to make reliable predictions. Further, the train dataset labels have been determined through ACMG classification that take into account all aspects of relevant biological data including functional and segregational evidence. As such the models capture patterns among the attributes that lead to these classifications.

The complete time taken to process a VCF file into suitable input, and then train a model was plotted for each model separately, and are shown in S12 Fig.

Discussion and conclusion

In our work we have created two tools that can be used to classify variants of the ATP7B gene linked with Wilson’s Disease. While tree-based XGBoost is one of the most reliable algorithms for tabular data, our study shows that TabNet, a deep learning model designed specifically for tabular data, slightly outperforms it in the classification of ATP7B variants. We have trained these models on a dataset classified through the application of ACMG guidelines, the gold standard in variant classification. Additionally, the data is a robust compilation of all publicly available variants of the gene exhaustively collected from literature and across 9 large databases. Thus the models were trained on accurately classified variants that capture all currently known types of exonic variants associated with Wilson’s Disease, due to which we anticipate the models to be able to generalize to newly reported variants in the future. We have shown our models’ accuracy through functional validation as well as comparison with other models. Finally, to address the large numbers of already reported variants of uncertain significance, we have collected and classified 977 exonic variants through both models; the predictions achieved a 91.4% concordance across 726 pathogenic and 167 benign variants. We have made these predictions openly available, along with their class probabilities to facilitate a better understanding of variant pathogenicity for clinicians and researchers.

Clinical diagnosis of Wilson’s Disease is often challenging due to the heterogenous nature of symptoms it presents with. Genetic testing has thus been included in the diagnosis process as part of the Leipzig scoring system [23]. Additionally, testing can also rule out other genetic disorders such as some congenital disorders of glycosylation that mimic Wilson disease, but are not caused by ATP7B variants [35]. Since early diagnosis may prevent patients ever becoming symptomatic, infant and newborn screening, as well as family screening also become important. Accurate clinical interpretation of variants is therefore essential for diagnosis. Our models offer a means of applying learning of patterns based on classification by ACMG rapidly to a large number of variants, which otherwise is a time consuming and expertise-dependent process. Given the complex nature and varied mechanisms of genetic diseases, adopting a generalized approach to classifying causative variants is ill advised. We have shown this through the superior performance of our models over other general ACMG based models. To the best of our knowledge, no other models based on the ACMG classification of Wilson’s disease variants currently exist.

We believe therefore, that our models can be utilized for the rapid classification of Wilson’s Disease variants for better understanding of their pathogenicity in both research and clinical settings.

Limitations: Even though the WilsonGen database is an exhaustive compendium of currently known and classified variants, the number of classified exonic variants still remains small. ACMG classification of variants is also a time-consuming process, and thus a newer dataset may take time to make. We have thus been able to test model generalization on a dataset of 156 test variants. Additionally, the functional classification of ATP7B variants is still ongoing. Upon its completion, a clearer picture of which of the two models has performed better will be able to be obtained.

Supporting information

S1 Fig. Train and validation accuracies of the TabNet model across 187 epochs.

(TIF)

pone.0303787.s001.tif (575.1KB, tif)
S2 Fig. Confusion matrix depicting the models’ predictions on the 30% test data.

Fig A represents TabNet while B represents XGBoost.

(TIF)

pone.0303787.s002.tif (42KB, tif)
S3 Fig. The Precision-Recall curve of both the models.

(TIF)

pone.0303787.s003.tif (99.2KB, tif)
S4 Fig. The model learning rate and loss of the TabNet model across187 epochs.

(TIF)

pone.0303787.s004.tif (377.5KB, tif)
S5 Fig. Confusion matrix of predictions made on the ACMG-qualified independent validation dataset comprising of 156 variants.

Fig A represents TabNet while B represents XGBoost.

(TIF)

pone.0303787.s005.tif (39.1KB, tif)
S6 Fig. Scatterplot of class probability vs the actual ACMG class for the TabNet model across the validation set of 156 variants.

(TIF)

pone.0303787.s006.tif (49.7KB, tif)
S7 Fig. Scatterplot of class probability vs the actual ACMG class for the XGBoost model across the validation set of 156 variants.

(TIF)

pone.0303787.s007.tif (50.3KB, tif)
S8 Fig. Barplot comparing the accuracy, MCC, NPV and PPV of 13 models with TabNet and XGBoost.

Abbreviations: MAssessor—MutationAssessor; MTaster—MutationTaster.

(TIF)

pone.0303787.s008.tif (415.8KB, tif)
S9 Fig. Scatterplot of class probability vs the predicted class for the TabNet model across all VUS variants 251 exonic VUS variants that were a part of the validation dataset.

(TIF)

pone.0303787.s009.tif (68.1KB, tif)
S10 Fig. Scatterplot of class probability vs the predicted class for the XGBoost model across all VUS variants 251 exonic VUS variants that were a part of the validation dataset.

(TIF)

pone.0303787.s010.tif (51.5KB, tif)
S11 Fig. Plot depicting the feature importance of the top 15 features of each model.

The x-axis for XGBoost plots F-score, while that of TabNet plots scores for each feature.

(TIF)

pone.0303787.s011.tif (588.7KB, tif)
S12 Fig

Plot depicting the complete time taken to process a VCF file into suitable input, and then train a model was plotted for (A) TabNet and (B) XGBoost respectively.

(TIF)

pone.0303787.s012.tif (80.2KB, tif)
S1 Table. A complete list of the 73 features used in training the model, along with the ACMG attribute they provide information about, along with their description, as well as their datatype.

(XLSX)

pone.0303787.s013.xlsx (12.2KB, xlsx)
S2 Table

The classification report with the Precision, Recall and F1 scores for the TabNet model (A) and XGBoost model (B) respectively.

(XLSX)

pone.0303787.s014.xlsx (7.8KB, xlsx)
S3 Table. Comparison of the performance of both models against 13 other models on the independent test dataset.

(XLSX)

pone.0303787.s015.xlsx (8.4KB, xlsx)
S4 Table

A list of 977 exonic variants of uncertain significance reclassified by our models TabNet (A) and XGBoost (B). Variants highlighted in bold represent concordance between predictions from both algorithms. Table (C) describes the nucleotide and protein changes in HGVS nomenclature, and also describes each variant’s exonic function.

(XLSX)

pone.0303787.s016.xlsx (121.9KB, xlsx)

Acknowledgments

AV acknowledges a Senior Research Fellowship from ICMR. MK acknowledges a Senior Research Fellowship from CSIR.

Abbreviations

ACMG-AMP

American College of Medical Genetics and Genomics and the Association of Molecular Pathologists

MCC

Matthews Correlation Coefficient

NPV

Negative Predictive Value

PPV

Positive Predictive Value

SDM

Site-Directed Mutagenesis

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

This work was supported by the Council of Scientific and Industrial Research (CSIR) [IndiGenApp Grant and OLP2301]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Wang J, Tang L, Xu A, Zhang S, Jiang H, Pei P, et al. Identification of mutations in the ATP7B gene in 14 Wilson disease children: Case series. Medicine. 2021;100: e25463. doi: 10.1097/MD.0000000000025463 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Rodriguez-Castro KI, Hevia-Urrutia FJ, Sturniolo GC. Wilson’s disease: A review of what we have learned. World J Hepatol. 2015;7: 2859–2870. doi: 10.4254/wjh.v7.i29.2859 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gao J, Brackley S, Mann JP. The global prevalence of Wilson disease from next-generation sequencing data. Genet Med. 2019;21: 1155–1163. doi: 10.1038/s41436-018-0309-9 [DOI] [PubMed] [Google Scholar]
  • 4.Kim G-H, Yang JY, Park J-Y, Lee JJ, Kim JH, Yoo H-W. Estimation of Wilson’s disease incidence and carrier frequency in the Korean population by screening ATP7B major mutations in newborn filter papers using the SYBR green intercalator method based on the amplification refractory mutation system. Genet Test. 2008;12: 395–399. doi: 10.1089/gte.2008.0016 [DOI] [PubMed] [Google Scholar]
  • 5.Roberts EA. Update on the Diagnosis and Management of Wilson Disease. Curr Gastroenterol Rep. 2018;20: 56. doi: 10.1007/s11894-018-0660-7 [DOI] [PubMed] [Google Scholar]
  • 6.Olivarez L, Caggana M, Pass KA, Ferguson P, Brewer GJ. Estimate of the frequency of Wilson’s disease in the US Caucasian population: a mutation analysis approach. Ann Hum Genet. 2001;65: 459–463. doi: 10.1017/S0003480001008764 [DOI] [PubMed] [Google Scholar]
  • 7.Jang J-H, Lee T, Bang S, Kim Y-E, Cho E-H. Carrier frequency of Wilson’s disease in the Korean population: a DNA-based approach. J Hum Genet. 2017;62: 815–818. doi: 10.1038/jhg.2017.49 [DOI] [PubMed] [Google Scholar]
  • 8.Collet C, Laplanche J-L, Page J, Morel H, Woimant F, Poujois A. High genetic carrier frequency of Wilson’s disease in France: discrepancies with clinical prevalence. BMC Med Genet. 2018;19: 143. doi: 10.1186/s12881-018-0660-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17: 405–424. doi: 10.1038/gim.2015.30 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kumar M, Gaharwar U, Paul S, Poojary M, Pandhare K, Scaria V, et al. WilsonGen a comprehensive clinically annotated genomic variant resource for Wilson’s Disease. Sci Rep. 2020;10: 1–6. doi: 10.1038/s41598-020-66099-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38: e164. doi: 10.1093/nar/gkq603 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581: 434–443. doi: 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.A global reference for human genetic variation. Nature. 2015;526: 68–74. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Scott EM, Halees A, Itan Y, Spencer EG, He Y, Azab MA, et al. Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery. Nat Genet. 2016;48: 1071–1076. doi: 10.1038/ng.3592 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Witten IH, Frank E, Hall MA. Data Mining: Practical Machine Learning Tools and Techniques. Elsevier; 2011. https://play.google.com/store/books/details?id=bDtLM8CODsQC. [Google Scholar]
  • 16.Shwartz-Ziv R, Armon A. Tabular data: Deep learning is not all you need. Information Fusion. 2022. pp. 84–90. doi: 10.1016/j.inffus.2021.11.011 [DOI] [Google Scholar]
  • 17.Arik SO, Pfister T. TabNet: Attentive Interpretable Tabular Learning. 2019 [cited 25 Jul 2022].
  • 18.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. 2016 [cited 25 Jul 2022].
  • 19.Ketkar N, Moolayil J. Automatic Differentiation in Deep Learning. Deep Learning with Python. 2021. pp. 133–145.
  • 20.Anaconda Documentation—Anaconda documentation. [cited 26 Jul 2022]. https://docs.anaconda.com/.
  • 21.Favalli V, Tini G, Bonetti E, Vozza G, Guida A, Gandini S, et al. Machine learning-based reclassification of germline variants of unknown significance: The RENOVO algorithm. Am J Hum Genet. 2021;108: 682–695. doi: 10.1016/j.ajhg.2021.03.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nicora G, Zucca S, Limongelli I, Bellazzi R, Magni P. A machine learning approach based on ACMG/AMP guidelines for genomic variant classification and prioritization. Sci Rep. 2022;12: 1–12. doi: 10.1038/s41598-022-06547-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023. [cited 28 Feb 2024]. doi: 10.1126/science.adg7492 [DOI] [PubMed] [Google Scholar]
  • 24.Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet. 2016;99: 877. doi: 10.1016/j.ajhg.2016.08.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31: 3812. doi: 10.1093/nar/gkg509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Adzhubei I, Jordan DM, Sunyaev SR. Predicting Functional Effect of Human Missense Mutations Using PolyPhen-2. Curr Protoc Hum Genet. 2013;0 7: Unit7.20. doi: 10.1002/0471142905.hg0720s76 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ionita-Laza I, Mccallum K, Xu B, Buxbaum J. A SPECTRAL APPROACH INTEGRATING FUNCTIONAL GENOMIC ANNOTATIONS FOR CODING AND NONCODING VARIANTS. Nat Genet. 2016;48: 214. doi: 10.1038/ng.3477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res. 2009;19: 1553–1561. doi: 10.1101/gr.092619.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Schwarz JM, Rödelsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods. 2010;7: 575–576. doi: 10.1038/nmeth0810-575 [DOI] [PubMed] [Google Scholar]
  • 30.Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, et al. Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum Mutat. 2013;34: 57. doi: 10.1002/humu.22225 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PLoS One. 2012;7: e46688. doi: 10.1371/journal.pone.0046688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K, et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet. 2014;24: 2125–2137. doi: 10.1093/hmg/ddu733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Reva B, Antipin Y, Sander C. Determinants of protein function revealed by combinatorial entropy optimization. Genome Biol. 2007;8. doi: 10.1186/gb-2007-8-11-r232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42: D980–5. doi: 10.1093/nar/gkt1113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Espinós C, Ferenci P. Are the new genetic tools for diagnosis of Wilson disease helpful in clinical practice? JHEP Rep. 2020;2: 100114. doi: 10.1016/j.jhepr.2020.100114 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Jianhong Zhou

18 Jan 2024

PONE-D-23-35773WilsonGenAI a deep learning approach to classify pathogenic variants in Wilson DiseasePLOS ONE

Dear Dr. BK,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 03 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jianhong Zhou

Staff Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following financial disclosure:

“Funding from the Council of Scientific and Industrial Research (CSIR) through the IndiGenApp Grant and OLP2301

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

5. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process.

6. We note that Figure 1 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

1. You may seek permission from the original copyright holder of Figure 1 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

7. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper presents genetic variant classification using machine learning techniques, specifically TabNet and XGBoost, to classify ATP7B gene variants associated with Wilson's Disease. The study's strength lies in its robust training and validation on a high-confidence dataset and its practical application, as evidenced by successful independent verification and potential utility in clinical and research settings. I have several comments that are need to be addressed.

Major:

(1) Why were TabNet and XGBoost chosen as the primary models for this analysis over other deep learning or machine learning models? What specific advantages do they offer for this type of data and problem? Please provide the comparison with other relevant deep learning methods.

(2) The authors mentioned that TabNet uses sequential attention for feature selection, which is instance-wise. How does this impact the generalizability of the model across different datasets or variants? Is there a risk of overfitting to specific features in the training dataset?

(3) The authors note that XGBoost is effective in handling sparse data. However, it is not clear that on how this capability was specifically advantageous in current study, given the characteristics of used dataset?

(4) For the models, the authors have set specific hyperparameters. The manuscript need more details about how were these parameters chosen.

(5) The authors adjusted the scale_pos_weight in XGBoost for class imbalance. How significant was the class imbalance in used dataset, and how did this adjustment impact the model's performance, especially in terms of precision and recall?

(6) TabNet stopped training at the 187th epoch out of a possible 1000. Was this due to an early stopping criterion based on validation accuracy? A big epoch size does not necessarily increase the accuracy of the model. How was the risk of overfitting addressed given the excessive epoch size (>100)?

(7) The authors mentioned the top 20 features in feature importance plots for both models. Could the authors provide insights into what these features represent and how they contribute to the pathogenicity classification? How interpretable are these models in terms of understanding the biological significance of these features?

(8) The manuscript needs more details on how specificity, negative predictive value (NPV), or area under the precision-recall curve (AUPRC) considered?

(9) The test sets' composition (number of benign vs. pathogenic variants) and their source (whether they were balanced or reflective of real-world distributions) are not detailed. How might this affect the models' generalizability to other datasets or real-world scenarios?

(10) The comparison with CADD and other models like RENOVO and MLVar suggests superior performance of your models. However, were these comparisons made under similar conditions (e.g., same datasets, metrics)? How do the models compare in terms of computational efficiency and scalability?

(11) When reclassifying variants of uncertain significance, how did the authors validate the accuracy of these reclassifications? Is there a risk of introducing bias or errors in this process, given the uncertain nature of these variants?

(12) The discussion section of the manuscript needs to be significantly expanded. These are few points the authors may consider while revising the discussion section. In discussion the authors should interpret and explain the findings, placing them in the context of the broader field. Begin by summarizing the main findings of the study, highlighting how they address the research questions or hypotheses stated in the introduction. Then, contextualize these results within the existing literature, discussing how these findings align with or differ from previous research and the potential reasons for these similarities or differences. What is the significance and implications of the results, considering both their theoretical and practical applications. Acknowledge the limitations, discussing how they might affect your findings and suggesting areas for future research to address these gaps. This section should bridge the gap between the presented research and the larger scientific community, demonstrating how this work contributes to and advances the field.

Minor:

(1) It would help readers to introduce Wilson's disease in the introduction section.

(2) The relevance of choosing the ATP7B gene needs to be added in the introduction.

(3) "Non-exonic variants and VUS were removed from the analysis and this resulted in a variant dataset of 723 unique variants, ..." Explain why. What is VUS? Expand all the abbreviations at the first use.

(4) lines 106 – 113: The parameters could be presented in a table.

Reviewer #2: Summary:

Vatsyayan et al. applied two ML models (TabNet and XGBoost) to classify ATP7B genetic variants of Wilsen disease based on highly engineered features of each variant. Both models show very high classification accuracy, indicating the potential usability to reduce the manual evaluation efforts such as following guidelines of American College of Medical Genetics and Genomics and the Association of Molecular Pathologists. However, because of the lack of comparison with other variants classification methods e.g. disease agnostic model, it is hard to tell the novelty of WilsonGenAI and whether WilsonGenAI really adds value to Wilson's disease specific variant classification. Due to the high requirement of storage size of the WilsonGenAI, I have not evaluate the software itself. Please see the following comments for major revision:

Major:

1. In introduction, please review and discuss related works. Line 174-180 should be part of the introduction.

2. In results, in addition to CADD, please compare the WilsonGenAI results with more state-of-the-art methods such as Eigen-PC, REVEL, AphaMissense, etc. Moreover, the argument of not comparing the proposed methods with RENOVO and MLVar are not convincing. Please also include these results in Table S3. Without seeing these baseline results, it is difficult to conclude TabNet and XGBoost are necessary Wilson's Disease specific model. The model comparison figure (e.g. barplot of Table S3) might be the main figure highlighted by this paper.

3. line 68, there are much more pathogenic variants than benign class. Have the authors considered whether the imbalanced distribution will influence the results?

4. line 74-76, it seems the three population used for annotation is different from the population of WilsonGen dataset. Can the authors discuss more on the potential problem of this inconsistency?

5. Figure S1. Can the authors show both training and validation loss in order to easily see whether the model is overfitting or not.

6. Figure S2. It seems the important features identied by XGBoost and TabNet are quite different but their ROC are similar. Can the authors discuss more about this?

7. Figure S3. It seems the accuracies are very unstable. Can the authors comment on this problem?

8. Figrue S4. It is weird that the total number of variants are different between the methods.

9. Since the proposed models only consider two classes, in practice, for the variants with around 0.5 predicted probability, should the user regard them as VUS? Is there a recommended threshold? This is especially important as the authors claimed WilsonGenAI could be used for clinical diagnosis.

10. For the VUS of independent datasets (line 124), is the predicted score of them around the margin of the two classes? Is there a trend or correlation between the predicted score and the 5 ordinal classes?

11. Line 189-200, is there a specific consideration to choose S855 and C271X for validation? Ideally, it would be very interesting to see if some VUS with very high predicted pathogenic probability can be validated to lead to low Copper concentration.

12. It seems the independent dataset is not available.

Minor:

1. Please define abbreviation WD, VUS before using it.

2. line 149. Please round the number.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 May 17;19(5):e0303787. doi: 10.1371/journal.pone.0303787.r002

Author response to Decision Letter 0


5 Mar 2024

A point-by-point response to reviewers

Reviewer #1: This paper presents genetic variant classification using machine learning techniques, specifically TabNet and XGBoost, to classify ATP7B gene variants associated with Wilson's Disease. The study's strength lies in its robust training and validation on a high-confidence dataset and its practical application, as evidenced by successful independent verification and potential utility in clinical and research settings. I have several comments that are need to be addressed.

Major:

(Q1) Why were TabNet and XGBoost chosen as the primary models for this analysis over other deep learning or machine learning models? What specific advantages do they offer for this type of data and problem? Please provide the comparison with other relevant deep learning methods.

Response: We appreciate the inquiry from the reviewer. In our initial exploration of model selection, we conducted a comprehensive analysis using the Weka suite (Witten et al. 2011). The dataset under consideration at that time comprised 725 variants, split into 70% training and 30% testing datasets. The table below illustrates the train and test accuracies achieved by different algorithms:

Model Train Accuracy Test Accuracy

RandomForest 97.925 98.611

J48 97.41 98.61

SMO 96.89 97.92

NaiveBayes 83.59 76.39

Consistent with conventional wisdom, tree ensemble models demonstrated superior performance on tabular data. Notably, RandomForest and J48 outperformed other algorithms. Therefore, we opted for the XGBoost classifier, a widely-used gradient-boosted decision tree, known for its efficiency in handling tabular datasets. XGBoost offers advantages such as faster execution, robust performance with missing data, and effective handling of class-imbalanced datasets. Its built-in regularization helps mitigate overfitting, a concern associated with models like RandomForest.

Furthermore, recognizing the specialized nature of tabular data, we incorporated the TabNet deep learning model into our analysis. TabNet, designed specifically for tabulated datasets, leverages a transformer architecture to emulate the learning of decision trees. This design enables TabNet to rapidly discern intricate data patterns. While XGBoost excels in certain scenarios, TabNet has demonstrated superiority over tree methods for specific tabular datasets.

Given the variability in the comparative performance of these models depending on the dataset, we decided to include results from both XGBoost and TabNet for a comprehensive evaluation. This dual-model approach allows for a more nuanced understanding and robust assessment of the predictions made by each model.

(Q2) The authors mentioned that TabNet uses sequential attention for feature selection, which is instance-wise. How does this impact the generalizability of the model across different datasets or variants? Is there a risk of overfitting to specific features in the training dataset?

Response: We appreciate the reviewer's insightful consideration of the generalizability of TabNet across diverse datasets. While TabNet incorporates features like prior scales to mitigate overfitting, predicting the exact extent of model generalization remains challenging, particularly in the absence of ample accurately classified variant datasets. Our training utilized the most extensive dataset of Wilson's Disease variants reported in literature, encompassing nine large datasets with ACMG classifications. The model consistently demonstrated high classification accuracy across both classes, as assessed by Matthews Correlation Coefficient (MCC). This performance instills confidence in its potential to perform well on other real-world datasets. Regrettably, we were unable to conduct additional testing due to the scarcity of available ACMG-classified or functionally validated variant datasets. Despite this limitation, our rigorous training on a diverse and comprehensive dataset enhances our confidence in the model's ability to generalize to different datasets and variants. Future investigations and validations with additional variant datasets would certainly contribute to a more comprehensive understanding of TabNet's generalizability across a broader spectrum of genetic variations.

(Q3) The authors note that XGBoost is effective in handling sparse data. However, it is not clear that on how this capability was specifically advantageous in current study, given the characteristics of used dataset?

Response: We appreciate the reviewer's observation, and would like to clarify the specific advantage of XGBoost's capability to handle sparse data in our study. Real-world datasets often exhibit missing values, posing a challenge for deep learning models. In our dataset, certain features, such as pathogenicity and conservation scores, had missing values due to the inherent characteristics of their respective prediction algorithms. To address this issue with TabNet, we had to perform imputation by substituting missing data with a constant value far outside the scale of all scores. This substitution aimed to avoid introducing unintended bias.

Contrastingly, XGBoost demonstrated an inherent advantage in handling sparse data. Unlike TabNet, XGBoost required no data substitution for missing values, resulting in a more streamlined preprocessing step. Moreover, XGBoost's Sparsity-aware Split Finding algorithm automatically determines optimal splits for data points with missing values, contributing to an improved overall performance. This capability proved advantageous in our study, in terms of streamlining the preprocessing phase and enhancing the model's efficiency in handling sparse data patterns.

(Q4) For the models, the authors have set specific hyperparameters. The manuscript need more details about how were these parameters chosen.

Response: We appreciate the reviewer's request for more details on the hyperparameter selection process. In the TabNet model, we explored different values for the mask_type parameter during experimentation. The "entmax" setting demonstrated superior overall prediction accuracy compared to the default "sparsemax," leading us to choose it for model training. The “patience” parameter, governing the number of epochs to await improvement before terminating a training run, was set at 100, with a maximum of 1000 epochs allowed. Various dataset splits were tested, including 70% train and 30% test, as well as 80% train and 20% test, to ensure robust testing.

For the XGBoost model, hyperparameters were carefully selected and evaluated using a 5-fold cross-validation approach. A randomized search on hyperparameters was conducted using RandomizedSearchCV with 5-fold cross-validation. To address class imbalance, the scale_pos_weight parameter was determined by dividing the number of majority class entries by the number of minority class entries. Model performance was assessed using the mean cross_val_score function with a 10-fold cross-validation. Multiple models, with and without the determined hyperparameters (including scale_pos_weight), were tested using accuracy, AUC, and MCC metrics. Additionally, various train/test splits were explored to identify the best-performing model.

These details have been incorporated into the revised manuscript to provide a comprehensive understanding of the hyperparameter selection process for both the TabNet and XGBoost models.

(Q5) The authors adjusted the scale_pos_weight in XGBoost for class imbalance. How significant was the class imbalance in used dataset, and how did this adjustment impact the model's performance, especially in terms of precision and recall?

Response: Our train set had 577 pathogenic/likely pathogenic and 146 benign/likely benign variants. Given this imbalance, we adjusted the scale_pos_weight parameter to the recommended value of 3.95. To evaluate the impact of this adjustment, we employed the Matthews Correlation Coefficient (MCC) metric, which comprehensively considers all components of the confusion matrix, namely true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It thus enabled us to determine if the classifier was doing well on both positive and negative classes. This was pertinent due to the potential clinical implications associated with misclassifying benign variants as pathogenic.

Comparing models with and without the adjusted scale_pos_weight, we observed an improvement in performance with the weighted model achieving an MCC of 0.95 compared to 0.90 without the weights. While precision remained consistent at 0.98 for both models, the weighted model exhibited an enhanced recall of 1.00 as opposed to 0.98 without the adjustment. Moreover, the F1-score demonstrated improvement with the weighted model, reaching 0.99 compared to 0.98 without the adjustment.

It is noteworthy that our pursuit of optimal hyperparameter configurations involved experimenting with various combinations. Throughout this process, models consistently performed better when the scale_pos_weight parameter was appropriately adjusted. This underscores the significance of addressing class imbalance, as reflected in the superior performance and robustness of models that incorporated the weighted approach.

(Q6) TabNet stopped training at the 187th epoch out of a possible 1000. Was this due to an early stopping criterion based on validation accuracy? A big epoch size does not necessarily increase the accuracy of the model. How was the risk of overfitting addressed given the excessive epoch size (>100)?

Response: Indeed, TabNet implemented an early stopping mechanism based on the validation accuracy metric during training. The early stopping criterion was defined by setting patience at 100, meaning that if the accuracy did not improve for 100 consecutive epochs, the training process would halt. Subsequently, TabNet automatically selected the epoch with the best accuracy score for making predictions on the evaluation set.

To mitigate the risk of overfitting, the model width, representing the number of nodes in a layer, was set to 8. Additionally, the parameter n_steps was configured to 3. These decisions aimed to strike a balance between model complexity and generalization capacity, reducing the likelihood of overfitting.

Furthermore, to validate the robustness of the model, its performance was rigorously assessed on an independent validation set, where it was able to correctly classify all variants across both classes. We thus anticipate the model to be able to generalize well on data beyond the training set.

(Q7) The authors mentioned the top 20 features in feature importance plots for both models. Could the authors provide insights into what these features represent and how they contribute to the pathogenicity classification? How interpretable are these models in terms of understanding the biological significance of these features?

Response: Certainly, the feature importance plots for both models shed light on the factors influencing pathogenicity classification, derived from a comprehensive training set of 73 attributes. These attributes include variant positional information, global population prevalence, pathogenicity prediction scores from various tools, and evolutionary conservation scores.

The top features identified by both models highlight critical determinants of pathogenicity. Loss of function (LoF) information emerges as a key contributor, emphasizing the significance of genetic lesions that impede normal gene product formation, a hallmark of disease causation. The genomic position of the mutation(Start: nucleotide), could also be important in predicting pathogenic effect. Further, the global prevalence of variants, as indicated by the 1000Genomes allele frequency (1000Genomes AF - ALL), underscores the observation that the number of high frequency disease causing variants is usually small, i.e. most pathogenic variants are rarely prevalent across a population. The remaining features common to both models are pathogenicity scores from seven prediction tools (MetaSVM, MCAP, MutPred, SIF4G, REVEL, PolyPhen2 HDIV, and MutationTaster), reflecting the amalgamation of diverse computational predictions.

The XGBoost model additionally considers exonic function (Function), which describes the nature of the effect the variant has (a stopgain/loss variant for example, would have a larger effect on the protein than a synonymous variant).Allele frequencies from the GnomAD database, representing a larger population dataset, are also considered.It also takes into account conservation scores (Siphy 29way logOdds and MutationAssessor) that dictate how conserved a given site is among mammals, indicating a potentially important location, and thus a potentially more disruptive effect. The model further incorporates pathogenicity scores from DANN, MetaRNN, and BayesDel.

The TabNet model additionally considersvariant prevalence across Gnomad (GnomadAF - Raw) and the Northeast African subset of the Greater Middle East populations (GME AF - NEA). Pathogenicity and conservation scores, including LRT, integrated_fitCons, PrimateAI, Eigen-PC-raw coding, and LIST-S2, enhance the model's ability to capture nuances in variant significance.

Thus both models take a well-rounded approach, and consider different aspects that determine variant pathogenicity, and are thus able to make reliable predictions. Further, the train dataset labels have been determined through ACMG classification that take into account all aspects of relevant biological data including functional and segregational evidence. As such the models capture patterns among the attributes that lead to these classifications.

The table below is a subset of Supplementary Table 1, and offers greater detail on each of the top 20 important features:

Feature Name ACMG2015 Description Dtype

Function PVS1, BP7 Exonic function of the variant (e.g.: nonsynonymous SNV, stopgain/loss, frameshift insertion/deletion etc.) category

DANN Scores PP3, BP4 Deleterious Annotation of genetic variants using Neural Networks. Score range: 0-1. float64

Start: Nucleotide Genomic location of nucleotide int64

MetaSVM PP3, BP4 A radial SVM model to predict pathogenicity, trained on whole exome variants. Score range: -2 to 3. float64

Siphy 29way logOdds Scores PP3, BP4 SiPhy score based on 29 mammals genomes. The larger the score, the more conserved the site. Score range: 0 to 37.9718. float64

Gnomad AF - ALL BA1, BS1, BS2, PM2 Alt allele Frequency in the GnomAD database float64

LoF PVS1, BP7 Whether a variant is High Confidence LoF category

MCAP Scores PP3, BP4 Pathogenicity classifier for rare missense variants in the human genome. Score range: 0-1. float64

MutPred Scores PP3, BP4 Automates the inference of molecular mechanisms of disease from amino acid substitutions. Models changes of structural features and functional sites between wild-type and mutant protein sequences float64

SIFT4G Score PP3, BP4 Faster implementation of SIFT for wider range of organisms. Score range: 0-1. float64

1000Genomes AF -ALL BA1, BS1, BS2, PM2 Allele frequency in the 1000 Genomes database float64

MetaRNN Scores PP3, BP4 Pathogenicity prediction scores for human nonsynonymous SNVs (nsSNVs) and non-frameshift (NF) indels. float64

BayesDel with AF Scores PP3, BP4 Deleteriousness meta-score for coding and non-coding variants, SNVs and small insertion / deletions from database with integrated MaxAF. Score range: -1.29334 to 0.75731. float64

REVEL Scores PP3, BP4 Predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. Score range: 0-1. float64

MutationAssessor Scores PP3, BP4 Predicts the functional impact of amino-acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms. The functional impact is assessed based on evolutionary conservation of the affected amino acid in protein homologs. Score range: -5.545 to 5.975. float64

Polyphen2 HDIV Scores PP3, BP4 Polyphen2 prediction based on HumDiv; The PolyPhen-2 score predicts the possible impact of an amino acid substitution on the structure and function of a human protein. Score range: 0-1.

Attachment

Submitted filename: ResponsetoReviewers.docx

pone.0303787.s017.docx (1.1MB, docx)

Decision Letter 1

Muhammad Salman Bashir

1 May 2024

WilsonGenAI a deep learning approach to classify pathogenic variants in Wilson Disease

PONE-D-23-35773R1

Dear Dr. BK,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Muhammad Salman Bashir, M.S.C

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Muhammad Salman Bashir

7 May 2024

PONE-D-23-35773R1

PLOS ONE

Dear Dr. BK,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Muhammad Salman Bashir

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Train and validation accuracies of the TabNet model across 187 epochs.

    (TIF)

    pone.0303787.s001.tif (575.1KB, tif)
    S2 Fig. Confusion matrix depicting the models’ predictions on the 30% test data.

    Fig A represents TabNet while B represents XGBoost.

    (TIF)

    pone.0303787.s002.tif (42KB, tif)
    S3 Fig. The Precision-Recall curve of both the models.

    (TIF)

    pone.0303787.s003.tif (99.2KB, tif)
    S4 Fig. The model learning rate and loss of the TabNet model across187 epochs.

    (TIF)

    pone.0303787.s004.tif (377.5KB, tif)
    S5 Fig. Confusion matrix of predictions made on the ACMG-qualified independent validation dataset comprising of 156 variants.

    Fig A represents TabNet while B represents XGBoost.

    (TIF)

    pone.0303787.s005.tif (39.1KB, tif)
    S6 Fig. Scatterplot of class probability vs the actual ACMG class for the TabNet model across the validation set of 156 variants.

    (TIF)

    pone.0303787.s006.tif (49.7KB, tif)
    S7 Fig. Scatterplot of class probability vs the actual ACMG class for the XGBoost model across the validation set of 156 variants.

    (TIF)

    pone.0303787.s007.tif (50.3KB, tif)
    S8 Fig. Barplot comparing the accuracy, MCC, NPV and PPV of 13 models with TabNet and XGBoost.

    Abbreviations: MAssessor—MutationAssessor; MTaster—MutationTaster.

    (TIF)

    pone.0303787.s008.tif (415.8KB, tif)
    S9 Fig. Scatterplot of class probability vs the predicted class for the TabNet model across all VUS variants 251 exonic VUS variants that were a part of the validation dataset.

    (TIF)

    pone.0303787.s009.tif (68.1KB, tif)
    S10 Fig. Scatterplot of class probability vs the predicted class for the XGBoost model across all VUS variants 251 exonic VUS variants that were a part of the validation dataset.

    (TIF)

    pone.0303787.s010.tif (51.5KB, tif)
    S11 Fig. Plot depicting the feature importance of the top 15 features of each model.

    The x-axis for XGBoost plots F-score, while that of TabNet plots scores for each feature.

    (TIF)

    pone.0303787.s011.tif (588.7KB, tif)
    S12 Fig

    Plot depicting the complete time taken to process a VCF file into suitable input, and then train a model was plotted for (A) TabNet and (B) XGBoost respectively.

    (TIF)

    pone.0303787.s012.tif (80.2KB, tif)
    S1 Table. A complete list of the 73 features used in training the model, along with the ACMG attribute they provide information about, along with their description, as well as their datatype.

    (XLSX)

    pone.0303787.s013.xlsx (12.2KB, xlsx)
    S2 Table

    The classification report with the Precision, Recall and F1 scores for the TabNet model (A) and XGBoost model (B) respectively.

    (XLSX)

    pone.0303787.s014.xlsx (7.8KB, xlsx)
    S3 Table. Comparison of the performance of both models against 13 other models on the independent test dataset.

    (XLSX)

    pone.0303787.s015.xlsx (8.4KB, xlsx)
    S4 Table

    A list of 977 exonic variants of uncertain significance reclassified by our models TabNet (A) and XGBoost (B). Variants highlighted in bold represent concordance between predictions from both algorithms. Table (C) describes the nucleotide and protein changes in HGVS nomenclature, and also describes each variant’s exonic function.

    (XLSX)

    pone.0303787.s016.xlsx (121.9KB, xlsx)
    Attachment

    Submitted filename: ResponsetoReviewers.docx

    pone.0303787.s017.docx (1.1MB, docx)

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES