Abstract
Background and objective
Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC).
Methods
The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms—logistic regression, random forest, and support vector machine—along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods.
Results
The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability.
Conclusions
The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13040-024-00420-x.
Keywords: Driver mutation, Machine learning, Pathogenecity prediction algorithm
Introduction
Over the past decade, large-scale genome sequencing projects have revolutionized cancer research, revealing the underlying mechanisms that drive these diseases [1, 2]. The initiation and progression of cancer are primarily driven by somatic alterations that confer a selective growth advantage, allowing normal cells to become tumor cells and grow faster [3, 4]. Cancer cells harbor a large number of somatic mutations that vary across tumor types [5]. Only a small fraction of these mutations, known as driver mutations, are believed to have phenotypic consequences for disease development in specific tissue contexts. The remaining alterations, which evolved randomly, have no apparent impact on the disease, are called passenger mutations [6]. An important aim for cancer research is to prioritize somatic mutations identified from the sequenced data of tumor cells and subsequently detect driver genes that promote carcinogenesis [7]. Accurate prioritization of driver mutations is crucial for improving patient prognosis, highlighting its significant impact on patient stratification, therapeutic strategy selection, and clinical management [8].
A wide range of bioinformatic tools have been developed to assess the pathogenic potential of coding mutations, particularly for single nucleotide variants (SNVs) in cancer and other diseases [7, 9, 10]. Examples of a few such algorithms include SIFT [11], Polyphen2 [12], MutationAssessor [13], MPC [14], PrimateAI [15], DEOGEN2 [16], CADD [17], integrated_fitCons [18], phyloP17way_primate [19], and phastCons17way_primate [20], etc. Variant types like nonsense (introducing a premature stop codon), frameshift or in-frame insertions/deletions (altering the reading frame), and splice site variants (affecting RNA splicing) often have more predictable consequences [21]. Most of the existing algorithms tackle the challenge of estimating pathogenicity only for missense variants, which may or may not impact the protein function of genes. These algorithms estimate and present the pathogenicity of SNVs (primarily for missense mutations) through numerical scores [9].The scientific community extensively utilizes these scores to prioritize (pathogenic or not) their mutations of interest. Besides, in a clinical context, in-silico predictions from computational algorithms are recognized as a valuable tool for variant interpretation, and are included as one among the eight evidence criteria recommended by the American College of Medical Genetics and Genomics (ACMG) and Association of Molecular Pathologists (AMP) guidelines [21]. The ACMG/AMP guidelines recommend that in-silico predictions can be considered supporting evidence for variant classification only when there is consensus among multiple prediction tools else the evidence should not be used to classify the variant.
While some scoring methods rely solely on conservation across and between species (e.g., MutationAssessor, SIFT, and bStatistic [22]), other algorithms include diverse genomic properties like protein structure information and local depletion of variation (e.g., AlphaMissense [23], MPC, Polyphen2). Certain algorithms combine evolutionary conservation with other features for a more robust assessment of variant pathogenicity (e.g., DEOGEN2, FATHMM [24]) [7, 25]. All these algorithms primarily rely on either statistical methods or machine learning (ML) approaches to estimate pathogenicity. Due to the good performance of ML in resolving complex biological problems, recent tools increasingly utilize diverse genomic features together with ML algorithms to estimate variant pathogenicity (e.g., gMVP [26], AlphaMissense, PrimateAI). However, some established tools continue to employ effective statistical methods (e.g. SIFT4G [27], PROVEAN [28]) [29].
Despite the availability of many scoring algorithms, only a few have gained popularity among researchers. These established algorithms often exhibit considerable discordances in their pathogenicity assessments [30, 31]. For example, variants classified as pathogenic by one algorithm may be predicted as benign by another algorithm [32, 33]. This inconsistency, particularly evident for a large number of exonic mutations and even for cancer driver mutations (Fig. 1), highlights the need for a systematic approach that can rank these algorithms based on their classification accuracy (pathogenic vs. non-pathogenic). Motivated by these challenges and observations, our study proposes an ensemble machine learning approach to evaluate the performance and rank the existing pathogenic scoring algorithms based on their ability to distinguish between pathogenic driver and benign passenger mutations in cancer. Some other studies [31, 34, 35] worked on similar research questions but were restricted to only germline variants and data from ClinVar database [36] and include only a few PCSAs.
Fig. 1.
Variability of pathogenic scores across different scoring algorithms. A higher score in the figure indicates greater pathogenicity of the mutation. For five different known cancer driver mutations (HNSC-TCGA) the standard deviation of pathogenic scores ranges between 0.245 to 0.298
We consider Head and Neck Squamous Carcinoma data from TCGA (HNSC-TCGA) for our analysis. HNSC is a prevalent cancer in Southeast Asia (GLOBOCAN, 2023) due to high tobacco usage. HNSC ranked as the third most prevalent cancer worldwide, with 1,464,550 (7.6% of all cancers) new cases and 487,993 deaths (4.8% of all cancer-related deaths) [37]. HNSC encompasses various sub-sites including buccal mucosa, lip, tongue, etc., - exhibit a considerable heterogeneity in genomic alterations among them [38]. HNSC progresses from a dysplastic, precancerous state to full-fledged cancer by acquiring stepwise somatic alterations [39]. Lymph node metastasis is an important prognostic marker for HNSC (observed in ~ 40% patients, leads to poor survival) [40] and ~ 50% of HNSC patients demonstrate locoregional recurrence within 2 years of therapeutic intervention [41].HNSC is primarily driven through somatic inactivation of tumor suppressor genes [42] which often reduces the opportunity for targeted therapy. Additionally, the relatively low mutation burden of HNSC makes it less responsive to traditional immunotherapy [43].
We utilized three different ML algorithms - Logistic Regression (LR) [44], Random Forest (RF) [45], and Support Vector Machine (SVM) [46] - together with the Recursive Feature Elimination (RFE) [47] technique to develop our analytical approach. All three chosen ML algorithms are popular and widely used in various bioinformatic tools and genomic studies. An average sort method was applied to the algorithm-wise ranks from all three ML models to aggregate and deduce the final rank of the scoring algorithms. A quintile based cut-off applied to the final rank-sum distribution identified the top 11 scoring algorithms. These top performing algorithms demonstrated a significantly superior ability to distinguish pathogenic driver mutations from benign passenger mutations in the study cancer type - HNSC-TCGA compared to the remaining algorithms. An orthogonal evaluation of these top 11 PCSAs on independent HNSC study and other cancer cohorts including - Head and Neck Squamous Carcinoma - Clinical Proteomic Tumor Analysis Consortium (HNSC-CPTAC), Breast Invasive Carcinoma - TCGA (BRCA-TCGA), Colorectal Adenocarcinoma (COADREAD-TCGA) and (Lung Adenocarcinoma and Lung Squamous Cell Carcinoma) NSCLC-TCGA exhibited excellent performance in separating pathogenic driver from benign passenger mutations.
Materials and methods
Datasets
Somatic mutation data
We have selected the HNSC-TCGA cohort as the primary dataset for our study. This study cohort comprises 502 HNSC patients. For all the patients, the somatic Mutation Annotation Format (MAF) file generated from exome sequencing data was downloaded from cBioPortal [48] on 23rd November 2023. The obtained exonic MAF is a tab-delimited text file that stores information about somatic mutations across multiple patients along with essential annotations for downstream analysis. For each mutation, the essential annotations include chromosomal coordinates (GRCh37), reference and alternate alleles, location information (whether the mutation occurs in the exon, intron, or UTR of a gene), variant type (whether the mutation is a single nucleotide change, insertion, or deletion), variant classification (missense, nonsense, nonstop, splice, frameshift, inframe, etc.), codon change, protein change, and other relevant information (transcript IDs, etc.). To evaluate the robustness and generalizability of our findings, we have also included additional somatic mutations (MAF files) from other studies which include HNSC-CPTAC, BRCA-TCGA, COADREAD-TCGA and NSCLC-TCGA. These additional cancer datasets were downloaded from cBioPortal on 28rd October 2024.
Pathogenic and conservation scoring algorithms (PCSAs)
We utilized the dbNSFP database for the annotation of exonic variants with various existing PCSAs. The academic version of the dbNSFP database [30] (version : dbNSFP4.7a) was downloaded on 28th October2024, which collates pathogenic scores from different PCSAs and other pertinent information such as chromosomal coordinates, reference and alternate alleles, gene names, and dbSNP IDs for all non-silent single-nucleotide variants present in the exonic regions of the human genome. Although the database contains genomic coordinates for GRCh37 and GRCh38, we primarily used GRCh37 coordinates in our analysis. For each PCSA, the database contains a corresponding normalized score (derived from original score as reported by the developer). Since the range and scale of the original scores differ, we considered the normalized scores provided by the dbNSFP database, which we believe is more appropriate for performing comparative analysis across these PCSAs. The rank score in dbNSFP ranges between 0 and 1, indicating the proportion of scores considered less damaging. For instance, a score of 0.9 suggests it belongs to the top 10% of the most damaging variants. For this study, we considered 41 following PCSAs: SIFT, SIFT4G, Polyphen2_HDIV, Polyphen2_HVAR, LRT [49], MutationTaster [50], MutationAssessor, FATHMM, PROVEAN, VEST4 [51], MetaSVM [52], MetaLR [52], MetaRNN [53], M-CAP [54], REVEL [55], MutPred [56], MVP [57], gMVP, MPC, PrimateAI, DEOGEN2, ClinPred [58], LIST-S2 [59], VARITY_R [60], VARITY_ER [60], AlphaMissense, CADD, DANN [61], fathmm-MKL [62], fathmm-XF [63], Eigen-raw [64], Eigen-PC-raw [64], integrated_fitCons, GERP++_RS [65], phyloP17way_primate, phyloP100way_vertebrate [66], phyloP470way_mammalian [66], phastCons17way_primate, phastCons100way_vertebrate, phastCons470way_mammalian, and bStatistic (Supplementary Table S1).
Driver genes in cancer
We conducted a detailed review of the cancer literature to compile a comprehensive list of cancer driver genes based on large-scale cancer genomics studies. We considered 299 genes (Supplementary Table S2) as cancer driver genes from a pan-cancer study [67] and are highly cited in the community. The pan-cancer study analyzed exome sequencing data of 9,423 tumors spanning across 33 cancer types from the TCGA projects, utilizing twenty-six computational tools for the detection of 299 cancer driver genes.
Data pre-processing
Data cleaning
The downloaded somatic mutation annotation file (MAF) contains mostly exonic and a few non-exonic variants such as UTRs and intronic variants. We excluded all the non-exonic variants along with small insertions and deletions from the dataset, considering only the exonic single nucleotide variants (SNVs). Since most of the existing scoring algorithms provide scores only for missense variants, all other types of variants, such as nonsense, non-stop, and splice site variants, were eliminated. Only the missense SNVs were included in this study.
Annotate with features
All the missense single nucleotide somatic variants across all the patients were then annotated with scores from 41 PCSAs, out of which 33 are pathogenicity scoring algorithms and 8 are conservation scoring algorithms.
Label with target variable
Missense variants present in any of the 299 driver genes (as mentioned above) in the HNSC-TCGA and other cohorts are classified as driver mutations, while the non-driver set was selected randomly from the other genes except the genes present in COSMIC - cancer gene census [68] and IntOGen [7]. All other missense mutations are considered non-driver mutations (referred to as ”ndmm”). Since the non-driver set is much larger compared to the driver set, we performed bootstrapping to address the class imbalance. To create balanced datasets for analysis, we randomly sampled the ndmm set with replacement from HNSC-TCGA data to generate 100 non-driver sets (Fig. 2). Each of these non-driver datasets contains the same number of missense variants as the driver set. This process resulted in 100 balanced datasets, each comprising the same driver set combined with a unique non-driver set (Fig. 2 and algorithm section) for the primary HNSC-TCGA cancer cohort.
Fig. 2.
An overall workflow of the study
Data structure of labelled data
The labelled data was organized as a matrix. Each row represented a mutation, and each of the 41 columns corresponded to a specific PCSA, acting as a feature. The values within the matrix cells represented the rank score assigned to each mutation by the corresponding PCSA. The final column indicated the class label (driver or non-driver) for each mutation, serving as the target variable (Fig. 2 and algorithm section).
Missing values imputation
We observed missing values in the labelled datasets. The number of missing values varied across the 41 PCSAs (features). To address this, we utilize sklearn’s “IterativeImputer” which is similar to MICE (Multiple Imputation by Chained Equations) [69] method. The IterativeImputer works by iteratively modeling each feature with missing values as a function of the other features. For each feature, a regression model is trained on the known values, and then used to predict the missing values. This process is repeated multiple times to improve the accuracy of the imputations.
Feature ranking methods
Recursive Feature Elimination (RFE): Recursive Feature Elimination (RFE) is a technique used in machine learning to select the most important features from a dataset. RFE works by iteratively fitting a machine learning model to identify the least important feature among the feature set. It starts with all features, trains a model, removes the least important feature, trains again, and repeats this process until a terminating condition is met. RFE can also be used to rank features based on their importance to the prediction task. This ranking is typically obtained from attributes such as “coef” for linear models or “feature_importances” for models with built-in feature importance. In this study, three different machine learning algorithms were used in RFE to rank the features: (i) Logistic Regression (LR), (ii) Random Forest (RF), and (iii) Support Vector Machine (SVM). More details about LR, RF and SVM can be found in Supplementary methods.
Data split and model performance
The input data was split into an 80:20 ratio for training and testing, with 80% of the total data used to train the classifier and the remaining 20% used for model testing. The performance of different ML algorithms was assessed by comparing the Area Under the Curve (AUC) score. The AUC score is obtained by evaluating the model on the test data. The AUC is an important metric used to evaluate the performance of binary classification models. Both, Area Under the Receiver Operating Characteristic Curve (AUC-ROC) and Area Under the Precision-Recall Curve (AUC-PR) were obtained to check the model performance. The AUC-ROC curve plots the true positive rate against the false positive rate at various classification thresholds whereas AUC-RP evaluates the trade-off between Precision and Recall at various threshold settings which is more suitable for imbalanced datasets. Both the metrics serve as a single scalar value summarizing the overall performance of the model. AUC values range from 0 to 1, with a value of 0.5 indicating no discriminative power (equivalent to random guessing) and a value of 1.0 representing perfect classification. Higher AUC values indicate better model performance. We also consider other popular metrics in our analysis to evaluate the performance of classification models which include accuracy, precision, recall and F1-score.
Aggregation of ranks
After applying three different ML algorithms to the 100 sets of labelled data, we obtained 300 ranks for each PCSA. The aggregation of these ranks occurs in two steps. In the first step, we calculate the average of the 100 ranks for each PCSA and sort them in ascending order to determine the ML algorithm-wise final rank. In the second step, we sum up the ML algorithm-wise final ranks and sort them in ascending order to determine the final rank of each feature (Fig. 2 and algorithm section for more details). A lower rank indicates better performance of the feature in our study.
Algorithm of the implemented method
The objective of this study is to evaluate the performance of the PCSAs with respect to their ability to separate driver from non-driver missense variants in the chosen HNSC-TCGA dataset.

Implementation
We extensively used the Python programming language for various aspects of our study, including evaluating algorithm performance and obtaining feature and ML algorithm-wise RFE ranks. We employed standard Python - v3.8.10 libraries, such as Scikit-Learn - v1.2.2 [70], Pandas - v1.4.3 [71], and Numpy - v1.22.2 [72], for our analysis.
Results
Datasets
The combined somatic mutation file (i.e., MAF) for 502 HNSC-TCGA patients was obtained from cBioPortal. The MAF contained 124,374 somatic mutations, with an average of 148 mutations per patient. The genome build of the mutations is GRCh37. Over 96% of the mutations were single nucleotide variants (SNVs), while insertions and deletions comprised the remaining 3.2% (Fig. 3A and Supplementary Table S3 for more details). Among all the mutations, approximately 11.95% were non-coding (falling outside of the coding region), while the remaining approximately 88.05% were located in the coding regions of genes (Fig. 3B). A total of 16,726 protein-coding genes harbored somatic mutations, with missense mutations being the most frequent type (64.3%), followed by silent mutations (25.3%) and nonsense mutations (5.3%) (Fig. 3C and Supplementary Table S3 for more details). The number of missense mutations per patient ranged from 2 to 2,277, with an average of approximately 140. Since most existing algorithms provide pathogenicity scores only for missense variants, only these mutations were selected for further analysis. All missense mutations were further annotated with all 41 PCSAs.
Fig. 3.
A Distribution of Single Nucleotide Variants (SNV), Insertions, and Deletions. B Number of exonic and non-exonic sites. C Distribution of exonic variant classes. D number of missing missense sites (for non-driver sets an average was taken across all 100 datasets) across all 41 scoring algorithms. E Distribution of standard deviation in scores across driver and non-driver mutations
For this study, we considered 299 cancer driver genes, as mentioned in the methods section. Missense mutations located in the coding regions of any of these 299 genes are classified as driver mutations. Out of 70,446 missense mutations found to be present in 502 HNSC-TCGA patients (Supplementary Table S4), 2,936 are identified as driver mutations, belonging to 299 pan-cancer driver genes. All other missense mutations are classified as non-driver mutations (n = 67,510). However, from the non-driver set, we excluded mutations belonging to COSMIC (CGC) or IntOGen driver genes resulting in 62,986 missense mutations. Further, to address the class imbalance, we randomly selected 2,936 mutations from the 62,986 non-driver mutations. To better represent the non-driver set, we repeated this random selection process 100 times with replacement, resulting in one driver mutation set and 100 non-driver mutation sets.
Next, we examined the missing values in the datasets. Across the 100 datasets, approximately 663 (driver − 356 (12.1%) and non-driver − 307 (10.5%)) data points (on average) exhibited missing values for MutationAssessor, the highest among all PCSAs, followed by MutPred (~ 525) (driver − 213 (7.3%) and non-driver − 312 (10.6%)) and MPC (~ 518) (driver − 277 (9.4%) and non-driver − 241 (8.2%)). Conversely, CADD and DANN scores contained no missing values (Fig. 3D) (see Supplementary Table S5). We further investigated the variability within the data by calculating the standard deviation across all 41 PCSAs for both driver and non-driver mutations. Both sets displayed similar standard deviations (average: 0.196 for driver mutations and 0.199 for non-driver mutations) across the 41 PCSAs (Fig. 3E). We addressed the missing value issue using the imputation method described in the Methods section.
Assessing the predictive performance of ML models built with 41 PCSAs
All three algorithms were provided with 100 sets of data, as described in the previous section, to assess their performance with respect to all 41 PCSAs. The Random Forest (RF) algorithm proved to be the best performer, with an average AUC-ROC of 0.89 (range: 0.86 to 0.90) and AUC-PR of 0.89 (range: 0.87 to 0.91) (Supplementary Table S6). In contrast, both Logistic Regression [AUC-ROC (average) : 0.83 (range: 0.80 to 0.86)] and Support Vector Machine [AUC-ROC (average) : 0.83 (range: 0.80 to 0.86)] showed similar performance (Supplementary Table S6). The AUC-ROC and AUC-PR of RF is significantly higher compared to the AUC-ROC and AUC-PR of LR and SVM (p-value < 2.22e-16, Wilcoxon test of means), indicating its superior performance (Fig. 4A, B and Supplementary Table S6). This superiority is further reflected in the AUC-ROC and AUP-PR curve (combining all 100 runs), as shown in Fig. 4C, D. The AUC-ROC and AUP-PR curves for LR and SVM largely overlap, reflecting similar performance across different classification thresholds. The AUC-ROC and AUP-PR curve for RF consistently outperforms the other two algorithms. Other metrics like accuracy, F1-score, precision and recall also reflect the same (Fig. 4E and Supplementary Table S6).
Fig. 4.
A, B The box-plot represents the area under the curve (AUC-ROC, AUC-PR) of all 100 runs across all three ML algorithms. The performance of random forest is significantly better than logistic regression and support vector machine algorithms. C, D The performance of each algorithm is assessed by the ROC curve and the average AUC is written in the parenthesis. A higher AUC score means better performance. E Distribution of other popular performance evaluation metrics (average across all 100 runs) of all three ML algorithms - accuracy, F1-score, precision, recall etc
Identification of top ranking PCSAs by evaluating the performance in separating pathogenic driver from benign passenger mutations
This study aims to evaluate the performance (rank) of the PCSAs that best differentiate pathogenic driver mutations from benign passenger mutations in HNSC-TCGA. Each driver and passenger mutation is annotated with scores from 41 different PCSAs, which were treated as features for analysis. We employed three different ML algorithms along with recursive feature elimination (RFE) techniques to rank these 41 PCSAs/features. RFE generally requires labelled data and the chosen ML algorithm for feature ranking. RFE was provided with all 100 sets of data and each of the three ML algorithms (as mentioned above) individually to obtain feature rankings.
Each ML algorithm generated 100 ranks for each of the 41 features. For a few PCSAs, we observed considerable variation in ranks across 100 runs of logistic regression and support vector machine models (Fig. 5A). These ML algorithm-specific ranks were aggregated by taking the average of all 100 individual ranks and then sorting them in ascending order. Figure 5A presents a summarized boxplot depicting the aggregated feature rankings for all 41 PCSAs based on each ML algorithm. The final rank was calculated by summing the individual aggregated ranks obtained from each ML algorithm and sorting them in ascending order. A lower rank signifies a feature with better performance in distinguishing pathogenic driver from benign passenger mutations.
Fig. 5.

A The box plot summarises the rank of each 41 features/scores across all 100 runs and between three ML algorithms namely - random forest, logistic regression and support vector machine. B Distribution of final aggregated rank sum of all three Ml algorithms across all scoring algorithms. Green dotted arrowheads indicate the quintile based cut-off point to select the top performing PCSAs from the final rank sum distribution
Our analysis identified DEOGEN2 as the top-ranked PCSA, followed by integrated_fitCons and MVP, which ranked second with both having the same rank-sum value (Supplementary Table S7). The top PCSA was consistent across all three ML algorithm-specific rankings. However, the second and third positions varied between ML algorithms, with MVP, integrated_fitCons, PROVEAN, and M-CAP. MVP ranked second, fifth and third and integrated_fitCons ranked fourth, second and fourth in RF, LR, and SVM, respectively. Since M-CAP was ranked third, fourth and fifth in RF, LR and SVM, it secured the third position in the final ranking (Fig. 5A, Supplementary Table S7). Figure 5B and Supplementary Table S7 summarize the individual ML algorithm rankings and the final combined ranking. In contrast to the top ranker, the five lowest-ranked PCSAs were phyloP470way_mammalian, phastCons17way_primate, phastCons470way_mammalian, SIFT4G, phyloP17way_primate. Further, we applied a quintile based cut-off to select the top performing PCSAs. Eleven PCSAs turned out to be the top performer that can more effectively differentiate pathogenic drivers from benign passenger mutations compared to the remaining 30 PCSAs.
Next, to assess the performance of the top 11 PCSAs compared to the remaining 30 PCSAs, we divided the original data into two distinct sets. One set contained scores from the top 11 PCSAs for each driver and non-driver mutation, while the other set included scores from the remaining 30 PCSAs for the same mutations. We applied all three ML algorithms to both sets of data.
All three algorithms showed a significant difference in performance between classifiers built using the top 11 PCSAs (classifier_11) and those built with the remaining 30 PCSAs (classifier_30) regarding their ability to distinguish pathogenic driver mutations from benign non-driver mutations. The AUC-ROC and AUC-PR for all three ML algorithms was significantly higher (p-value < 2.22e-16, Wilcoxon test of means) for classifier_11 compared to classifier_30 (Fig. 6A, B). This pattern was further confirmed by other metrics like accuracy, F1-scores, AUC-ROC and AUC-PR curves (Table 1; Fig. 6C, D), where classifier_11 consistently outperformed classifier_30 in separating pathogenic driver mutations for all three ML algorithms. These results support the reliability of the ranking identified using RFE followed by the aggregation of rankings, as described earlier.
Fig. 6.

A, B The box-plot represents the area under the curve (AUC-ROC, AUC-PR) of all 100 runs (for the classifier built using 11 scores is significantly better than the classifier built using 30 scores) across all three ML algorithms. For all three ML algorithms the performance of the classifier built using 11 scores is significantly better than the classifier built using 30 scores. C, D The performance of each algorithm is assessed by the ROC curve between the classifier built using 11 scores is significantly better than the classifier built using 30 scores. The average AUC of all 100 runs is written in the parenthesis. E A head-to-head comparison of performance metrics between top 6, top 11, non-top 35 and non-top 30 PCSAs among all three ML algorithms
Table 1.
Performance metrics of all three ML algorithms on HNSC-TCGA data
| Top-11-PCSAs | Top-30-PCSAs | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Accuracy | Precision | Recall | F1-score | AUC-ROC | AUC-PR | Model | Accuracy | Precision | Recall | F1-score | AUC-ROC | AUC-PR | ||
| Random Forest | 0.795 | 0.795 | 0.795 | 0.795 | 0.882 | 0.889 | Average | Random Forest | 0.746 | 0.748 | 0.746 | 0.746 | 0.831 | 0.850 | Average |
| 0.013 | 0.013 | 0.013 | 0.013 | 0.011 | 0.011 | SD | 0.013 | 0.013 | 0.013 | 0.013 | 0.012 | 0.012 | SD | ||
| 0.795 | 0.795 | 0.795 | 0.795 | 0.883 | 0.890 | Median | 0.748 | 0.750 | 0.748 | 0.747 | 0.831 | 0.848 | Median | ||
| 0.826 | 0.825 | 0.826 | 0.825 | 0.910 | 0.914 | Max | 0.774 | 0.777 | 0.774 | 0.773 | 0.859 | 0.878 | Max | ||
| 0.766 | 0.766 | 0.766 | 0.766 | 0.855 | 0.864 | Min | 0.714 | 0.714 | 0.714 | 0.714 | 0.803 | 0.815 | Min | ||
| Logistic Regrassion | 0.742 | 0.743 | 0.742 | 0.742 | 0.826 | 0.818 | Average | Logistic Regression | 0.672 | 0.673 | 0.672 | 0.672 | 0.738 | 0.737 | Average |
| 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.018 | SD | 0.014 | 0.014 | 0.014 | 0.014 | 0.015 | 0.020 | SD | ||
| 0.741 | 0.743 | 0.741 | 0.741 | 0.827 | 0.818 | Median | 0.673 | 0.674 | 0.673 | 0.673 | 0.740 | 0.738 | Median | ||
| 0.775 | 0.776 | 0.775 | 0.775 | 0.858 | 0.858 | Max | 0.704 | 0.707 | 0.704 | 0.704 | 0.770 | 0.782 | Max | ||
| 0.710 | 0.710 | 0.710 | 0.710 | 0.791 | 0.780 | Min | 0.628 | 0.628 | 0.628 | 0.627 | 0.703 | 0.698 | Min | ||
| Support Vector Machine | 0.742 | 0.743 | 0.742 | 0.742 | 0.826 | 0.817 | Average | Support Vector Machine | 0.673 | 0.674 | 0.673 | 0.672 | 0.738 | 0.735 | Average |
| 0.012 | 0.012 | 0.012 | 0.012 | 0.013 | 0.018 | SD | 0.014 | 0.015 | 0.014 | 0.015 | 0.015 | 0.020 | SD | ||
| 0.743 | 0.745 | 0.743 | 0.742 | 0.828 | 0.819 | Median | 0.672 | 0.674 | 0.672 | 0.672 | 0.737 | 0.734 | Median | ||
| 0.774 | 0.774 | 0.774 | 0.773 | 0.857 | 0.857 | Max | 0.704 | 0.704 | 0.704 | 0.704 | 0.767 | 0.781 | Max | ||
| 0.709 | 0.709 | 0.709 | 0.709 | 0.789 | 0.777 | Min | 0.631 | 0.631 | 0.631 | 0.630 | 0.702 | 0.690 | Min | ||
In the top 11 PCSAs, we noticed a sharp increase in the rank-sum between position 6 (MetaLR) and position 7 (AlphaMissense). Therefore, we want to check whether the top 6 PCSAs will perform better compared to top 11 - as derived from quantile based cut-off. We repeated the above mentioned procedure i.e., we splitted the original data into two buckets: one containing scores from the top 6 PCSAs for each mutation, and the other containing scores from the remaining 35 PCSAs. We applied all three ML algorithms. From the evaluation metrics like accuracy, F1-score, AUC-ROC etc., we found the top 11 PCSAs perform better in separating the two classes (pathogenic vs. benign) compared to top 6 PCSAs (Fig. 6E).
Next, to assess the relative performance of classifier_11, we compared its predictions to those of individual PCSAs and found that it significantly outperforms them (p-value < 2.22e-16, Wilcoxon test) across all three ML algorithms (Fig. 7A, Supplementary Table S8).
Fig. 7.
A Box-plot reflecting the performance comparison of model built using top-11 PCSAs with the model built with individual PCSAs in HNSC-TCGA data. B AUC-ROC and AUC-PR curve plot reflecting the performance of model built using top 11 PCSAs in additional cohorts (HNSC-CPTAC, BRCA-TCGA, COADREAD-TCGA, and NSCLC-TCGA) and ML algorithm (XGBoost)
Performance assessment of top 11 PCSAs in independent HNSC and other cancer cohorts
We analyzed data from additional cohorts: HNSC-CPTAC, BRCA-TCGA, COADREAD-TCGA, and NSCLC-TCGA, which collectively contained 23,376, 130,495, 332,610, and 393,371 somatic mutations from 85, 1009, 528, and 1144 patients, respectively. The majority of these mutations were single nucleotide variants (SNVs), ranging from 90.1 to 96.7% across cohorts. Among the exonic mutations, missense mutations were the most prevalent, accounting for 60.8–66.6% across cohorts. Detailed cohort-wise statistics are provided in Supplementary Table S3. All missense mutations were further annotated with top 11 PCSAs. The genome build of the mutations is GRCh37 for all cohorts except HNSC-CPTAC where the genome build is GRCh38.
To select the driver and non-driver datasets, we applied the same procedure (as described earlier) to additional cohorts (HNSC-CPTAC, BRCA-TCGA, COADREAD-TCGA, and NSCLC-TCGA), with the exception for the non-driver set, which was randomly sampled only once. We identified 468, 3098, 6176, and 8860 driver missense mutations in 299 genes across these cohorts, respectively. Further, on evaluation of the missing values in the dataset - MutationAssessor and MutPred exhibited the highest missing value rates, averaging approximately 10% and 9%, respectively, while fathmm-MKL_coding had the lowest rate (see Supplementary Table S5 for details).
Next, to check the generalizability of the top 11 PCSAs, we applied the classifier_11 (trained and built using HNSC-TCGA data by considering PANCAN driver genes) on the additional cohorts. Notably, the classifier maintained consistent performance across all four cohorts, demonstrating the robustness of these top 11 PCSAs in separate pathogenic drivers and benign non-drivers in HNSC cohorts and also in other cancer types. Random Forest consistently outperformed other models, achieving AUC-ROC scores between 0.81 and 0.85 (AUC-PR: 0.81–0.86) (Fig. 7B; Table 2) which also reflect in other metrics (Table 2). Logistic regression and SVM exhibited similar performance, with AUC-ROC scores ranging from 0.78 to 0.83 (AUC-PR: 0.77–0.83) (Fig. 7B; Table 2).
Table 2.
Performance metrics of ML algorithms in additional cohorts with top 11 PCSAs
| Model | Metrics | HNSC-CPTAC | BRCA-TCGA | COADREAD-TCGA | NSCLC-TCGA |
|---|---|---|---|---|---|
| Random Forest | Accuracy | 0.754 | 0.765 | 0.755 | 0.725 |
| Precision | 0.787 | 0.775 | 0.789 | 0.774 | |
| Recall | 0.697 | 0.745 | 0.696 | 0.634 | |
| F1 Score | 0.739 | 0.760 | 0.740 | 0.697 | |
| AUC-ROC | 0.837 | 0.855 | 0.834 | 0.819 | |
| AUC-PR | 0.841 | 0.860 | 0.835 | 0.815 | |
| Logistic Regression | Accuracy | 0.718 | 0.746 | 0.724 | 0.705 |
| Precision | 0.719 | 0.724 | 0.719 | 0.713 | |
| Recall | 0.716 | 0.796 | 0.736 | 0.685 | |
| F1 Score | 0.717 | 0.758 | 0.728 | 0.699 | |
| AUC-ROC | 0.804 | 0.835 | 0.806 | 0.789 | |
| AUC-PR | 0.778 | 0.836 | 0.803 | 0.775 | |
| Support Vector Machine | Accuracy | 0.716 | 0.742 | 0.724 | 0.707 |
| Precision | 0.712 | 0.722 | 0.718 | 0.713 | |
| Recall | 0.724 | 0.789 | 0.739 | 0.693 | |
| F1 Score | 0.718 | 0.754 | 0.728 | 0.703 | |
| AUC-ROC | 0.803 | 0.835 | 0.806 | 0.789 | |
| AUC-PR | 0.778 | 0.836 | 0.803 | 0.774 | |
| XGBoost | Accuracy | 0.760 | 0.770 | 0.740 | 0.730 |
| Precision | 0.776 | 0.771 | 0.760 | 0.761 | |
| Recall | 0.731 | 0.767 | 0.703 | 0.672 | |
| F1 Score | 0.753 | 0.769 | 0.730 | 0.714 | |
| AUC-ROC | 0.830 | 0.850 | 0.824 | 0.817 | |
| AUC-PR | 0.841 | 0.854 | 0.826 | 0.814 |
To further explore the potential of these 11 PCSAs, we trained an XGBoost model on the HNSC-TCGA data. This model also showed strong performance across the additional cohorts, with AUC-ROC scores ranging from 0.81 to 0.85 (AUC-PR: 0.81–0.85) (Fig. 7B; Table 2). These results, combined with the performance of Random Forest, Logistic Regression, and SVM, highlight the uniformity of these 11 PCSAs in distinguishing between pathogenic and benign mutations across various cancer types.
Further, we assess the performance of classifier_11, built with top 11 PCSAs on known cancer driver hotspots mutations. We randomly select 20 different known somatic hotspots mutations from Cancer Hotspots database (https://www.cancerhotspots.org) spanning across 15 different well-established cancer genes including NRAS, KRAS, BRAF, ERBB2 etc. All these mutations were not present in the training set. We annotated the sites with top 11 PCSAs and applied classifier_11 for all ML algorithms. All three ML algorithms (Random Forest, Logistic Regression, and Support Vector Machine) classified all 20 mutations as pathogenic drivers. XGBoost, however, correctly identified 19 mutations as pathogenic and 1 as benign. Next we check the ACMG and ClinVar status of these mutations. Of the 20 mutations, 3 were classified as pathogenic, 12 as likely pathogenic, 4 as uncertain significance, and 1 as likely benign according to ACMG guidelines. ClinVar provided slightly different classifications: 5 pathogenic, 6 pathogenic/likely pathogenic, 3 likely pathogenic, 2 drug response, 2 conflicting classifications, 1 uncertain significance, and 1 not present. Details are provided in Supplementary Table S9.
Discussion
The methodology presented in this study allows for the performance evaluation of existing pathogenic and conservational scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger mutations in HNSC. We included 41 different PCSAs in our study. Three different ML algorithms—logistic regression, random forest, and support vector machine - were employed alongside the recursive feature elimination technique for ML algorithm-wise ranking of the PCSAs. The individual rankings were then aggregated using an average sorting method. These ML algorithm-specific ranks were summed to generate a rank sum for each PCSA, followed by a sorting step to determine the final ranking of all PCSAs. While previous studies [31, 34, 35] have explored similar questions, our approach differs in algorithmic methodology. In our study, we focused on somatic mutations from multiple cancer cohorts, employed an ensemble machine learning model, and incorporated a larger number of PCSAs compared to previous studies that rely only on germline or ClinVar data, and a limited set of PCSAs.
We applied a quintile based cut-off to select the top 11 PCSAs. Subsequently, we noted a clear performance difference between classifiers built using these top 11 PCSAs compared to those built with the remaining 30 PCSAs. The classifier utilizing the top 11 PCSAs demonstrated a significantly better ability to differentiate pathogenic driver mutations from benign non-driver mutations. We also observed an excellent performance of these top 11 PCSAs in an independent HNSC validation cohort (HNSC-CPTAC) and other cancers as well (BRCA-TCGA, COADREAD-TCGA, and NSCLC-TCGA) indicating their consistency, robustness and generalizability. Further, the performance also persisted when a different ML algorithm (XGBoost) was trained with the top 11 PCSAs from HNSC-TCGA and applied to additional cohorts reflecting that the top rankers are not biased toward ML algorithms used for their selection. The models exhibited an excellent predictive performance for somatic driver hotspots mutations (n = 20) from well-established cancer gene somatic mutations such as KRAS – G12D, NRAS – G61A, EGFR – L813R, etc. As per ClinVar annotation, two of the assessed EGFR hotspot mutations are druggable. Although repurposing of cross-cancer drugs is an emerging area, it requires careful consideration of factors like tumor biology, drug sensitivity, and potential side effects in relevant tissues [73, 74].
Among the 41 PCSAs, DEOGEN2 is the top performer, followed by integrated_fitCons and MVP holding the second position having the same rank-sum value. The top PCSA consistently maintained its positions across all three ML algorithms. DEOGEN2 is a supervised ML algorithm that utilizes a random forest approach. Trained and tested on the Humsavar16 dataset, DEOGEN2 integrates 11 features, including information on protein folding, conservation, gene essentiality, etc. The fitCons algorithm employs a statistical method called INSIGHT to estimate the functional impact score by contrasting patterns of polymorphism and divergence between dispersed genomic sites and nearby neutrally evolving sites. It utilizes public data, such as DNase-seq, RNA-seq, and histone modification data from ENCODE, to estimate the score. MVP employs a deep residual neural network (ResNet) model to estimate the pathogenic score using features such as amino-acid constraint score, conservation, protein structure and modification, gene mutation intolerance, sub-genic regional depletion of missense variants, and few existing pathogenic scores. For training and testing, MVP considers the positive set, which comprises pathogenic sites from HGMD and UniProt, and the negative set, which includes rare missense variants from population data. Among the remaining top eight PCSAs, two utilize statistical approaches: PROVEAN (average delta score) and MutationAssessor (entropy based). The other six rely on machine learning or deep learning: M-CAP (gradient boosting tree classifier), MetaLR (logistic regression classifier), AlphaMissense (neural network based classifier), VARITY_R (XGBoost classifier), MutPred (support vector machine and random forest classifier) and fathmm-MKL_coding (multiple kernel learning).
The conservation scoring algorithms performed poorly and were ranked at the bottom in the final rankings. The lowest-ranking PCSAs are: phastCons17way_primate, phastCons470way_mammalian, and phyloP17way_primate.
Interestingly, two out of the top three identified PCSAs utilize machine learning approaches to estimate pathogenic scores. This suggests that the effective application of AI/ML algorithms, combined with well-curated feature sets encompassing relevant protein information, conservation scores, gene essentiality, mutation intolerance, and other factors, could play a crucial role in enhancing PCSA performance. In future, our approach can be extended by integrating a priori biological knowledge driven weighted rank sum approach for better performance.
Finally, our analysis highlights that even well-known PCSAs can exhibit variable performance. Therefore, one should not rely solely on the popularity of a single PCSA when assessing the pathogenicity of mutations, especially in cancer. A data-driven approach that integrates multiple PCSAs might provide a more reliable assessment of pathogenicity. Our methodology is one such approach to rank and select existing PCSAs to determine the pathogenicity of mutations.
Conclusions
In this study, we describe a methodology for evaluating the performance of PCSAs using an ensemble machine learning approach. Three different ML algorithms (logistic regression, random forest and support vector machine) were employed together with recursive feature elimination technique followed by an average sort method to evaluate the performance (rank) of the PCSAs. The PCSAs were evaluated based on their ability to classify pathogenic driver mutations from benign passenger mutations in HNSC-TCGA data. Top 11 PCSAs were selected based on a quantile based cut-off identified from the final rank-sum distribution. The identified top 11 PCSAs also show excellent performance in other cancers and known hotspots mutations as well, indicating its consistency and usefulness. All these findings highlight that our method performs better compared to some popular PCSAs performed poorly for HNSC data. Therefore, we recommend that rather than relying on the popularity of existing scores, a data-driven, integrated approach that considers multiple PCSAs is likely a more robust method for accurate pathogenicity assessment.
Supplementary Information
Authors’ contributions
Conceptualization: SD, AM, NKB. Data curation: SD. Formal analysis: SD, VP, SC, AG. Funding acquisition: AM, NKB. Methodology: SD, VP, AM, NKB. Writing - original draft: SD. Writing - review & editing: AM, NKB. The authors read and approved the final manuscript.
Funding
This work was supported by the Indian Council of Medical Research (ICMR), Government of India (Project ID: BMI/12(06)/2022 and IRIS ID No. 2021–14057) grant to N K Biswas. A. Mukhopadhyay also acknowledges the support received from core research grant CRG/2022/007730 from SERB, DST, Government of India. VP was supported by NSM project grant provided by MeiTy, GoI. SC acknowledges Department of Biotechnology, Govt of India for PhD fellowship (DBT/2019/NIBMG/1225) and BRIC-RCB (RCB/NIBMG-PhD/2019/1011). AG acknowledges ICMR-SRF fellowship (GENOMICS-BMS/2021–10619) and BRIC-RCB (RCB/NIBMG-PhD/2022-23/A/394/1002).
Data availability
Codes and datasets used in this study are publicly available on GitHub under https://github.com/subrata-codeons/RPCS.
Declarations
Consent for publication
All the authors approved the manuscript for publication.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Anirban Mukhopadhyay, Email: anirban@klyuniv.ac.in.
Nidhan K. Biswas, Email: nkb1@nibmg.ac.in
References
- 1.Hudson TJ, Anderson W, Aretz A, Barker AD, Bell C, Bernabé RR, et al. International network of cancer genome projects. Nature. 2010;464:993–8. 10.1038/nature08987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Chang K, Creighton CJ, Davis C, Donehower L, Drummond J, Wheeler D, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45:1113–20. 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458:719–24. 10.1038/nature07943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, et al. Accumulation of driver and passenger mutations during tumor progression. Proc Natl Acad Sci U S United States. 2010;107:18545–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499:214–8. 10.1038/nature12213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LAJ, Kinzler KW. Cancer genome landscapes. Science. 2013;339:1546–58 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Martínez-Jiménez F, Muiños F, Sentís I, Deu-Pons J, Reyes-Salazar I, Arnedo-Pac C, et al. A compendium of mutational cancer driver genes. Nat Rev Cancer. 2020;20:555–72. 10.1038/s41568-020-0290-x. [DOI] [PubMed] [Google Scholar]
- 8.Tamborero D, Rubio-Perez C, Deu-Pons J, Schroeder MP, Vivancos A, Rovira A, et al. Cancer genome interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med. 2018;10:25. 10.1186/s13073-018-0531-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Liu X, Wu C, Li C, Boerwinkle E. dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and splice-site SNVs. Hum Mutat United States. 2016;37:235–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Nourbakhsh M, Degn K, Saksager A, Tiberti M, Papaleo E. Prediction of cancer driver genes and mutations: the potential of integrative computational frameworks. Brief Bioinform. 2024;25:bbad519. 10.1093/bib/bbad519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res Engl. 2003;31:3812–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–9 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res Engl. 2011;39:e118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Samocha KE, Kosmicki JA, Karczewski KJ, O\textquoterightDonnell-Luria AH, Pierce-Hoffman E, MacArthur DG, et al. Regional missense constraint improves variant deleteriousness prediction. bioRxiv. Cold Spring Harbor Laboratory; 2017; Available from: https://www.biorxiv.org/content/early/2017/06/12/148353.
- 15.Sundaram L, Gao H, Padigepati SR, McRae JF, Li Y, Kosmicki JA, et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet United States. 2018;50:1161–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Raimondi D, Tanyalcin I, Ferté J, Gazzo A, Orlando G, Lenaerts T, et al. DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic Acids Res Engl. 2017;45:W201-206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res Engl. 2019;47:D886-894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gulko B, Hubisz MJ, Gronau I, Siepel A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat Genet. 2015;47:276–83 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Siepel A, Pollard KS, Haussler D. New methods for detecting lineage-specific selection. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Res Comput Mol Biol. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 190–205.
- 20.Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–50 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–23. 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.McVicker G, Gordon D, Davis C, Green P. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet United States. 2009;5:e1000471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Sci United States. 2023;381:eadg7492. [DOI] [PubMed] [Google Scholar]
- 24.Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat United States. 2013;34:57–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Cheng F, Zhao J, Zhao Z. Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes. Brief Bioinform. 2016;17:642–56. 10.1093/bib/bbv068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zhang H, Xu MS, Fan X, Chung WK, Shen Y. Predicting functional effect of missense variants using graph attention neural networks. Nat Mach Intell Engl. 2022;4:1017–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Vaser R, Adusumalli S, Leng SN, Sikic M, Ng PC. SIFT missense predictions for genomes. Nat Protoc Engl. 2016;11:1–9. [DOI] [PubMed] [Google Scholar]
- 28.Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PLoS One. 2012;7:e46688 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Andrades R, Recamonde-Mendoza M. Machine learning methods for prediction of cancer driver genes: a survey paper. Brief Bioinform. 2022;23:bbac062. 10.1093/bib/bbac062. [DOI] [PubMed] [Google Scholar]
- 30.Liu X, Li C, Mou C, Dong Y, Tu Y. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med. 2020;12:103. 10.1186/s13073-020-00803-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32:358–68 United States. [DOI] [PubMed] [Google Scholar]
- 32.Sun H, Yu G. New insights into the pathogenicity of non-synonymous variants through multi-level analysis. Sci Rep. 2019;9:1667. 10.1038/s41598-018-38189-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Montenegro LR, Lerário AM, Nishi MY, Jorge AAL, Mendonca BB. Performance of mutation pathogenicity prediction tools on missense variants associated with 46, XY differences of sex development. Clinics. 2021;76:e2052 Available from https://www.sciencedirect.com/science/article/pii/S180759322200059X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ghosh R, Oak N, Plon SE. Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines. Genome Biol. 2017;18:225. 10.1186/s13059-017-1353-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Favalli V, Tini G, Bonetti E, Vozza G, Guida A, Gandini S, et al. Machine learning-based reclassification of germline variants of unknown significance: the RENOVO algorithm. Am J Hum Genet United States. 2021;108:682–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44:D862-8. 10.1093/nar/gkv1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Zhou T, Huang W, Wang X, Zhang J, Zhou E, Tu Y, et al. Global burden of head and neck cancers from 1990 to 2019. iScience. 2024;27:109282 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517:576–82. England. [DOI] [PMC free article] [PubMed]
- 39.Warnakulasuriya S, Johnson NW, van der Waal I. Nomenclature and classification of potentially malignant disorders of the oral mucosa. J oral Pathol Med Off Publ Int Assoc Oral Pathol Am Acad Oral Pathol. 2007;36:575–80 Denmark. [DOI] [PubMed] [Google Scholar]
- 40.Biswas NK, Das C, Das S, Maitra A, Nair S, Gupta T, et al. Lymph node metastasis in oral cancer is strongly associated with chromosomal instability and DNA repair defects. Int J cancer. 2019;145:2568–79 United States. [DOI] [PubMed] [Google Scholar]
- 41.Chatterjee A, Chaudhary A, Ghosh A, Arun P, Mukherjee G, Arun I, et al. Overexpression of CD73 is associated with recurrence and poor prognosis of gingivobuccal oral cancer as revealed by transcriptome and deep immune profiling of paired tumor and margin tissues. Cancer Med United States. 2023;12:16774–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Johnson DE, Burtness B, Leemans CR, Lui VWY, Bauman JE, Grandis JR. Head and neck squamous cell carcinoma. Nat Rev Dis Prim. 2020;6:92. 10.1038/s41572-020-00224-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Jardim DL, Goodman A, de Melo Gagliato D, Kurzrock R. The challenges of tumor mutational burden as an immunotherapy biomarker. Cancer Cell United States. 2021;39:154–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction, second edition (Springer series in statistics). 2009.
- 45.Breiman L. Random forests. Mach Learn Springer. 2001;45:5–32. [Google Scholar]
- 46.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97. 10.1007/BF00994018. [Google Scholar]
- 47.Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46:389–422. 10.1023/A:1012487302797. [Google Scholar]
- 48.Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal United States. 2013;6:pl1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res. 2009;19:1553–61 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods. 2014;11:361–2 United States. [DOI] [PubMed] [Google Scholar]
- 51.Carter H, Douville C, Stenson PD, Cooper DN, Karchin R. Identifying mendelian disease genes with the variant effect scoring tool. BMC Genomics Engl. 2013;14(Suppl 3):S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K, et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet Engl. 2015;24:2125–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Li C, Zhi D, Wang K, Liu X. MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning. Genome Med Engl. 2022;14:115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Jagadeesh KA, Wenger AM, Berger MJ, Guturu H, Stenson PD, Cooper DN, et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet United States. 2016;48:1581–6. [DOI] [PubMed] [Google Scholar]
- 55.Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, et al. REVEL: an Ensemble Method for Predicting the pathogenicity of rare missense variants. Am J Hum Genet United States. 2016;99:877–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, et al. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinf Engl. 2009;25:2744–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Qi H, Zhang H, Zhao Y, Chen C, Long JJ, Chung WK, et al. MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun Engl. 2021;12:510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Alirezaie N, Kernohan KD, Hartley T, Majewski J, Hocking TD. ClinPred: prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants. Am J Hum Genet. 2018;103:474–83 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Malhis N, Jacobson M, Jones SJM, Gsponer J. LIST-S2: taxonomy based sorting of deleterious missense mutations across species. Nucleic Acids Res Engl. 2020;48:W154-161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Wu Y, Li R, Sun S, Weile J, Roth FP. Improved pathogenicity prediction for rare human missense variants. Am J Hum Genet. 2021;108:1891–906 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinf Engl. 2015;31:761–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Shihab HA, Rogers MF, Gough J, Mort M, Cooper DN, Day INM, et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinf Engl. 2015;31:1536–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, Campbell C. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinf Engl. 2018;34:511–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Ionita-Laza I, McCallum K, Xu B, Buxbaum JD. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet. 2016;48:214–20 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol. 2010;6:e1001025 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–21 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, et al. Comprehensive characterization of Cancer driver genes and mutations. Cell. 2018;173:371-385.e18 United States. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Sondka Z, Bamford S, Cole CG, Ward SA, Dunham I, Forbes SA. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat Rev Cancer. 2018;18:696–705. 10.1038/s41568-018-0060-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67 Available from https://www.jstatsoft.org/index.php/jss/article/view/v045i03. [Google Scholar]
- 70.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res JMLR org. 2011;12:2825–30. [Google Scholar]
- 71.McKinney W. Data structures for statistical computing in Python. 2010.
- 72.Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585:357–62. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Xia Y, Sun M, Huang H, Jin WL. Drug repurposing for cancer therapy. Signal Transduct Target Ther. 2024;9:92. 10.1038/s41392-024-01808-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Weth FR, Hoggarth GB, Weth AF, Paterson E, White MPJ, Tan ST, et al. Unlocking hidden potential: advancements, approaches, and obstacles in repurposing drugs for cancer therapy. Br J Cancer. 2024;130:703–15. 10.1038/s41416-023-02502-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Codes and datasets used in this study are publicly available on GitHub under https://github.com/subrata-codeons/RPCS.





