Summary
Small insertions and deletions (indels) are critical yet challenging genetic variations with significant clinical implications. However, the identification of pathogenic indels from neutral variants in clinical contexts remains an understudied problem. Here, we developed INDELpred, a machine-learning-based predictive model for discerning pathogenic from benign indels. INDELpred was established based on key features, including allele frequency, indel length, function-based features, and gene-based features. A set of comprehensive evaluation analyses demonstrated that INDELpred exhibited superior performance over competing methods in terms of computational efficiency and prediction accuracy. Importantly, INDELpred highlighted the crucial role of function-based features in identifying pathogenic indels, with a clear interpretability of the features in understanding the disease-causing variants. We envisage INDELpred as a desirable tool for the detection of pathogenic indels within large-scale genomic datasets, thereby enhancing the precision of genetic diagnoses in clinical settings.
Keywords: InDel, machine learning, pathogenicity prediction, clinical genomics, whole genome sequencing
INDELpred is a machine-learning-based predictive model for discerning pathogenic from benign indels. It outperforms competing methods in computational efficiency and prediction accuracy, emphasizing the importance of function-based features in identifying pathogenic indels. We envisage INDELpred as a desirable tool for the detection of pathogenic indels within large-scale genomic datasets.
Introduction
Advancements in next-generation sequencing technologies have spurred the discovery of numerous genetic variations within the human genome.1 Among these, insertions and deletions (indels), ranging from a single base to hundreds of bases, are the second most prevalent type of genetic variation, accounting for 15%–21% of all variants.2 Although single-nucleotide variants (SNVs) have been the focus of extensive research, indels have not yet received the same level of scrutiny despite their proven link to a wide spectrum of diseases, including epilepsy3 (MIM: 254770), central nervous system anomalies4 (MIM: 608281), rare diseases,5 and various cancers.6 Indels in coding regions can cause frameshift variants or in-frame alterations that may drastically affect protein function and stability, leading to a cascade of pathological effects.7 Additionally, the role of indels in non-coding regions is increasingly recognized as the pivotal etiology in gene regulation and mRNA processing,8,9 which may have far-reaching consequences for cellular function and disease pathogenesis.10,11,12 However, due to insufficient supporting evidence, these non-coding indels are often overlooked and regarded as variants of uncertain significance (VUSs).13 Therefore, interpreting the functional impact of a myriad of indels identified through clinical genomic sequencing is of great importance for personalized medicine yet remains an arduous challenge, underscoring the need for predictive methods to precisely evaluate their pathogenic potential.
Current predictive frameworks for variant pathogenicity predominantly cater to SNVs, with comparatively fewer tools available for assessing indels specifically.14,15,16,17,18,19 Existing methods for indel pathogenicity mainly concentrate on the protein-coding genomic regions and are categorically divided into direct and indirect prediction methods. Direct prediction methods are primarily centered on analyzing the structural and functional impacts of indel variants on proteins.20,21,22,23,24,25,26 Computational tools such as SIFT15 and PolyPhen26,27 harness sequence homology and physicochemical property analyses to assess the potential deleterious consequences of such mutations. Despite their efficacy, these methods are susceptible to yielding false positive results, especially in cases where frameshift variants do not exert a pronounced effect on protein function.28 Conversely, indirect prediction methods augment the assessment of genetic variants by incorporating diverse features, including demographic characteristics,29 meta-predictive factors,30 and additional pathogenicity-associated statistics. Methods like combined annotation dependent depletion (CADD)18 and consequence-agnostic pathogenicity interpretation of clinical exome variations (CAPICE)17 employ an expansive array of over 60 genome annotation features to estimate the pathogenic potential of both SNVs and indels. Moreover, meta-predictors such as MetaRNN16 leverage existing predictors as meta-predictive factors, enhancing the prediction accuracy by integrating outputs from multiple prediction algorithms, although this may escalate the computational time and complexity of the method.
Though some tools, such as CADD, exhibit notable prediction capabilities, their utilities in practical contexts may be compromised by excessive processing durations. Moreover, a recent review study observed that although the meta-predictors have demonstrated effectiveness, the potential circular reasoning within these models may undermine their evaluative reliability.30 Furthermore, a high degree of correlation among features16 augments redundancy and consequently reduces the interpretability of the model. Additionally, there exists a scarcity of validation efforts using clinical whole-genome sequencing (WGS) data, suggesting that the efficacy of these tools in clinical settings may not be as formidable as indicated by evaluations performed on curated datasets. Beyond the mere quantification of predictive accuracy, the elucidation of the biological relevance of these predictions is of paramount importance. Despite tools like CAPICE demonstrating commendable predictive accuracy in silico, there remains a significant disparity in understanding the biological implications of their perdition outputs.
To address these challenges, we have developed INDELpred, a predictive model characterized by its elevated accuracy, rapidity, and interpretability in the pathogenicity assessment of indels. INDELpred is remarkable for its superior performance, as evidenced by its proficiency across seven distinct evaluation metrics. Notably, INDELpred outpaces existing software in terms of computational speed, delivering swift predictions while necessitating minimal data storage, thereby enabling effective large-scale genomic investigation. Furthermore, upon application on clinical WGS datasets, INDELpred exhibits remarkable robustness by prioritizing indels according to their pathogenicity scores and maintaining a markedly low rate of false positives for the variants deemed pathogenic. Additionally, through anchoring feature importance, INDELpred identifies function-based features as the foremost contributors to its predictive prowess, suggesting the substantial impact of functional regions on the determination of pathogenicity. Overall, we anticipate that INDELpred will yield clinical insights into genomic variants of indels and offer reliable diagnostic utilities for evaluating their pathogenicity, thus having potential implications for helping develop new therapeutic strategies toward personalized precision medicine.
Material and methods
Data collection
The entire data preprocessing process is illustrated in Figure S1. We downloaded indels from the ClinVar,31 VKGL,32 HGMD,33 and gnomAD29 databases, all aligned to the hg19 reference genome. Indels from the ClinVar database dated October 9, 2021, were compiled for INDELpred training (termed the ClinVarTrain dataset), while those as of August 6, 2023, not included in the ClinVarTrain, were used for internal testing (termed the ClinVarTest). To further assess our model, we created two datasets for external testing. The first dataset, collected from the VKGL as of July 2023, was denoted the VKGLTest. The VKGL dataset is a curated database grounded in real-world clinical inquiries and professional oversight. The second, sourced from the HGMD-2017 and gnomAD (v.2.1.1) databases, was referred to as the HGMDTest. Additionally, clinical WGS data aligned to hg38 reference genome were collected from our previous study.3 This dataset encompassed 26 clinically significant indels identified among 30 pediatric patients with epilepsy.
The ClinVar database and clinical WGS datasets were provided in VCF format, whereas the VKGL, HGMD, and gnomAD databases were in tabular format. For the tabular data, we additionally extracted chromosome locations and variant information (reference allele [REF] and alternate allele [ALT]) for indels and converted them into VCF format. We then split multi-allelic sites into separate entries, if existing, and extracted and normalized all indels using bcftools (v.1.9).34 All VCF files use a 1-based coordinate system. For the clinical WGS data, we merged all VCF files from the 30 patients into a single file. To prevent data leakage during model training and evaluation, we excluded sites present in the training dataset from all the testing datasets except the clinical WGS data.
Data annotation
We used the VCF files as input to ANNOVAR,35 leveraging the hg19 version of the refGene (from UCSC RefSeq,36 v.2020-08-17) and gnomAD databases (v.2.1.1) to annotate indels from ClinVar, VKGL, and HGMD databases. For annotating the clinical WGS data, we used the hg38 version. The annotation information provided by the refGene included "Func.refGene," "ExonicFunc.refGene," and "Gene.refGene," while gnomAD offered "controls_AF_popmax." These annotation results were used to derive features for the selected indels and then develop the INDELpred model.
Data preprocessing
The inclusion criteria for an indel were (1) a length between 1 and 100 base pairs (bp) and (2) an annotation result other than “nonframeshift substitution,” “frameshift substitution,” “nonsynonymous SNV,” or “synonymous SNV.” These criteria were applied consistently across all datasets. Additionally, indels initially categorized as benign or likely benign and pathogenic or likely pathogenic were relabeled as benign and pathogenic for this study.
Furthermore, two extra inclusion criteria were further applied to the ClinVarTrain dataset: (3) the site being located on chromosomes 1–22, X, or Y and (4) the site having the CLNREVSTAT status of “criteria_provided, _single_submitter,” “criteria_provided, _multiple_submitters,_no_conflicts,” or “reviewed_by_expert_panel.” The latter one ensured that the indel variants used for model training were of high confidence. For clinical WGS data, the selected variant sites also must be located on chromosomes 1–22, X, or Y.
The HGMDTest dataset was built by integrating two sources. Pathogenic variants were taken from disease-causing mutation sites in the HGMD database, while benign variants were derived from the gnomAD and VariSNP37 databases. VariSNP is dedicated to benign variants, while gnomAD provides population allele frequency (AF) information. We randomly selected a roughly equal number of benign variants from gnomAD with AFs <0.01. To assess the impact of type II circularity,38 which may arise when variants in a gene region are exclusively labeled as either pathogenic or benign, we created an additional validation dataset called HGMDSharedGene. This dataset included only gene regions harboring both benign and pathogenic variants.
After preprocessing all datasets, we cross-checked for conflicting pathogenicity annotations. Due to the significantly small number of conflicting annotation (n = 24), we opted to exclude these entries from the data. Consequently, we finalized the count of benign and pathogenic indels across different datasets for developing and validating the INDELpred model (Tables 1 and S1).
Table 1.
Number of benign and pathogenic indels in different datasets
| Dataset category | Dataset name | Benign | Pathogenic | Total |
|---|---|---|---|---|
| Training dataset | ClinVarTrain | 28,104 | 41,836 | 69,940 |
| Internal testing dataset | ClinVarTest | 16,091 | 33,294 | 49,385 |
| External testing dataset | VGKLTest | 3,989 | 3,996 | 7,985 |
| External testing dataset | HGMDTest | 23,667 | 23,062 | 46,729 |
| Check type II loops | HGMDSharedGene | 15,855 | 19,336 | 35,191 |
| Clinical testing dataset | clinical WGS | 3,496,179 | 26 | 3,496,205 |
Feature design
To establish INDELpred, we crafted four categories of features, leading to a set of 16 features used for the model development (Table S2).
AF
The variant AF was sourced from the gnomAD database (v.2.1.1), where the feature labeled “controls_AF_popmax” referred to the maximum AF observed within the outbred control population from the gnomAD genomes.
Indel length
The length of the indel was calculated as follows:
where and denote the number of bp of ALT and REF recorded in the VCF file, respectively. The indel length measured the disparity in the bp counts between the ALT and REF.
Gene-based scores
Gene-based features correlated genes with a series of genetic alterations. The impacted genes encompassed genomic functional regions of functional significance, such as exonic, upstream, intergenic, intronic, non-coding RNA associated, splicing related, and UTRs. Additionally, we examined the typology of mutational events, including frameshift deletions, frameshift insertions, non-frameshift deletions, non-frameshift insertions, start loss, stop gain, and stop loss. To quantify the relevance of these features (i.e., functional regions and mutational events) to pathogenicity, we utilized our training dataset to determine the proportion of pathogenic indels within each functional region of each gene, as defined by the “Gene.refGene" annotation:
where denoted the gene-based score for indel situated within the region of feature in gene , , and . and are the numbers of unique features and genes annotated for all indels in the training dataset, respectively. and represent the number of pathogenic indels and all indels within the region of feature in gene , respectively. is an indicator function, and .
Function-based scores
Genomic features indicative of tolerance and intolerance to genetic variations provided insights into the potential functional impact of variants on gene expression and protein function. We therefore defined two functional-based features for tolerance to the variations (denoted as “general tolerance” and “exon tolerance”) and another two for intolerance (“general intolerance” and “exon intolerance”), based on the “Func.refGene” (e.g., exonic, upstream, etc.) and "ExonicFunc.refGene" (e.g., frameshift insertion, stop, loss, etc.) annotations.
To quantify the tolerance or intolerance of each function,39 we initially fitted ordinary least squares regression models (statsmodels library, v. 0.14.0) using the training dataset to predict the number of benign and pathogenic indels, respectively, with respect to the number of indel VUSs (neither pathogenicity nor benign) based on the function types as annotated in “Func.refGene” or "ExonicFunc.refGene," considering the use of VUSs could reduce the impact of background noise. We then performed an outlier test on each fitted models, obtaining the studentized residuals to represent the functional-based tolerance and intolerance scores (Tables S3 and S4). Finally, the four features mentioned above with specific scores for a particular indel variant were determined according to the indel’s annotated functional type, thereby called functional-based scores, which offered a measure of the propensity for pathogenicity of the variants.
Development of INDELpred model
Features with missing values were simply imputed with zeros. To establish the INDELpred model, a gradient boosting classifier40 (GBDT) was developed. We applied the grid search approach with the stratified 5-fold cross-validation to optimize the hyperparameters of the INDELpred model, which were determined based on the maximum value of the weighted F1 score. INDELpred was implemented upon scikit-learn (v.1.3.0) and Python (v.3.10.12). Consequently, the optimal values of hyperparameters for the GBDT-based INDELpred model were a learning_rate of 0.15 and a min_samples_leaf of 0.01. All remaining parameters were retained at their default settings. After that, we retrained the model configured with the optimal hyperparameters using the whole training dataset to obtain the final INDELpred model.
Feature ranking
We adopted the SHAP41 (Shapley additive explanations) value (v.0.42.1) to rank the importance of individual features toward predicting the pathogenicity of indels. A SHAP value quantified the marginal contribution of a feature that influenced the prediction probability of pathogenicity. Positive SHAP values were indicative of the increasing risk of pathogenicity, while negative ones were indicative of the decreasing risk. We bootstrap sampled 20% of the indels from the ClinVarTest dataset and calculated the mean absolute SHAP value per feature across these indels to obtain feature importance. This process was repeated 100 times to ensure a robust estimation of feature contributions.
Competing methods
We compared our INDELpred against four predictive models, including CADD,42 CAPICE,17 and VSET.19 CADD integrated multiple annotations to compute a single score to predict the deleteriousness of both SNVs and indels. CAPICE employed machine learning to predict the pathogenicity of variants unearthed through clinical exome sequencing. Variant effect scoring tool (VEST) leveraged a random forest to score the functional impact of amino acid substitutions and indels. These methods took VCF files as input. The first three methods were downloaded from their respective GitHub repositories, and we adhered to their recommended parameter settings for our comparative experiments. For VEST, we applied its online web service, uploading the VCF file for analysis and deliberately choosing the “VEST-4” option. We then retrieved the resultant scores once the prediction process was completed.
Prediction evaluation on genomic factors
To analyze the impact of various genomic factors on their predictive accuracy and robustness, we partitioned the ClinVarTest dataset into different subgroups following four parameters.
-
•
Indel length: segmented the dataset into four groups based on the length of indels: "0–10bp," "10bp ∼ 20bp," "20bp ∼ 50bp," and ">50bp".
-
•
AF: stratified the dataset into four groups based on the AF: "0" (where the variant AF is zero in the gnomAD database or the variant is unrecorded in the database), "0%–0.1%," "0.1%–1%,” and “>1%.”
-
•
Gain-of-function versus loss-of-function: to explore the differential performance of the model in predicting the functional consequences of indels in gene products between gain of function and loss of function, we classified genes into ONCs and TSGs according to the COSMIC’s Cancer Gene Census.43
-
•
SharedGenes group: to evaluate the risk of inflated model performance due to type II circularity, we identified a subset of genes within the ClinVarTest dataset that harbored both benign and pathogenic indels. This subset, termed the SharedGenes group, was used to guarantee that the predictive capabilities were not solely a reflection of the models' ability to recognize gene-specific variant patterns.
To investigate the correlation between the data confidence and the predictive accuracy of the models, we segmented ClinVarTest and VKGLTest datasets according to the distinct confidence intervals. For ClinVarTest, we applied the CLINSTAT criteria to assign confidence ratings spanning from 0 to 4 stars based on the number of gold stars associated with the review status. For VKGLTest, data stratification was performed based on the “support” column, which indicates the number of laboratories corroborating the pathogenicity of a specific indel. A higher number indicated a more reliable interpretation of the variant.
Evolution metrics
The performance of the final model was evaluated using the following metrics.
-
1.
The area under the receiver operating characteristic curve (AUROC).
-
2.
The area under the precision-recall curve (AUPR).
-
3.
Matthews correlation coefficient (MCC):
where TP represents true positive, FP represents false positive, TN represents true negative, and FN represents false negative.
-
4.
Cohen’s kappa:
-
5.
Balanced accuracy (BA):
-
6.
F1 score:
Experimental setting
All experiments were conducted on a workstation with 12 Intel Xeon Gold 5118 CPUs and 188 GB random access memory. We performed experiments within the CPU-based environment and deliberately ran each method on a single thread with a virtual memory allocation of 1 GB for performance comparison.
Results
Overview of INDELpred
We built INDELpred to predict the pathogenicity of indels by leveraging a set of features. Specifically, functional-based features (n = 4) and gene-based features (n = 10) were first derived for indel variants (Figure 1A). By integrating the two individual features of indel AF and indel length, we created a compact yet comprehensive list of 16 features to establish our INDELpred model (Table S2). The GBDT-based INDELpred model was rigorously trained with the ClinVarTrain dataset using the stratified 5-fold cross-validation in conjunction with grid search to ascertain the optimal hyperparameter settings (Figure 1B). Using the optimal hyperparameters, we retained the INDELpred model on the entire ClinVarTrain dataset to finalize the model. The performance of the final INDELpred model was then assessed using a suite of evaluation metrics. To thoroughly evaluate the robustness and reliability in predicting the pathogenicity of indels across different genomic contexts, we tested the model on a variety of datasets, including internal and external testing datasets, as well as a clinical WGS data (Figure 1C).
Figure 1.
Overview of INDELpred development and evaluation
(A) After annotation by ANNOVAR, the function-based features for indel variants were represented by the studentized residuals that were derived from the fitted ordinary least squares regression models, while the gene-based features were calculated by the ratios of pathogenic to all indels for each function in the specific gene regions.
(B) With the four distinct feature categories, the GBDT-based INDELpred model was trained upon the ClinVarTrain dataset using the stratified 5-fold cross-validation. The final INDELpred model was subsequently applied to four testing datasets for the evaluation of prediction accuracy, runtime, and data storage requirements using a set of metrics.
(C) INDELpred model for indel pathogenicity prediction was further validated across various genomic contexts, including stability analysis with different genomic factors, reliability analysis based on the varying support laboratories (confidence levels) of indels, shared gene analysis between HGMD and gnomAD, and the adoption of a clinical WGS dataset with 30 pediatric individuals for practical applicability assessment.
INDELpred enhances the identification of pathogenic indels
We exploited multiple datasets to evaluate the performance of INDELpred in identifying pathogenic indels. These datasets included an internal testing dataset, ClinVarTest, and two external testing datasets, VKGLTest and HGMDTest. For benchmarking purposes, we chose the well-established methods CADD, CAPICE, and the coding region-specific variant scoring tool VEST for comparison. CADD employed a threshold score of 20 to delineate benign from pathogenic variants, whereas the other methods conventionally utilized a threshold of 0.5 for this distinction.
Results revealed that INDELpred achieved comparable or superior prediction accuracy in the comparison with CADD and CAPICE (Figure 2A). Conversely, VEST demonstrated the potential limitations evidenced by the substantial number of missing predictions, with VEST failing to provide predictions for 16,076, 3,003, and 24,550 variants in the ClinVarTest, VKGLTest, and HGMDTest datasets, respectively (Table S5). Furthermore, we randomly selected 80% samples from each testing dataset and assessed prediction performance in terms of the six evaluation metrics. This process was repeated 20 times to investigate the prediction robustness of each method. Results demonstrated that INDELpred was comparable or better than competing methods, as evidenced by its achievement of high values of F1 score (0.979 ± 0.000 for ClinVarTest, 0.928 ± 0.001 for VKGLTest, and 0.985 ± 0.000 for HGMDTest; median ± standard deviation); Cohen’s kappa (0.936 ± 0.001 for ClinVarTest, 0.848 ± 0.002 for VKGLTest, and 0.972 ± 0.001 for HGMDTest); MCC (0.936 ± 0.001 for ClinVarTest, 0.852 ± 0.002 for VKGLTest, and 0.972 ± 0.001 for HGMDTest); BA (0.968 ± 0.000 for ClinVarTest, 0.924 ± 0.001 for VKGLTest, and 0.986 ± 0.000 for HGMDTest); AUROC (0.993 ± 0.000 for ClinVarTest, 0.974 ± 0.001 for VKGLTest, and 0.998 ± 0.000 for HGMDTest); and AUPR (0.996 ± 0.000 for ClinVarTest, 0.966 ± 0.002 for VKGLTest, and 0.998 ± 0.000 for HGMDTest) (Figures 2B, 2C, and S2; Table S6). Despite the presence of genes in the testing dataset that were not used in training, our model still maintained high accuracy (Table S7). Additionally, we qualitatively visualized the distribution of prediction scores that represented the likelihood of pathogenic and benign variants. Results displayed that INDELpred scores were consistently concentrated near 1 for pathogenic variants and 0 in the benign ones across the three datasets (Figures 2D and S3). This comparative analysis underscored the efficacy of INDELpred in distinguishing between benign and pathogenic indels compared to other competing methods.
Figure 2.
Indel pathogenicity prediction performance comparison between INDELpred and the competing methods on the datasets of ClinVarTest, VKGLTest, and HGMDTest
(A) The number of predicted loci by different models for the three testing datasets. The gray color indicates the loci not predicted by the model.
(B) Radar chart of six evolution metrics. Values closer to the periphery represent a score approaching 1, indicating the high prediction performance.
(C) ROC curves with AUROC values yielded by different methods.
(D) Distribution of prediction scores by different models. The horizontal axis ranges from 0 to 1. CADD scores were normalized as , where we chose the value of 20 as the CADD cutoff value.
(E) Time consumption of the three models across the three datasets. Each method was independently executed five times on each dataset to ensure the consistency of the results. The data storage requirement for each method is shown on the x axis. Statistical test: two-sided Mann-Whitney U test with the Benjamini–Hochberg correction. ∗∗p < 0.01 and ∗∗∗∗p < 0.0001. The center line indicating the median and whiskers representing 1.5 × IQR.
We further assessed the computational efficiency and algorithmic complexity for predicting pathogenic indels. VEST was excluded from this evaluation because it is a tool specifically designed for coding region variants, which limits its detection range compared to the other three tools. Consequently, INDELpred exhibited significantly high computational efficiency coupled with minimal data storage requirements (11.7 ± 19.16 min, 30 GB storage), approximately 6- and 28-fold faster than CAPICE (71.18 ± 76.38 min, 84 GB; p < 0.01) and CADD (322.37 ± 197.17 min, 129 GB; p < 0.0001), respectively (Figure 2E; Table S8). Collectively, although CAPICE also showed promising prediction accuracy in terms of some evaluation metrics for the VKGLTest dataset (Table S6), it is the combined consideration of robustness, superior performance, exceptional computational efficiency, and low storage requirements that establish INDELpred as a preferred tool for the accurate prediction of pathogenic indels within genomic data.
INDELpred shows stability for the impact of different genomic factors
Next, to ascertain the predictive capability of INDELpred with respect to various genomic factors, we partitioned the ClinVarTest dataset based on four parameters: indel length, AF, the functional nature of variants (gain of function or loss of function), and the classification within the SharedGenes group. Typically, gain-of-function and loss-of-function variants were linked to the roles of oncogenes (ONCs) and tumor suppressor genes (TSGs), respectively. The SharedGenes group contained the genes that harbored both pathogenic and benign variants. Considering that indels with lengths that are divisible by three represent non-frameshift variants, we also categorized indel length into two subsets based on whether or not it is divisible by three. Results showed that INDELpred maintained a consistently high AUROC in both subsets (Figure 3A). Moreover, AUROC values for all methods tended to decrease with the increase in indel length. However, the decrease in AUROC values for INDELpred was more gradual compared to the other methods (Figure 3B). A similar trend was observed for AF, where AUROC values across all methods declined as AF rose. Nevertheless, INDELpred still maintained the highest prediction accuracy (AUROC = 0.90) even for indels with high AF (>0.01) (Figure 3C). These findings suggested that INDELpred was more robust and stable in handling long and common indels, which can be challenging for other prediction algorithms. Besides, the analysis also revealed that all methods exhibited better performance in TSGs than ONCs (Figure 3D), indicating that these methods were more adept at identifying loss-of-function variants. Additionally, the performance of INDELpred, along with those of CAPICE and CADD, was not affected by the presence of type II loops, as the AUROC values remained consistent (Figure 3E). This was further supported by consistent results across various evaluation metrics in the HGMDSharedGene dataset, which contains the same genes as both pathogenic and benign variants and maintained comparable metric accuracy to that observed in the HGMDTest dataset (Figure 3F). VEST exhibited the lowest prediction performance across all four genomic factor parameters.
Figure 3.
Evaluation of INDELpred for the impact of different genomic factors
(A–E) AUROC results assessed for each partition subset of the ClinVarTest dataset based on genomic factors of whether or not the indel length was divisible by three (A), indel length (B), AF (C), groups of TSGs and ONCs (D), and groups of shared genes (E). The number of indels in each subset is indicated on the top accordingly.
(F) Radar chart of the six metrics valued using the HGMDSharedGene dataset.
(G) AUROC results with respect to different levels of CLINSTAT credibility in ClinVarTest (left) as well as different support laboratories in VKGLTest (right).
Furthermore, we partitioned the ClinVarTest and VKGLTest datasets to examine the prediction performance with respect to the varying levels of confidence. As expected, for the ClinVarTest dataset, a general upward trend in AUROC values was observed across all methods as data credibility increased. Remarkedly, INDELpred outperformed the other methods at all levels of confidence, achieving impressively high yet stable AUROC values of 0.993, 0.995, and 0.996 for confidence stars ≥0, ≥1, and ≥2, respectively (Figure 3G, left; Tables S9 and S10). In the VKGLTest dataset, INDELpred and CAPICE demonstrated impressive stability. INDELpred, in particular, maintained comparable high AUROC values of 0.975, 0.994, and 1.000 for support levels ≥1, ≥2, and ≥3, respectively (Figure 3G, right; Tables S11 and S12). These results suggested the high accuracy and stability of INDELpred in predicting pathogenic indels under different genomic factor and confidence levels.
INDELpred improves efficiency in identifying pathogenic indels within clinical WGS data
Next, we applied INDELpred and the competing methods to the clinical WGS data collected from a cohort of 30 individuals, which harbored approximately 3 million unique indel variants. Among these, 26 distinct variants were strongly corroborated as etiological indel variants (Table 1). INDELpred and CAPICE successfully processed all variants in the analysis, while CADD failed to complete three individuals due to exceeding the predefined computational resource constraints. Notably, INDELpred displayed exceptional efficiency with a median processing time of 52 ± 45 min, significantly faster than CAPICE (228 ± 204 min, p < 0.0001) and CADD (1,888 ± 1,314 min, p < 0.0001) (Figure 4A; Table S13). Owning to the profoundly imbalanced dataset where the benign variants far exceed the pathogenic ones, we calculated the AUPR to measure the prediction performance in conjunction with the AUROC. Compared to the competing methods, INDELpred reached the highest AUPR of 0.506 and AUROC of 0.990 (Figures 4B and S4).
Figure 4.
Comparison of indel pathogenicity prediction performance between INDELpred and the competing methods on the clinical WGS dataset
(A) Computational time taken by the three methods for the clinical WGS dataset collected from 30 pediatric individuals. Statistical test: two-sided Mann-Whitney U test with the Benjamini-Hochberg correction. ∗∗∗∗p < 0.0001. Each box in the boxplots corresponds to the interval between the 25th and 75th percentile (interquartile range, IQR) with the center line indicating the median and whiskers representing 1.5 × IQR.
(B) Precision-recall curves with AUPR values for each predictive model.
(C) Curve plot showing sensitivity of pathogenic indel in the top-k variants ranked by different prediction methods for varying k values.
(D) Bar plot showing sensitivity of individual in top 30, top 100, and top 150 variants ranked by different prediction methods.
(E) Prediction scores yielded by different predictive models for the 26 disease-causative indels. CADD scores were normalized as , where we chose the value of 20 as the CADD cutoff value. Each box in the boxplots corresponds to the interval between the 25th and 75th percentile (interquartile range, IQR) with the center line indicating the median and whiskers representing 1.5 × IQR.
To assess the predictive power of INDELpred in pinpointing pathogenic indels, we prioritized and ranked the variants based on their predicted pathogenicity scores. We calculated the sensitivity of pathogenic indels that measured the proportion of true positive pathogenic indels in the top-k variants as ranked by the prediction method. Compared to the other methods, INDELpred demonstrated markedly higher achievements in the sensitivity of pathogenic indels for the varying k values (Figure 4C). For example, it reached 92.3% (24/26) within the smallest number of top variants (k = 136), surpassing the performance of the other methods. In addition, we further calculated the sensitivity of individuals with true positive pathogenic indels in the ranked top-k variants. Results showed that INDELpred outperformed the other methods (Figure 4D). It enabled us to identify pathogenic indels in 56.7% (17/30), 90% (27/30), and 93.3%(28/30) of individuals within the top 30, top 100, and top 150 variants, respectively. Additionally, the distribution of the predicted pathogenicity scores for the 26 disease-causing indels showed the highest median value (0.997) that was achieved by INDELpred (Figure 4E). Considering the high accuracy and computational efficiency demonstrated by INDELpred in clinical whole genome sequencing, our tool showed superior potential for application in sifting through large-scale genomic data to accurately identify pathogenic InDels within clinical contexts.
INDELpred illustrates the impact features on pathogenicity prediction
We further explored the relevance of different features used in the INDELpred model, which were measured by SHAP values. We first examined the correlation between features, and the results showed that most features exhibited minimal interdependence, with the exception of function-based features (Figure 5A). The relatively high correlation within function-based features was likely due to the presence of benign and deleterious indicators. With SHAP values, we identified the top five most important features, including exon intolerance, AF, general tolerance, general intolerance, and exonic (Figure 5B). It is worth noting that the exonic feature exhibited the most importance among other gene-based features, suggesting its critical role in predicting pathogenic indels. To further quantify the contribution of each feature type toward the INDELpred’s ability to predict pathogenic indels, we compared the relative importance of the four feature types. Result showed that function-based features and AF exhibited the two largest contribution to the prediction of pathogenicity (Figure 5C).
Figure 5.
Analysis of pathogenicity-related factors through the lens of features utilized by INDELpred
(A) Spearman correlation coefficient matrix of the 16 features used by INDELpred.
(B) The contribution of each individual feature, represented by feature importance, toward the prediction of indel pathogenicity by INDELpred model.
(C) Relative contribution of each of four feature categories toward the prediction of indel pathogenicity by INDELpred model. Each box in the boxplots corresponds to the interval between the 25th and 75th percentile (interquartile range, IQR) with the center line indicating the median and whiskers representing 1.5 × IQR.
(D) Comparison of AUROC results across three datasets for AF with threshold-free and threshold-based values.
(E) Hierarchical clustering analysis of 16 individual features. All values of individual features were Z scored.
The AF quantified the prevalence of the indels within a population. The lower AF was generally indicative of a more deleterious variant, which was consequently less likely to persist through generations. The American College of Medical Genetics and Genomics (ACMG) guideline suggested that AF can serve as a threshold to enrich for pathogenic variants.44 However, recent studies advocated for the adoption of a precise AF value rather than a broad threshold.45 In light of these findings, we evaluated the impact of incorporating AF as a precise value versus using a threshold on INDELpred’s performance. Our analysis revealed an enhanced model performance when AF was treated as a specific value (Figure 5D), confirming its significance in the accurate determination of variant pathogenicity. Furthermore, a hierarchical clustering analysis using all 16 features also demonstrated their collective efficacy in discriminating between benign and pathogenic indels (Figure 5E).
Discussion
We have developed INDELpred, a machine-learning-based model characterized by its computational efficiency and remarkable accuracy that surpasses other established methods in the prediction of pathogenic indels across a variety of evaluation datasets. INDELpred distinguished itself primarily in two aspects. Firstly, it outpaced the competing methods without compromising performance accuracy. Importantly, INDELpred has demonstrated its potential for clinical utility in large-scale genomic studies, which is essential in the context of precision medicine. Secondly, it incorporated a range of compact yet informative features, including AF, indel length, function-based features, and gene-based features that facilitated the elucidation of their relevance in the pathogenic potential of indels. Through comprehensive analysis, INDELpred showed its capacity to rapidly unravel the genetic underpinnings of disease-causing variants, thereby improving the interpretation of genomic data for accurate diagnoses.
Variant interpretation tools rarely prioritize runtime optimization as much as the initial variant calling steps, partly due to SNV prediction tools relying on precomputed databases to expedite the process.45,46,47 However, when they encounter more complex variants like indels, SVs, and copy-number variations (e.g., annotSV,48 X-CNV49), the computational burden can be overwhelming, and thus, runtime becomes a more pressing concern. INDELpred addressed this concern by leveraging a streamlined set of 16 features highly pertinent to indels, including function-based features that significantly contribute to the prediction outcomes. The post-annotation was only based on essential databases such as gnomAD for AFs and refGene for gene information. Compared to existing tools that can predict both SNVs and indels, such as CADD and CAPICE, INDELpred managed to save disk space and reduce the computational load significantly, making it highly suitable for genome-wide analyses in clinical settings as well as for processing large public databases. Another challenge that constrains the utility of computing tools in genomics is the accuracy of their prediction.50 The highest AUROC value does not necessarily equate to an optimal performance, particularly in the context of imbalanced datasets used for establishing predictive models.51 Here, we have systematically assessed INDELpred along with other competing methods by employing an array of evaluation metrics such as AUROC, MCC, and Cohen’s kappa. These metrics gauged the discriminative ability and overall accuracy of the model. Additionally, metrics like the F1 score and average precision score were essential for harmoniously balancing precision and recall, particularly critical in the prediction of rare variants.
In data selection, we have considered the potential for type I and type II loops, that is, data leakage or the presence of individual genes being exclusively either pathogenic or benign, which can artificially inflate model performance.38,52 Since we did not utilize the results from the existing tools, data leakage can be heavily mitigated through the strict separation of training and testing datasets, with the testing dataset remaining entirely undisclosed during the feature preprocessing of the training dataset.45,53 Other tools such as CAPICE may encounter the challenge of type I loops due to their reliance on previously developed tools.17 Furthermore, INDELpred alleviated the risk of type II loops using the SharedGene subsets from ClinVarTest,54,55 which comprised only 1,712 loci. Given the potential bias introduced by the relatively small number of loci (1,712) in the SharedGene subsets from ClinVarTest, we further incorporated the HGMDSharedGene dataset, which includes a total of 35,190 loci, all covered by SharedGene. This inclusion broadened our dataset and diminished the likelihood of falsely elevated accuracy resulting from type II loops.
The features used in INDELpred were primarily calculated based on the pathogenic and benign nature of indels in the training dataset, thus offering clear interpretability. Among them, the function-based features contributed the most toward predicting pathogenic indels, surpassing the influence of gene-based features. This was largely attributed to the exertion of the ClinVar database for training, where the primary determinant of a variant’s classification was its association with disease causation. Genetic variations, encompassing both coding and non-coding regions, can have a profound influence on protein levels and function.56,57 Variants that altered amino acids may disrupt protein stability or specific functions such as molecular interactions or enzymatic activities. Similarly, variations in regulatory elements can affect gene expression, while changes in splicing motifs can result in substantial modifications to protein sequence and structure.58 These alterations can lead to either a loss or gain of protein function, which may ultimately cause disease. In contrast, the impact of gene-specific changes on pathogenicity is often related to particular protein functions or types of disease.59 This aspect was more challenging to capture by INDELpred, as the current training dataset does not explicitly account for the diversity and complexity of gene-specific effects. Therefore, a database that incorporates disease types or severity as quantitative measures is needed for a more nuanced representation of the impact of genes on indel pathogenicity.
The prevalence of polymorphisms within a population was a key indicator for evaluating genetic variants. In the INDELpred model, AF emerged as a singular yet highly influential feature, ranking as the second contribution toward the assessment of indel pathogenicity. This underscored its significance in the determination of a variant’s potential impact on health. Nevertheless, the utility of this metric was somewhat diminished when analyzing rare variants, necessitating the integration of additional indicators to achieve a more robust analysis. Moreover, guidelines from ACMG and Association for Molecular Pathology (AMP) posited that a minor AF greater than 5% typically denoted a benign nature of the variant.44 Despite this, previous studies have suggested that the exact value of AF presented a more profound influence on model accuracy than a binary categorization based on a threshold.45 Our study supported this view, confirming that the specific AF value markedly affects INDELpred’s predictive performance. However, our dataset currently exhibited a paucity of information regarding AF in the East Asian population.29 With the acquisition of more comprehensive data, we anticipated the development of a model that was finely tuned to the genetic characteristics of different regions, thereby enhancing the precision of pathogenicity predictions for diverse populations.
One primary aim of variant deleterious prediction methods was to improve the interpretation of variants called by clinical sequencing.44 These prediction methods have been widely applied to the analysis of the WGS or whole-exosome sequencing data in clinical settings.60,61 However, many methods tended to produce low specificity, resulting in the misclassification of a large number of neutral variants as deleterious. In the clinical context, the imbalance between the number of pathogenic and non-pathogenic variants suggested that setting a stringent threshold for classification is not always practical or informative. Instead, methods that can stratify variants based on their likelihood of being pathogenic can be more useful. For example, the INDELpred method demonstrates a promising approach by identifying pathogenic loci for 17 out of 30 individuals (56.7%) among the top 30 loci. Comparatively, INDELpred shows superior sensitivity and specificity to other methods, suggesting that it is a more effective tool for discerning the most likely pathogenic loci in a clinical setting.
Several limitations should be acknowledged in the present study. First, the lack of databases with quantitative indicators for disease severity and individual disease likelihood hampers the ability to make finely tuned predictions about health outcomes.58 Furthermore, INDELpred thereby suffers from interpreting the cumulative effects of small changes across multiple genes, known as polygenic contributions, and how these changes collectively influence disease development. In addition to the need for quantitative data, incorporating findings from genome-wide association studies62 and protein structural features63 can provide valuable insights into the genetic basis of diseases and, thus, improve the assessment of variant pathogenicity. Additionally, it should be noted that our current clinical sample size is relatively small (30 patients). A larger clinical cohort could offer more robust and dependable evidences of the prediction capabilities in a clinical setting. Future studies with expanded clinical sample sizes are necessary to further validate and enhance the clinical applicability of INDELpred.
In summary, we provided the INDELpred model, a robust tool for the prediction of indel pathogenicity that exhibits high efficiency and notable performance by capitalizing on interpretable features. Our analysis reveals that functional-based features are the most significant contributors to the accurate identification of pathogenic indels. INDELpred offers critical genomic insights into indels, enhancing our ability to not only elucidate indels implicated in disease causation but also detect deleterious indels with precision and accuracy. Consequently, INDELpred holds promise for refining diagnostic processes and facilitating the development of personalized therapeutic approaches.
Data and code availability
The code for INDELpred model development and validation is available at https://github.com/yilin-wei98/INDELpred. The raw clinical WGS data, which belong to a previous article,3 were deposited in the CNSA with accession code CNP0000788. The corresponding VCF files may be available from the authors with reasonable request.
Acknowledgments
The authors wish to express their profound gratitude to Jianbiao Li of BGI Research for his invaluable contribution to this study. This research was generously supported by several key grants: the National Key Research and Development Program of China (No. 2022YFC2502402), the Open Project of State Key Laboratory of Respiratory Disease (No. SKLRD-OP-202309), the Supported by the specific research fund of The Innovation Platform for Academicians of Hainan Province (No. YSPTZX202118), the National Key Research and Development Program of China (No. 2023YFC2605400), the Key-Area Research and Development Program of Guangdong Province (No. 2023B0303040001), the Guangzhou Basic and Applied Basic Research Foundation (No. 202201010189), and the National Natural Science Foundation of China (No. 32171441). We would also like to thank the China National GeneBank for the support. The authors are deeply thankful for this support, which was pivotal to the success of this research.
Author contributions
Conceptualization, Y.B. and T.Z.; methodology, Y.B. and T.Z.; software, Y.W. and T.Z.; formal analysis, Y.W. and T.Z.; validation, Y.W., X.J., and B.W.; writing – original draft, Y.B., Y.W., and T.Z.; writing – review & editing, Y.B., M.F., and F.L.; supervision, Y.B. and X.J.
Declaration of interests
The authors declare no competing interests.
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xhgg.2024.100325.
Contributor Information
Xin Jin, Email: jinxin@genomics.cn.
Yong Bai, Email: baiyong@genomics.cn.
Web resources
ANNOVAR, https://annovar.openbioinformatics.org/en/latest/
CADD 1.6 post-release 1, https://cadd.gs.washington.edu/
CAPICE v.5.1.1, https://capice.molgeniscloud.org/
ClinVar, https://www.ncbi.nlm.nih.gov/clinvar/
gnomAD, https://gnomad.broadinstitute.org/
HGMD, https://www.hgmd.cf.ac.uk/ac/index.php
INDELpred, https://github.com/yilin-wei98/INDELpred/
OMIM, https://www.omim.org/
VariSNP, http://structure.bmc.lu.se/VariSNP/
VEST-4, https://cravat.us/CRAVAT/
Supplemental information
References
- 1.Satam H., Joshi K., Mangrolia U., Waghoo S., Zaidi G., Rawool S., Thakare R.P., Banday S., Mishra A.K., Das G., Malonia S.K. Next-Generation Sequencing Technology: Current Trends and Advancements. Biology. 2023;12:997. doi: 10.3390/biology12070997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sayers E.W., Agarwala R., Bolton E.E., Brister J.R., Canese K., Clark K., Connor R., Fiorini N., Funk K., Hefferon T., et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2019;47:D23–D28. doi: 10.1093/nar/gky1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zou D., Wang L., Liao J., Xiao H., Duan J., Zhang T., Li J., Yin Z., Zhou J., Yan H., et al. Genome sequencing of 320 Chinese children with epilepsy: a clinical and molecular study. Brain. 2021;144:3623–3634. doi: 10.1093/brain/awab233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yang Y., Zhao S., Sun G., Chen F., Zhang T., Song J., Yang W., Wang L., Zhan N., Yang X., et al. Genomic architecture of fetal central nervous system anomalies using whole-genome sequencing. NPJ Genom. Med. 2022;7:31. doi: 10.1038/s41525-022-00301-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.The 100,000 Genomes Project Pilot Investigators. Smedley D., Smith K.R., Martin A., Thomas E.A., McDonagh E.M., Cipriani V., Ellingford J.M., Arno G., Tucci A., et al. 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care — Preliminary Report. N. Engl. J. Med. 2021;385:1868–1880. doi: 10.1056/NEJMoa2035790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Turajlic S., Litchfield K., Xu H., Rosenthal R., McGranahan N., Reading J.L., Wong Y.N.S., Rowan A., Kanu N., Al Bakir M., et al. Insertion-and-deletion-derived tumour-specific neoantigens and the immunogenic phenotype: a pan-cancer analysis. Lancet Oncol. 2017;18:1009–1021. doi: 10.1016/S1470-2045(17)30516-8. [DOI] [PubMed] [Google Scholar]
- 7.Stenson P.D., Mort M., Ball E.V., Howells K., Phillips A.D., Thomas N.S., Cooper D.N. The Human Gene Mutation Database: 2008 update. Genome Med. 2009;1:13. doi: 10.1186/gm13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sample P.J., Wang B., Reid D.W., Presnyak V., McFadyen I.J., Morris D.R., Seelig G. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 2019;37:803–809. doi: 10.1038/s41587-019-0164-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Baeza-Centurion P., Miñana B., Valcárcel J., Lehner B. Mutations primarily alter the inclusion of alternatively spliced exons. Elife. 2020;9 doi: 10.7554/eLife.59959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Whiffin N., Karczewski K.J., Zhang X., Chothani S., Smith M.J., Evans D.G., Roberts A.M., Quaife N.M., Schafer S., Rackham O., et al. Characterising the loss-of-function impact of 5’ untranslated region variants in 15,708 individuals. Nat. Commun. 2020;11:2523. doi: 10.1038/s41467-019-10717-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Borck G., Zarhrate M., Cluzeau C., Bal E., Bonnefont J.-P., Munnich A., Cormier-Daire V., Colleaux L. Father-to-daughter transmission of Cornelia de Lange syndrome caused by a mutation in the 5′ untranslated region of theNIPBL Gene. Hum. Mutat. 2006;27:731–735. doi: 10.1002/humu.20380. [DOI] [PubMed] [Google Scholar]
- 12.Johnston J.J., Williamson K.A., Chou C.M., Sapp J.C., Ansari M., Chapman H.M., Cooper D.N., Dabir T., Dudley J.N., Holt R.J., et al. NAA10 polyadenylation signal variants cause syndromic microphthalmia. J. Med. Genet. 2019;56:444–452. doi: 10.1136/jmedgenet-2018-105836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ellingford J.M., Ahn J.W., Bagnall R.D., Baralle D., Barton S., Campbell C., Downes K., Ellard S., Duff-Farrier C., FitzPatrick D.R., et al. Recommendations for clinical interpretation of variants found in non-coding regions of the genome. Genome Med. 2022;14:73. doi: 10.1186/s13073-022-01073-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Folkman L., Yang Y., Li Z., Stantic B., Sattar A., Mort M., Cooper D.N., Liu Y., Zhou Y. DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. Bioinformatics. 2015;31:1599–1606. doi: 10.1093/bioinformatics/btu862. [DOI] [PubMed] [Google Scholar]
- 15.Hu J., Ng P.C. SIFT Indel: Predictions for the Functional Effects of Amino Acid Insertions/Deletions in Proteins. PLoS One. 2013;8 doi: 10.1371/journal.pone.0077940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Li C., Zhi D., Wang K., Liu X. MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning. Genome Med. 2022;14:115. doi: 10.1186/s13073-022-01120-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li S., Van Der Velde K.J., De Ridder D., Van Dijk A.D.J., Soudis D., Zwerwer L.R., Deelen P., Hendriksen D., Charbon B., Van Gijn M.E., et al. CAPICE: a computational method for Consequence-Agnostic Pathogenicity Interpretation of Clinical Exome variations. Genome Med. 2020;12:75. doi: 10.1186/s13073-020-00775-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rentzsch P., Schubach M., Shendure J., Kircher M. CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 2021;13:31. doi: 10.1186/s13073-021-00835-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Douville C., Masica D.L., Stenson P.D., Cooper D.N., Gygax D.M., Kim R., Ryan M., Karchin R. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel) Hum. Mutat. 2016;37:28–35. doi: 10.1002/humu.22911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gulko B., Hubisz M.J., Gronau I., Siepel A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 2015;47:276–283. doi: 10.1038/ng.3196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Davydov E.V., Goode D.L., Sirota M., Cooper G.M., Sidow A., Batzoglou S. Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ PLoS Comput. Biol. 2010;6 doi: 10.1371/journal.pcbi.1001025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Schwarz J.M., Rödelsperger C., Schuelke M., Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods. 2010;7:575–576. doi: 10.1038/nmeth0810-575. [DOI] [PubMed] [Google Scholar]
- 23.Carter H., Douville C., Stenson P.D., Cooper D.N., Karchin R. Identifying Mendelian disease genes with the Variant Effect Scoring Tool. BMC Genom. 2013;14:S3. doi: 10.1186/1471-2164-14-S3-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Choi Y., Sims G.E., Murphy S., Miller J.R., Chan A.P. Predicting the Functional Effect of Amino Acid Substitutions and Indels. PLoS One. 2012;7 doi: 10.1371/journal.pone.0046688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Shihab H.A., Gough J., Cooper D.N., Stenson P.D., Barker G.L.A., Edwards K.J., Day I.N.M., Gaunt T.R. Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. 2013;34:57–65. doi: 10.1002/humu.22225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Adzhubei I., Jordan D.M., Sunyaev S.R. Predicting Functional Effect of Human Missense Mutations Using PolyPhen-2. CP Hum. Genet. 2013;7:Unit7.20. doi: 10.1002/0471142905.hg0720s76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Flanagan S.E., Patch A.-M., Ellard S. Using SIFT and PolyPhen to Predict Loss-of-Function and Gain-of-Function Mutations. Genet. Test. Mol. Biomarkers. 2010;14:533–537. doi: 10.1089/gtmb.2010.0036. [DOI] [PubMed] [Google Scholar]
- 28.Wang X., Dong Q., Chen G., Zhang J., Liu Y., Cai Y. Frameshift and wild-type proteins are often highly similar because the genetic code and genomes were optimized for frameshift tolerance. BMC Genom. 2022;23:416. doi: 10.1186/s12864-022-08435-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liu Y., Yeung W.S.B., Chiu P.C.N., Cao D. Computational approaches for predicting variant impact: An overview from resources, principles to applications. Front. Genet. 2022;13 doi: 10.3389/fgene.2022.981005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Landrum M.J., Chitipiralla S., Brown G.R., Chen C., Gu B., Hart J., Hoffman D., Jang W., Kaur K., Liu C., et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 2020;48:D835–D844. doi: 10.1093/nar/gkz972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Fokkema I.F.A.C., Velde K.J., Slofstra M.K., Ruivenkamp C.A.L., Vogel M.J., Pfundt R., Blok M.J., Lekanne Deprez R.H., Waisfisz Q., Abbott K.M., et al. Dutch genome diagnostic laboratories accelerated and improved variant interpretation and increased accuracy by sharing data. Hum. Mutat. 2019;40:2230–2238. doi: 10.1002/humu.23896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Stenson P.D., Mort M., Ball E.V., Evans K., Hayden M., Heywood S., Hussain M., Phillips A.D., Cooper D.N. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum. Genet. 2017;136:665–677. doi: 10.1007/s00439-017-1779-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pruitt K.D., Tatusova T., Maglott D.R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Schaafsma G.C.P., Vihinen M. VariSNP, A Benchmark Database for Variations From dbSNP. Hum. Mutat. 2015;36:161–166. doi: 10.1002/humu.22727. [DOI] [PubMed] [Google Scholar]
- 38.Grimm D.G., Azencott C., Aicheler F., Gieraths U., MacArthur D.G., Samocha K.E., Cooper D.N., Stenson P.D., Daly M.J., Smoller J.W., et al. The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity. Hum. Mutat. 2015;36:513–523. doi: 10.1002/humu.22768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gussow A.B., Petrovski S., Wang Q., Allen A.S., Goldstein D.B. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol. 2016;17:9. doi: 10.1186/s13059-016-0869-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Friedman J.H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 2001;29 doi: 10.1214/aos/1013203451. [DOI] [Google Scholar]
- 41.Lundberg S.M., Lee S.-I. In: Advances in Neural Information Processing Systems 30. Guyon I., Luxburg U.V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., editors. Curran Associates, Inc.; 2017. A Unified Approach to Interpreting Model Predictions; pp. 4765–4774. [Google Scholar]
- 42.Rentzsch P., Witten D., Cooper G.M., Shendure J., Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019;47:D886–D894. doi: 10.1093/nar/gky1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sondka Z., Bamford S., Cole C.G., Ward S.A., Dunham I., Forbes S.A. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat. Rev. Cancer. 2018;18:696–705. doi: 10.1038/s41568-018-0060-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Richards S., Aziz N., Bale S., Bick D., Das S., Gastier-Foster J., Grody W.W., Hegde M., Lyon E., Spector E., et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Alirezaie N., Kernohan K.D., Hartley T., Majewski J., Hocking T.D. ClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants. Am. J. Hum. Genet. 2018;103:474–483. doi: 10.1016/j.ajhg.2018.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ioannidis N.M., Rothstein J.H., Pejaver V., Middha S., McDonnell S.K., Baheti S., Musolf A., Li Q., Holzinger E., Karyadi D., et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 2016;99:877–885. doi: 10.1016/j.ajhg.2016.08.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Jagadeesh K.A., Wenger A.M., Berger M.J., Guturu H., Stenson P.D., Cooper D.N., Bernstein J.A., Bejerano G. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 2016;48:1581–1586. doi: 10.1038/ng.3703. [DOI] [PubMed] [Google Scholar]
- 48.Geoffroy V., Herenger Y., Kress A., Stoetzel C., Piton A., Dollfus H., Muller J. AnnotSV: an integrated tool for structural variations annotation. Bioinformatics. 2018;34:3572–3574. doi: 10.1093/bioinformatics/bty304. [DOI] [PubMed] [Google Scholar]
- 49.Zhang L., Shi J., Ouyang J., Zhang R., Tao Y., Yuan D., Lv C., Wang R., Ning B., Roberts R., et al. X-CNV: genome-wide prediction of the pathogenicity of copy number variations. Genome Med. 2021;13:132. doi: 10.1186/s13073-021-00945-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Katsonis P., Wilhelm K., Williams A., Lichtarge O. Genome interpretation using in silico predictors of variant impact. Hum. Genet. 2022;141:1549–1577. doi: 10.1007/s00439-022-02457-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Saito T., Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS One. 2015;10 doi: 10.1371/journal.pone.0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Qorri E., Takács B., Gráf A., Enyedi M.Z., Pintér L., Kiss E., Haracska L. A Comprehensive Evaluation of the Performance of Prediction Algorithms on Clinically Relevant Missense Variants. Int. J. Mol. Sci. 2022;23:7946. doi: 10.3390/ijms23147946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Won D.-G., Kim D.-W., Woo J., Lee K. 3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints. Bioinformatics. 2021;37:4626–4634. doi: 10.1093/bioinformatics/btab529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Bu F., Zhong M., Chen Q., Wang Y., Zhao X., Zhang Q., Li X., Booth K.T., Azaiez H., Lu Y., et al. DVPred: a disease-specific prediction tool for variant pathogenicity classification for hearing loss. Hum. Genet. 2022;141:401–411. doi: 10.1007/s00439-022-02440-1. [DOI] [PubMed] [Google Scholar]
- 55.Quinodoz M., Peter V.G., Cisarova K., Royer-Bertrand B., Stenson P.D., Cooper D.N., Unger S., Superti-Furga A., Rivolta C. Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity. Am. J. Hum. Genet. 2022;109:457–470. doi: 10.1016/j.ajhg.2022.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J., et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Ward L.D., Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nat. Biotechnol. 2012;30:1095–1106. doi: 10.1038/nbt.2422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Tabet D., Parikh V., Mali P., Roth F.P., Claussnitzer M. Scalable Functional Assays for the Interpretation of Human Genetic Variation. Annu. Rev. Genet. 2022;56:441–465. doi: 10.1146/annurev-genet-072920-032107. [DOI] [PubMed] [Google Scholar]
- 59.Sun B.B., Kurki M.I., Foley C.N., Mechakra A., Chen C.-Y., Marshall E., Wilk J.B., Biogen Biobank Team. Sun B.B., Ghen C.-Y., et al. Genetic associations of protein-coding variants in human disease. Nature. 2022;603:95–102. doi: 10.1038/s41586-022-04394-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Richter F., Morton S.U., Kim S.W., Kitaygorodsky A., Wasson L.K., Chen K.M., Zhou J., Qi H., Patel N., DePalma S.R., et al. Genomic analyses implicate noncoding de novo variants in congenital heart disease. Nat. Genet. 2020;52:769–777. doi: 10.1038/s41588-020-0652-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Liu H.-Y., Zhou L., Zheng M.-Y., Huang J., Wan S., Zhu A., Zhang M., Dong A., Hou L., Li J., et al. Diagnostic and clinical utility of whole genome sequencing in a cohort of undiagnosed Chinese families with rare diseases. Sci. Rep. 2019;9 doi: 10.1038/s41598-019-55832-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Tam V., Patel N., Turcotte M., Bossé Y., Paré G., Meyre D. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 2019;20:467–484. doi: 10.1038/s41576-019-0127-1. [DOI] [PubMed] [Google Scholar]
- 63.Wang M., Sun Z., Akutsu T., Song J. Recent Advances in Predicting Functional Impact of Single Amino Acid Polymorphisms: A Review of Useful Features, Computational Methods and Available Tools. Curr. Bioinf. 2013;8:161–176. doi: 10.2174/1574893611308020004. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The code for INDELpred model development and validation is available at https://github.com/yilin-wei98/INDELpred. The raw clinical WGS data, which belong to a previous article,3 were deposited in the CNSA with accession code CNP0000788. The corresponding VCF files may be available from the authors with reasonable request.





