Abstract
Accurate tumor mutation burden (TMB) quantification is critical for immunotherapy stratification, yet remains challenging due to variability across sequencing platforms, tumor heterogeneity, and variant calling pipelines. Here, we introduce TMBquant, an explainable AI-powered caller designed to optimize TMB estimation through dynamic feature selection, ensemble learning, and automated strategy adaptation. Built upon the H2O AutoML framework, TMBquant integrates variant features, minimizes classification errors, and enhances both accuracy and stability across diverse datasets. We benchmarked TMBquant against nine widely used variant callers, including traditional tools (e.g. Mutect2, VarScan2, Strelka2) and recent AI-based methods (DeepSomatic, Octopus), using 706 whole-exome sequencing tumor–control pairs. To evaluate clinical relevance, we further assessed TMBquant through survival analyses across immunotherapy-treated cohorts of non–small cell lung cancer (NSCLC), nasopharyngeal carcinoma (NPC), and the two NSCLC subtypes: lung adenocarcinoma and lung squamous cell carcinoma. In each cohort, TMBquant consistently achieved the highest hazard ratios, demonstrating superior patient stratification compared to all other methods. Importantly, TMBquant maintained robust predictive performance across both high-TMB (NSCLC) and low-TMB (NPC) settings, highlighting its generalizability across cancer types with distinct biological characteristics. These findings establish TMBquant as a reliable, reproducible, and clinically actionable tool for precision oncology. The software is open source and freely available at https://github.com/SomaticCaller/SomaticCaller. To enhance reproducibility, we provide detailed usage instructions and representative code snippets for TMBquant in the Methods section (see Code Availability).
Keywords: variant calling, counting-based biomarker, tumor mutation burden, immunotherapy, heterogeneous samples
Introduction
Tumor mutation burden (TMB) has emerged as a pivotal biomarker in cancer research and immunotherapy, offering critical insights into the genomic landscape of tumors and their potential response to immune checkpoint inhibitors (ICIs) [1–3]. As a quantitative measure of the total number of somatic mutations per megabase of coding regions, TMB has been extensively studied for its ability to predict clinical outcomes in various cancers, including non–small cell lung cancer (NSCLC) and melanoma [4]. Despite its promise, accurately quantifying TMB remains challenging due to the inherent heterogeneity of sequencing data, differences in variant calling algorithms, and variability in tumor purity and sequencing depth across samples [5, 6].
Existing somatic variant callers, such as Mutect2, VarScan2 [7], Strelka2 [8], and VarDict [9], have been widely adopted for TMB calculation. However, their performance varies significantly depending on the sample characteristics, leading to inconsistencies in TMB estimation across diverse datasets [10–13]. This variability complicates the establishment of universal thresholds for clinical decision-making, potentially impacting patient stratification for immunotherapy [14].
Although TMB holds promise, it is important to recognize that PD-L1 expression remains the primary clinically validated biomarker for immunotherapy decision-making in NSCLC. Clinical trials such as KEYNOTE-010, KEYNOTE-024, and KEYNOTE-042 have demonstrated that patients with PD-L1 expression levels ≥50% derive substantial clinical benefit from pembrolizumab treatment [15–17]. However, PD-L1 expression alone does not fully predict response outcomes, as some patients with low or negative PD-L1 levels also benefit from ICIs. This underscores the need for additional or complementary biomarkers, such as TMB, to refine patient stratification and improve predictive accuracy.
Moreover, while ensemble approaches that combine results from multiple callers can enhance overall accuracy, they often require manual intervention and lack the scalability needed for large-scale datasets [18].
In recent years, explainable artificial intelligence (AI) has demonstrated transformative potential in addressing challenges across biomedical research [19–22]. Explainable AI not only enhances predictive accuracy but also provides transparency into the decision-making process, a crucial feature for regulatory approval and clinical adoption [23]. By integrating machine learning algorithms with domain knowledge, explainable AI systems can dynamically adapt to heterogeneous data characteristics, making them particularly well suited for applications like TMB quantification [24–26].
Here, we introduce TMBquant, a novel explainable AI-powered caller designed to address the limitations of existing TMB quantification methods. Built on the H2O AutoML framework, TMBquant automates the selection of optimal variant calling strategies based on sample-specific characteristics [27]. Unlike traditional approaches, TMBquant dynamically adjusts to sequencing depth, tumor purity, and other key features, ensuring high precision and consistency across diverse datasets [28]. Its robust explainability features further enable users to gain actionable insights into the factors influencing model predictions, fostering trust and reproducibility in biomarker evaluation [3].
In this study, we systematically evaluate TMBquant’s performance against seven leading somatic variant callers using a benchmark dataset of 706 tumor–control whole-exome sequencing pairs. We demonstrate TMBquant’s superior accuracy, stability, and adaptability in TMB quantification across heterogeneous samples, paving the way for its adoption in precision oncology workflows. Furthermore, we discuss the potential of explainable AI tools like TMBquant to set new standards for variant calling and biomarker discovery, emphasizing their role in advancing personalized cancer care.
Method
Data preparation
We utilized a dataset comprising 706 tumor–control pairs derived from NSCLC whole-exome sequencing (WES) using the Integrated DNA Technologies (IDT) panel. This dataset was obtained from patients treated with anti-PD-(L)1 monotherapy agents at Sun Yat-sen University Cancer Center as part of clinical trials conducted between December 2015 and August 2017, with data cutoff in January 2019. All patients included in this study met the following eligibility criteria: (i) age >18 years; (ii) Eastern Cooperative Oncology Group performance status of 0–1; (iii) diagnosed with advanced or recurrent NSCLC; (iv) prior failure of first-line platinum-based doublet chemotherapy; and (v) radiologically evaluable disease based on RECIST (Response Evaluation Criteria in Solid Tumors) version 1.1 [18].
Among the 95 Chinese NSCLC patients who received anti-PD-(L)1 monotherapy at Sun Yat-sen University Cancer Center, 75 patients were included in the final analysis. Tumor samples were obtained from formalin-fixed, paraffin-embedded sections of resected tumors or biopsies, with matched normal blood controls. To ensure robust somatic mutation detection, tumor samples had an average sequencing depth of ≥150×, while matched control samples had a depth of ≥80×.
In addition to the NSCLC cohort, we analyzed an independent nasopharyngeal carcinoma (NPC) cohort. This cohort included patients with recurrent or metastatic NPC (R/M NPC), enrolled consecutively between March 2016 and January 2018 across two phase I trials investigating PD-1 inhibitors (camrelizumab and nivolumab; ClinicalTrials.gov: NCT02721589 and NCT02593786). In these studies, camrelizumab was administered at a fixed dose of 200 mg every 2 weeks to 93 patients across dose escalation and expansion phases, while 33 patients received nivolumab during dose escalation. Cohort sizes were determined primarily based on safety endpoints, with expansion phases designed to benchmark objective response rates against historical controls. Tumor samples and paired blood specimens were collected before treatment initiation. All participants provided written informed consent. Incorporating this virus-driven, low-TMB cohort enabled stringent evaluation of TMBquant’s sensitivity and robustness in biologically distinct settings [29].
Sequencing reads underwent quality control using fastp, which included adapter trimming, removal of low-quality bases, and filtering of unpaired reads. The cleaned reads were aligned to the hs37d5 reference genome using BWA-MEM and further processed with Picard MarkDuplicates to remove PCR artifacts. Tumor purity and sequencing depth for each sample were estimated using Facets and bamdst, respectively.
To evaluate TMBquant’s performance across a wide range of sequencing depths and tumor purity levels, we generated an expanded dataset by resampling the original tumor and control samples. Specifically, tumor and control samples were subsampled at proportions of 100%, 80%, 60%, and 40%, and subsequently paired in all possible combinations. This approach resulted in a dataset comprising 11 296 tumor–control pairs, allowing us to rigorously assess TMBquant’s accuracy, robustness, and adaptability under different sample heterogeneity conditions. This comprehensive dataset enabled a thorough evaluation of TMBquant’s ability to stratify patients based on TMB, providing critical insights into its clinical applicability in immunotherapy response prediction.
Overview of TMBquant methodology
TMBquant is an explainable AI-powered caller designed to address the dual challenges of enhancing accuracy and ensuring stability in TMB quantification across heterogeneous datasets. By leveraging machine learning algorithms and automated optimization within the H2O AutoML framework, TMBquant dynamically adapts to sample-specific and batch-level variations, achieving high precision and recall even in datasets with significant heterogeneity. The H2O AutoML framework further enhances TMBquant’s explainability by supporting SHAP (Shapley Additive Explanations) analysis, enabling users to quantify the contribution of individual features to model predictions both at the sample level and across batches.
At the batch level, TMBquant minimizes inter-sample variability by identifying an optimal variant calling strategy that ensures consistency across varying sequencing depths and tumor purity levels. At the single-sample level, TMBquant tailors its approach by selecting the most suitable somatic variant caller for each sample, utilizing its diverse feature set to maximize accuracy. Furthermore, the tool integrates robust explainability features, providing actionable insights into how sample characteristics influence model predictions. Together, these strategies establish TMBquant as a reproducible, adaptable, and transparent solution for TMB evaluation in precision oncology workflows. To illustrate this methodology, Fig. 1 presents an overview of the TMBquant framework, highlighting its core components, AI-driven optimization, and explainability mechanisms.
Figure 1.
Overview of the TMBquant framework. The left panel illustrates the model development process based on 706 tumor/normal matched samples subjected to whole-exome sequencing. Samples were systematically downsampled at 80%, 60%, and 40% sequencing depth levels to generate 11 296 simulated datasets. Mutations were detected using seven somatic variant callers (freebayes, lofreq, mutect2, sniper, strelka2, VarDict, and VarScan). Consensus mutations, defined as variants detected by at least three callers simultaneously, were labeled as true mutations. Somatic mutation results were used to construct predictive models with five base learners (GLM, GBM, Deep Learning, Random Forest, and XGBoost), which were combined into a stacked ensemble model. Model explainability was performed using SHAP analysis. Performance evaluation led to the determination of single-sample and batch-sample optimal callers. The right panel outlines the application phase. Independent validation datasets undergo whole-exome sequencing, and TMBquant first determines whether samples should be processed in batch or individually. For batch mode: Mutations are initially called by the batch-sample optimal caller, batch-level predictions are made, and simulated annealing refines the selection of the best caller for each sample. For single-sample mode: Mutations are called directly by the single-sample optimal caller. The final somatic mutation profiles produced after TMBquant optimization are used for downstream TMB quantification and biomarker evaluation. This framework ensures robust, adaptive mutation calling across diverse sequencing conditions and sample heterogeneities.
Enhancing stability in TMB quantification across batches
To ensure stable TMB quantification across heterogeneous datasets, TMBquant employs a batch-level optimization approach aimed at minimizing variance in TMB calculations. The method begins by utilizing mutation calling results from seven widely used somatic variant callers—Mutect2, VarScan2, Strelka2, VarDict, SomaticSniper, FreeBayes, and LoFreq. Variants are initially called using a single default caller, and the resulting mutation data serve as the input feature matrix. For each caller, deviation values—defined as deviation metric
—are extracted as response variables to represent caller-specific performance.
At the batch level, the optimization target is to minimize both the total error and the variance of errors across the batch. Let there be
samples in a batch, each represented by a feature vector
, and let the error metric for sample
be defined as
![]() |
The total error and variance of errors for the batch are given by
Total Error:
.
Variance:
, where 
The joint optimization objective can be expressed as
![]() |
where
represents the overall loss function,
is a hyperparameter that balances total error and variance, and
represents the parameters of the predictive model.
To achieve this, a regression model is trained using the H2O AutoML framework to predict the error metric for each sample. Suppose the predicted error for sample i is denoted as
, the optimization objective becomes
![]() |
H2O AutoML uses a stacked ensemble model combining multiple base learners (GLM, GBM, Random Forest, Deep Learning, and XGBoost) to optimize
. The training process leverages 10-fold cross-validation to minimize both total error and variance, ensuring robustness and predictive accuracy.
The predicted deviation values
are then passed to a simulated annealing algorithm, which further optimizes the mean and variance of the batch deviations to identify the optimal configuration of variant callers for the batch. This combined approach ensures that TMBquant achieves both accurate and stable TMB quantification at the batch level.
To predict the optimal caller configuration for a batch of samples, a regression model is trained using the H2O AutoML framework. The features (X) include sequencing depth, tumor purity, total mutation count, SNV/InDel ratios, six single-nucleotide variant (SNV) types, and 96 trinucleotide contexts. A stacked ensemble model combines five base learners (GLM, GBM, random forest, deep learning, and XGBoost) and undergoes 10-fold cross-validation to maximize predictive performance. Predicted deviation values for each caller are then used as input to a simulated annealing algorithm, which identifies the optimal caller combination by minimizing the mean deviation across the batch.
By leveraging this approach, TMBquant ensures consistent performance across samples with varying sequencing depths and tumor purities. This batch-level strategy is particularly effective in reducing inter-sample variability, thereby enhancing the reliability of TMB quantification in large-scale studies.
Ensuring high accuracy in single-sample TMB quantification
For precise TMB quantification at the single-sample level, TMBquant dynamically selects the optimal somatic variant caller for each sample based on its unique characteristics. This process involves the extraction of a comprehensive feature set from each sample, including sequencing depth, tumor purity, MSI scores (calculated using MSIsensor2), mutation counts, SNV/InDel ratios, and additional mutation-specific metrics. For each sample, the deviation value for all seven callers is calculated, and the caller with the lowest deviation is designated as the optimal one.
To ensure
is minimized, we define the optimization target mathematically. Let
represent the feature vector for a given sample, where
is the number of features. For each variant caller
, the model predicts an output score
, and the goal is to minimize the deviation metric
.
The optimization can be expressed as
![]() |
where
,
, and
represent the predicted
,
, and
, respectively, for caller
based on the feature vector
.
Using H2O AutoML, the stacked ensemble model is trained to approximate these functions as
, and
. The deviation metric is then approximated as
![]() |
During training, H2O AutoML optimizes the ensemble model by minimizing the expected loss over all samples in the training set:
![]() |
where
represents the parameters of the base learners (GLM, GBM, Random Forest, etc.) in the ensemble model, and
is the number of training samples.
By leveraging cross-validation and feature importance analysis, H2O AutoML ensures the ensemble model selects the most critical features and optimizes the deviation metric
. As a result, the caller
with the lowest deviation metric is chosen for each sample.
To automate and generalize this process, a multi-class classification model is constructed using the H2O AutoML framework. The model uses the extracted features as predictors (X) and the optimal caller (with the minimum deviation) as the response variable (Y). A stacked ensemble model is built by integrating predictions from five base learners (GLM, GBM, Random Forest, Deep Learning, and XGBoost), ensuring robust and accurate predictions. The dataset is split into training (60%) and validation (40%) sets to evaluate model performance. For new samples, TMBquant applies the trained model to recommend the optimal caller, which is then used to perform mutation calling and generate the final TMB quantification.
This single-sample strategy ensures that each sample’s unique characteristics are accounted for, resulting in highly accurate and personalized TMB quantification. The approach outperforms traditional one-size-fits-all pipelines by adapting dynamically to sample-specific attributes, making it particularly effective for datasets with high heterogeneity.
By combining batch-level stability optimization with single-sample accuracy enhancement, TMBquant offers a comprehensive framework for robust and precise TMB quantification. These dual strategies ensure that the tool adapts dynamically to sample variability while maintaining accuracy, setting a new standard for TMB evaluation in precision oncology workflows.
To promote transparency and reproducibility, we provide a detailed exposition of the mathematical formulations underlying TMBquant’s optimization strategies, accompanied by visual workflow diagrams for both batch-level and single-sample quantification, in Supplementary Material S3.
Code availability
To ensure full reproducibility and transparency, we have publicly released the complete source code for TMBquant at https://github.com/SomaticCaller/SomaticCaller. The package is designed with flexibility and practical utility in mind. TMBquant supports both single-sample and batch-sample analysis modes, and provides an ensemble learning framework to select the most appropriate variant caller under diverse data conditions. The repository includes detailed installation instructions, example datasets, and execution workflows.
A minimal example for running TMBquant in batch-sample mode and simulated annealing optimization is shown below:
| # Step 1: Train batch model Rscript BatchSample.H2O.R \ —FeatureType = lofreq \ —FeatureFile = Feature/lofreq.csv \ —algorithms = gbm,rf \ —output = Results \ —list = InfoList/info.list \ —ratio = 0.7 \ —addPCA = TRUE \ —nthreads = 16 \ —memory = 32G # Step 2: Predict on test batch Rscript BatchSample.predict.R \ —model = Results/h2o_seed1_model_list.Rdata \ —ResultDir = Predict \ —FeatureType = lofreq \ —NewFile = Feature/lofreq.csv \ —list = InfoList/info.list \ —addPCA = TRUE \ —datatype = lofreq # Step 3: Optimize caller usage with simulated annealing python SAOpt.py \ —input Predict/h2o_seed1_lofreq_Predict.csv \ —output Predict/h2o_seed1_lofreq_Predict_Opt.tsv \ —init-temp 1.0 \ —final-temp 1e-7 \ —cooling-rate 0.8 \ —iterations 200 \ —greedy-init \ —graphical |
Users may refer to the README for additional workflows including single-sample mode, supported variant callers, feature formatting requirements, and parameter tuning strategies. This accessibility facilitates reproducibility and encourages broader community adoption and validation.
Results
Accuracy analysis in single-sample detection
To evaluate the accuracy of TMBquant in detecting the optimal somatic variant caller for single samples through AI-driven selection, we analyzed confusion matrices for both training and testing datasets (Fig. 2). In the training dataset, Top1Call predictions achieved near-perfect accuracy, with VarScan2 and Mutect2 achieving 100% and 99.6% accuracy, respectively. For the testing dataset, Top1Call accuracy remained consistently high, with LoFreq and Mutect2 demonstrating excellent precision. When a secondary prediction (Top2Call) was allowed, the accuracy improved further across all categories, with Mutect2 reaching 99.6% accuracy in the test dataset. These results highlight TMBquant’s ability to adapt to single-sample characteristics through AI-driven optimization, achieving highly accurate detection results. The consistency between training and testing datasets demonstrates the model’s excellent generalization capability. Furthermore, reliable Top2Call predictions suggest that TMBquant effectively identifies robust alternatives when the primary choice is incorrect, making it particularly valuable for complex, heterogeneous datasets.
Figure 2.
Confusion matrices illustrating the accuracy of TMBquant’s variant caller recommendations at the single-sample level across training and testing datasets. Each matrix compares the actual optimal variant caller (ground truth, x-axis) with the variant caller predicted by TMBquant (y-axis). The upper panels represent Top1Call predictions (highest confidence caller), while the lower panels represent Top2Call predictions (second highest confidence caller). Color intensity indicates the proportion of correct predictions, with darker diagonal elements reflecting higher classification accuracy. Across both training and testing datasets, TMBquant achieves high Top1Call and Top2Call accuracy, especially for dominant variant callers such as LoFreq and Mutect2. These results demonstrate the effectiveness of TMBquant’s stacked ensemble learning strategy in generalizing across diverse data batches.
While these findings demonstrate TMBquant’s accuracy in single-sample detection, the next step was to explore whether its ensemble modeling approach enhances predictive performance compared to individual models.
To validate the effectiveness of ensemble learning in TMBquant, we compared the performance of the stacked ensemble model with individual base models (Fig. 3). The stacked ensemble model significantly outperformed its individual counterparts, achieving a Top1Call accuracy of 79% and a Top2Call accuracy of 91% on the validation dataset. In contrast, base models such as GBM and deep learning achieved lower Top1Call accuracies of 68% and 65%, respectively. These results demonstrate that ensemble learning effectively combines the strengths of diverse algorithms, resulting in superior predictive performance. By leveraging complementary features across models, the ensemble approach enhances both accuracy and robustness, reinforcing its utility for single-sample detection.
Figure 3.
Comparative performance of TMBquant’s stacked ensemble model versus individual base learners in variant caller prediction. The upper panel shows the Top1Call and Top2Call prediction accuracies across different algorithmic models (stacked ensemble, XGBoost, Random Forest, Deep Learning, GBM, GLM) on training (left) and testing (right) datasets. The lower panel displays the Top1Call and Top2Call prediction accuracies stratified by different variant callers selected by TMBquant, highlighting method-specific robustness. Color bars differentiate Top1 and Top2 prediction accuracies. TMBquant’s stacked ensemble consistently outperforms individual models in both training and testing, achieving the highest overall accuracy and demonstrating particular strength in predicting optimal callers for LoFreq and Mutect2 samples. These findings underscore the ensemble model’s robustness and adaptability to complex feature spaces.
Building on the strong overall performance of the stacked ensemble model, we next evaluated the accuracy and stability of its core components, focusing specifically on LoFreq as a critical model within TMBquant.
The accuracy and stability of LoFreq were evaluated under varying conditions of tumor purity and sequencing depth (Fig. 4). Among all evaluated models, LoFreq consistently demonstrated the highest accuracy and the lowest deviation, with minimal variance in predictions even in challenging scenarios. These findings validate LoFreq as a critical component of TMBquant, ensuring consistent and precise results across heterogeneous datasets. Its ability to deliver accurate and stable predictions further underscores its importance in supporting TMBquant’s decision-making framework for single-sample analyses.
Figure 4.

Prediction score heatmaps visualizing TMBquant’s confidence levels for Top1Class and Top2Class recommendations across training and testing datasets. Rows correspond to individual samples analyzed by TMBquant, while columns represent predicted variant caller classes. Color intensity reflects prediction confidence scores, with darker shades indicating higher model certainty. Side annotations indicate ground truth caller labels (response), Top1Class, and Top2Class assignments. In both training and testing datasets, TMBquant consistently exhibits high confidence scores for the correct callers, particularly for LoFreq, reinforcing its key role as a stable and reliable component within the optimization framework. This heatmap analysis provides additional evidence supporting TMBquant’s predictive consistency across heterogeneous sample cohorts.
Stability analysis in batch-sample detection
To assess the stability of TMBquant in batch analysis, deviation metrics for all seven callers were calculated across multiple datasets (Fig. 5). The results revealed that TMBquant minimized deviation consistently across batches, with LoFreq and Mutect2 showing the smallest variance among all evaluated callers. This demonstrates TMBquant’s ability to ensure stable performance across heterogeneous batch datasets, a critical factor for large-scale applications.
Figure 5.

Batch-level deviation analysis and optimization results across variant callers evaluated by TMBquant. Left panel: boxplots and scatter plots comparing overall accuracy distributions across six widely used variant callers (FreeBayes, LoFreq, Mutect2, SomaticSniper, Strelka2, and VarDict) for both training (n = 20 batches, blue) and testing datasets (n = 20 batches, red), with each batch derived from systematically downsampled subsets of 706 original whole-exome sequencing samples (totaling 11 296 simulated tumor–control pairs). Each dot represents batch-level accuracy, and gray lines connect corresponding batches across training and testing phases, illustrating significant batch-to-batch variability. Blue and red boxes denote interquartile ranges, with medians indicated by horizontal lines. Across all callers, a substantial decrease in overall accuracy from training to testing phases is observed, emphasizing challenges posed by batch heterogeneity and cross-batch generalization. Right panel: overall accuracy trends for LoFreq caller optimized by TMBquant across batches. Circles represent training batches (n = 20), and triangles represent testing batches (n = 20). Compared to conventional callers, TMBquant optimization consistently achieves higher accuracy and significantly reduces batch-level variability, as evidenced by tighter accuracy distributions across diverse datasets. This figure collectively highlights the critical role of dynamic AI-powered optimization by TMBquant in stabilizing TMB quantification performance across heterogeneous batches.
Building on this, we next evaluated the regression model’s performance using different features to determine the best predictors of batch stability.
Regression models were constructed using features derived from different callers to predict deviation metrics (Fig. 6). Among all tested features, LoFreq demonstrated the highest overall prediction accuracy, both in training and testing datasets. This indicates that LoFreq’s feature set is particularly well suited for capturing batch-level variance and reinforces the reliability of TMBquant’s regression framework.
Figure 6.
Accuracy matrix across variant caller pairs evaluated and predicted by TMBquant. This heatmap presents the overall accuracy of TMBquant’s variant caller selection across pairwise evaluated (rows) and predicted (columns) variant calling methods. The left panel corresponds to the training dataset, and the right panel corresponds to the testing dataset. Each cell denotes the accuracy score for predicting the correct variant caller given the input features evaluated by a specific caller. Higher values (red) indicate better prediction performance, whereas lower values (blue) highlight misclassification tendencies. The “overall” column summarizes the average accuracy of predicting the correct caller regardless of the evaluated caller. Compared to conventional methods, TMBquant maintains relatively high prediction consistency between training and testing datasets, with LoFreq showing the highest prediction robustness across batches. This figure demonstrates the generalizability of TMBquant’s multi-caller modeling strategy across heterogeneous data contexts.
Having established LoFreq as a strong predictor, we evaluated the generalization capability of the batch-level regression model.
The confusion matrix for the batch regression model compared training and testing datasets, revealing consistent performance with minimal prediction errors (Fig. 7). This confirms TMBquant’s ability to generalize across batches while maintaining stable deviation predictions. Such consistency is essential for ensuring reproducible results in large-scale bioinformatics studies.
Figure 7.
Confusion matrices of variant caller predictions by TMBquant across training and testing datasets. The left panel shows the confusion matrix for the training dataset, and the right panel shows the confusion matrix for the testing dataset. Each cell displays the number of instances where a variant caller (row, ground truth) was predicted as another caller (column, predicted label), along with the prediction proportion. Diagonal cells represent correct classifications, while off-diagonal cells indicate misclassifications. Prediction proportions and sample sizes are annotated within each cell to facilitate detailed interpretation. Overall, TMBquant achieves high classification accuracy for dominant callers such as LoFreq and Mutect2 across both training and testing datasets. Performance degradation is more noticeable for less frequently occurring callers (e.g. SomaticSniper and Strelka2) in the testing set, reflecting the expected impact of sample size imbalance. This figure provides detailed insight into the predictive precision and misclassification patterns of TMBquant’s caller optimization framework.
To further enhance the transparency of TMBquant’s predictive framework and bridge computational outputs with clinical interpretation, we conducted a comprehensive SHAP analysis to assess feature contributions to variant caller selection. Detailed results are provided in Supplementary Material S4. Through SHAP-based evaluation, we demonstrate that sample-specific attributes—such as tumor purity, sequencing depth, mutation burden, and trinucleotide contexts—significantly influence model predictions. This analysis provides interpretable, biologically grounded insights into TMBquant’s decision-making process, ultimately reinforcing its clinical applicability and fostering greater confidence in its outputs.
Survival analysis based on TMB classification
To further evaluate the clinical utility of TMBquant across diverse tumor types and subtypes, we conducted comprehensive survival analyses using immunotherapy-treated cohorts of NSCLC, NPC, and its two major histological subtypes, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). In each cohort, progression-free survival (PFS) was compared between TMB-High and TMB-Low groups as defined by median TMB values calculated independently for each of the nine tested variant callers, including traditional tools (Mutect2, Strelka2, VarScan2, etc.) and recent AI-powered methods (DeepSomatic, Octopus).
Across all four datasets, TMBquant consistently achieved the highest hazard ratio (HR) among all tested callers, indicating the strongest stratification ability in identifying patients who would benefit from immune checkpoint blockade therapy.
In the NSCLC cohort (Fig. 8), TMBquant exhibited the most significant separation between TMB-High and TMB-Low survival curves, with an HR of 1.917 (95% confidence interval (CI) = 1.171–3.136, P = .005). This outperformed all other tools, whose HRs ranged from 1.060 to 1.833. Notably, even compared with the latest AI-based tools such as Octopus (HR = 1.769) and DeepSomatic (HR = 1.288), TMBquant retained superior discriminative power.
Figure 8.

Kaplan–Meier survival analysis in NSCLC cohort stratified by TMB quantification across different variant callers. Kaplan–Meier plots illustrate progression-free survival (PFS) among 75 NSCLC patients stratified into high-TMB (n = 37) and low-TMB (n = 38) groups, according to median TMB thresholds computed by each indicated variant caller. Each subplot represents survival curves generated by a specific caller, annotated with hazard ratio (HR), 95% confidence interval (CI), and log-rank P-value. Among the tested methods, TMBquant exhibited the strongest prognostic stratification capability (HR = 1.917, 95% CI = [1.171–3.136], log-rank P = .005), followed by octopus (HR = 1.833, 95% CI = [1.119–3.002], P = .0094) and sniper (HR = 1.769, 95% CI = [1.083–2.891], P = .0142). The comparative performance of other callers was variable: Lofreq (HR = 1.703, 95% CI = [1.045–2.776], P = .0228), VarScan (HR = 1.605, 95% CI = [0.991–2.602], P = .0445), Mutect2 (HR = 1.378, 95% CI = [0.851–2.229], P = .1781), VarDict (HR = 1.323, 95% CI = [0.820–2.135], P = .2387), strelka2 (HR = 1.264, 95% CI = [0.782–2.044], P = .3273), freebayes (HR = 1.288, 95% CI = [0.796–2.083], P = .2898), and deepsomatic (HR = 1.060, 95% CI = [0.658–1.707], P = .8081). Time is shown in days on the x-axis, and progression-free survival probability is plotted on the y-axis. These results underscore the robustness and superior clinical utility of TMBquant in accurately distinguishing high-risk and low-risk NSCLC patient groups compared to other conventional and AI-based variant callers.
To evaluate the cross-cancer generalizability of TMBquant, we further applied the same analytical framework to an independent immunotherapy-treated NPC cohort (Fig. 9). NPC, a virally driven malignancy predominantly associated with Epstein–Barr virus infection, is characterized by a generally low TMB and an immunologically active but distinct tumor microenvironment. These features make NPC an ideal model to assess the robustness and sensitivity of TMB-based stratification methods in low-TMB contexts. Remarkably, TMBquant maintained its superior performance in this setting, achieving the highest hazard ratio (HR = 2.460; 95% CI = 1.348–4.489) among all evaluated methods, outperforming both conventional tools such as Mutect2 (HR = 2.382) and FreeBayes (HR = 2.071), as well as AI-driven approaches like DeepSomatic (HR = 1.734).
Figure 9.

Kaplan–Meier survival analysis in NPC cohort stratified by TMB quantification across different variant callers. Kaplan–Meier plots illustrate progression-free survival (PFS) among 51 NPC patients, stratified into high-TMB (n = 25) and low-TMB (n = 26) groups, based on cohort-specific median TMB thresholds computed using various variant callers. Each subplot presents survival curves corresponding to a specific caller, annotated with hazard ratio (HR), 95% confidence interval (CI), and log-rank P-value. Among all evaluated methods, TMBquant shows the strongest prognostic stratification (HR = 2.460, 95% CI = [1.348–4.489], log-rank P = .0001). Notably strong stratification performance is also observed for octopus (HR = 2.382, 95% CI = [1.310–4.331], P = .0002), Mutect2 (HR = 2.246, 95% CI = [1.243–4.058], P = .0007), deepsomatic (HR = 2.226, 95% CI = [1.233–4.018], P = .0008), strelka2 (HR = 2.169, 95% CI = [1.205–3.906], P = .0012), and lofreq (HR = 2.071, 95% CI = [1.155–3.713], P = .0032). Moderate or non-significant stratification was observed with freebayes (HR = 1.734, 95% CI = [0.981–3.065], P = .0316), sniper (HR = 1.764, 95% CI = [0.997–3.122], P = .0267), VarScan (HR = 1.543, 95% CI = [0.879–2.709], P = .0956), and VarDict (HR = 0.950, 95% CI = [0.546–1.655], P = .8493). Time (days) is represented on the x-axis, and progression-free survival probability is on the y-axis. These findings highlight the robustness and broad clinical utility of TMBquant, demonstrating its superior performance even in virus-driven, low-TMB tumor types such as NPC, compared to conventional and other AI-based variant callers.
To further assess TMBquant’s performance in NSCLC subtypes, we conducted separate analyses on LUAD (Fig. 10) and LUSC (Fig. 11) samples. In LUAD, TMBquant achieved the highest HR of 2.236 (95% CI = 1.175–4.254), clearly surpassing other methods such as Octopus (HR = 1.863) and Mutect2 (HR = 1.789). In LUSC, although overall survival stratification was less pronounced due to smaller sample size, TMBquant still yielded the strongest HR of 1.732 (95% CI = 0.759–3.956), showing better separation than any of the other eight tools tested.
Figure 10.

Kaplan–Meier survival analysis in LUAD subtype of NSCLC stratified by TMB quantification across different variant callers. Kaplan–Meier plots depict progression-free survival (PFS) among 45 LUAD patients classified into high-TMB (n = 22, except VarScan: n = 21) and low-TMB (n = 23, VarScan: n = 24) groups, based on cohort-specific median TMB thresholds derived from each indicated variant caller. Each subplot illustrates survival curves annotated with hazard ratio (HR), 95% confidence interval (CI), and log-rank P-value. TMBquant demonstrates the strongest prognostic capability (HR = 2.236, 95% CI = [1.175–4.254], P = .0054), followed by sniper (HR = 1.863, 95% CI = [0.988–3.513], P = .0333), octopus (HR = 1.789, 95% CI = [0.952–3.360], P = .0483), Mutect2 (HR = 1.731, 95% CI = [0.924–3.243], P = .0622), lofreq (HR = 1.713, 95% CI = [0.915–3.207], P = .0678), VarScan (HR = 1.634, 95% CI = [0.886–3.012], P = .1000), VarDict (HR = 1.602, 95% CI = [0.866–2.964], P = .1120), freebayes (HR = 1.272, 95% CI = [0.689–2.349], P = .4246), strelka2 (HR = 1.220, 95% CI = [0.662–2.248], P = .5115), and deepsomatic (HR = 1.029, 95% CI = [0.561–1.886], P = .9254). Survival time (days) is represented on the x-axis and progression-free survival probability on the y-axis. This figure emphasizes the robustness and superior predictive performance of TMBquant in stratifying molecularly heterogeneous LUAD patients, further underscoring its utility compared to conventional and other AI-based callers.
Figure 11.

Kaplan–Meier survival analysis in LUSC subtype of NSCLC stratified by TMB quantification across different variant callers. Kaplan–Meier plots illustrate progression-free survival (PFS) among 27 LUSC patients, categorized into high-TMB (n = 13) and low-TMB (n = 14) groups based on cohort-specific median TMB thresholds derived from various variant callers. Each subplot depicts survival curves annotated with hazard ratio (HR), 95% confidence interval (CI), and log-rank P-value. In this cohort, TMB-based stratification generally showed modest performance. Nevertheless, TMBquant consistently demonstrated relatively superior stratification (HR = 1.732, 95% CI = [0.759–3.956], P = .1717) compared to other variant callers, including octopus (HR = 1.215, 95% CI = [0.536–2.754], P = .6365), sniper and VarScan (both HR = 1.204, 95% CI = [0.531–2.729], P = .6523), VarDict (HR = 1.194, 95% CI = [0.527–2.707], P = .6661), strelka2 (HR = 1.189, 95% CI = [0.525–2.695], P = .6756), lofreq (HR = 1.156, 95% CI = [0.510–2.620], P = .7248), freebayes (HR = 1.132, 95% CI = [0.500–2.565], P = .7639), deepsomatic (HR = 1.115, 95% CI = [0.492–2.524], P = .7932), and Mutect2 (HR = 1.047, 95% CI = [0.462–2.371], P = .9120). Time in days is plotted on the x-axis and PFS probability on the y-axis. These findings emphasize the adaptability and relative strength of TMBquant for tumors with more complex and heterogeneous mutational profiles, such as lung squamous cell carcinoma (LUSC), despite inherently limited TMB stratification power across this subtype.
These findings collectively support that TMBquant provides robust and generalizable predictive power across both high-TMB (e.g. NSCLC) and virus-driven low-TMB tumors (e.g. NPC). The superior performance of TMBquant stems from its explainable AI foundation and dynamic optimization framework, which adaptively adjusts variant calling strategies to minimize false positives and false negatives in TMB computation.
To facilitate a comprehensive and transparent comparison, detailed survival analysis results—including HRs, 95% CIs, log-rank P-values, and stratified sample counts—for each variant calling tool across the NSCLC, NPC, LUAD, and LUSC cohorts are provided in Supplementary_Table_Survival_Analysis_Details.xlsx.
By consistently outperforming both conventional and state-of-the-art AI variant callers in immunotherapy stratification tasks, TMBquant emerges as a promising and clinically relevant tool for precision oncology, capable of guiding treatment decisions across a wide spectrum of cancer types.
While our survival analyses systematically evaluated TMBquant’s performance across different cancer types and histological subtypes, we acknowledge that the current dataset did not allow further stratification based on the specific ICIs administered (e.g. nivolumab, pembrolizumab, or atezolizumab). This limitation, stemming from treatment annotation granularity and cohort sample sizes, is discussed in the Discussion. Future studies leveraging larger and uniformly treated cohorts will be critical to fully elucidate TMBquant’s predictive consistency across different immunotherapy regimens.
Discussion
Robustness and generalizability of TMBquant
Accurate and stable TMB quantification is critical for its clinical utility in biomarker-driven patient stratification and immunotherapy response prediction. A major challenge in this domain is batch-level variability, which arises due to differences in sequencing depth, tumor purity, and platform-specific biases. Addressing this issue requires not only precision in variant calling but also robust adaptation to dataset heterogeneity.
To rigorously assess the robustness and reliability of TMBquant, we performed correlation analysis between predicted and true deviation values, providing a quantitative measure of its ability to capture batch-level variance. The strong correlation observed (Fig. 12) suggests that TMBquant effectively minimizes deviations, maintaining consistent performance across diverse datasets. This stability is paramount in ensuring reproducible biomarker evaluation, as variability in mutation burden estimation can significantly impact downstream clinical interpretations, including patient eligibility for ICI therapies. By controlling these fluctuations, TMBquant enhances the translatability of genomic insights into clinical decision-making.
Figure 12.
Correlation heatmaps between predicted and observed deviation values across variant caller features in TMBquant. Left and right panels represent the training and testing datasets, respectively. Each row corresponds to an individual sample analyzed by TMBquant, and each column corresponds to features extracted from a specific variant caller. Heatmap color intensity reflects the Pearson correlation coefficient between the predicted deviation and the true observed deviation for each feature–caller pair, ranging from −1 (strong negative correlation, blue) to +1 (strong positive correlation, red). Dendrograms above the heatmaps illustrate hierarchical clustering based on feature similarity, revealing structured patterns across different callers. Consistent strong positive correlations (reddish color) across both training and testing datasets confirm that TMBquant accurately captures batch-level variability and maintains generalizability when applied to unseen data. This result supports the robustness of the feature extraction and model training strategies used in TMBquant for optimizing variant caller performance prediction across heterogeneous sample cohorts.
In addition to batch-level evaluation, we investigated the role of ensemble learning in mitigating overfitting within single-sample variant calling models. Overfitting is a well-documented issue in machine learning–driven bioinformatics, where models may learn dataset-specific noise rather than generalizable patterns. Our analysis demonstrated that TMBquant’s stacked ensemble approach effectively reduces overfitting in base models, ensuring robust generalization to unseen datasets. This is particularly important given that existing variant callers often suffer from dataset-dependent performance biases, which can undermine TMB consistency when applied to real-world patient cohorts.
Moreover, the ability of TMBquant to adapt dynamically to sequencing heterogeneity distinguishes it from conventional approaches that employ static variant calling strategies. Traditional methods typically apply a one-size-fits-all threshold for TMB calculation, which may lead to misclassification of tumors with intermediate mutation burdens. By leveraging explainable AI-powered optimization, TMBquant adjusts variant selection strategies at both the single-sample and batch levels, ensuring that TMB estimates remain clinically interpretable and reproducible across diverse tumor types and sequencing conditions.
Taken together, our findings establish TMBquant as a reliable and adaptive framework for TMB quantification, bridging the gap between bioinformatics-driven variant analysis and clinical biomarker evaluation. By integrating robust statistical modeling, AI-driven optimization, and batch-level error correction, TMBquant provides a generalizable, high-fidelity solution for TMB estimation. Further details on stability assessments and the role of ensemble learning in overfitting mitigation are provided in Supplementary Materials S1 and S2, respectively. Future studies will explore the extension of TMBquant to additional cancer types and multi-omics integration, further enhancing its applicability in precision oncology workflows.
Limitations and clinical interpretation of TMB
Despite the enhanced accuracy and reproducibility of TMB quantification achieved by TMBquant, several important limitations inherent to the clinical use of TMB as a biomarker warrant careful consideration.
First, while multiple clinical studies such as CheckMate 568 and CheckMate 227 have demonstrated that patients with high TMB may derive greater benefit from ICIs, other large-scale trials—including IMpower150 and KEYNOTE-189—have failed to establish a consistent correlation between TMB levels and clinical outcomes. These discrepancies may be attributed to multiple factors, including differences in sequencing platforms, gene panel sizes, variant calling pipelines, tumor histology, patient selection criteria, and treatment heterogeneity across studies. Such variability complicates the generalization of TMB as a standalone biomarker and highlights the need for caution when interpreting TMB levels in clinical decision-making.
Second, the lack of a standardized threshold to define “TMB-high” status further complicates its clinical application. Different studies have employed diverse cutoff values—e.g. ≥10 mutations/Mb in CheckMate 227 and ≥13.5 mutations/Mb in the exploratory analysis of KEYNOTE-158—thus limiting cross-study comparability and hindering the development of unified clinical guidelines. While TMBquant provides highly consistent TMB measurements across heterogeneous datasets, it remains agnostic to the absolute threshold chosen for clinical stratification. This design ensures flexibility, allowing users to adopt study-specific or institution-specific cutoff values without compromising the accuracy and reproducibility of TMB assessment.
To further address the impact of variable TMB thresholds on model performance, we conducted a sensitivity analysis by systematically varying the TMB cutoff values from 8 to 25 mutations/Mb, including thresholds of 10 and 13.5 mutations/Mb used in prior clinical studies (CheckMate 227, KEYNOTE-158). The detailed results are provided in Supplementary_Table_TMB_Cutoff_Sensitivity_Analysis.xlsx.
Our findings show that TMBquant maintained consistently strong predictive performance across a wide range of TMB thresholds. Notably, when the threshold exceeded 18 mutations/Mb, TMBquant achieved the best stratification performance compared to other variant calling methods. At lower thresholds (<18 mutations/Mb), TMBquant’s performance remained close to the best, demonstrating its robustness to varying clinical definitions of TMB-high status.
These results support the flexibility and generalizability of TMBquant under diverse clinical settings, further reinforcing its applicability in biomarker-driven immunotherapy stratification.
Third, TMB alone does not capture the full complexity of tumor-immune interactions. Factors such as PD-L1 expression, the composition of the tumor microenvironment, interferon signaling, and specific genomic alterations (e.g. STK11 or KEAP1 mutations) significantly modulate immune responsiveness. Consequently, integrating TMBquant-based TMB measurements with additional biomarkers may offer a more comprehensive framework for guiding immunotherapy decisions.
Future directions include validating TMBquant predictions in larger, prospective ICI-treated cohorts and evaluating its utility within multi-biomarker frameworks that incorporate TMB, PD-L1 expression, immune gene signatures, and circulating tumor DNA-based assays.
Impact of immunotherapy regimen heterogeneity
In our survival analyses, TMBquant demonstrated consistently strong performance across multiple cancer types and histological subtypes. However, we acknowledge that the validation cohorts included patients treated with different ICIs, such as nivolumab, pembrolizumab, and atezolizumab.
Due to limitations in the available treatment annotation and cohort sample sizes, we were unable to perform a detailed subgroup analysis stratified by the specific ICI administered.
The heterogeneity of immunotherapy regimens may introduce variability in treatment response, even among patients with similar TMB levels, as different PD-1/PD-L1 inhibitors and combination regimens can have distinct clinical efficacies. This potential confounding factor is an important limitation to consider when interpreting the survival analyses presented.
Future research using prospectively collected datasets with uniform treatment protocols and detailed drug-specific annotations will be essential to further evaluate TMBquant’s predictive performance across distinct immunotherapy modalities. Such analyses will help refine its clinical applicability and inform optimal integration strategies for TMBquant into precision oncology frameworks.
Impact of training data composition and cancer type generalizability
TMBquant was designed with the primary goal of improving the technical accuracy of TMB quantification by enhancing the precision and recall of somatic variant calling. The model was trained on a cohort of non–immunotherapy-treated samples, focusing specifically on optimizing mutation detection to better reflect the true mutational landscape, which underpins TMB calculation. This training design aims to minimize technical biases and produce a reproducible, high-fidelity TMB metric that can subsequently serve as a more reliable biomarker for clinical interpretation.
Nonetheless, we acknowledge that training exclusively on non–ICI-treated samples may not fully capture the biological and immunogenic nuances associated with tumors responding to ICIs. Although the fundamental processes governing somatic mutation generation are shared across tumors irrespective of treatment exposure, the tumor-immune interplay may differ between treated and untreated contexts.
Future efforts incorporating ICI-treated patients into training cohorts could refine TMBquant by embedding immunotherapy-specific mutational patterns, thereby further enhancing its predictive relevance for ICI outcomes.
Regarding cancer type generalizability, the current study primarily focused on NSCLC due to the extensive availability of high-quality WES datasets and the clinical importance of TMB in NSCLC immunotherapy.
To preliminarily evaluate TMBquant’s robustness beyond NSCLC, we further validated its performance in an independent NPC cohort—characterized by low TMB and viral oncogenesis—as well as in two NSCLC histological subtypes: LUAD and LUSC. Across these diverse cohorts, TMBquant consistently achieved superior stratification performance compared to conventional and AI-based variant callers.
These results collectively support the broader applicability of TMBquant across different tumor types; however, future prospective studies across a wider range of malignancies, such as melanoma, gastric cancer, and bladder cancer, are necessary to further confirm its generalizability and optimize its clinical deployment.
Comparative context: TMB versus established and emerging biomarkers
Although TMB has garnered increasing attention as a predictive biomarker for immunotherapy, it is important to recognize that PD-L1 expression remains the primary clinically validated biomarker in NSCLC. Clinical trials such as KEYNOTE-010, KEYNOTE-024, and KEYNOTE-042 have consistently demonstrated that PD-L1 expression levels ≥50% are associated with improved outcomes with pembrolizumab treatment. However, PD-L1 alone does not fully account for all responders and non-responders to ICIs, highlighting the need for additional or complementary biomarkers.
Beyond PD-L1, several other biomarkers have shown promise in predicting immunotherapy responses, including T-cell inflammation signatures (e.g. KEYNOTE-042), microsatellite instability-high (MSI-H) status (e.g. CheckMate 142), and circulating tumor DNA-based TMB (bTMB) assessed in trials such as B-F1RST and MYSTIC. Each of these biomarkers captures distinct aspects of tumor-intrinsic or immune-related biology. TMBquant, by improving the technical accuracy and stability of TMB measurement through AI-optimized variant calling, serves as a complementary approach that can enhance the predictive landscape when integrated with these established markers.
Although direct comparisons between TMB and PD-L1 expression, MSI-H status, or bTMB were not feasible in the current study due to dataset limitations, we have discussed their relative advantages and limitations. We envision that future prospective studies combining TMBquant outputs with PD-L1 levels, immune gene signatures, and liquid biopsy–based assays will establish more comprehensive multi-biomarker strategies to optimize patient stratification and therapeutic decision-making in precision oncology.
Future directions for TMBquant development and clinical translation
Despite the promising performance demonstrated by TMBquant in technical validation and survival analyses, several avenues for future enhancement remain.
First, given the growing recognition that TMB alone may not be sufficient to fully capture tumor immunogenicity, future work will focus on integrating TMBquant outputs with other complementary biomarkers. Specifically, combining TMB with PD-L1 expression levels, interferon-γ gene signatures, tumor microenvironment profiling, and circulating tumor DNA–based analyses may enhance predictive accuracy and better inform immunotherapy decision-making. Such a multi-biomarker framework could address the inherent heterogeneity of tumor-immune interactions and improve patient stratification.
Second, while TMBquant currently optimizes variant calling based on general tumor genomic features, its performance could be further refined by incorporating immunotherapy-treated cohorts into model training. This would enable the model to learn immunotherapy-specific mutational patterns, potentially improving its ability to predict treatment outcomes.
Finally, to support clinical translation, prospective validation of TMBquant in real-world immunotherapy cohorts is essential. Similar to how circulating tumor DNA-based TMB (bTMB) was validated in trials such as B-F1RST and MYSTIC, dedicated clinical studies are necessary to rigorously evaluate TMBquant’s predictive performance in prospective settings. These efforts will be critical steps toward establishing TMBquant as a robust component of precision oncology pipelines.
Conclusion
In this study, we present TMBquant, an explainable AI-driven variant caller designed to advance the accuracy, stability, and clinical interpretability of TMB quantification across heterogeneous cancer samples. By integrating dynamic feature selection, ensemble learning, and batch-level optimization within the H2O AutoML framework, TMBquant effectively addresses key challenges such as variability in sequencing depth, tumor purity, and mutation calling pipelines.
Comprehensive benchmarking against nine widely used somatic variant callers—including both traditional methods and AI-powered tools like DeepSomatic and Octopus—demonstrated that TMBquant achieves superior accuracy, minimized variability, and enhanced predictive robustness. Beyond initial validation on 706 tumor-normal whole-exome pairs, we extended survival analyses across multiple independent immunotherapy-treated cohorts, including NSCLC, NPC, and the NSCLC subtypes LUAD and LUSC. In all settings, TMBquant consistently achieved the highest HRs, confirming its ability to stratify patients more effectively than existing methods, even across tumors with diverse mutational burdens and biological contexts.
Importantly, TMBquant demonstrated robust performance not only in high-TMB cancers such as NSCLC but also in virus-driven, low-TMB malignancies like NPC, highlighting its unique generalizability and clinical applicability across a broad spectrum of cancer types. This capability addresses a critical need in precision oncology for biomarkers that remain reliable across varying tumor ecosystems.
Beyond quantitative performance, TMBquant’s explainability features—via SHAP-based interpretation—offer actionable insights into the contribution of sample-specific characteristics to TMB estimation. This transparency bridges the gap between computational predictions and clinical decision-making, fostering greater trust and facilitating adoption in translational research and clinical settings.
As an open-source tool, TMBquant provides a robust, scalable, and interpretable solution for TMB evaluation across heterogeneous datasets. Future development efforts will focus on expanding its utility through integration with multi-omics data, exploration of additional mutation classes, and prospective validation in clinical trials to further solidify its role in improving immunotherapy outcomes.
In conclusion, TMBquant establishes a new benchmark for TMB quantification by combining cutting-edge machine learning with explainable modeling principles. Its superior predictive performance, generalizability across cancers, and clinical transparency position TMBquant as a next-generation tool for biomarker discovery and personalized oncology.
Key Points
TMBquant utilizes an explainable AI-powered framework to dynamically adapt to sample-specific characteristics, ensuring highly accurate and consistent tumor mutation burden (TMB) quantification across heterogeneous datasets.
By employing H2O AutoML for batch-level error and variance minimization, TMBquant ensures stability and reproducibility in TMB evaluation, even in datasets with varying sequencing depth and tumor purity.
TMBquant’s machine learning–based ensemble modeling dynamically selects the optimal variant caller for each sample, significantly reducing false-positive and false-negative rates and achieving superior precision in single-sample analysis.
Integrated explainability features, including SHAP analysis, provide actionable insights into the relationship between sample characteristics and model predictions, enhancing transparency and fostering trust among researchers and clinicians.
TMBquant demonstrated the strongest performance in survival analyses, outperforming other tools in stratifying patients into TMB-High and TMB-Low groups, with a robust association between TMB levels and progression-free survival outcomes, underscoring its potential for clinical applications in precision oncology.
Supplementary Material
Acknowledgements
We thank all of the faculty members and graduate students who discussed the mathematical and statistical issues in seminars.
Contributor Information
Shenjie Wang, Department of Respiratory Medicine, The Second Affiliated Hospital of Xi'an Jiaotong University, No. 157, Xiwu Road, Xincheng District, Xi'an 710004, China; School of Computer Science and Technology, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China.
Xiaonan Wang, School of Computer Science and Technology, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China; Nanjing Geneseeq Technology Inc., 128 Huakang Road, Pukou, Nanjing 211800, China.
Xiaoyan Zhu, School of Computer Science and Technology, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China.
Xuwen Wang, Department of Respiratory Medicine, The Second Affiliated Hospital of Xi'an Jiaotong University, No. 157, Xiwu Road, Xincheng District, Xi'an 710004, China; School of Computer Science and Technology, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China.
Yuqian Liu, School of Computer Science and Technology, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China.
Minchao Zhao, Nanjing Geneseeq Technology Inc., 128 Huakang Road, Pukou, Nanjing 211800, China.
Zhili Chang, Nanjing Geneseeq Technology Inc., 128 Huakang Road, Pukou, Nanjing 211800, China.
Yang Shao, Nanjing Geneseeq Technology Inc., 128 Huakang Road, Pukou, Nanjing 211800, China.
Haitao Zhang, Department of Respiratory Medicine, Tangdu Hospital, The Fourth Military Medical University, 569 Xinsi Road, Baqiao District, Xi'an, Shaanxi 710038, China.
Shuanying Yang, Department of Respiratory Medicine, The Second Affiliated Hospital of Xi'an Jiaotong University, No. 157, Xiwu Road, Xincheng District, Xi'an 710004, China.
Jiayin Wang, Department of Respiratory Medicine, The Second Affiliated Hospital of Xi'an Jiaotong University, No. 157, Xiwu Road, Xincheng District, Xi'an 710004, China; School of Computer Science and Technology, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, 28 Xianning West Road, Beilin, Xi’an 710049, China.
Author contributions
J.W., S.Y., and H.Z. conceived this research; S.W., X.W., X.Z., and Y.L. designed the algorithm and the method; S.W. implemented coding and designed the software; M.Z., Z.C, X.W., and Y.S. collected the sequencing data; S.W. and J.W. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.
Conflict of interest: The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results. This research did utilize data provided by Minchao Zhao, Zhili Chang, Xiaonan Wang, and Yang Shao from Nanjing Geneseeq Technology Inc. The authors acknowledge the contribution of these individuals in the provision of data, but affirm that they had no influence over the study design, data analysis, interpretation of the results, or the writing of the manuscript.
Funding
This work was funded by the National Natural Science Foundation of China, grant numbers 72293581, 72274152, and 62402376. This work was also supported by the Natural Science Basic Research Program of Shaanxi, grant number 2020JC-01. The article processing charge was funded by the Natural Science Basic Research Program of Shaanxi, grant number 2020JC-01.
References
- 1. Chan TA, Yarchoan M, Jaffee E, et al. Development of tumor mutation burden as an immunotherapy biomarker: utility for the oncology clinic. JCO Precis Oncol 2019;3:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Samstein RM, Lee CH, Shoushtari AN, et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat Genet 2019;51:202–6. 10.1038/s41588-018-0312-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Goodman AM, Kato S, Bazhenova L, et al. Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol Cancer Ther 2017;16:2598–608. 10.1158/1535-7163.MCT-17-0386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Rizvi H, Sanchez-Vega F, la K, et al. Molecular determinants of response to anti-programmed cell death (PD-1) and anti-programmed death-ligand 1 (PD-L1) blockade in patients with non-small-cell lung cancer profiled with targeted next-generation sequencing. J Clin Oncol 2018;36:633–41. 10.1200/JCO.2017.75.3384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Chalmers ZR, Connelly CF, Fabrizio D, et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med 2017;9:34. 10.1186/s13073-017-0424-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Koboldt DC. Best practices for variant calling in clinical sequencing. Genome Med 2020;12:91. 10.1186/s13073-020-00791-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Koboldt DC, Zhang Q, Larson DE, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012;22:568–76. 10.1101/gr.129684.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Kim S, Scheffler K, Halpern AL, et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 2018;15:591–4. 10.1038/s41592-018-0051-x. [DOI] [PubMed] [Google Scholar]
- 9. Lai Z, Markovets A, Ahdesmaki M, et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res 2016;44:e108. 10.1093/nar/gkw227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Qian ZY, Pan YQ, Li XX, et al. Modulator of TMB-associated immune infiltration (MOTIF) predicts immunotherapy response and guides combination therapy. Sci Bull 2024;69:803–22. 10.1016/j.scib.2024.01.025. [DOI] [PubMed] [Google Scholar]
- 11. Hellmann MD, Callahan MK, Awad MM, et al. Tumor mutational burden and efficacy of nivolumab monotherapy and in combination with ipilimumab in small-cell lung cancer. Cancer Cell 2018;33:853–861.e4. 10.1016/j.ccell.2018.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. The Cancer Genome Atlas Research Network . Comprehensive molecular profiling of lung adenocarcinoma. Nature 2014;511:543–50. 10.1038/nature13385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Le DT, Uram JN, Wang H, et al. PD-1 blockade in tumors with mismatch-repair deficiency. N Engl J Med 2017;372:2509–20. 10.1056/NEJMoa1500596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Brahmer JR, Abu-Sbeih H, Ascierto PA, et al. Society for Immunotherapy of Cancer (SITC) clinical practice guideline on immune checkpoint inhibitor-related adverse events. J Immunother Cancer 2021;9:e002435. 10.1136/jitc-2021-002435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Herbst RS, Baas P, Kim DW, et al. Pembrolizumab versus docetaxel for previously treated, PD-L1-positive, advanced non-small-cell lung cancer (KEYNOTE-010): a randomised controlled trial. Lancet (London, England) 2016;387:1540–50. 10.1016/S0140-6736(15)01281-7. [DOI] [PubMed] [Google Scholar]
- 16. Reck M, Rodríguez-Abreu D, Robinson AG, et al. Pembrolizumab versus chemotherapy for PD-L1-positive non-small-cell lung cancer. N Engl J Med 2016;375:1823–33. 10.1056/NEJMoa1606774. [DOI] [PubMed] [Google Scholar]
- 17. Mok T, Wu YL, Kudaba I, et al. Pembrolizumab versus chemotherapy for previously untreated, PD-L1-expressing, locally advanced or metastatic non-small-cell lung cancer (KEYNOTE-042): a randomised, open-label, controlled, phase 3 trial. Lancet (London, England) 2019;393:1819–30. 10.1016/S0140-6736(18)32409-7. [DOI] [PubMed] [Google Scholar]
- 18. Wang S, Zhu X, Wang X, et al. TMBstable: a variant caller controls performance variation across heterogeneous sequencing samples. Brief Bioinform 2024;25:bbae159. 10.1093/bib/bbae159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Gao S, Fang A, Huang Y, et al. Empowering biomedical discovery with AI agents. Cell 2024;187:6125–51. 10.1016/j.cell.2024.09.022. [DOI] [PubMed] [Google Scholar]
- 20. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 2017;30:4765–74. [Google Scholar]
- 21. Molnar C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Victoria, BC, Canada: Leanpub; 2019. [Google Scholar]
- 22. Nazir S, Dickson DM, Akram MU. Survey of explainable artificial intelligence techniques for biomedical imaging with deep neural networks. Comput Biol Med 2023;156:106668. 10.1016/j.compbiomed.2023.106668. [DOI] [PubMed] [Google Scholar]
- 23. Overman MJ, Lonardi S, Wong KYM, et al. Durable clinical benefit with nivolumab plus ipilimumab in DNA mismatch repair-deficient/microsatellite instability-high metastatic colorectal cancer. J Clin Oncol 2017;36:773–9. 10.1200/JCO.2017.76.9901. [DOI] [PubMed] [Google Scholar]
- 24. Snyder A, Makarov V, Merghoub T, et al. Genetic basis for clinical response to CTLA-4 blockade in melanoma. N Engl J Med 2014;371:2189–99. 10.1056/NEJMoa1406498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Schrock AB, Ouyang C, Sandhu J, et al. Tumor mutational burden is predictive of response to immune checkpoint inhibitors in MSI-high endometrial cancer. Gynecol Oncol 2019;152:612–8. [DOI] [PubMed] [Google Scholar]
- 26. Rizvi NA, Hellmann MD, Snyder A, et al. Mutational landscape determines sensitivity to PD-1 blockade in non-small cell lung cancer. Science 2015;348:124–8. 10.1126/science.aaa1348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Hugo W, Zaretsky JM, Sun L, et al. Genomic and transcriptomic features of response to anti-PD-1 therapy in metastatic melanoma. Cell 2016;165:35–44. 10.1016/j.cell.2016.02.065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Forde PM, Chaft JE, Smith KN, et al. Neoadjuvant PD-1 blockade in resectable lung cancer. N Engl J Med 2018;378:1976–86. 10.1056/NEJMoa1716078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Ma Y, Chen X, Wang A, et al. Copy number loss in granzyme genes confers resistance to immune checkpoint inhibitor in nasopharyngeal carcinoma. J Immunother Cancer 2021;9:e002014. 10.1136/jitc-2020-002014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.












