Skip to main content
Nature Communications logoLink to Nature Communications
. 2025 Jan 2;16:136. doi: 10.1038/s41467-024-54970-z

A deep multiple instance learning framework improves microsatellite instability detection from tumor next generation sequencing

John Ziegler 1,4, Jaclyn F Hechtman 1,5, Satshil Rana 1, Ryan N Ptashkin 1,6, Gowtham Jayakumaran 1,7, Sumit Middha 1,8, Shweta S Chavan 2,9, Chad Vanderbilt 1, Deborah DeLair 1,10, Jacklyn Casanova 1, Jinru Shia 1, Nicole DeGroat 1,11, Ryma Benayed 1,12, Marc Ladanyi 1, Michael F Berger 1,2, Thomas J Fuchs 1,3,13, A Rose Brannon 1,✉,#, Ahmet Zehir 1,6,#
PMCID: PMC11696176  PMID: 39746944

Abstract

Microsatellite instability (MSI) is a critical phenotype of cancer genomes and an FDA-recognized biomarker that can guide treatment with immune checkpoint inhibitors. Previous work has demonstrated that next-generation sequencing data can be used to identify samples with MSI-high phenotype. However, low tumor purity, as frequently observed in routine clinical samples, poses a challenge to the sensitivity of existing algorithms. To overcome this critical issue, we developed MiMSI, an MSI classifier based on deep neural networks and trained using a dataset that included low tumor purity MSI cases in a multiple instance learning framework. On a challenging yet representative set of cases, MiMSI showed higher sensitivity (0.895) and auROC (0.971) than MSISensor (sensitivity: 0.67; auROC: 0.907), an open-source software previously validated for clinical use at our institution using MSK-IMPACT large panel targeted NGS data. In a separate, prospective cohort, MiMSI confirmed that it outperforms MSISensor in low purity cases (P = 8.244e-07).

Subject terms: Cancer genomics, Machine learning


Identifying microsatellite instability (MSI) from routine next generation sequencing assays is an important part of clinical patient care. Here, authors develop a deep-learning based algorithm, highlighting its performance in a large validation cohort.

Introduction

Microsatellite instability (MSI) is the phenotypic measure of deficiencies in the DNA mismatch repair (MMR) machinery, resulting in varying lengths of deletions and insertions in microsatellites. Microsatellites are short, tandemly repeated DNA sequences, and in patients with an MSI-high (MSI-H) phenotype, these regions exhibit significant errors due to replication slippage. The Food and Drug Administration (FDA) has approved pembrolizumab, an immune checkpoint inhibitor, for patients with MSI-H and/ or MMR-deficient (MMR-D) cancers of any histology1,2. Thus, reliable and robust testing strategies for MSI/MMR status are critical for the clinical management of patients with metastatic cancer. Traditional tools that allow screening of MSI status include MSI polymerase chain reaction (PCR) and/or MMR immunohistochemistry (IHC) testing, but their application for pan-cancer screening, including cancer types with a much lower prevalence of MSI than colorectal and endometrial cancer have raised concerns regarding cost-effectiveness and optimal resource and tissue utilization. Given that clinical comprehensive genomic profiling using targeted NGS panels is now being used more widely to inform treatment decisions in patients with advanced solid cancers, the advantages of also extracting MSI status from these data are apparent. We validated and implemented MSISensor as a way of identifying MSI status in patients who are undergoing next-generation sequencing (NGS) testing at Memorial Sloan Kettering Cancer Center (MSKCC) using MSK-IMPACT, an FDA-authorized targeted tumor sequencing panel35. While this enabled a comprehensive and prospective MSI analysis across a wide array of tumor types, certain features of the clinical samples and the algorithm prevented us from reliably identifying all MSI-H patients. Because MSISensor calculates a distribution of the number of deletions in a given microsatellite region and compares the tumor and the matched normal distributions, samples with low tumor purity may lead to false negatives. Further we observed samples with low sequence coverage also suffer from false negative results. Finally, the presence of an ‘indeterminate’ category (MSISensor scores between 3 & 10, 3.8% of all samples tested), leads to complexities in patient management and the need for orthogonal testing with use of additional tissue resources.

The utility of supervised deep learning in classifying genomic data and results has been demonstrated by a number of somatic and germline variant callers based on deep learning methods68. These prior genomic applications of deep learning methodologies have relied on labeled training data and are thus fully supervised. This existence of a true label for every data point makes learning in a fully-supervised manner computationally feasible. For MSI classification, the ground truth label is not at the individual genomic location as with the labels for individual variants. Rather, the label is for the entire sample and is based on a variety of testing modalities as previously mentioned. Here, we describe a new computational tool for accurately classifying MSI status using NGS data, called MiMSI (Multiple-instance MSI, pronounced “My-MSI”). Our method utilizes a deep multiple instance learning model rather than traditional statistical modeling methods to achieve greater sensitivity while retaining specificity9.

Results

Model development

We utilized multiple instance learning (MIL), a weakly supervised machine learning methodology to develop an MSI classifier. With MIL, a broad label is applied to a collection of multiple individual data points, called instances, rather than to each instance individually10. This problem formalization is evident in many machine learning applications in medicine and has since been combined with deep convolutional methods to form highly performant models1113. In formulating the MSI classification as a MIL problem, we consider each microsatellite region in a patient sample to be an instance and each patient sample to be our ‘bag’, or collection of instances. Ground truth MSI status exists at the bag level due to the orthogonal PCR and MMR-IHC testing performed for each case in our training and validation datasets. Therefore, we can compute the loss function and training accuracy for each sample without requiring the accurate classification of each individual microsatellite region. Optimizing the loss function across all the samples in our training dataset allows us to build a final model to predict the MSI status of subsequent test samples.

The end-to-end classification pipeline (Fig. 1a) starts with converting the aligned NGS reads at each microsatellite site for both the tumor and matched normal sample into a vector representation (i). The model predicts a single probability that the sample is MSI by (ii) calculating a feature representation for each microsatellite vector for a sample, (iii) pooling all microsatellite representations into a sample-level representation and (iv) utilizing a classification layer on the preceding sample representation to determine the final probability of MSI status (Fig. 1b). The first stage of the model is a deep convolutional neural network (CNN) designed to extract a feature embedding for each microsatellite vector in a given sample. The network (Supplementary Fig. 1) is based on a ResNet architecture, employing residual connections after each pair of convolutional layers. These residual connections allow the network to combine both high-level and low-level features learned at varying levels in the network into one cohesive feature embedding at the final stages14. The feature embeddings established by the network for each of the microsatellite loci are then pooled into a sample-level embedding. We tested two different MIL-pooling methods for our embedding-based models. The first is a traditional mean operation where we average the instance-level feature embeddings into a sample-level embedding. For the second method, we implemented an attention mechanism as proposed by Ilse et al.13. The reasons for adding an attention mechanism to our model are twofold. First, it allows the model to learn a weighted average rather than pooling every microsatellite instance equivalently. Second, the attention scores produced provide an interpretability metric for identifying the microsatellite instances that contribute the most to the model’s classification. In both pooling methods, the pooled sample-level embedding is passed through the last stage of the network where we utilize a sigmoid classifier to predict the final likelihood of MSI occurrence. Ninety-five percent confidence intervals (95% CI) are generated by repeating this procedure 10 times for a given sample.

Fig. 1. MiMSI model design and performance metrics.

Fig. 1

a Schematic representation of converting sequencing reads in a genomic region into a vector representation. Reference sequence, along with mapping qualities and CIGAR strings for each read is used in the vectorization after downsampling. The set of vectors for a given sample is passed through the model (see eFigure 1). b Study cohort used for both training the model and testing the performance. c Distribution of MSISensor scores for samples with orthogonal testing performed. d Area under the receiver operator curve (auROC) analysis of the test cohort analyzed with MSISensor and MiMSI at 4 different downsampled coverage levels (100X, 200X, 300X, and 400X). e MSISensor scores and MiMSI probabilities for the test cohort. Colors indicate the orthogonal test status. Source data are provided as a Source Data file.

MiMSI Performance compared to orthogonal testing

To train and validate the performance of MiMSI, we created a dataset, deliberately including cases where MSISensor had failed to identify the MSI status due to low tumor purity and coverage, and for which orthogonal MSI-PCR and MMR IHC data is available (n = 1058, Fig. 1c). We trained the model using 741 samples with four different down-sampled coverage values when creating instance vectors: 100, 200, 300 and 400X, to normalize the coverage inequality between the tumor and the matched blood sample (Fig. 1d, see “Methods” and Source Data). Next, 317 previously unseen cases were used as a test set to compute the accuracy, sensitivity, specificity, and area under the receiver operator curve (auROC) metrics (Fig. 1e and Table 1). MSISensor demonstrated a low auROC (0.907), as we specifically selected cases that were challenging for MSISensor. Performance of MiMSI was highest with 100X and 200X models (auROC: 0.97 for both) and degraded with increased coverage. At higher coverages, down-sampling reduced the average number of loci used in training and testing (Supplementary Table 1). Because both 100X and 200X showed better results, we used these downsampling values to train the model with an attention-pooling mechanism as well. Both the 100X and 200X models demonstrated similar performance metrics, however, confidence intervals (CIs) derived from performing random downsampling ten times showed 200X has less variation due to down-sampling (Supplementary Fig. 2) therefore we used the 200X model with the attention pooling for the rest of the analyses in this manuscript. Further, using the optimal cut-off approach that maximizes the sensitivity and specificity, we calculated a score of 0.4 as a cut-off threshold to determine MSS vs MSI-H cases with MiMSI and classified any case for which the 95% CIs crossed the 0.4 boundary as MSI-indeterminate (MSI-ind).

Table 1.

Performance metrics for MSISensor and MiMSI

Model Accuracy Sensitivity Specificity auROC
MSI Sensor 0.835 0.67 1 0.907
100x with average 0.938 (.933, .943) 0.885 (.875, .895) 0.990 (.986, .994) .970 (.967, .972)
100x with attention 0.940 (.928, .951) 0.887 (.863, .912) 0.991 (.987, .996) .963 (.956, .970)
200x with average 0.942 (.939, .944) 0.896 (.890, .901) 0.988 (.988, .988) .972 (.970, .975)
200x with attention 0.942 (.936, .948) 0.895 (.884, .905) 0.988 (.985, 992) .971 (.968, .973)
300x 0.929 (.912, .947) 0.867 (.826, .908) 0.991 (.985, .996) .970 (.966, .975)
400x 0.916 (.908, .923) 0.843 (.828, .858) 0.988 (.988, .988) .964 (.962, .965)

Of the 317 samples in the test cohort, 17 samples (5%) were discordant compared to orthogonal testing results, and one sample was indeterminate by MiMSI. Fifteen cases were classified as MSS by MiMSI; however, they were orthogonally MSI-H. Mutational signatures analysis, where it can be applied, indicated low percentages of the contribution of signatures indicative of MMR deficiency (Fig. 2), suggesting these could represent cases where MMR proteins were lost, but the phenotype of this loss did not appear in the genome yet. Two of these cases exhibited a DNA polymerase-ε (POLE) deficient mutational signature and the associated high tumor mutation burden in addition to an MMR deficient signature. One case was orthogonally MSI-H and was classified as MSI-indeterminate by MiMSI: Sample_18831 was an endometrial cancer case with 27 mutations identified, with a median VAF of 5% suggesting a very low tumor purity. Mutational signature analysis of this sample revealed a high fraction of signatures of MMR deficiency, suggesting that signature analysis of mutation contexts could help clarify the MiMSI indeterminate cases (Fig. 2). Two samples were orthogonally MSS, however, MiMSI classified them as MSI-H. Sample_54409 was a colon adenocarcinoma with 303 somatic mutations detected and classified as MSI-H by MiMSI but MSS by orthogonal testing with IHC/PCR. Mutational signature analysis attributed 72% of the mutations to the deficiency of POLE resulting from an exonuclease domain mutation (V411L). Another 15% of the mutations were attributable to MMR deficiency, and there was a nonsense mutation in MSH2 (E580*), suggesting the possibility of a smaller clone with MMR phenotype that was either not clear during IHC review or MSH2 expression was retained15. Sample_34903 was a UEC with over 500 mutations identified with only 5 indels. Mutation signature analysis showed that 52% of the mutations were attributable to concurrent deficiencies in proficient DNA replication by POLE and MMR system (Fig. 2)16.

Fig. 2. Mutational signature analysis for discrepant cases.

Fig. 2

Bar chart showing the fraction of mutations explained by a given mutation signature: mismatch-repair deficiency (MMR: Red); error-prone DNA Polymerase ε (POLE: gray) and concurrent MMR and POLE (blue). All other signatures are shown in light gray. Samples shown had a minimum of 10 mutations for signature analysis and were discrepant between orthogonal testing (MMR IHC or MSI PCR) and MiMSI. Labels shown are the MiMSI categorization. Source data are provided as a Source Data file.

We further performed sample dilution experiments to determine the sensitivity of MiMSI to changes in tumor purity and to compare its performance at low purity with MSISensor (Supplementary Table 2). We diluted a tumor DNA sample from an MSI-H case validated by MSI PCR with successive amounts of normal DNA from matched FFPE normal tissue and evaluated the sequencing data with both MSISensor and MiMSI. The MSISensor score decreased from its original value of 36.7 as tumor purity decreased, however, it remained indicative of an MSI-H phenotype until the final dilution of 6%. At that point, the MSISensor lacked sufficient signal to score the case above our MSI-H threshold of 10. MiMSI, however, classified the sample as highly likely to be MSI-H, even at the lowest dilution point.

Importance of the number of microsatellite sites processed

Since vector data generated over the microsatellite sites are used for the prediction of MSI-H phenotype, we investigated how the number of loci used in analysis affects the outcomes. We randomly downsampled the microsatellite loci used in the test cohort and asked how MiMSI classifications change (Fig. 3). As the number of loci used decreased, the confidence intervals of the samples analyzed increased, which led to an increase in the number of MSI-Ind cases. The rate of MSI-Ind cases increased from 1.2% (900 sites used) to 12% (100 sites used) demonstrating that the number of sites used is an important factor in classifying cases properly.

Fig. 3. Downsampling microsatellite loci used in classification.

Fig. 3

MiMSI classification results after randomly downsampling microsatellite loci were used for the classifier. The average length of confidence intervals (CI) across the 317 samples for a given number of sites used is shown above each figure. Data are presented as MiMSI score +/− 95% CI error bars; MSI-H in red, MSI-Indeterminate in teal, MSS in blue. Source data are provided as a Source Data file.

Comprehensive features of MMR-D cancers

Next, we collected a separate, prospective, cohort of 5037 samples from 42 different cancer types with available orthogonal IHC data (see Source Data). Of the 5037 samples, 842 were MMR-D (580 with MLH1 loss, 166 with MSH2 loss, 60 with MSH6 loss, and 36 with PMS2 loss, Fig. 4). To our knowledge, this is the largest cohort of samples with concurrent NGS and IHC results performed on the same tissue and represents an opportunity to investigate molecular features of MMR-D and loss of different MMR proteins, in addition to confirming how MSISensor and MiMSI perform in a large, clinical cohort.

Fig. 4. Molecular and genomic features of 5037 prospective clinical cancer samples.

Fig. 4

a Distribution plots of the tumor mutation burden (TMB) for MMR proficient (MMR-P) versus deficient (MMR-D) as determined by IHC; the MMR-D cases were further broken down by the specific MMR protein lost. Pairwise comparisons were performed using a two-sided Mann-Whitney Wilcoxon test, demonstrating significant difference of TMB between MMR-P (n = 4195) and MMR-D (n = 842) cases (p < 2.2 x 10−16), MLH1 loss (n = 580) vs MSH2 loss (n = 166) (p = 8.4 × 107), and MSH2 loss vs MSH6 loss (n = 60) (p = 0.034). In contrast, there was no difference in TMB between MLH1 loss and MSH6 loss (p = 0.87), MLH1 loss and PMS2 loss (n = 36) (p = 0.96), and MSH6 loss and PMS2 loss (p = 0.62). In the box plots for ac, the central line represents the median; the box corresponds to 25–75% quartiles; the upper whisker extends to the largest value no farther than 1.5 × IQR; and the lower whisker extends from the 25% quartile to the smallest value no farther than 1.5 × IQR. b Distribution plots of indel to SNV ratios based on MMR and/or protein status (n same as in a). Higher indel-to-SNV ratios are common in tumors with inactivated MMR proteins. Pairwise two-sided Mann-Whitney Wilcoxon comparisons upheld significant differences between all groups (MMR-P (n = 2699) vs MMR-D (n = 789) p < 2.2 × 1016, MLH1 (n = 557) vs MSH2 (n = 152) p < 2.2 × 1016, MLH1 vs MSH6 (n = 46) p < 2.2 × 10−16, MLH1 vs PMS2 (n = 36) p = 0.026, MSH2 vs MSH6 p = 3.6 × 1012, MSH6 vs PMS2 p = 8.1 × 108) c Distribution plots of the fraction of the signature analysis driven by the MMR related signatures in the subset of tumors with at least 15 mutations detected (MMR-P n = 167, MMR-D n = 719, MLH1 loss n = 506, MSH2 loss n = 143, MSH6 loss n = 37, PMS2 loss n = 35). Pairwise two-sided Mann-Whitney Wilcoxon comparisons of the MMR signature fractions were significant between MMR-P vs MMR-D (p < 2.2 × 1016), MLH1vs MSH2 (p = 1 × 1011), MLH1 vs MSH6 (p = 4.8 × 105), and MLH1 vs PMS1 (p = 0.0028), but not between either MSH2 or MSH6 and PMS2 (p = 0.44 and p = 0.2, respectively). d MSISensor scores for all tumors in each IHC category, grouped by algorithmic classification (MSI-H red, MSI-Indeterminate Teal, MSS blue, Not reported gray). e MiMSI score for all tumors by IHC category, grouped by updated algorithmic score. (See Supplementary Fig. 3 for MiMSI scores for each IHC category with 95% confidence intervals.). f Sensitivity of MSISensor and MiMSI based on tumor purity as compared to the IHC standard. g Sensitivity of MSISensor (red triangle) and MiMSI (blue square) by cancer type. Source data are provided as a Source Data file.

First, we investigated the genomic features correlating with MMR-D phenotype and loss of different MMR proteins. Tumor mutation burden (TMB) was higher in MMR-D samples compared to MMR-P group (median TMB in MMR-D = 39, median TMB in MMR-P = 5.27, Mann-Whitney Wilcoxon p-value < 2.2 × 10−16, Fig. 4a). Pair-wise comparisons across the different sub-groups of MMR protein loss showed MSH2 loss led to higher TMB compared to MLH1 loss (median TMB in MSH2 loss = 46.5, median TMB in MLH1 loss = 37.7, Mann-Whitney Wilcoxon p-value = 0.0013). Loss of MMR machinery leads to the formation of indels as opposed to single nucleotide variations (SNVs) therefore, we quantified the indel burden in each sample in the form of indel to SNV ratio. As expected, we found higher rates of indel/SNV rate in MMR-D samples overall compared to MMR-P samples (median indel/SNV in MMR-P = 0.18, median indel/SNV in MMR-D = 0.5, Mann-Whitney Wilcoxon p-value = < 2.2 x 1016, Fig. 4b). Amongst the MMR-D samples, loss of MLH1 caused the highest level of indel/SNV ratio (median = 0.57) and MSH6 loss led to lowest (median = 0.09). Finally, we investigated the contribution of MMR mutation signature in samples with at least 15 mutations. We observed a much higher contribution of mutation signatures related to MMR in MMR-D cohort (n = 730) compared to MMR-P cohort (n = 168) (median signature contribution in MMR-P = 0.036, median signature contribution in MMR-D = 0.62, Mann-Whitney Wilcoxon p-value = < 2.2 x 1016, Fig. 4c). Loss of MSH6 was associated with the lowest contribution of MMR signature (median = 0.42) while MLH1 loss led to highest (median = 0.67). Overall, these results demonstrate the molecular features that are associated with MMR-D tumors and help further delineate the phenotypic differences between the four proteins part of the MMR machinery.

We then used this cohort to investigate the performance of MSISensor and MiMSI. Overall, the sensitivity of MiMSI (91.6%, 95% CI = 89.5%–93.4%) was greater than MSISensor (86.1%, 95% CI = 83.3% – 88.6%, Fig. 4d, e). Results for MSISensor were not reported for 2.3% of the cases (n = 118) due to low coverage or low tumor purity. Moreover, amongst the MSISensor indeterminate cases (n = 247), MiMSI was able to correctly call 91% (n = 226) of the cases. (Fig. 4d). When stratified by estimated tumor purity, both methods were similar in sensitivity for samples where tumor content is equal to or greater than 30%. However, MiMSI sensitivity was higher (85.1%, 95% CI = 81.0%–88.5%) than MSISensor (72.8%, 95% CI = 67.3%–77.8%) for samples with less than 30% purity (McNemar’s chi-squared P = 8.244e-07, Fig. 4f). We also observed differences in sensitivities across cancer types. Sensitivities for Bladder Cancer (n = 73), Colorectal Cancer (n = 2448), and Esophagogastric Carcinoma (n = 475) were comparable between the two callers. For Cancer of Unknown Primary tumors, the sensitivity of MSISensor (92.3%, 95% CI = 64.0% – 99.8%) was higher than MiMSI (87.5%, 95% CI = 61.7% – 98.4%). However, the sensitivity for MiMSI was higher in multiple cancer types, including Small Bowel Cancer (100%, 95% CI = 82.4%–100% vs 94.1%, 95% CI = 71.3% – 99.9%), Prostate Cancer (93.8%, 95% CI = 79.2% – 99.2% vs 85.7%, 95% CI = 63.7% – 97.0%), and Endometrial Cancer (89.7%, 95% CI = 85.7 – 92.9% vs 79.%, 95% CI = 73.9%– 94.9%).

Global comparison of MiMSI with MSISensor

After clinically validating the results of MiMSI, we set out to investigate MMR-D phenotype across a large variety of cancer types (n = 101) by analyzing 45,112 tumor samples sequenced with the MSK-IMPACT assay (see Source Data). Comparison of MiMSI results with MSISensor showed an overall 96% concordance for cases identified as MSS and MSI-H with both methods (Fig. 5 and Table 2). Moreover, MiMSI reduced the number of samples in the MSI-Ind category identified by MSISensor from 3.8% (n = 1724) to 0.47% (n = 210).

Fig. 5. Global comparison of MiMSI results with MSISensor.

Fig. 5

Comparison of MSISensor scores and MiMSI scores +/− 95% CI error bars for a cohort of 45,112 tumor samples. Source data are provided as a Source Data file.

Table 2.

Comparison of MiMSI and MSISensor results in the 45,112 MSK-IMPACT cohort

MSISensor
MSS MSI-Ind MSI-H
MiMSI MSS 42,059 1459 63
MSI-Ind 149 50 11
MSI-H 79 215 1027

Analysis of tumors without a matched control

While MiMSI was trained using a patient-matched normal control sample, there are instances where a matched control may not be available for a given sample. To assess MiMSI’s ability to classify cases in a tumor-only mode, we reanalyzed a subset of the samples in the test cohort using an unrelated normal sample (Fig. 6a) with MiMSI. We observed that while all the orthogonally MSI-H cases had scores > 0.4, a majority of orthogonally MSS cases had scores > 0.4 leading to false positive (FP) calls. We hypothesized that ancestral differences between the tumors and the normal comparator might lead to FPs. Therefore, we also tested a pooled blood control (an equimolar mixture of 10 blood samples) as a comparator (Fig. 6b). We observed a shift of MiMSI scores towards zero in general with the pooled control, but more so for the orthogonally MSS cases. Because the attention mechanism identifies sites that are different between the tumor and the comparator and increases their weights, using MiMSI, in the context of unmatched samples, seems to yield higher MSI-H probability scores overall. Consequently, using MiMSI without the attention mechanism, where differences between the tumor and comparator would be averaged uniformly should yield more balanced probability scores. Analysis of the tumor and unrelated comparator using MiMSI without attention (Fig. 6c) showed a similar score distribution to analysis with pooled normal using MiMSI with attention. Most concordant results were achieved with the pool normal as a comparator using MiMSI without attention (Fig. 6d). The presence of FNs amongst the orthogonally MSI-H cases suggests using a different threshold than 0.4 would account for the variability observed between the tumor and the pooled normal and could increase sensitivity and specificity. We suggest users be cautious in interpreting unmatched results and perform their own validation with their NGS dataset.

Fig. 6. Tumor-only analysis of test cohort.

Fig. 6

A subset of the test cohort (142 orthogonally MSS tumors and 132 orthogonally MSI tumors) was analyzed using the MiMSI model with attention mechanism with an unrelated normal comparator (a) and a pooled normal comparator (b). c, d show the same comparison using the MiMSI model without an attention mechanism. Data are presented as MiMSI score +/− 95% CI error bars; MiMSI class MSI-H in red, MSI-Indeterminate in teal; MSS in blue. Source data are provided as a Source Data file.

Analysis of whole exome sequencing (WES) results

Finally, we wanted to assess the ability of MiMSI to classify other NGS modalities than the one it was trained on without any re-training. We recaptured NGS libraries from 582 samples using WES probes and analyzed the data using MiMSI. WES results were highly concordant with MSK-IMPACT classifications (overall concordance: 98.6%) with a total of 5 discordant cases (Fig. 7) indicating MiMSI can be used out of the box for new datasets without re-training. However, similar to the unmatched analysis of tumors, we do caution users to validate the model with their datasets compared to a truth set before setting out to do large-scale analyses.

Fig. 7. Analysis of the WES dataset.

Fig. 7

Scatterplot showing the MiMSI scores +/− 95% CI error bars from 581 MSK-IMPACT and WES captured from the same DNA library. Source data are provided as a Source Data file.

Discussion

Here, we describe a novel algorithm for MSI classification caller which uses vectorized NGS reads at microsatellite loci and is extensible to multiple DNA capture types. Given the importance of correctly identifying the MSI status in clinical cohorts and the fact that low tumor purity is common in formalin-fixed paraffin-embedded specimens, we believe MiMSI will be a valuable addition to clinical analysis pipelines for comprehensive genomic profiling.

Detecting MSI-H status is crucial for the management of cancer patients: it is predictive of Lynch syndrome regardless of the primary tumor as well as predictive of response to immune checkpoint inhibition17. However, current DNA-based methods often have limited technical sensitivity in many MSI-H samples as these tumor samples typically show increased levels of tumor-infiltrating lymphocytes or a ‘Crohn’s-like’ lymphocytic reaction, which can sharply reduce tumor purity (i.e., the proportion of all cells in the biopsy that are tumor cells). To date, MMR IHC is the only tumor purity agnostic method to screen for MSI-H status, but it requires separate material that cannot be used for other predictive tests (such as sequencing). We show that MiMSI can be used concurrently with NGS assays performed on low tumor purity cases, decreasing the need for additional follow-up IHC tests required with MSISensor. Furthermore, while the clinical utility has yet to be fully explored, we show MiMSI can detect MSI-H phenotype that occurs concurrently with other genomic lesions such as exonuclease domain mutations in POLE, where the MSI phenotype may not be clearly apparent. Finally, MMR IHC has a sensitivity of approximately 94% results can show false-retained MMR protein patterns with pathogenic missense mutations, removing an indication for treatment with immune checkpoint inhibition and pointing to the need for sensitive testing methods15.

The application of a deep learning-based classifier to NGS data allows us to infer MSI status with greater sensitivity than prior statistical methods, especially in situations where the sample has low tumor content, or sequencing produces many low-coverage regions. Since these traditional statistical methods rely on the difference in distributions between the tumor and normal sample, they are sensitive to sequencing issues or sample quality issues. Our proposed model is trained to utilize a dataset composed of samples with varied levels of coverage, purity, and sequencing quality, resulting in a more resilient classifier. This pre-trained model may potentially suffer from single institution bias, as many machine learning models do; however, the MiMSI package includes a tool to train a model on new data. In addition, the deep nature of our model means that it is able to classify against multiple levels of features derived from the aligned sequencing results, rather than just a set of hand-chosen features such as the number of deletions observed. Finally, by using WES data generated from the same DNA specimen, we show that MiMSI can be used with data generated with different capture methods without re-training.

The results demonstrate that a deep multiple-instance learning approach can be utilized to infer abnormalities in NGS data, even when that data is weakly labeled. Microsatellite instability is one such application; however, this method is applicable to other classification tasks as well. As NGS assays become more prevalent in clinical diagnostics methods like MiMSI allow us to identify MMR deficient and MSI-H cases during sequencing rather than as a separate IHC or PCR test. Furthermore, MiMSI can be utilized across many cancer types, raising the probability that additional MSI cases can be identified in tumor types where MSI testing isn’t typically performed, or MMR deficiency isn’t commonly observed.

Methods

Clinical data generation

This study complies with all relevant ethical regulations. Prospective clinical sequencing data from 45,112 samples (40,414 consented patients) with the MSK-IMPACT panel4 that passed standard clinical quality metrics between January 2014 - April 2020 were used with MSKCC Institutional Review Board approval. MSISensor(v0.2) was used in the clinical analysis pipeline. MSISensor scores between 0 and 3 were considered MSS, between 3 and 10 were considered MSI-Ind, and scores greater than were considered MSI-H3. Based on our analytical validation, for samples with tumor content less than 20%, MSI classification based on MSISensor is censured or not reported.

Converting NGS data to vectors

The intermediate file from which variant analysis is typically performed in NGS is the alignment file (BAM/SAM). The format of the alignment file is not conducive to the implementation of deep learning algorithms involving a convolutional neural network (CNN) approach. Thus, we internally developed a program that efficiently converts the known microsatellite regions of the genome to three-dimensional vector representations of fixed sizes.

Each microsatellite region in the alignment file for both the tumor and the normal sample is read utilizing pySAM, a Python-based SAMtools reader. An individual vector for every site above a coverage threshold is generated by compiling a list of the aligned reads that completely span each given microsatellite region. Each aligner read is subsequently converted into a two-dimensional vector of size L × 3, where L is the length of the microsatellite region on the reference sequence, and three is the number of data points extracted from the alignment file at each base. These three channels are the CIGAR string value, the mapping quality, and a binary integer representing the reverse strand mapping.

The generated single-read vectors are concatenated to form a three-dimensional region-level representation with dimensions C x L x 3, where C is the required coverage at the interested region. To achieve the standard input size, the CNN requires that we downsample the coverage to the desired cutoff threshold for the first dimension. In our experiments, we used coverage cutoffs of 100, 200, 300, and 400. The downsampling process is random to avoid bias from sorting the alignment file in our model. Because the microsatellite regions vary in size and the model requires fixed dimensions, we introduced zeroes on the edges of the vectors for microsatellite regions that are smaller than 40 bp. This process results in C x 40 x 3 vectors for both the matched tumor and normal samples at each microsatellite locus. With the assumption that a comparison between tumor and normal is necessary, we stacked the two vectors together, achieving a 2 C x 40 x 3 vector for every microsatellite region in the patient’s genome. Due to coverage variability across various regions, the number of microsatellite vectors varies by patient. Our final input for training and testing the model is an N x 2 C x 40 x 3 vector for each patient sample, where N refers to the number of microsatellite vectors generated for the particular patient sample.

MiMSI Model architecture

MiMSI’s architecture (Supplementary Fig. 1) can be separated into two main components. The first portion of the model creates an instance-level feature embedding for every microsatellite site, and the second calculates a bag-level embedding on which we calculate our final MSI prediction.

The instance-level model is based on ResNet-18, an 18-layer deep convolutional neural network that has performed well in various image-based classification tasks18. The original ResNet-18 architecture was adapted to accommodate the small size of our microsatellite vectors and to support a multiple-instance learning approach. Compared to the 18-layer ResNet-18, MiMSI was built with 12 layers, and we downsampled the image twice over the course of the forward pass rather than at every residual connection. In each standard convolution layer, a kernel size of 3 x 3 is utilized, with zero padding and a stride of 1. In the downsampling convolutions, the kernel size, padding, and stride are 3 x 3, 0, and 2 respectively. The stride is increased to 2 to downsample the vector by a factor of 2x.

The output of the final residual block in the instance-level model is fed into a fully connected linear layer to create an N x 512 vector representation of the sample, where N is the number of microsatellite loci. This collection of instance-level feature embeddings is further processed in the second portion of the model, which combines the instances into a bag-level score. In our embedding-based models, we utilized two different combination methods - average pooling and attention pooling. In the average pooling models, we calculate the mean over the 512 features of the N microsatellite sites to form a bag-level embedding of 1 x 512. In the attention mechanism models, we utilize a combination of two linear layers with 128 features and the tanh activation function to compute an attention score for each microsatellite instance. These attention scores are passed into a softmax function to ensure they sum to one and then applied to the microsatellite instance vectors via a dot product. The result is a weighted average of the instance vectors which forms the pooled 1 x 512 bag-level embedding. The final prediction of MSI status is calculated via a sigmoid layer on the preceding bag-level embedding. In the 300x and 400x coverage experiments we used an instance-based model rather than the average or attention pooling due to GPU memory constraints utilizing larger microsatellite instances. The instance-based model calculates a prediction of MSI status on each microsatellite instance in the bag. These instance scores are then aggregated by taking the average of the top k instance scores. In our testing, we found that k = 25 performed best and used that as our aggregation threshold for both the 300 and 400x evaluation. The final output of the model is a probability of MSI status for the sample, which we threshold at 0.4 to achieve our final binary classification of MSS vs MSI-H.

Microsatellite selection

We generated a list of microsatellite loci (n = 1755) using the msisensor scan tool that is bundled with MSIensor tool (v0.2) that is covered by MSK-IMPACT (detailed list of these loci can be found within the tests folder that comes with MiMSI).

Dataset

Our training and validation datasets consisted of 1058 cases for which we had orthogonal MMR IHC or MSI PCR testing. These cases were a combination of cases from internal validation cohorts as well as cases that were labeled as exceptionally difficult based on purity, coverage, or MMR status. The ground truth labels were determined via IHC or PCR testing for MMR-d and/or MSI status. MSI cases were given a ground truth label of one, while MSS cases were labeled zero. Our training dataset was a subset of 741 of the full 1058 labeled cases. The training dataset contained a total of 396 MSS cases and 345 MSI-H cases. The remaining 317 cases were kept unseen from the model and used as our test dataset. Orthogonal MMR IHC was collected and confirmed by a board-certified pathologist to curate a secondary prospective dataset of 5037 cases for additional testing of the MiMSI algorithm.

Training and validating the model

The model is trained by minimizing the binary cross entropy loss function across the training dataset of 741 cases. The loss function is given by:

Loss=ylog(y^)+(1y)log(1y^) 1

The model was built using PyTorch and optimized using the Adam algorithm19,20. Multiple learning rates and weight decay were tested and the model with the highest accuracy on the testing set was chosen18. In our experiments we found the best performing learning rate and weight decay to be .0001 and .0005 respectively. The model was trained for 50 epochs and initialized with random weights. The batch size was set at one, meaning that each epoch was one pass through every sample in the training dataset.

To validate our model, we tested an unseen dataset of 317 cases, of which 160 were MSS and 157 were MSI-H, and we report our metrics against those held out samples. To compare performance against MSISensor we ran the tool against the same 317 cases utilizing the default parameters.

Downsampling effects on accuracy

Evaluation of our held-out test set indicated that the 100x and 200x coverage models were the best-performing models. However, given that MSK-IMPACT is a targeted, deep sequencing assay with average coverage close to 600x we conducted further analysis to quantify the effects of randomly downsampling reads down to these lower coverage levels before inputting the generated vectors into our model. Each case in the test set cohort was randomly downsampled to 100 and 200 reads 10 times each, giving us 10 replicates of each tumor/normal sample to input into the trained 100x and 200x models, respectively. Each replicate was evaluated using its corresponding model resulting in 10 predictions, which we used to create 95% confidence intervals for each of the 137 samples at both coverage levels. Comparing these confidence intervals, the 200x model empirically demonstrated more stable predictions (Supplementary Fig. 2). The average length of a prediction’s confidence interval was 0.044 for the 200x model, compared to 0.06 for the 100x model. Also, fewer confidence intervals overlapped our prediction threshold of 0.4 when using the 200x model. This is particularly important for our application since a confidence interval that overlaps our threshold effectively creates an Indeterminate case.

Error, sensitivity, and specificity

A case with an MSISensor score of more than 10 was considered MSI-H, while a case with a score between 3 and 10 was considered Indeterminate, and a score less than 3 was considered MSS. A case with a probability of MSI-H less than 0.4 by MiMSI was considered an MSS classification, while a score greater than or equal to 0.4 was considered MSI-H.

Error Rate (E), Sensitivity (TPR), and specificity (TNR) were calculated according to the following formulas:

E=(FN+FP)(TP+FP+TN+FN) 2
TPR=TP(TP+FN) 3
TNR=TN(TN+FP) 4

For calculating MSISensor accuracy, we considered a case a true negative if MSISensor returned a score less than 3 for a case confirmed MSS by PCR/IHC, or a true positive if MSISensor returned a score greater than 10 for a case confirmed MSI-H by PCR/IHC. A case was considered a false negative if MSISensor classified a case Indeterminate or MSS and it was determined to be MSI-H by orthogonal testing, and a false positive if MSISensor classified a case MSI-H that was classified MSS by PCR/IHC. The Receiver Operating Characteristic (ROC) and Precision Recall plots were built utilizing the functionality in the open source scikit learn package21, and we calculated the Area Under the Receiver Operating Characteristic (auROC) from this package as well.

Concordance with MSISensor on the MSK-IMPACT cohort

The 45,112 MSK-IMPACT cases were analyzed with the default parameters of MSISensor and subsequently classified by the MiMSI model after training and validating on the dataset described above. We retained the same cutoffs for MSS, Indeterminant, and MSI-H for MSISensor cases as utilized in training and validating the model above. We defined a concordant case as one where (1) both MSISensor and MiMSI classified the sample as MSS or (2) both MiMSI and MSISensor classified the sample as MSI-H. Any other combination of classifications from MSISensor and MiMSI was treated as a discordant result.

Mutational signature analysis

Mutational signatures were determined as described before6. In brief, the contribution of various mutation signatures was assessed for each sample by analyzing the trinucleotide context of the mutations, resulting in a 96 mutational‐class matrix. These mutations were then resampled 1000 times and subjected to a decomposition analysis, which aimed to minimize the Kullback–Leibler divergence between the observed sample signature and an approximation constructed from the 30 Stratton signatures. Each signature was assigned a weight corresponding to the percentage of contributing mutations. A signature was considered dominant for the sample if it accounted for more than 40% of the observed mutations.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Reporting Summary (320.5KB, pdf)
Peer Review File (3.3MB, pdf)

Source data

Source Data (5.6MB, xlsx)

Acknowledgements

We thank the members of Molecular Diagnostics Service and Fuchs Laboratory for their input. This study was funded by the National Cancer Institute under the MSK Cancer Center Support Grant/Core Grant (P30 CA008748).

Author contributions

Conception and design: J.Z., J.F.H., T.P., G.J., M.F.B., T.J.F., A.R.B., and A.Z. Administrative support: N.D. and M.L. Provision of study material or patients: J.F.H., S.M., S.S.C., C.V., D.D., J.C., J.S., and R.B. Collection and assembly of data: J.Z., J.F.H., S.R., R.N.P., G.J., and A.Z. Data analysis and interpretation: J.Z., J.F.H., S.R., R.N.P., G.J., T.J.F., A.R.B., and A.Z. Manuscript writing: All authors.

Peer review

Peer review information

Nature Communications thanks Lee Albacker, Ruibang Luo, and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Data availability

The raw sequencing data for the MSK-IMPACT analysis is protected and cannot be broadly available due to privacy laws; patient consent to deposit raw sequencing data was not obtained. Three-dimensional microsatellite vectors for training and test data are available at https://mskcc.box.com/s/x1b4y5lkc49epbw8zujvziwuzuu3fnot. Clinical and somatic alteration data for the secondary cohort are publicly available on cBioPortal: https://www.cbioportal.org/study/summary?id=pancan_mimsi_msk_2024. Raw data may be requested from brannona@mskcc.org with appropriate institutional approvals. Data will be shared for a span of 2 years within 2 weeks of execution of a data transfer agreement with MSK, which will retain all title and rights to the data and results from their use. Source data are provided in this paper.

Code availability

The code developed and used in these experiments has been made freely available under version 3 of the GNU General Public License (GPLv3). The software, along with the fully trained model is hosted at https://github.com/mskcc/mimsi (10.5281/zenodo.13357948).

Competing interests

John Ziegler is an employee of MongoDB, New York. Jaclyn F. Hechtman is an employee of Caris Life Sciences and has received consulting fees from Pfizer. Ryan N. Ptashkin is an employee of Natera. Gowtham Jayakumaran is an employee of Guardant Health. Sumit Middha is an employee of Adaptimmune. Shweta S. Chavan is an employee of Repertoire Immune Medicines, Cambridge, MA. Chad Vanderbilt has equity, Intellectual Property Rights, Professional Services and Activities (uncompensated) for Paige.AI. Deborah DeLair is an employee of Northwell Health, Greenvale, NY. Jinru Shia has been engaged in Professional Services and Activities (uncompensated) for Paige.AI. Nicole DeGroat is an employee of Regeneron Pharmaceuticals, Tarrytown, NY. Ryma Benayed is an employee of AstraZeneca, New York. Marc Ladanyi received advisory board compensation from Merck, Bristol-Myers Squibb, Takeda, Bayer, Lilly Oncology, and Paige.AI, and research support from LOXO Oncology and Helsinn Healthcare. Michael F. Berger received consulting fees from Eli Lilly, AstraZeneca, and Paige.AI, grant support from Boundless Bio, and has intellectual property rights in SOPHiA Genetics. Thomas J. Fuchs is the founder, chief scientist, and shareholder of Paige.AI and is an employee of Elli Lilly and Company. A. Rose Brannon has intellectual property rights in SOPHiA Genetics. Ahmet Zehir is an employee of Natera and received honoraria from Illumina. The remaining authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: A. Rose Brannon, Ahmet Zehir.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-024-54970-z.

References

  • 1.Le, D. T. et al. Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade. Science357, 409–413 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Center for Drug Evaluation & Research. FDA approves pembrolizumab for adults and children with TMB-H solid tumors. https://www.fda.gov/drugs/drug-approvals-and-databases/fda-approves-pembrolizumab-adults-and-children-tmb-h-solid-tumors (2020).
  • 3.Middha, S. et al. Reliable pan-cancer microsatellite instability assessment by using targeted next-generation sequencing data. JCO Precis. Oncol.2017, PO.17.00084 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cheng, D. T. et al. Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): A hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. J. Mol. Diagn.17, 251–264 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med.23, 703–713 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol.36, 983–987 (2018). [DOI] [PubMed] [Google Scholar]
  • 7.Torracinta, R. & Campagne, F. Training genotype callers with neural networks. Preprint at 10.1101/097469 (2016).
  • 8.Luo, R., Sedlazeck, F. J., Lam, T.-W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat. Commun.10, 998 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
  • 10.Dietterich, T. G., Lathrop, R. H. & Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell.89, 31–71 (1997). [Google Scholar]
  • 11.Quellec, G., Cazuguel, G., Cochener, B. & Lamard, M. Multiple-Instance Learning for Medical Image and Video Analysis. IEEE Rev. Biomed. Eng.10, 213–234 (2017). [DOI] [PubMed] [Google Scholar]
  • 12.Wang, X., Yan, Y., Tang, P., Bai, X. & Liu, W. Revisiting multiple instance neural networks. Pattern Recognit.74, 15–24 (2018). [Google Scholar]
  • 13.Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Proc. 35th International Conference on Machine Learning 2127–2136 (PMLR, 2018).
  • 14.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016).
  • 15.Hechtman, J. F. et al. Retained mismatch repair protein expression occurs in approximately 6% of microsatellite instability-high cancers and is associated with missense mutations in mismatch repair genes. Mod. Pathol.33, 871–879 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Haradhvala, N. J. et al. Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair. Nat. Commun.9, 1746 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Latham, A. et al. Microsatellite instability is associated with the presence of lynch syndrome pan-cancer. J. Clin. Oncol.37, 286–295 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med.25, 1301–1309 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference for Learning Representations (ICLR, 2015).
  • 20.Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Article 721, 8026–8037 (Curran Associates Inc., Red Hook, NY, USA, 2019).
  • 21.Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.12, 2825–2830 (2011). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting Summary (320.5KB, pdf)
Peer Review File (3.3MB, pdf)
Source Data (5.6MB, xlsx)

Data Availability Statement

The raw sequencing data for the MSK-IMPACT analysis is protected and cannot be broadly available due to privacy laws; patient consent to deposit raw sequencing data was not obtained. Three-dimensional microsatellite vectors for training and test data are available at https://mskcc.box.com/s/x1b4y5lkc49epbw8zujvziwuzuu3fnot. Clinical and somatic alteration data for the secondary cohort are publicly available on cBioPortal: https://www.cbioportal.org/study/summary?id=pancan_mimsi_msk_2024. Raw data may be requested from brannona@mskcc.org with appropriate institutional approvals. Data will be shared for a span of 2 years within 2 weeks of execution of a data transfer agreement with MSK, which will retain all title and rights to the data and results from their use. Source data are provided in this paper.

The code developed and used in these experiments has been made freely available under version 3 of the GNU General Public License (GPLv3). The software, along with the fully trained model is hosted at https://github.com/mskcc/mimsi (10.5281/zenodo.13357948).


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES