Abstract
Clinical decision-making is driven by multimodal data, including clinical notes and pathologic characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise to advance clinical care1,2. However, the scarcity of well-annotated multimodal datasets in the clinical setting hinders the development of useful models. Here, we develop the Multimodal transformer with Unified maSKed modeling (MUSK), a vision-language foundation model designed to leverage large-scale, unlabeled, unpaired image and text data. MUSK is pre-trained on 50 million pathology images from 11,577 patients and 1 billion pathology-related text tokens using unified masked modeling. It is further pre-trained on 1 million pathology image-text pairs to align vision and language features efficiently. With minimal or no further training, MUSK is tested in a wide range of applications and demonstrates superior performance across 23 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, image classification, and molecular biomarker prediction. Furthermore, MUSK shows strong performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction, and immunotherapy response prediction in lung and gastro-esophageal cancers. MUSK effectively combines complementary information from pathology images and clinical reports and can potentially improve diagnosis and precision cancer therapy.
Introduction
Clinical decision-making is a complex process that involves information obtained from multiple data modalities. In clinical practice, physicians do not rely on a single data source for making diagnosis and treatment decisions. Instead, they incorporate information from multiple sources, including patient demographics, medical history, imaging findings, and pathologic characteristics of the disease. Therefore, making accurate diagnosis and treatment decisions requires the synthesis of information from multi-modal data. Given the complexity of these tasks, artificial intelligence (AI) approaches that can effectively integrate multi-modal data hold significant promise to advance clinical care1–5.
Foundation models represent a new frontier of medical AI research and development6,7. These models are pretrained on massive, diverse datasets and can be applied to numerous downstream tasks with minimal or no further training8–13. This has significant advantages over the traditional approach that requires training a new model for every new task. However, a major hurdle for the development of multi-modal AI models has been the scarcity of well-annotated datasets, especially in the clinical setting.
Recent efforts have been made to develop vision-language foundation models for medicine14, particularly in the field of pathology15–17. Although initial results are promising, there are several important considerations that could limit their potential clinical impact. First, these studies use off-the-shelf foundation models based on contrastive learning18, which requires paired image-text data for pretraining. While the scale of data is impressive with ~ 0.2 – 1.2 million image-text pairs, it is still far below the billions of data points used for training natural vision-language models19. And it remains unclear whether this scale is sufficient to fully capture diversity of the entire disease spectrum. Second, previous studies are focused on relatively simple tasks such as image classification or image and text retrieval, with the intended applications for cancer detection and diagnosis. However, the prediction of treatment response and outcomes using multi-modal foundation models has not been demonstrated. This is a much more challenging problem but has significant implications for guiding treatment decisions in precision medicine20.
Here, we present a new vision-language foundation model based on Multi-modal transformer with Unified maSKed modeling (MUSK) for pretraining. Motivated by the success of multi-modal learning of natural image-text data21, we introduce pathology-specific and general methodology adaptations to the MUSK approach for building high-performance foundation models for precision oncology. MUSK pretraining leverages large-scale, unlabeled, unpaired data with 50 million pathology images and 1 billion text tokens (Figure 1a). The pathology images used for masked pretraining originated from 11,577 patients representing 33 tumor types. We performed extensive evaluation on a wide range of downstream tasks including image and text retrieval, visual question answering, image classification, and molecular biomarker prediction. MUSK achieves superior performance over state-of-the-art foundation models on 23 patch-level and slide-level benchmarks (Figure 1b). Furthermore, MUSK is evaluated on multi-modal clinical report and image data from more than 8,000 patients and shows strong performance for predicting clinical outcomes, including melanoma relapse prediction, pan-cancer prognosis prediction, and immunotherapy response prediction.
Fig. 1. Data curation, model development and evaluation.

a. MUSK model pretraining. We develop a vision-language foundation model built upon a multimodal transformer architecture as the network backbone. The model pretraining consists of two sequential phases. First, MUSK is pretrained on a total of 50 million pathology images and 1 billion pathology-related text tokens. The images originated from nearly 33,000 whole-slide histopathology scans from 11,577 patients representing 33 tumor types. Adapted from the BEiT321 architecture, the MUSK model consists of shared self-attention blocks and two independent experts for vision and language inputs; pretraining is achieved by masked modeling. Second, MUSK is pretrained on 1 million image-text pairs from Quilt1M using contrastive learning for multimodal alignment. b. General-purpose clinical applications. Once the pretraining is complete, MUSK can be used for various downstream tasks with minimal or no further training. Importantly, we evaluate MUSK using whole-slide images and clinical reports for outcome prediction, including relapse, prognosis, and immunotherapy response prediction. MUSK substantially improves upon state-of-the-art vision-language foundation models, including CLIP15, Quilt1M46, BiomedCLIP47, and CONCH16. The graphics of reports, melanoma, prognosis, lung cancer, and gastro-esophageal cancer in b are created with Biorender.com.
Results
Zero-shot Cross-modal Retrieval
A key feature of a foundation model is the ability to perform downstream tasks without further training, i.e., zero-shot learning capability15,16. By learning an aligned latent embedding space for visual and language representations, MUSK can retrieve relevant texts based on an image query, and vice versa. We evaluated MUSK for zero-shot cross-modal retrieval on two benchmark datasets: BookSet22 and PathMMU23 with 4,265 and 7,774 image-text pairs, respectively.
MUSK achieved superior performance over seven other foundation models in both image-to-text and text-to-image retrieval tasks (Figure 2, Supplementary Table 1 and 2). On the PathMMU dataset, MUSK outperformed the second-best model CONCH with 34.4% (95% confidence interval (CI): 33.4%–35.5%) vs. 27.3% (95% CI: 26.4%–28.3%) for image-to-text retrieval for Recall@50 (recall rate of the top 50 retrieval candidates). Similarly, on the BookSet dataset, MUSK outperformed the second-best model CONCH with 74.8% (95% CI: 73.6–75.9) vs. 71.3% (95% CI: 70.0%–72.6%) for Recall@50. We observed similar patterns for text-to-image retrieval tasks, with MUSK surpassing the second-best model with an improvement at Recall@50 of 4.0% and 7.5% in absolute terms. These results demonstrate MUSK’s strong zero-shot learning capability.
Fig. 2. Cross-modal retrieval and visual question answering.

a. Zero-shot image-to-text and text-to-image retrieval. MUSK consistently outperforms existing foundation models across different recall levels on the BookSet and PathMMU. The two-sided Wilcoxon signed-rank test is used to assess the statistical differences between MUSK and the second-best model (CONCH). Visual examples are shown in the Supplementary Fig. 4. b. Visual question answering. MUSK substantially outperforms existing foundation models in the PathVQA benchmark dataset. Notably, MUSK improved accuracy by 7% over the best performing model (K-PathVQA) specifically trained for visual question answering. Some examples of the results are shown for MUSK and PLIP model. The two-sided Mann-Whitney U test is used to evaluate statistical significance. For VQA tasks-specific models, no confidence intervals were reported in the original papers. In a-b, data in foundation models is represented as mean with 95% confidence intervals estimated using the bootstrap method (n=1,000 replicates).
Visual Question Answering
In addition to cross-modal retrieval, another common vision-language task is visual question answering (VQA), which uses the input of pathology images and accompanying textual questions to generate an answer. Existing approaches require the design of sophisticated network models specifically for this task24–27. By contrast, MUSK is a general-purpose vision-language foundation model that can perform VQA with minimal training (Figure 1b and Supplementary Fig. 3). We evaluated the performance on the PathVQA28 dataset that comprises 32,799 questions derived from 4,998 pathology images. MUSK achieved an accuracy of 73.2% (95% CI: 72.1%–74.4%), significantly outperforming other vision-language foundation models including PLIP, Quilt1M, BiomedCLIP and CONCH (Figure 2b). Remarkably, MUSK surpassed the best-performing model K-PathVQA27 specifically designed for VQA purposes (accuracy: 68.9%), highlighting the advantage of building powerful foundation models.
Image Retrieval and Classification
While developed as a multimodal foundation model, MUSK can also serve as a standalone image encoder. Here, we demonstrate MUSK’s vision capability across various image-based tasks, including image retrieval and classification.
We evaluated the performance of zero-shot image retrieval on the UniToPatho and BRACS datasets. In both datasets, MUSK outperformed other foundation models across all evaluation metrics (Extended Data Fig. 1a–b and Supplementary Table 3). For example, MUSK exceeds CLIP by 22.3%, PLIP by 8.6%, and CONCH by 2.5% in terms of mMV @5 (accuracy of the top 5 majority votes) on the BRACS dataset.
For image classification, we first evaluated the model’s performance for zero-shot learning on four benchmark datasets: PatchCamelyon29, SkinCancer30, PanNuke31, and UniToPatho32. Despite the challenging task of distinguishing multiple classes and the lack of any training data, MUSK still achieved promising performance for zero-shot image classification (Supplementary Table 4 and Figure 3a), surpassing the second-best model (CONCH, BiomedCLIP, or Quilt1M depending on the dataset) by 10.5%, 27.5%, 7.3%, and 10.1%, respectively.
Fig. 3. Patch-level image classification.

a. Zero-shot image classification. MUSK consistently outperforms 7 alternative foundation models when evaluated on the UniToPatho, SkinCancer, PatchCamelyon, and PanNuke benchmark datasets, with p-values < 0.0001. b. 10-shot image classification. MUSK consistently outperforms other foundation models across 12 benchmark datasets. Two-sided Wilcoxon signed-rank tests are used to calculate the statistical differences between MUSK and the top-performing alternative model. The data are presented with means and 95% confidence intervals (error bars). These intervals are estimated using the bootstrap method (n=1,000 replicates) (a); or calculated from n=10 independent experiments (b).
We then assessed MUSK’s capability for few-shot image classification, i.e., using only a few samples for fine-tuning of the pre-trained model. This can be useful when the amount of training data is small, either because it is practically difficult to annotate enough samples or because the prevalence of disease is low. We performed a comprehensive evaluation of the model for few-shot image classification using 12 benchmark datasets. These datasets contain expert annotated histopathology images of normal samples and malignancies from various tissues/organs such as skin, lung, colon, breast, prostate, kidney, lymph nodes, etc.
Across all 12 datasets29–39, MUSK showed the highest accuracy for 10-shot image classification compared with other foundation models (Figure 3b, Extended Data Fig. 2a, and Supplementary Table 5). Notably, the increase in classification accuracy was highest in the most challenging tasks, for which other models did not perform well. For instance, in the UniToPatho dataset, MUSK achieved an increase of 9.8% over the second-best model. Furthermore, we evaluated the model for [1,2,4,8]-shot image classification with even fewer training samples and obtained similar results (Extended Data Fig. 1c). This indicates that MUSK is a robust and label-efficient vision encoder for image classification.
We finally evaluated the model for supervised image classification using all available training data in each of the 12 benchmark datasets. MUSK achieved an average accuracy of 88.2%, outperforming other foundation models including CLIP, PLIP, Quilt1M, BiomedCLIP, and CONCH by a margin of 17.5%, 9.1%, 11.7%, 11%, and 2.2% respectively (Extended Data Fig. 2b and Supplementary Table 6). Visualizations of image embeddings produced by different models further highlight the robustness of MUSK’s feature representation capabilities (Supplementary Fig. 5). These results demonstrate that MUSK provides a powerful approach to learn more effective image representations for pathology classification.
Molecular Biomarker Prediction
Molecular biomarkers such as protein expression and gene mutation are critical components of precision oncology that can directly inform targeted therapies40. Here, we evaluated the performance of MUSK against five state-of-the-art pathology foundation models for predicting molecular biomarkers from histopathology images. Specifically, we assessed the models for predicting receptor status in breast cancer and IDH mutation status in brain tumors using the publicly available BCNB41 and MUV-IDH42 datasets, respectively. MUSK achieved significantly higher performance compared to other pathology foundation models including PLIP15, UNI11, GigaPath10, Virchow12, or CONCH16 for predicting ER, PR, HER2 status and IDH mutation status (Mann-Whitney U test P<0.05; Extended Data Fig. 3 and Supplementary Table 7). For instance, our model achieved an AUC of 0.826 (95% CI: 0.813–0.839) for predicting HER2 status, which is a significant improvement over leading comparison methods: GigaPath (0.786, 95% CI: 0.756–0.817) and CONCH (0.771, 95% CI: 0.745–0.796), P=0.008.
Melanoma Relapse Prediction
Melanoma is the most serious form of skin cancer and has a relatively high likelihood of relapse that can lead to death. Accurate prediction of relapse after curative-intent surgery may inform personalized treatment strategies43. For instance, patients at high risk of relapse should receive adjuvant systemic therapy, while those at low risk of relapse may avoid the toxicity associated with the drugs. Traditional risk factors such as tumor thickness and presence of ulceration do not accurately predict an individual patient’s relapse44.
To address this unmet need, we developed a multimodal approach based on MUSK to predict the risk of relapse in melanoma. We used the VisioMel dataset45, which includes clinical reports and whole-slide images (WSIs) of diagnostic H&E slides as well as 5-year follow-up data for 1,342 patients with melanoma. Compared with existing vision-language foundation models, MUSK achieved the highest AUC of 0.833 (95% CI: 0.8180.847) for predicting 5-year relapse, significantly outperforming PLIP, Quilt1M, BiomedCLIP, and CONCH (Extended Data Fig. 4a, b). We then conducted ablation experiments using single-modal inputs on the MUSK model (Extended Data Fig. 4d). The results show that a model based on clinical reports or images alone has a lower accuracy for predicting relapse. By combining complementary information obtained from two different data modalities, MUSK further improved the accuracy of relapse prediction, highlighting the power of our multimodal approach.
To be clinically useful, a prognostic model should be highly sensitive for predicting relapse in order to minimize the risk of under-treatment. We thus evaluated the model’s performance at a predetermined sensitivity threshold of 90% (Extended Data Fig. 4c). The MUSK model achieved a substantially higher specificity than other foundation models, with an improvement of about 12% (P=0.0079). The clinical implication is that our model may spare more patients from toxic but unnecessary adjuvant therapy. Finally, visualization of the model’s prediction reveals that MUSK could automatically uncover relevant pathological features for predicting relapse (Extended Data Fig. 4e and Supplementary Fig. 6).
Pan-cancer Prognosis Prediction
Having demonstrated the effectiveness of our approach for predicting relapse, specifically in melanoma, we next evaluated the model for its ability to predict prognosis broadly in a pan-cancer setting. To do this, we collected diagnostic H&E WSIs, associated pathology reports, and follow-up data from The Cancer Genome Atlas (TCGA), encompassing a total of 7,927 WSIs from 6,602 patients across 16 major cancer types. We trained a multimodal prognostic model in each cancer type and then evaluated the performance for predicting disease-specific survival.
Across all 16 cancer types, MUSK consistently outperformed clinical risk factors and state-of-the-art foundation models for prognosis prediction. On average, MUSK achieved a c-index of 0.746, significantly above the c-index of 0.645 (P<0.0001) for overall stage, and 0.667 (P<0.0001), 0.672 (P<0.0001), 0.668 (P<0.0001), and 0.684 (P<0.0001) for multimodal foundation models PLIP15, Quilt1M46, BiomedCLIP47, and CONCH16, and 0.649 (P<0.0001), 0.684 (P<0.0001), and 0.672 (P<0.0001) for pathology foundation models UNI11, GigaPath10, and Virchow12, respectively (two-sided Mann-Whitney U test, Figure 4a, Extended Data Fig. 3, Supplementary Fig. 7, and Supplementary Table 7). The best prediction was achieved for renal cell carcinoma, with a c-index of 0.887 (95% CI: 0.854–0.920). In addition, MUSK achieved high performance for prognosis prediction in breast invasive carcinoma, colorectal adenocarcinoma, low-grade glioma, and endometrial carcinoma with a c-index above 0.8.
Fig. 4. Prognosis prediction across 16 cancer types.

a. Kaplan-Meier analyses show that MUSK can significantly stratify patients for disease-specific survival across 16 cancer types, with hazard ratios ranging from 1.59 for glioblastoma multiforme to 36.83 for renal cell carcinoma. The two-sided log-rank test is used to compare the survival differences between the high-risk and low-risk groups (cutoff: median). HR: hazard ratio. b. The multimodal MUSK model significantly improves prognosis prediction over models based on clinical reports or whole-slide images alone as shown in the overall bars (p-value < 0.0001). The overall bars represent the average performance across 16 projects. Bladder Urothelial Carcinoma (BLCA), Breast Invasive Carcinoma (BRCA), Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC), Colorectal Adenocarcinoma Rectal Adenocarcinoma (COADREAD), Esophageal Carcinoma (ESCA), Glioblastoma Multiforme (GBM), Head and Neck Squamous Cell Carcinoma (HNSC), Low-Grade Glioma (LGG), Liver Hepatocellular Carcinoma (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), Pancreatic Adenocarcinoma (PAAD), Renal Cell Carcinoma (RCC), Skin Cutaneous Melanoma (SKCM), Stomach Adenocarcinoma (STAD) and Uterine Corpus Endometrial Carcinoma (UCEC). In b, data is represented as mean with standard deviation calculated using 5-fold cross-validation experiments. The two-sided Mann-Whitney U test is used to assess the statistical significance between MUSK and the comparison methods.
We assessed the ability of the MUSK model to stratify patients for disease-specific survival using Kaplan-Meier analysis. Our results demonstrate a significant stratification of low- and high-risk patients with distinct survival outcomes (log-rank test P<0.001) across 16 cancer types (Figure 4a). Strikingly, the model achieved a hazard ratio of greater than 30 in renal cell carcinoma, with a 10-year survival rate of 95.3% vs. 48.3% in the low and high-risk groups. We further conducted multivariable Cox regression analysis and confirmed that MUSK-based risk score was a significant prognostic factor across all 16 cancer types, independent of clinical risk variables including age, sex, stage, and tumor grade (Supplementary Fig. 8).
We conducted ablation experiments on the MUSK model by training image- and text-only models for prognosis prediction. These models showed reasonable performance with an average c-index of 0.654 (P<0.0001) and 0.673 (P<0.0001). Of note, the multimodal MUSK model consistently outperforms prognostic models with single-modal inputs across all 16 cancer types (Figure 4b), with a significantly higher c-index of 0.746. These results demonstrate that MUSK can effectively integrate the complementary information of multimodal image and text data for prognosis prediction across cancer types.
Immunotherapy Response Prediction
Immunotherapy, specifically immune checkpoint inhibitors (ICIs), has transformed the treatment landscape in oncology and offers the potential for long-term durable benefits. However, only ~ 20% of patients will respond and benefit from ICIs48,49. It is critical to identify which patients will benefit from ICIs given the toxicity and financial burden of these treatments. Existing biomarkers such as tumor PD-L1 expression and tumor mutation burden (TMB) have limited predictive power in distinguishing responders from non-responders50,51. There is an unmet need for more accurate prediction of immunotherapy response.
Here, we collected multi-modal datasets, including pre-treatment H&E slides, associated pathology reports, therapy response and follow-up data for 118 advanced non-small cell lung cancer (NSCLC) patients treated with ICIs. We evaluated the MUSK model for predicting two clinical endpoints: objective response and progression-free survival (PFS). Patients were classified as responders (complete or partial response) or non-responders (stable or progressive disease).
For predicting response, MUSK achieved an AUC of 0.768 (95% CI: 0.724–0.812), significantly higher than existing biomarkers such as tumor PD-L1 expression with an AUC of 0.606 (95% CI: 0.492–0.699, P<0.0001). MUSK also outperformed models trained using other multimodal foundation methods such as PLIP, Quilt1M, BiomedCLIP, and CONCH with AUCs ranging from 0.636 to 0.692 (Figure 5a). Similarly, for predicting PFS, MUSK demonstrated a significant improvement over existing biomarkers, with a c-index of 0.705 (95% CI: 0.677–0.732) compared to 0.574 (95% 0.447–0.691, P<0.0001) for tumor PD-L1 expression (Supplementary Fig. 10a). MUSK achieved a significantly higher performance for PFS prediction over existing pathology foundation models such as UNI, GigaPath, and Virchow with a c-index between 0.580 and 0.599 (Extended Data Fig. 3 and Supplementary Table 7). Compared with alternative multimodal approaches, MUSK also showed superior performance for PFS prediction over PLIP, Quilt1M, BiomedCLIP, and CONCH with a c-index in the range of 0.601 and 0.640 (Figure 5a).
Fig. 5. Lung cancer immunotherapy response prediction.

a. MUSK substantially outperforms other foundation models for predicting objective response and progression-free survival in non-small cell lung cancer patients treated with immunotherapy. b. The multimodal MUSK model significantly improves upon models based on clinical reports or whole-slide images alone for predicting immunotherapy response and outcomes. c. Kaplan-Meier analysis demonstrates that MUSK significantly stratifies patients into high-risk and low-risk groups for progression-free survival in the entire cohort and in clinically relevant subgroups defined by PD-L1 expression and EGFR mutation status. The two-sided log-rank test is used to compare the survival differences between the high-risk and low-risk groups. HR: hazard ratio. d. Two examples of lung cancer cases with and without objective response to immunotherapy. In each panel, the left image shows the original WSI, while the middle image displays the corresponding heatmaps that highlights the regions model focused on within the WSIs. The right images provide zoomed-in views of the regions receiving the most model attention. The case with response shows abundant infiltration of lymphocytes and minimal stroma. On the other hand, the case without response shows minimal lymphocyte infiltration and abundant stroma. TPS: tumor proportion score. In a and b, the error bars represent the mean with standard deviation computed from 5-fold cross-validation experiments, and the two-sided Mann-Whitney U test is used to measure the statistical significance between MUSK and the compared methods.
We compared the performance of the MUSK model with text- and image-only models trained based on clinical reports and WSIs alone. MUSK significantly improved upon single-modal methods, demonstrating the effectiveness of our multimodal approach for predicting immunotherapy response and outcomes (Figure 5b).
To assess the ability of MUSK to stratify patients for progression-free survival, we performed Kaplan–Meier analysis (Figure 5c). In the entire cohort, MUSK separated patients into two risk groups with a hazard ratio (HR) of 2.54 (1.66–3.90), P<0.0001. The median progression-free survival was 4.3 and 13.4 months for the high- and low-risk groups, respectively. In comparison, tumor PD-L1 expression did not significantly stratify patients for PFS (Supplementary Fig. 10a). Our analysis reveals that MUSK can further stratify patients for PFS regardless of PD-L1 expression, EGFR mutation status, as well as treatment regimens with single-agent ICI or chemo-ICI combination therapy (Figure 5c and Extended Data Fig. 6a). The results are particularly striking for patients with PD-L1 negative (TPS = 0) tumors, with an HR of 7.38 (2.15–25.38), P=0.0002. These findings are clinically significant since patients with PD-L1 negative and EGFR mutated tumors typically do not receive immunotherapy due to low response rates, but MUSK can identify a subset of these patients who may benefit from immunotherapy.
We further performed multivariate Cox regression analyses to evaluate the independent value of MUSK for predicting progression-free survival. We incorporated all relevant clinical variables in the analysis, including age, sex, histology, central nervous system metastases, smoking, EGFR mutation, and tumor PD-L1 expression. Our results indicate that MUSK is the most significant predictor of progression-free survival with P=0.0012 (Supplementary Fig. 9). Overall, these findings demonstrate that by integrating multimodal data, MUSK can provide valuable additional information for a patient’s likelihood of response to immunotherapy and, therefore, may potentially inform treatment decision-making.
To facilitate the interpretation of the model prediction, we generated attention heatmaps and overlaid them on the WSIs (Figure 5d). For patients predicted to have a high likelihood of response, the high-attention areas show abundant infiltration of lymphocytes and minimal intratumoral stroma. On the other hand, for those with a low likelihood of response, the high-attention areas show minimal intratumoral lymphocyte infiltration and abundant stroma with dense collagenous fibers. These findings suggest the model could uncover pathological features previously implicated in immunotherapy response52.
Finally, we evaluated the multimodal MUSK for predicting response and outcome in 101 advanced gastro-esophageal patients treated with ICI-based immunotherapy. The only established predictive biomarkers in gastro-esophageal cancer is microsatellite instability (MSI). In this cohort, MSI-H status had a modest accuracy for predicting immunotherapy response with an AUC of 0.616 (95% CI: 0.550–0.682, P<0.0001). In comparison, MUSK achieved a much higher AUC of 0.762 (95% CI: 0.718–0.805), which significantly outperformed other multimodal foundation models such as PLIP, Quilt1M, BiomedCLIP, and CONCH, with AUCs between 0.652 and 0.693 (Extended Data Fig. 5a). MUSK is also superior to pathology-based foundation models including UNI, GigaPath, and Virchow, with AUCs ranging from 0.644 to 0.651. Similar results were obtained for predicting progression-free survival, with MUSK outperforming other foundation models (Extended Data Fig. 5a). Consistent with results for lung cancer, the multimodal MUSK model significantly improved upon text- and image-only models for predicting immunotherapy response and outcomes in gastroesophageal cancer (Extended Data Fig. 5b).
We performed Kaplan–Meier analysis to assess MUSK for stratifying patient outcomes (Extended Data Fig. 5c). While PD-L1 expression did not significantly stratify patients for progression-free survival (Supplementary Fig. 10b), MUSK separated patients into two risk groups with a hazard ratio of 3.49 (2.02–6.01), P<0.0001. The median PFS was 3.6 and 14.1 months for the high- and low-risk groups, respectively. MUSK further stratified patients within biomarker-defined subgroups, such as PD-L1 negative (CPS = 0) and positive (CPS ≥ 1) tumors as well as MSS/MSI-L tumors. In addition, MUSK stratified patients regardless of treatment regimens with either single-agent ICI or chemo-ICI combination therapy (Extended Data Fig. 6b). Finally, we performed multivariate Cox regression analyses and our results show that MUSK is the only significant predictor of progression-free survival (P=0.0013) besides MSI status (Extended Data Fig. 5d). Visualization of attention heatmaps reveals differential patterns of lymphocyte infiltration and stroma abundance in responders vs non-responders (Extended Data Fig. 5e).
Discussion
In this study, we present MUSK, a new vision-language foundation model for general-purpose oncology applications. Through extensive benchmark evaluation on 23 downstream tasks, we show that MUSK achieves superior performance over existing foundation models, with minimal or no further training, for applications of cancer detection, diagnosis, and grading. Importantly, distinct from prior studies that rely on the similarity between different data modalities, we leverage the complementary information between clinical reports and images, and demonstrate that the multi-modal approach achieves superior outcome prediction over either modality alone. Specifically, MUSK shows strong performance for predicting melanoma relapse prediction, prognosis prediction across 16 cancer types, and immunotherapy response prediction in two real-world cohorts of lung and gastro-esophageal cancers.
The performance gain achieved by MUSK is largely attributable to its ability to incorporate unpaired image and text data for pretraining, which is far more common than paired data. Existing studies employ off-the-shelf foundation models with contrastive learning, which requires paired image-text data for pretraining. By contrast, MUSK is a custom-designed foundation model pretrained with unified masked modeling. This allows us to leverage substantially larger and more diverse unpaired data (50 million images and 1 billion text tokens), which represents several orders of magnitude increase over ~1 million image-text pairs used in prior studies15,16.
The scarcity of complete multimodal data represents a major challenge to training reliable AI models4. Our approach provides an effective solution to this problem by using more readily available unimodal data for unified masked learning followed by using multimodal data for fine-tuning and alignment. This training paradigm can be extended and applied to building multi-modal foundation AI models in other domains beyond pathology, such as radiology and dermatology images/reports7, as well as structured data such as genomics53.
Accurate prediction of treatment response and outcomes has significant clinical implications for precision oncology20. There are important conceptual and practical distinctions from cancer detection and diagnosis, the primary focus of existing pathology foundation models10–12,16. Given that pathologists have an excellent performance in diagnosing cancer (which is the current gold standard), the impact of AI models in such scenarios would be limited to an assistive role. However, outcome prediction is a much more challenging problem due to the inherent uncertainty associated with forecasting the future. Current approaches based on clinical risk factors such as cancer stage and tumor grade have a limited accuracy typically with a c-index around 0.65, leaving a large room for improvement. By combining routine clinical reports and histopathology images, the multimodal MUSK model significantly improved upon traditional risk factors for prognosis prediction across 16 cancer types, with an average c-index of 0.75 and exceeding 0.8 for several cancers. The model may be used to complement current staging system and refine risk stratification, paving the way for personalized treatment strategies. For instance, in early-stage cancers, adjuvant therapy can be given to patients at high risk of relapse after curative-intent surgery, while low-risk patients may avoid the toxicity associated with the systemic drugs.
Immunotherapy, particularly immune checkpoint inhibitors (ICIs), has prolonged survival for many patients with metastatic cancers and are the standard of care for the treatment of most tumor types. However, only a small fraction (10–20%) of patients will respond to immunotherapy and experience durable clinical benefits54. Given the financial burden and potential for immune-related toxicities of these treatments55, it is crucial to identifying patients who are most likely to respond and benefit from ICIs. Here, we fine-tuned our pretrained multi-modal foundation model for predicting immunotherapy response based on routine clinical reports and histopathology images. The model significantly improved upon existing clinical biomarkers such as PD-L1 expression and MSI status in lung and gastro-esophageal cancers, which are among the most common and lethal cancers worldwide56. To aid in model interpretation, we applied visualization techniques based on attention heatmaps, which revealed pathologic features of the tumor microenvironment that are consistent with known mechanisms of response and resistance to immunotherapy52,57. Importantly, the model identified a subset of patients with PD-L1 negative or EGFR mutated tumors who could benefit from ICIs. Since these patients typically do not receive ICIs due to overall low response rates54,58, our finding has significant clinical implications with the potential for broadening the population of patients who may benefit from ICIs.
While the results on immunotherapy response prediction are promising, it is worth noting that they are obtained based on relatively small cohorts of about 220 patients from one academic medical center. Before the model can be considered for clinical implementation and adoption, several steps are needed to ensure that it is rigorously evaluated for safety, efficacy, and clinical utility. First, these findings should be validated and confirmed in future studies using larger, multi-institution cohorts. Second, for high-stakes applications such as treatment decision-making, regulatory approval will be required that includes validation in prospective clinical trials of immunotherapy-treated patients from diverse populations. The generation of high-level evidence through rigorous prospective validation can ultimately lead to changes in clinical guidelines and clinical practice.
In conclusion, we propose a new vision-language foundation model by leveraging unpaired image-text data. The model provides an effective approach to the integration of pathology images and clinical reports and can potentially improve diagnosis and precision cancer therapy.
Online Methods
Model Design and Pretraining
The pretraining of MUSK, inspired by BEiT-321, consists of two main steps. The first step employs masked data modeling to leverage large-scale unpaired images and text. The second step utilizes around 1 million image-text pairs with contrastive learning to align the two modalities and establish connections between images and texts. The network backbone is a general-purpose multimodal Transformer inspired by mixture-of-experts networks in large language models59, multimodal pretraining60,21, and image generation61. The model configurations are specified in the Extended Data Fig. 7 and Supplementary Table 16.
Multimodal Data Curation for Pretraining
For pretraining of the multimodal MUSK foundation model, we incorporated unpaired pathology images and texts for masked learning, as well as paired image-text data for contrastive learning. The masked pretraining dataset comprised of 1 billion text tokens extracted from 1,001,800 pathology-related articles from the PubMed Central Open Access Dataset and 50 million pathology image patches from The Cancer Genome Atlas (TCGA). The image patches were obtained from nearly 33,000 digitized hematoxylin and eosin (H&E)-stained WSIs from 11,577 patients representing 33 tumor types. We used the Quilt1M46 dataset (802k image-text pairs) in addition to the PathAsst62 dataset (207k image-text pairs) for the second pretraining phase via contrastive learning.
Noisy image-text pairs collected from the web present challenges for training and may degrade model performance. Thus, instead of training directly on these datasets, we adopt a bootstrap approach similar to BLIP63 during contrastive learning. We initially train on Quilt1M to obtain a baseline model, then filter out low-similarity image-text pairs based on this model. The final model is trained on the refined image-text dataset with improved data quality (Supplementary Fig. 2).
Unified Masked Pretraining
We employ a unified masked data modeling approach for pre-training. We sample a batch of training images and texts during training to apply masked loss and optimize the model. We utilize the Masked Language Modeling (MLM) loss for texts and the Masked Image Modeling (MIM) loss for images.
MLM.
Similar to BERT64, we randomly selected 15% of the tokens within a text sequence and replaced them with a special [MASK] token. The model is then trained to predict these masked tokens using the context provided by all other unmasked text tokens. The positions of the masked tokens are denoted as , where is the total number of input tokens. The input sequence with masked tokens is . The output vectors , corresponding to the masked token positions, are fed into a classifier. The classifier predicts the most probable words from the vocabulary for each masked position, using cross-entropy loss as the objective function:
| #(1) |
MIM.
The input image is split into image patches , and then tokenized into as the output labels of MIM using an image tokenizer. The vocabulary contains discrete token indices. At the input layer, 40% image patches are randomly masked, and then the model predicts the visual tokens of the masked patches. The masked positions are denoted as . Next, we replace the masked patches with a learnable embedding , making the input corrupted image patches that are fed into the transformer. The pre-training objective is to maximize the log-likelihood of the correct visual tokens given the corrupted image:
| #(2) |
An image tokenizer is required to obtain semantically meaningful visual tokens. However, existing tokenizers, such as DALL-E65 and BEiT-v266, are primarily trained on natural images. Since the image tokenizer defines the learning targets for MIM, employing a non-pathology-specific tokenizer could result in suboptimal image representations. To address this, we trained a pathology-specific image tokenizer for MUSK following the BEiT-v2 methodology66, utilizing 5 million pathology images from the TCGA dataset. For training, we adopted CTransPath67 as the teacher model, providing semantic-aware targets to enhance the tokenizer’s performance.
Masked training settings.
Image augmentations include random vertical flip (), color dropping () to convert images to grayscale, and weak color jittering () with specific adjustments to brightness, contrast, saturation, and hue. Additionally, RandStainNA68 and FoVs69, which involve random magnifications at 10×, 20×, and 40×, are incorporated into the training pipeline. We pre-train MUSK for 1M steps using the masked pretraining loss of for images and for texts. The batch size is 2048 for images and 2048 for texts. MUSK uses an input image with 384 × 384 pixels, then patched as 16 × 16 pixels. Text data is tokenized using a SentencePiece tokenizer with a vocabulary size 64k. We use the AdamW70 optimizer with , and for optimization. We use a cosine learning rate decay scheduler with a peak learning rate of and a linear warmup of steps. The weight decay is set as 0.05, and the stochastic depth with a rate of 0.1 is used.
Contrastive Pretraining
The second pretraining step utilizes contrastive learning to further train MUSK on image-text pairs for modality alignment. The image embeddings and text embeddings are used to compute the contrastive loss 18. The contrastive loss aims to align the global representation of images and texts. We further design an auxiliary loss for fine-grained modality alignment. Specifically, we construct a lightweight cross-attention decoder module, utilizing the images as side information for masked language modeling. Image embeddings are employed as the key and value in the cross-attention, while the language embeddings serve as queries. This approach encourages language embedding to explore more detailed interactions with images, ultimately enhancing modality alignment. We empirically mask 30% of input text tokens and predict the ground-truth labels. We build a prediction layer at the output hidden states of the cross-modal decoder, and finally, we optimize the model through cross-entropy loss . The training loss for modality alignment is the combination of contrastive loss and the auxiliary MLM loss with the decoder (Extended Data Figure 7b):
| #(3) |
Contrastive training settings.
We pre-train MUSK with contrastive learning using the loss function for 20 epochs. The batch size is 3072 image-text samples. MUSK uses an input image with 384 × 384 pixels, then patched as 16 × 16 pixels. We apply image augmentations such as random resized cropping, horizontal flipping, and color jittering to enhance the training data. Text data is tokenized using a SentencePiece tokenizer with a vocabulary size of 64k. We use the AdamW optimizer with , , and for optimization. We use a cosine learning rate decay scheduler with a peak learning rate of and a linear warmup of 2 epochs. The weight decay is set as 0.05, and the stochastic depth rate is 0.1.
Ablation Study
MUSK enhances traditional masked pretraining and contrastive learning by introducing four key adaptations: pathology-specific augmentations (stain augmentations and multiple fields of views), a pathology-specific tokenizer for MIM, a fine-grained image-text decoder for better local alignment, and bootstrapped contrastive learning to filter noisy data. We performed a series of ablation studies which demonstrate that these adaptations are essential for optimizing model performance and significantly improving image representation, cross-modal learning, and data quality for precision oncology applications (Extended Data Fig. 8 and Supplementary Results).
Benchmark Datasets
We evaluated the MUSK model for multimodal retrieval, visual question answering, and histopathology image classification using various publicly available benchmark datasets. BookSet22 contains 4,265 image-text pairs for cross-modal retrieval, while PathMMU23 includes 7,774 annotated image-caption pairs, emphasizing retrieval with expert-reviewed cases. PathVQA28 provides 32,799 open-ended questions linked to 4,998 pathology images for visual question answering. For histopathology classification, PatchCamelyon29 contains 327,680 binary-labeled images, and NCT-CRC-HE-100K37 spans 107,180 images across nine colorectal tissue classes. SICAPv234 focuses on prostate histology with 12,081 patches in four classes, while Osteo38 contains 1,144 osteosarcoma images across three classes. RenalCell36 features 35,458 images for renal carcinoma classification with six tissue types, and SkinCancer30 includes 129,364 patches representing 16 skin conditions. LC2500035 contains 25,000 images split into lung and colon cancer subsets, with 5 classes in total. PanNuke31 includes over 200,000 nuclei annotations across 19 tissue types for binary classification. UniToPatho32 supports colorectal polyp grading with 9,536 patches in six classes, and WSSS4LUAD39 targets LUAD tissue classification with 10,091 images in three categories. BRACS33 offers 4,539 breast image patches with six lesion types, while BCNB41 and MUV-IDH42 include 1,057 and 872 slides, respectively, with multimodal clinical information for breast cancer and brain tumor classification. Detailed descriptions for each dataset are provided in Supplementary Methods.
Melanoma Relapse Prediction
We evaluated the MUSK model for predicting risk of relapse in melanoma by combining information from histopathology images and clinical reports. For this purpose, we used the VisioMel Challenge dataset1 curated from the French national database on melanoma. This dataset contains clinical reports, diagnostic H&E WSIs, and follow-up data for 1,342 melanoma patients. For comparison, we trained three models based on MUSK: (i) an image-based model; (ii) a text-based model; and (iii) a multimodal classifier that integrates image and text data. To evaluate these models, we conducted five-fold cross-validation through stratified sampling based on the relapse.
Since MUSK-V (the vision part of MUSK) is a patch-level encoder, we performed patch aggregation using attention-based multiple instance learning (AbMIL)71. It consists of a fully connected layer and a rectified linear unit (ReLU) nonlinearity, which maps the inputs to an embedding dimension of 512. This stage follows a two-layer, gated attention network with a hidden dimension 384. The network uses a fully connected classifier head that projects the attention-pooled, slide-level image embeddings to the desired dimension. Meanwhile, clinical reports—providing details such as the patient’s age at initial diagnosis, sex, primary site, and medical history—were encoded using MUSK-L, the language component of MUSK. Finally, we combined the slide-level image embeddings and text embeddings as input to a lightweight multilayer perceptron (MLP) classifier, which generates the prediction outputs.
We trained each model for 100 epochs on the training set, using an AdamW optimizer and a cosine learning rate scheduler with an initial learning rate of 0.001. We used a weighted data sampler that balances each outcome label’s sampling probability during each epoch. We set up early stopping criteria if the evaluation loss does not decrease for 10 consecutive epochs. We employed dropout at 0.25 after intermediate layers in the network for regularization.
Pan-cancer Prognosis Prediction
We evaluated the MUSK model for predicting survival outcomes across various cancer types by combining histopathology images and clinical reports. For this purpose, we used a total of 7,927 whole-slide diagnostic H&E images from 6,602 patients across 16 cancers (BLCA, BRCA, CECS, COADREAD, ESCA, GBM, HNSC, LGG, LIHC, LUAD, LUSC, PAAD, RCC, SKCM, STAD, UCEC) in TCGA72. These datasets were selected based on size of the cohort and the ratio of uncensored to censored patients. We excluded cancer types with fewer than 100 patients or fewer than 5% patients who had an outcome event. Each diagnostic H&E slide is associated with a detailed pathology report73. The clinical outcome endpoint is disease-specific survival. We trained a prognostic model separately for each cancer type and evaluated the performance using five-fold cross-validation by stratified sampling based on survival status. We compared the multimodal MUSK model with the unimodal approach, which only utilizes histopathology images or clinical reports as input.
To obtain a slide-level prediction, we used AbMIL to aggregate image features. Preprocessing is necessary for clinical reports since their text length exceeds the 100-token limit for MUSK-L. Here, we leveraged a large language model (GPT-474) to generate structured reports with more succinct and informative summaries based on the full clinical reports. This is achieved by applying expert-designed prompts, as shown in Supplementary Table 8. This procedure allows us to capture the most relevant information pertinent to survival outcomes, such as patient characteristics, tumor size, differentiation, invasion, and lymph node metastases. Training details are kept the same as the melanoma relapse prediction.
Immunotherapy Response Prediction
We evaluated the MUSK model for predicting immunotherapy response and outcomes in two real-world non-small cell lung cancer (NSCLC) and gastro-esophageal cancer cohorts. Both cohorts were obtained from Stanford University Medical Center with approval from the institutional review board. Informed consent was waived for this retrospective analysis. The inclusion criteria were advanced (metastatic or recurrent) NSCLC or gastroesophageal cancer (originating from the esophagus, gastroesophageal junction, or stomach) treated with anti-PD1 or anti-PD-L1 immune checkpoint blockade between 2018 and 2023, with an available H&E-stained tissue section from a pre-treatment core needle or surgical tumor biopsy. Patients were identified using the STAnford Research Repository (STARR)75. In both cohorts, patients were treated in various lines of immunotherapy with or without concurrent chemotherapy. The best overall response was evaluated through manual curation of radiology reports, and patients were divided into two groups: responders (complete or partial response) and non-responders (stable or progressive disease). Progression-free survival (PFS) was determined from the date of treatment initiation until the date of progression or death; patients who did not progress were censored at the date of the last follow-up.
Our study included 118 NSCLC patients and 101 gastro-esophageal cancer patients treated with immunotherapy (see Supplementary Table 11 and 12). We trained multimodal MUSK models by using the diagnostic H&E WSIs and the associated pathology report to predict objective response and PFS. We used the AbMIL method to aggregate histopathology image features and obtain a slide-level prediction. We used GPT-4 to standardize the clinical reports by applying expert-designed prompts, as shown in Supplementary Table 9 for lung cancer and Supplementary Table 10 for gastro-esophageal cancer. We concatenate image and report embeddings, which constitute a multimodal embedding for predicting immunotherapy response and outcomes. Models were evaluated through five-fold cross-validation with a stratified sampling of the relevant outcomes for each cohort. Training details are kept the same as the melanoma relapse prediction.
Model Visualization
To enhance model interpretability, we generated heatmaps76 to display the regions related to model predictions on the WSIs. We cropped WSIs into tiles with 85% overlap and calculated the attention scores for each tile to provide more detailed heatmaps. These attention scores were normalized to a 0–1 scale, with warmer colors on the heatmap indicating higher attention scores, thus highlighting areas more relevant to the model’s predictions. The heatmaps are superimposed onto the original WSIs with a semitransparent overlay.
Statistical Analysis
For zero-shot tasks or finetuned tasks with independent test sets, we evaluated performance variations using a non-parametric bootstrapping method, deriving 95% CIs from 1,000 bootstrap samples. For 5-fold cross-validation tasks, 95% CIs were estimated based on the results from the five folds. The two-sided Mann-Whitney U test or two-sided Wilcoxon signed-rank test (as indicated in the figure captions) was used to assess the statistical significance. We used the area under the Receiver Operating Characteristic curve (AUC) to evaluate the performance of melanoma relapse prediction and immunotherapy response prediction. We used the concordance index (c-index) to evaluate the performance of prognostic models for survival endpoints. Kaplan-Meier curves were generated to assess patient stratification, using median predicted risk score as the cutoff value. The statistical significance of low-risk and high-risk patient groups was assessed using the log-rank test.
Extended Data
Extended Data Fig. 1. MUSK for image-to-image retrieval and image classification.

a. We perform zero-shot image retrieval on the UniToPatho dataset, and MUSK outperforms other vision-language foundation models. Data is represented as mean with 95% confidence intervals. Error bars represent the 95% confidence intervals, estimated using the bootstrap method with 1000 replicates. The two-sided Wilcoxon signed-rank test is used to calculate the statistical differences between MUSK and the compared methods (p<0.0001 in Recall@1, Recall@3, Recall@5, and mMv@5). b. Zero-shot image retrieval on the BRACS dataset. MUSK significantly outperforms other foundation models across various recall levels with p-values of 0.02, 0.07, 0.04, and 0.03 in Recall@1, Recall@3, Recall@5, and mMv@5 metrics, respectively. Two examples of image retrieval results with the top 3 candidates are shown. DCIS: ductal carcinoma in situ; IBC: invasive breast carcinoma. c. We evaluate the labeling efficiency of various models under a few-shot learning scenario by varying the number of training labels per class. We present results for the [1, 2, 4, 8, 10]-shot classification across multiple datasets: LC25000 31, UniToPatho 28, NCT-CRC 33, and BRACS (6 cls) 29. The average accuracy shows that MUSK consistently outperforms existing models across these benchmarks. In these box plots, the central lines indicate the median, box bounds are the 25th and 75th percentiles, and whiskers extend to 1.5 times the interquartile range.
Extended Data Fig. 2. MUSK for supervised image classification.

a. 10-shot classification performance across 12 benchmarks compared with seven alternative vision-language models regarding classification balanced accuracy. The two-sided Wilcoxon signed-rank test is used to assess the statistical differences between MUSK and the compared methods in the 12 benchmark datasets: BRCAS (3-cls) (p=0.43), UniToPtho (p=0.002), BRCAS (6-cls) (p=0.002), SICAPv2 (p=0.01), PatchCamelyon (p=0.006), LC25000 (p=0.002), PanNuke (p=0.23), RenalCell (p=0.002), SkinCancer (p=0.01), NCT-CRC-HE-100K (p=0.55), Osteo (p=0.04), and WSSS4LUAD (p=0.006). b. Linear probe classification results on 12 benchmark datasets compared with seven alternative models. The two-sided Wilcoxon signed-rank test is used to calculate the statistical differences between MUSK and the compared methods in the 12 benchmark datasets. P-values are observed as follows: BRCAS (3-cls) (p=0.002), UniToPtho (p=0.002), BRCAS (6-cls) (p=0.01), SICAPv2 (p=0.13), PatchCamelyon (p=0.002), LC25000 (p=0.002), PanNuke (p=0.002), RenalCell (p=0.002), Skin-Cancer (p=0.002), NCT-CRC-HE-100K (p=0.55), Osteo (p=0.002), and WSSS4LUAD (p=0.002). In a and b, error bars represent the 95% confidence intervals, which are computed from 10 experiments using different seeds.
Extended Data Fig. 3. Comparison of MUSK with state-of-the-art pathology foundation models on slide-level benchmark tasks.

The comparison methods include both unimodal pathology foundation models (UNI, GigaPath, and Virchow) and multimodal pathology foundation models (PLIP and CONCH). a. Biomarker prediction. AUC results for predicting ER, PR, and HER2 status in the BCNB test set, as well as IDH status in the MUV-IDH dataset. b. Immunotherapy response prediction. Performance in terms of AUC and c-index for lung and gastro-esophageal cancers, respectively. c. Prognosis prediction. C-index results for prognosis predictions across 16 TCGA cohorts. MUSK significantly outperforms the compared methods as shown in the overall bars (p-value < 0.0001), representing the average performance across 16 projects. In a-c, data are represented as mean with standard deviations, based on 5-fold cross-validation experiments. The two-sided Mann-Whitney U test is used to assess the statistical significance between MUSK and the comparison methods. ****p < 0.0001.
Extended Data Fig. 4. Melanoma relapse prediction.

a-b. MUSK achieves superior performance for predicting the 5-year risk of relapse in a cohort of 1,342 melanoma patients compared with existing multimodal pathology foundation models. c. At 90% sensitivity for relapse prediction, MUSK substantially improved the specificity by about 15% over other foundation models. d. The multimodal MUSK model significantly improves upon relapse prediction over models based on clinical reports or whole slide images alone. e. Two examples of melanoma cases with and without relapse. In each panel, the left image shows the original WSI, while the middle image displays the corresponding heatmaps that highlights the regions model focused on within the WSIs. The right images provide zoomed-in views of the regions receiving the most model attention. The case with relapse shows the presence of skin ulceration with abundant intratumoral macrophages accompanied by fibrosis, less intratumoral and peritumoral lymphocytes, and brisk mitotic activity. On the other hand, the case without relapse shows an intact epidermis without ulceration, abundant intratumoral and peritumoral lymphocytes, and inconspicuous mitotic activity. In a, c, and d, the error bars represent the standard deviations computed from 5-fold cross-validation experiments, and the two-sided Mann-Whitney U test is used to measure the statistical significance between MUSK and the compared methods.
Extended Data Fig. 5. Gastro-esophageal cancer immunotherapy response prediction.

a. MUSK outperforms other foundation models for predicting objective response and progression-free survival in gastroesophageal cancer patients treated with immunotherapy. b. The multimodal MUSK model improves upon models based on clinical reports or whole-slide images alone. c. Kaplan-Meier analysis demonstrates that MUSK significantly stratifies patients into high-risk and low-risk groups for progression-free survival, in the entire cohort and clinically relevant subgroups. The two-sided log-rank test is used to compare the survival differences between the high-risk and low-risk groups. HR: hazard ratio. d. Multivariate Cox regression analysis shows MUSK is the only significant predictor of progression-free survival beside MSI status. We computed the P-values using the two-sided Wald test and we presented the hazard ratio (HR) with 95% confidence intervals. e. Two examples of gastro-esophageal cancer cases with and without objective response to immunotherapy. In each panel, the left image shows the original WSI, while the middle image displays the corresponding heatmaps that highlights the regions model focused on within the WSIs. The right images provide zoomed-in views of the regions receiving the most model attention. The case with response shows abundant infiltration of lymphocytes within and around the tumor; the stroma is less fibrotic and displays more edema. On the other hand, the case without response shows minimal lymphocyte infiltration and increased intra-tumoral and peri-tumoral fibrotic stroma. CPS: combined positive score; MSI/MSS: microsatellite instable/stable; ADC: adenocarcinoma; SCC: squamous cell carcinoma. In a and b, the error bars represent the mean with standard deviation computed from 5-fold cross-validation experiments, and the two-sided Mann-Whitney U test is used to measure the statistical significance between MUSK and the compared methods.
Extended Data Fig. 6. Kaplan-Meier analysis of the MUSK model for stratifying patients under different treatment regimens.

The results demonstrate that MUSK significantly stratifies patients for progression-free survival into low- and high-risk groups, in (a) lung and (b) gastro-esophageal cancer, treated with immunotherapy with or without concurrent chemotherapy.
Extended Data Fig. 7. MUSK model configuration.

a. MUSK integrates two independent Transformers for image and text data modalities. This architecture processes sequences within each modality independently and fuses them in the attention modules, allowing cross-modal interactions and ensuring robustness across different data types. b. During the second phase of pretraining (Figure 1), MUSK requires modality alignment using contrastive loss as its training objective, augmented with an auxiliary MLM loss. This MLM component utilizes a streamlined cross-attention decoder that employs text embeddings as queries to interact dynamically with image embeddings, thereby instilling intricate cross-modal insights.
Extended Data Fig. 8. Ablation study on training configuration.

We conducted ablation studies to evaluate the impact of various training configurations (refer to the Supplementary Materials for detailed descriptions): a. The effect of mask pretraining. b. The effect of data distribution, comparing natural images/text with pathology images/text. c. The impact of the data scale for mask pretraining was evaluated using Quilt1M, 15M images with 500M text tokens, and 50M images with 1B text tokens. d. The model capacity of the MUSK models. In a-d, the error bars represent the average of the standard deviations within the datasets. Error bars are not provided for the I2T retrieval, T2I retrieval, and VQA tasks because they are evaluated on a single dataset. The evaluation metrics used are balanced accuracy for the linear probe classification, 10-shot classification, and zero-shot classification tasks; accuracy for the VQA task; Recall@50 for the I2T and T2I retrievals; and mMV @5 for the I2I retrieval. T2I: Text-to-Image; I2T: Image-to-Text; I2I: Image-to-Image; VQA: Visual Question Answering; cls: classification.
Supplementary Material
Supplementary information is available for this paper.
Acknowledgements
This work was supported in part by the National Institutes of Health under research grants R01CA222512, R01CA233578, R01CA269559, R01CA285456, R01CA290715 from the National Cancer Institute, grant R01DE030894 from the National Institute of Dental and Craniofacial Research, and a research grant from Stanford Institute for Human-Centered Artificial Intelligence. K.-H.Y. is supported in part by the National Institute of General Medical Sciences grant R35GM142879, the National Heart, Lung, and Blood Institute grant R01HL174679, the Department of Defense Peer Reviewed Cancer Research Program Career Development Award HT9425-231-0523, the Research Scholar Grant RSG-24-1253761-01-ESED.
Footnotes
Competing interests
The authors declare no competing interests.
Code Availability
Our work is publicly available at https://github.com/lilab-stanford/MUSK, including installation, pertaining instructions, model weights, and example data.
Inclusion & Ethics Statement
The study was approved by the Institutional Review Board of Stanford University (protocol #67432), Informed consent was waived for this retrospective analysis.
Data Availability
The histopathology and clinical data for TCGA used in this study are publicly available through the following resources: National Cancer Institute Genomic Data Commons Portal (https://portal.gdc.cancer.gov/), cBioPortal (https://www.cbioportal.org/), and TCGA Pathology Reports (https://github.com/tatonetti-lab/tcga-path-reports). Additional datasets used include Quilt1M (https://github.com/wisdomikezogwo/quilt1m), PathAsst (https://huggingface.co/datasets/jamessyx/PathCap), PathVQA (https://huggingface.co/datasets/flaviagiammarino/path-vqa), BookSet and PubmedSet (https://warwick.ac.uk/fac/cross_fac/tia/data/arch), PatchCamelyon (https://patchcamelyon.grand-challenge.org/), NCT-CRC-HE-100K (https://zenodo.org/record/1214456), SICAPv2 (https://data.mendeley.com/datasets/9xxm58dvs3/1), Osteo (https://www.cancerimagingarchive.net/collection/osteosarcoma-tumor-assessment/), RenalCell (https://zenodo.org/records/6528599), SkinCancer (https://www.isic-archive.com/), LC25000 (https://github.com/tampapath/lung_colon_image_set), PanNuke (https://warwick.ac.uk/fac/cross_fac/tia/data/pannuke), UniToPatho (https://ieee-dataport.org/open-access/unitopatho), WSSS4LUAD (https://wsss4luad.grand-challenge.org/WSSS4LUAD/), BRACS (https://www.bracs.icar.cnr.it/), Visiomel (https://www.drivendata.org/competitions/148/visiomel-melanoma/), PathMMU (https://huggingface.co/datasets/jamessyx/PathMMU), BCNB (https://bcnb.grand-challenge.org/), and MUV-IDH (https://doi.org/10.25493/WQ48-ZGX). The data for patients in the immunotherapy cohorts are subject to controlled access since they contain sensitive information about patient privacy. Requests for access may be submitted to the corresponding author (rli2@stanford.edu) with a brief research plan and require consent to a data use agreement. We aim to respond to all data access requests within 4 weeks. Data usage is restricted to non-commercial academic research purposes.
Main References
- 1.Sammut S-J et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Vanguri RS et al. Multimodal integration of radiology, pathology and genomics for prediction of response to pd-(l) 1 blockade in patients with non-small cell lung cancer. Nature cancer 3, 1151–1164 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Acosta JN, Falcone GJ, Rajpurkar P & Topol EJ Multimodal biomedical AI. Nature Medicine 28, 1773–1784 (2022). [DOI] [PubMed] [Google Scholar]
- 4.Boehm KM, Khosravi P, Vanguri R, Gao J & Shah SP Harnessing multimodal data integration to advance precision oncology. Nature Reviews Cancer 22, 114–126 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lipkova J et al. Artificial intelligence for multimodal data integration in oncology. Cancer cell 40, 1095–1110 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Moor M et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023). [DOI] [PubMed] [Google Scholar]
- 7.Kim C et al. Transparent medical image AI via an image–text foundation model grounded in medical literature. Nature Medicine 30, 1154–1165 (2024). [DOI] [PubMed] [Google Scholar]
- 8.Singhal K et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zhou Y et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Xu H et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chen RJ et al. Towards a general-purpose foundation model for computational pathology. Nature Medicine 30, 850–862 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Vorontsov E et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nature Medicine 30, 2924–2935 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang X et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Christensen M, Vukadinovic M, Yuan N & Ouyang D Vision–language foundation model for echocardiogram interpretation. Nature Medicine 30, 1481–1488 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Huang Z, Bianchi F, Yuksekgonul M, Montine TJ & Zou J A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 29, 2307–2316 (2023). [DOI] [PubMed] [Google Scholar]
- 16.Lu MY et al. A visual-language foundation model for computational pathology. Nature Medicine 30, 863–874 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lu MY et al. A multimodal generative AI copilot for human pathology. Nature 634, 466–473 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Radford A et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763 (PMLR, 2021). [Google Scholar]
- 19.Schuhmann C et al. LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022). [Google Scholar]
- 20.Bhinder B, Gilvary C, Madhukar NS & Elemento O Artificial intelligence in cancer research and precision medicine. Cancer discovery 11, 900–915 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wang W et al. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19175–19186 (2023). [Google Scholar]
- 22.Gamper J & Rajpoot N Multiple instance captioning: Learning representations from histopathology textbooks and articles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16549–16559 (2021). [Google Scholar]
- 23.Sun Y et al. PathMMU: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. arXiv preprint arXiv:2401.16355; (2024). [Google Scholar]
- 24.Kim J-H, Jun J & Zhang B-T Bilinear attention networks. Advances in neural information processing systems 31 (2018). [Google Scholar]
- 25.Nguyen BD et al. Overcoming data limitation in medical visual question answering. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part IV 22, 522–530 (Springer, 2019). [Google Scholar]
- 26.Li LH, Yatskar M, Yin D, Hsieh C-J & Chang K-W VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557; (2019). [Google Scholar]
- 27.Naseem U, Khushi M, Dunn AG & Kim J K-PathVQA: Knowledge-aware multimodal representation for pathology visual question answering. IEEE Journal of Biomedical and Health Informatics 28, 1886–1895 (2024). [DOI] [PubMed] [Google Scholar]
- 28.He X, Zhang Y, Mou L, Xing E & Xie P PathVQA: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286; (2020). [Google Scholar]
- 29.Veeling BS, Linmans J, Winkens J, Cohen T & Welling M Rotation equivariant CNNs for digital pathology. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16–20, 2018, Proceedings, Part II 11, 210–218 (Springer, 2018). [Google Scholar]
- 30.Kriegsmann K et al. Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections. Frontiers in Oncology 12, 1022967 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kumar N et al. A multi-organ nucleus segmentation challenge. IEEE transactions on medical imaging 39, 1380–1391 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Barbano CA et al. Unitopatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading. In 2021 IEEE International Conference on Image Processing (ICIP), 76–80 (IEEE, 2021). [Google Scholar]
- 33.Brancati N et al. BRACS: A dataset for breast carcinoma subtyping in H&E histology images. Database 2022, baac093 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Silva-Rodŕıguez J, Colomer A, Sales MA, Molina R. & Naranjo V. Going deeper through the gleason scoring scale: An automatic end-to-end system for histology prostate grading and cribriform pattern detection. Computer methods and programs in biomedicine 195, 105637 (2020). [DOI] [PubMed] [Google Scholar]
- 35.Borkowski AA et al. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142; (2019). [Google Scholar]
- 36.Brummer O, Pol¨ onen P, Mustjoki S. & Br¨ uck O.¨ Integrative analysis of histological textures and lymphocyte infiltration in renal cell carcinoma using deep learning. bioRxiv; 2022–08 (2022). [Google Scholar]
- 37.Kather JN et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine 16, e1002730 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Arunachalam HB et al. Viable and necrotic tumor assessment from whole slide images of osteosarcoma using machine-learning and deep-learning models. PloS one 14, e0210706 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Han C et al. Multi-layer pseudo-supervision for histopathology tissue semantic segmentation using patchlevel classification labels. Medical Image Analysis 80, 102487 (2022). [DOI] [PubMed] [Google Scholar]
- 40.Kather JN et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nature cancer 1, 789–799 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Xu F et al. Predicting axillary lymph node metastasis in early breast cancer using deep learning on primary tumor biopsy slides. Frontiers in oncology 11, 759007 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Roetzer-Pejrimovsky T et al. The digital brain tumour atlas, an open histopathology resource. Scientific Data 9, 55 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Atkins MB et al. The state of melanoma: emergent challenges and opportunities. Clinical Cancer Research 27, 2678–2697 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Thompson AK, Kelley BF, Prokop LJ, Murad MH & Baum CL Risk factors for cutaneous squamous cell carcinoma recurrence, metastasis, and disease-specific death: a systematic review and metaanalysis. JAMA dermatology 152, 419–428 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.VisioMel. Visiomel challenge: Predicting melanoma relapse (2023). URL https://www.drivendata.org/competitions/148/visiomel-melanoma/page/674/. Accessed: 2024-02-01.
- 46.Ikezogwo W et al. Quilt-1m: One million image-text pairs for histopathology. Advances in Neural Information Processing Systems 36 (2024). [PMC free article] [PubMed] [Google Scholar]
- 47.Zhang S et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915; (2023). [Google Scholar]
- 48.Hellmann MD et al. Nivolumab plus ipilimumab in advanced non–small-cell lung cancer. New England Journal of Medicine 381, 2020–2031 (2019). [DOI] [PubMed] [Google Scholar]
- 49.Gandhi L et al. Pembrolizumab plus chemotherapy in metastatic non–small-cell lung cancer. New England journal of medicine 378, 2078–2092 (2018). [DOI] [PubMed] [Google Scholar]
- 50.Samstein RM et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nature genetics 51, 202–206 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Cristescu R et al. Pan-tumor genomic biomarkers for PD-1 checkpoint blockade–based immunotherapy. Science 362, eaar3593 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Bagaev A et al. Conserved pan-cancer microenvironment subtypes predict response to immunotherapy. Cancer cell 39, 845–865 (2021). [DOI] [PubMed] [Google Scholar]
- 53.Cui H et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods 21, 1470–1480 (2024). [DOI] [PubMed] [Google Scholar]
- 54.Mok TS et al. Pembrolizumab versus chemotherapy for previously untreated, PD-L1-expressing, locally advanced or metastatic non-small-cell lung cancer (KEYNOTE-042): a randomised, open-label, controlled, phase 3 trial. The Lancet 393, 1819–1830 (2019). [DOI] [PubMed] [Google Scholar]
- 55.Johnson DB, Nebhan CA, Moslehi JJ & Balko JM Immune-checkpoint inhibitors: long-term implications of toxicity. Nature Reviews Clinical Oncology 19, 254–267 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Bray F et al. Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 74, 229–263 (2024). [DOI] [PubMed] [Google Scholar]
- 57.Bruni D, Angell HK & Galon J The immune contexture and immunoscore in cancer prognosis and therapeutic efficacy. Nature Reviews Cancer 20, 662–680 (2020). [DOI] [PubMed] [Google Scholar]
- 58.Herbst RS et al. Atezolizumab for first-line treatment of PD-L1–selected patients with NSCLC. New England Journal of Medicine 383, 1328–1339 (2020). [DOI] [PubMed] [Google Scholar]
Method References
- 59.Shazeer N et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538; (2017). [Google Scholar]
- 60.Bao H et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35, 32897–32912 (2022). [Google Scholar]
- 61.Esser P et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning (2024). [Google Scholar]
- 62.Sun Y et al. Pathasst: A generative foundation AI assistant towards artificial general intelligence of pathology. In AAAI Conference on Artificial Intelligence (2023). [Google Scholar]
- 63.Li J, Li D, Xiong C & Hoi SCH BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (2022). [Google Scholar]
- 64.Devlin J, Chang M-W, Lee K & Toutanova K BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics; (2019). [Google Scholar]
- 65.Ramesh A et al. Zero-shot text-to-image generation. In International conference on machine learning, 8821–8831 (Pmlr, 2021). [Google Scholar]
- 66.Peng Z, Dong L, Bao H, Ye Q & Wei F BEiT v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366; (2022). [Google Scholar]
- 67.Wang X et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Medical image analysis 81, 102559 (2022). [DOI] [PubMed] [Google Scholar]
- 68.Shen Y, Luo Y, Shen D & Ke J RandStainNA: Learning stain-agnostic features from histology slides by bridging stain augmentation and normalization. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 212–221 (Springer, 2022). [Google Scholar]
- 69.Kang M, Song H, Park S, Yoo D & Pereira S Benchmarking self-supervised learning on diverse pathology datasets. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3344–3354 (2022). [Google Scholar]
- 70.Loshchilov I & Hutter F Decoupled weight decay regularization. In International Conference on Learning Representations, 1–18 (2019). [Google Scholar]
- 71.Ilse M, Tomczak J & Welling M Attention-based deep multiple instance learning. In International conference on machine learning, 2127–2136 (PMLR, 2018). [Google Scholar]
- 72.Weinstein JN et al. The cancer genome atlas pan-cancer analysis project. Nature genetics 45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Kefeli J & Tatonetti N TCGA-reports: A machine-readable pathology report resource for benchmarking text-based ai models. Patterns 5, 100933 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Achiam J et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774; (2023). [Google Scholar]
- 75.Callahan A et al. The stanford medicine data science ecosystem for clinical and translational research. JAMIA open 6, ooad054 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Lu MY et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering 5, 555–570 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The histopathology and clinical data for TCGA used in this study are publicly available through the following resources: National Cancer Institute Genomic Data Commons Portal (https://portal.gdc.cancer.gov/), cBioPortal (https://www.cbioportal.org/), and TCGA Pathology Reports (https://github.com/tatonetti-lab/tcga-path-reports). Additional datasets used include Quilt1M (https://github.com/wisdomikezogwo/quilt1m), PathAsst (https://huggingface.co/datasets/jamessyx/PathCap), PathVQA (https://huggingface.co/datasets/flaviagiammarino/path-vqa), BookSet and PubmedSet (https://warwick.ac.uk/fac/cross_fac/tia/data/arch), PatchCamelyon (https://patchcamelyon.grand-challenge.org/), NCT-CRC-HE-100K (https://zenodo.org/record/1214456), SICAPv2 (https://data.mendeley.com/datasets/9xxm58dvs3/1), Osteo (https://www.cancerimagingarchive.net/collection/osteosarcoma-tumor-assessment/), RenalCell (https://zenodo.org/records/6528599), SkinCancer (https://www.isic-archive.com/), LC25000 (https://github.com/tampapath/lung_colon_image_set), PanNuke (https://warwick.ac.uk/fac/cross_fac/tia/data/pannuke), UniToPatho (https://ieee-dataport.org/open-access/unitopatho), WSSS4LUAD (https://wsss4luad.grand-challenge.org/WSSS4LUAD/), BRACS (https://www.bracs.icar.cnr.it/), Visiomel (https://www.drivendata.org/competitions/148/visiomel-melanoma/), PathMMU (https://huggingface.co/datasets/jamessyx/PathMMU), BCNB (https://bcnb.grand-challenge.org/), and MUV-IDH (https://doi.org/10.25493/WQ48-ZGX). The data for patients in the immunotherapy cohorts are subject to controlled access since they contain sensitive information about patient privacy. Requests for access may be submitted to the corresponding author (rli2@stanford.edu) with a brief research plan and require consent to a data use agreement. We aim to respond to all data access requests within 4 weeks. Data usage is restricted to non-commercial academic research purposes.
