Abstract
Advances in artificial intelligence have paved the way for leveraging hematoxylin and eosin-stained tumor slides for precision oncology. We present ENLIGHT–DeepPT, an indirect two-step approach consisting of (1) DeepPT, a deep-learning framework that predicts genome-wide tumor mRNA expression from slides, and (2) ENLIGHT, which predicts response to targeted and immune therapies from the inferred expression values. We show that DeepPT successfully predicts transcriptomics in all 16 The Cancer Genome Atlas cohorts tested and generalizes well to two independent datasets. ENLIGHT–DeepPT successfully predicts true responders in five independent patient cohorts involving four different treatments spanning six cancer types, with an overall odds ratio of 2.28 and a 39.5% increased response rate among predicted responders versus the baseline rate. Notably, its prediction accuracy, obtained without any training on the treatment data, is comparable to that achieved by directly predicting the response from the images, which requires specific training on the treatment evaluation cohorts.
Histopathology has long been considered the gold standard of clinical diagnosis and prognosis in cancer. In recent years, the use of tumor molecular profiling within the clinic has allowed for more accurate cancer diagnostics, as well as the delivery of precision oncology1–3. Rapid advances in digital histopathology have also allowed the extraction of clinically relevant information embedded in tumor slides by applying machine learning and artificial intelligence methods, capitalizing on recent advancements in image analysis through deep learning4. Key advances are already underway, as whole-slide images (WSIs) of tissue stained with hematoxylin and eosin (H&E) have been used to diagnose tumors computationally5–8, classify cancer types7,9–13, distinguish tumors with low or high mutation burden14, identify genetic mutations6,15–23, predict patient survival24–30, detect DNA methylation patterns31 and mitosis32, and quantify tumor immune infiltration33, tumor-infiltrating lymphocytes (TILs)34 and spatial immune cell infiltration35.
Previous work has already impressively unraveled the potential of harnessing next-generation digital pathology to predict response to therapies directly from images36–40. In these direct supervised learning approaches, predicting response to therapy directly from the WSI requires large datasets consisting of matched imaging and response data. Therefore, they require a specific cohort for each drug or treatment that is to be predicted. However, the availability of such data on a large scale is still fairly limited, restricting the applicability of this approach and raising concerns about the generalizability of supervised predictors to other cohorts.
To overcome this challenge, we turned to developing and studying a generic method for generating WSI-based predictors of patient response for a broad range of cancer types and therapies, which does not require matched WSI and response datasets for training. To accomplish this, we have taken an indirect two-step approach. First, we developed DeepPT (Deep Pathology for Transcriptomics), a deep-learning framework for imputing (predicting) gene expression from H&E slides, which extends upon previous valuable work on this topic41–46. The DeepPT models are specific to cancer types and are built by training on matched WSI and expression data from The Cancer Genome Atlas (TCGA). Second, given the gene expression values predicted by these models for a new patient, we apply our previously published approach, ENLIGHT47, originally developed to predict patient response from measured tumor transcriptomics, to predict response from the DeepPT-imputed transcriptomics.
We proceed to provide an overview of DeepPT’s architecture and a brief recap of ENLIGHT’s workings, the study design and the cohorts analyzed. We then describe the results obtained in each of the two steps of ENLIGHT–DeepPT. First, we study the ability to predict tumor expression, showing the performance of the trained DeepPT models in predicting the bulk gene expression in 16 TCGA cohorts and 2 independent, unseen cohorts. Second, we analyze five independent clinical trial datasets of patients with different cancer types treated with various targeted and immune therapies. We show that ENLIGHT, adhering to the parameters used in its original publication47 without any adaptation, can successfully predict the true responders based on the expression values imputed by DeepPT.
Results
Computational pipeline and study design
Our computational framework consists of DeepPT (Fig. 1a) and ENLIGHT (Fig. 1b). Specifically, (1) to construct an RNA gene expression predictor from H&E slides, we trained a deep-learning model called DeepPT using formalin-fixed, paraffin-embedded (FFPE) WSIs and their corresponding bulk gene expression profiles for 16 cancer types constituting ten broad classes from TCGA. DeepPT performance was first evaluated through fivefold cross-validation within TCGA and then on two external datasets: the TransNEO breast cancer cohort (TransNEO-breast) consisting of 160 fresh-frozen (FF) slides48 and an unpublished brain cancer cohort (National Cancer Institute (NCI)-brain) consisting of 226 FFPE slides (Methods, Supplementary Table 1 and Extended Data Fig. 1a–d). (2) ENLIGHT is an unsupervised approach that predicts individual responses to targeted and immune therapies based on gene expression measured from the tumor tissue47 (Fig. 1b and Methods). Our final goal is to use the inferred gene expression from DeepPT as input to ENLIGHT to predict patient response (Fig. 1c). To this end, we further applied the DeepPT models that were trained on TCGA data to predict gene expression in five clinical trial datasets (see Methods and Supplementary Table 2 for full details).
Fig. 1 |. Study overview.
a, Three main components of the DeepPT architecture: the pretrained ResNet50 convolutional neural network (CNN) unit (left) extracts histopathology features from tile images; the autoencoder (middle) compresses the 2,048 features to a lower dimension of 512 features; and the MLP (right) integrates these histopathology features to predict the sample’s gene expression. b, Overview of the ENLIGHT pipeline (illustration taken from ref. 47): ENLIGHT starts by inferring the GI partners of a given drug from various cancer in vitro and clinical data sources. Given the SL and SR partners and the transcriptomics for a given patient sample, ENLIGHT computes a drug-matching score that is used to predict the patient response. Here, ENLIGHT uses the DeepPT-predicted expression to produce drug-matching scores for each patient studied. c, Overview of the analysis using DeepPT and ENLIGHT. Top row, DeepPT was trained with FFPE slide images and matched transcriptomics for an array of different cancer types from TCGA. Middle row, after the training phase, the models were applied to predict gene expression on the internal (held-out) TCGA datasets and on two external datasets on which they were not trained. Bottom row, the predicted tumor transcriptomics in each of the five independent test clinical datasets serves as input to ENLIGHT for predicting the patients’ response to treatment and assessing the overall prediction accuracy.
Prediction of gene expression from histopathology images
As illustrated in Fig. 1, we constructed models predicting normalized gene expression profiles from their corresponding histopathology images for each of the ten broad TCGA categories. Model performance was evaluated through the Pearson correlation (R) between the predicted and actual expression values of each gene across the test dataset samples. For each cancer type, a total of approximately 18,000 genes were analyzed. Among these, most genes (>75% for all cancer types) exhibited a positive correlation. Most cancer types demonstrated a median correlation of approximately 0.2 (Fig. 2a, empty violin plots). When focusing on the top 1,000 genes with the highest correlation, the mean of medians across the 16 cancer types was 0.43, ranging from 0.34 (for lung squamous cell carcinoma) to 0.55 (for rectum adenocarcinoma) (Fig. 2a, light blue violin plots).
Fig. 2 |. DeepPT prediction of gene expression from H&E slides.
a, Violin plots depicting the distribution of Pearson correlations between predicted and measured expression values across the cohort samples for all (approximately 18,000) genes (empty) and the top 1,000 genes with the highest correlations (light blue). In violin plots, the central mark is the median. The number of patients in each cohort is shown in parentheses. b, Median correlation between the predicted and measured expression values across the cohort samples obtained by HE2RNA (gray), SEQUOIA (pink) and DeepPT (light blue) for the top 1,000 best-predicted genes (independently selected for each model—those with the highest correlation). The performance of HE2RNA and SEQUOIA is taken as reported in the original publication49. c, Mean correlation between the predicted expression values of all genes and their measured values across the samples, obtained by HE2RNA (gray), tRNAsformer (purple) and DeepPT (light blue) for kidney cancer. The performance of HE2RNA and tRNAsformer has been reported in ref. 46. P values in b and c were calculated using a one-sided permutation test, and their values were zero in every case (*P < 0.001). d, Correlation distribution of the top 1,000 genes (left) and the number of genes with a correlation of >0.4 (right) achieved by DeepPT in two external unseen test cohorts. In violin plots, the central mark is the median. e, Venn diagrams illustrating the overlap between the well-predicted genes (R > 0.4) in TCGA-breast and TransNEO-breast (left) and in TCGA-brain and NCI-brain (right). Both have hypergeometric P values equal to zero. f, Pathway enrichment analysis on the well-predicted genes (R > 0.4). Each row represents a different cancer hallmark, and each column represents a different cohort (the two rightmost columns correspond to the two external cohorts). Values denote the FDR-corrected P values for pathway enrichment among the well-predicted genes (hypergeometric test). BRCA, breast invasive carcinoma; KIRC, kidney renal clear cell carcinoma; LGG, brain lower-grade glioma; LUSC, lung squamous cell carcinoma; LUAD, lung adenocarcinoma; HNSC, head and neck squamous cell carcinoma; PRAD, prostate adenocarcinoma; COAD, colon adenocarcinoma; STAD, stomach adenocarcinoma; KIRP, kidney renal papillary cell carcinoma; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; PAAD, pancreatic adenocarcinoma; READ, rectum adenocarcinoma; ESCA, esophageal carcinoma; GBM, glioblastoma multiforme; KICH, kidney chromophobe.
Benchmarking against the existing expression prediction approaches, HE2RNA44 and SEQUOIA49, for the nine cohorts reported in ref. 49, we found that the mean of median correlations between the predicted gene expression and their measured values across the samples of the top 1,000 genes obtained by DeepPT was 0.41. This value is roughly double those for the top 1,000 genes obtained by HE2RNA (0.16) and SEQUOIA (0.26 for both versions, SEQUOIA_scratch and SEQUOIA_pretrain) (Fig. 2b). Of note, these lists of 1,000 genes are composed of the top predicted genes in each model. To compare the performance of DeepPT to that of tRNAsformer on the kidney cohort, we followed the performance measured in ref. 46; DeepPT obtained a mean correlation of 0.40 across all considered genes, significantly outperforming tRNAsformer (0.34) and HE2RNA (0.32) (Fig. 2c). Additional comprehensive comparisons are provided in Extended Data Fig. 2a–d.
For external validation, we tested the prediction ability of DeepPT on two unseen independent datasets available to us, which contained matched tumor WSIs and gene expression data. We first applied the DeepPT model constructed using the TCGA-breast cancer dataset to predict gene expression from corresponding H&E slides of the TransNEO-breast cancer cohort (n = 160). Notably, the two datasets were generated independently at different facilities, with two different preparation methods (TCGA slides are FFPE, whereas TransNEO slides are FF). Therefore, the histological features extracted from these two datasets are rather distinct (Extended Data Fig. 3). Despite these differences, without any further training or tuning, we achieved a median correlation of 0.49 for the top 1,000 well-predicted genes. Remarkably, 1,408 genes had a correlation of >0.4 (Fig. 2d). Similarly, when applying the DeepPT model trained on TCGA-brain samples to predict gene expression from new unpublished NCI-brain slides (n = 226), we obtained a median correlation of 0.48 for the top 1,000 genes and 1,753 genes had a correlation of >0.4 (Fig. 2d). When identifying the well-predicted genes (having a Pearson correlation exceeding 0.4) in each cohort, we observed a significant overlap of well-predicted genes between the training and corresponding external test cohorts of the same cancer type (hypergeometric P < 0.01). Specifically, of the 1,408 well-predicted genes in the TransNEO-breast cohort, approximately 53% (742 genes) overlapped with those in the TCGA-breast cohort (Fig. 2e, left). Similarly, among the 1,753 well-predicted genes in the NCI-brain cohort, approximately 74% (1,291 genes) overlapped with those well predicted in the TCGA-brain cohort (Fig. 2e, right).
Genes reliably predicted by DeepPT are enriched for cancer hallmarks
We next explored whether genes that are reliably predicted by DeepPT (that is, those that are predicted with a Pearson correlation of R > 0.4 between predicted and measured expression across cohort samples) have known biological relevance to cancer. To this end, we carried out a pathway enrichment analysis focused on the cancer hallmarks. Specifically, we looked for enrichment among ten cancer hallmarks described in ref. 50 and for which detailed gene sets were given in ref. 51. Figure 2f summarizes the pathway enrichment analysis results for all TCGA subtypes and the two external cohorts. Interestingly, we observed a strong enrichment for immune processes across most cancer types (bottom row, ‘tumor-promoting inflammation’). Other enriched hallmarks include the cell cycle (‘sustaining proliferative signaling’), ‘avoiding immune destruction’ and ‘activating invasion and metastasis’. These results further attest that DeepPT can faithfully reconstruct key elements in cell expression related to cancer.
The expression of immune pathways is strongly associated with immune cell infiltration
To elucidate some of the slide morphological features that may be associated with gene expression predictions, we computed the correlation between the predicted gene expression and the abundance of TILs within the slides, using estimated TIL abundance data from ref. 34. Subsequently, we performed a gene set enrichment analysis52 to identify pathways enriched with genes whose predicted expression levels are highly correlated with TIL abundance. Intriguingly, in seven of the nine cancer types we examined, the most pronounced positive enrichment (indicated by the normalized enrichment score) was observed for two immune-related hallmarks: ‘avoiding immune destruction’ and ‘tumor-promoting inflammation’ (Fig. 3). Thus, DeepPT appears to consistently predict higher gene expression levels for immune-related genes in slides with higher TIL counts. This observation indicates that the predicted expression of immune genes is the one most correlated with TIL infiltration and counts.
Fig. 3 |. Gene set enrichment analysis identifying pathways whose predicted gene expression correlates with TIL abundance.
Only cancer cohorts with available data for both predicted gene expression and estimated TILs are included in the analysis. P values were calculated using a one-sided permutation test for gene set enrichment analysis (*P < 0.05). The precise P values for each cancer hallmark, following the same order presented in each panel, were as follows: 0.806, 0.440, 0, 0, 0, 0.510, 0, 0, 0, 0 (BRCA); 0, 0.963, 1, 1, 0.981, 0, 0.390, 0, 0, 0 (LUSC); 0.522, 0.426, 0.637, 0.138, 0.121, 0.345, 0.083, 0, 0, 0 (LUAD); 0.931, 0, 0, 0, 0, 0.871, 0.023, 0.369, 0.030, 0.704 (PRAD); 0.147, 0.452, 0.177, 0, 0.048, 0.016, 0, 0.067, 0, 0 (COAD); 0.968, 0.986, 0.007, 0.001, 0.026, 0.068, 0, 0, 0, 0 (STAD); 0.008, 0.110, 0.079, 0, 0, 0.130, 0, 0, 0, 0 (CESC); 0, 0.409, 0.559, 0.496, 0.109, 0, 0.302, 0, 0.241, 0 (PAAD); 0, 0.167, 0.661, 0, 0.076, 0, 0, 0, 0, 0 (READ).
DeepPT-inferred prognostic signature levels are associated with patient survival
The observation that genes that promote proliferation and metastasis, which are well-known prognostic markers, are specifically well predicted by DeepPT led us to explore how well such prognostic markers predict patient survival when calculated over the gene expression predicted by DeepPT. To this end, we studied three proliferation signatures known to be linked to cancer progression and poor prognosis using patient tumor data from TCGA: (1) the expression of the gene encoding Ki-67 (MKI67), a well-known marker for cell proliferation that was shown to have prognostic value in breast cancer, colorectal cancer, sarcoma, squamous cell carcinoma and prostate cancer53; (2) a proliferation index derived in ref. 54 (proliferation signatures were shown to be prognostic in various cancers, such as breast, kidney and lung cancers); and (3) an epithelial-to-mesenchymal transition (EMT) signature from MSigDB (Molecular Signatures Database)55, which is associated with the formation and progression of metastasis and was shown to be prognostic, for example, in pancreatic, gastric and ovarian cancers.
For each patient in the TCGA cohort, we calculated a signature score for each of the three signatures; the signature score is the mean gene-wise ranked expression across the genes of the signature. We then tested the correlation between signature scores derived from the predicted and the actual gene expression and the association of each to patient survival using Cox proportional hazards tests at the cohort level. All three signature scores exhibited a significant correlation between their actual and DeepPT-predicted expression: 0.367 for EMT, 0.321 for proliferation and 0.328 for the Ki-67 signature. DeepPT reconstructs the prognostic value of the signatures rather faithfully: the correlation of the hazard ratios (HRs) of each signature between those computed based on the actual and predicted expression was very high (0.75–0.88; Fig. 4). These results provide evidence that DeepPT well retains the prognostic value of these signatures.
Fig. 4 |. Comparison of the correlation of survival association in terms of log(HR) for three proliferation signatures based on actual and predicted expression.
Top, Ki-67; middle, proliferation index; bottom, EMT pathway. X axis, log(HR) of signature score based on actual expression; Y axis, log(HR) of signature score based on predicted expression. Each point represents a different TCGA cohort, and points are color-coded according to the significance of the survival association (two-sided Cox proportional hazards test) using a corrected P < 0.05 cutoff: green denotes that the survival association was significant by both the actual and predicted signatures; red and black denote that the survival association was significant by the actual or predicted signature only, respectively. Pearson correlation R and corresponding P values are denoted in each panel. The regression line and 95% confidence intervals are shown.
As mentioned above, Ki-67 was shown to have prognostic value in many different cancer types. Of note, most of those show higher hazard with higher Ki-67 levels based on actual and predicted expression values (Fig. 4). Interestingly, the associations in breast and colorectal cancers are more positive using the DeepPT-predicted values than the measured ones. The actual and predicted expression showed similar results in prostate cancer. As shown in Fig. 4, all cancer types showed a positive hazard association with the proliferation signature; in the case of kidney cancer (kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, kidney chromophobe) and lung adenocarcinoma, those associations were indeed significant based on both predicted and actual expression. Finally, while the association between EMT and survival in gastric cancer (stomach adenocarcinoma) was significant based on both actual and predicted expression, the association in pancreatic cancer (pancreatic adenocarcinoma) was significant only when using the DeepPT-predicted values.
Predicting treatment response from DeepPT-imputed gene expression
After establishing the predictive ability of DeepPT and demonstrating its ability to infer prognostic signature levels that maintain their survival predictability, we turn to our keygoal, that is, tostudy the ability of ENLIGHT–DeepPT to predict patient response accurately in five clinical cohorts. Those include two human epidermal growth factor receptor 2 (HER2)+ breast cancer cohorts treated with chemotherapy plus trastuzumab48,56; a BRCA+ pancreatic cancer cohort treated with poly(ADP-ribose) polymerase (PARP) inhibitors (olaparib or veliparib); a mixed-indication cohort of patients with lung, cervical, and head and neck cancer treated with bintrafusp alfa57, a bispecific antibody that targets transforming growth factor-β and programmed death ligand 1; and, finally, an anaplastic lymphoma kinase (ALK)+ non-small cell lung cancer cohort treated with ALK inhibitors (alectinib or crizotinib). For each dataset, response was defined as determined by the clinicians running the respective trial (see Methods and Supplementary Table 2 for more details).
As the ENLIGHT–DeepPT workflow was designed with future potential clinical application in mind, we focused our assessment of its predictive power on measures that have direct clinical importance, including both the odds ratio (OR) of response and the average precision (AP). The OR, which is of major translational interest, denotes the ratio of the odds of responding among patients receiving ENLIGHT-matched treatments versus the odds of responding among patients whose treatments were not ENLIGHT-matched. Patients were considered ENLIGHT-matched if their ENLIGHT matching score (EMS) was greater or equal to a threshold value of 0.54. This threshold was previously determined in the original ENLIGHT publication47 on independent data and was kept fixed here. Using this predefined threshold, we observed that the OR of ENLIGHT–DeepPT was higher than 1 for all datasets, although this was not statistically significant for the PARPi and ALKi datasets, possibly owing to their small sample sizes (Fig. 5a). This demonstrates that patients receiving ENLIGHT-matched treatments (EMS ≥0.54) indeed had a higher chance to respond. Figure 5b further depicts the prediction performance of ENLIGHT–DeepPT through a complementary measure, the AP. Reassuringly, the AP of ENLIGHT–DeepPT well exceeds the overall response rate (ORR) for all five datasets, affirming its broader predictive power beyond that quantified by the OR.
Fig. 5 |. Predicting treatment response from H&E slides.
a, OR (y axis) for the five datasets tested and the aggregate cohort of all patients together (x axis). The drugs and sample sizes are denoted in the x-axis labels. The black horizontal dashed line denotes an OR of 1, which is expected by chance. Asterisks denote the significance of the OR being larger than 1 according to Fisher’s exact test. All P values were FDR corrected, P = 0.16 (PARPi), 0.18 (ALKi), 0.01 (bintrafusp alfa), 0.03 (trastuzumab1), 0.08 (trastuzumab2), 0.007 (all). *P < 0.1, **P < 0.05. b, AP (y axis) for the five datasets and the aggregate cohort, as in a. The black horizontal dashed lines denote the ORR for each dataset. An AP higher than the ORR demonstrates better accuracy than expected by chance. Asterisks denote the significance of the AP being higher than the response rate using a one-sided proportion test. All P values were FDR corrected, P = 0.22 (PARPi), 0.04 (ALKi), 0.04 (bintrafusp alfa), 0.003 (trastuzumab1), 0.11 (trastuzumab2), 0.003 (all). **P < 0.05. c, OR of the direct supervised method (y axis) for all 234 patients as a function of the fraction of patients above a given threshold (coverage, x axis). We present coverage between 10% and 90% only to avoid the measurement noise of extreme coverage values, where data are too small. The blue line denotes the OR of ENLIGHT–DeepPT for all 234 patients at its original clinical decision threshold. The red diamond denotes the threshold on the direct supervised method that yields the same coverage as ENLIGHT–DeepPT at its original, fixed threshold. d, Comparison of the OR of ENLIGHT–DeepPT and that of the direct supervised method (y axis) at thresholds that yield the same coverage (x axis). e, AP of ENLIGHT–DeepPT (light blue) and the direct supervised method (purple) for each dataset and on aggregate as in b. The dashed lines denote the ORR for each case as in b. f, OR for ENLIGHT-actual and ENLIGHT–DeepPT when predicting response to trastuzumab (for the trastuzumab1 cohort). g, Comparison of the AP (y axis) for both ENLIGHT-based models and the Sammut-ML predictor48. All methods were applied to the same patient group. The black horizontal dashed line denotes the ORR. The number of patients in each cohort is shown in parentheses.
Turning to an aggregate analysis of the performance of ENLIGHT–DeepPT, by analyzing all patients together in a simulated ‘basket’ trial in which each patient receives a different treatment (n = 234), we observed that the OR of ENLIGHT–DeepPT was 2.28 (95% confidence interval: 1.27, 4.06; rightmost bar in Fig. 5a), which is significantly higher than 1 (P = 0.007, Fisher’s exact test), and its precision was 46.5%, a 39.6% increase compared with the ORR of 33.3% (P = 0.024, one-sided proportion test). The AP was 0.47 (rightmost bar in Fig. 5b), which is significantly higher than the baseline response rate (P = 0.004, one-sided permutation test). Other performance metrics, such as area under the curve and accuracy, are given in Supplementary Table 3.
One of the advantages of ENLIGHT’s unsupervised approach is that it does not rely on data labeled with response to treatment data, which are usually scarce. To compare the performance of our two-step indirect model to that of a direct supervised model, we trained the same computational deep-learning pipeline as the one used in DeepPT on H&E slides and their corresponding response data on each of the five evaluation datasets described above (we replaced the regression component with a classification component). We used the training strategy that has been widely applied previously in the literature for direct H&E slide-based classification6,9,20,23,45 (see Methods for details) and termed this the ‘direct supervised’ model. Owing to the lack of independent treatment data for training, we applied leave-one-out cross-validation to evaluate the performance of the direct supervised model. As no tuning data were available to calibrate a single threshold for the direct supervised method, as was done in ref. 47, we calculated the OR of the direct supervised method for all patients (n = 234) on all possible thresholds. Figure 5c presents the OR as a function of the coverage (the fraction of patients with scores above a specific threshold). We compared these values to the OR of ENLIGHT–DeepPT at the clinical decision threshold established in ref. 47 based on measured RNA levels (0.54, blue line). Surprisingly, no threshold on the direct supervised values yielded an OR that surpassed the OR of ENLIGHT–DeepPT at its predetermined clinical decision threshold. In addition, we calculated ENLIGHT–DeepPT’s OR at various possible EMS thresholds and compared the results at thresholds that yielded the same coverage in both ENLIGHT–DeepPT and the direct supervised method (Fig. 5d). Finally, Fig. 5e compares the AP (which is threshold independent) of the two models. Remarkably, the overall performance of ENLIGHT–DeepPT (an unsupervised method), as measured by the OR and AP, was comparable to that of the supervised classifiers trained and predictive only for specific treatments; in some cases, ENLIGHT–DeepPT even outperformed them.
For one of the datasets (trastuzumab1), RNA sequencing data of tumor gene expression were also available and previously analyzed using ENLIGHT47. Of note, no measured mRNA data were available for any other dataset. Figure 5f,g compares the predictive performance using ENLIGHT–DeepPT scores to that using ENLIGHT scores calculated based on the measured expression values (denoted ‘ENLIGHT-actual’). Using the previously established threshold of 0.54, the OR of ENLIGHT–DeepPT was 3.36, which is lower than the OR of 6.95 obtained by ENLIGHT-actual but is still significantly higher than expected by chance (P = 0.03, test for OR > 1). The positive predictive value (also known as precision) of ENLIGHT–DeepPT was 50%, which is slightly lower than but not significantly different from the positive predictive value of 52% when using ENLIGHT-actual and 72.4% higher than the ORR of 29.7% observed in this study. However, the sensitivity (the fraction of responders correctly identified) of ENLIGHT–DeepPT was markedly lower than that of ENLIGHT-actual (42.1% versus 68%).
Finally, we sought to compare ENLIGHT–DeepPT to other predictive models for drug response. For the drugs analyzed in this study, the only available mRNA-based model for response is the multiomic machine learning predictor that uses DNA, RNA and clinical data, published in ref. 48 and denoted here as Sammut-ML. This model was based on in-cohort supervised learning to predict response to chemotherapy with or without trastuzumab among patients with HER2+ breast cancer. Figure 5g compares the performance of ENLIGHT-actual and ENLIGHT–DeepPT to that of Sammut-ML. In both analyses, we applied all methods to the same group of 56 patients for whom all relevant data were available (RNA sequencing, H&E slides, DNA sequencing and clinical features). To compare the predictors systematically across a wide range of decision thresholds, and as Sammut-ML did not derive a binary classification threshold, we used AP as the comparative rod here. As can be seen, all methods have rather comparable predictive power, with ENLIGHT–DeepPT having the highest AP (difference not statistically significant). Importantly, using only H&E slides without the need for RNA or DNA data or other clinical features has an invaluable practical advantage. To complement this analysis, we show that ENLIGHT–DeepPT outperforms a model that uses only the predicted expression of the drug targets as predictors of response (Extended Data Fig. 4a). Of note, we applied the original version of ENLIGHT as described in ref. 47. The only modification made to the original ENLIGHT version was the exclusion of the component that, in the case of monoclonal antibodies, considers the expression of the drug target itself. This component does not increase the prediction accuracy (Extended Data Fig. 4b), and we excluded it because it highly weighs a single gene, making it more susceptible to perturbation resulting from noisy prediction of that one gene. For completeness, we note that limiting the genetic interaction (GI) networks to include only highly predicted genes resulted in worse performance (Extended Data Fig. 4c).
Discussion
This study shows that combining DeepPT, a deep-learning framework for predicting gene expression from H&E slides, with ENLIGHT, a published unsupervised computational approach for predicting patient response from pretreated tumor transcriptomics, could be used to form an ENLIGHT–DeepPT approach for H&E-based prediction of clinical response to a host of targeted and immune therapies. We began by showing that DeepPT remarkably outperforms the current state-of-the-art method in predicting mRNA expression profiles from H&E slides. Then, we showed that ENLIGHT–DeepPT successfully predicts the true responders in several clinical datasets from different indications, treated with a variety of targeted drugs, directly from H&E images.
Combining DeepPT with ENLIGHT is a promising approach for predicting response directly from H&E slides because it does not require response data on which to train. This is a crucial advantage compared with the more common practice of using response data to train classifiers in an end-to-end manner. Indeed, applying ENLIGHT to the predicted expression has successfully enabled us to predict the response to four different treatments in five datasets spanning six cancer types, with considerable accuracy and without the need for any treatment data for training. While supervised models can only be obtained for drugs with available H&E and response data, ENLIGHT–DeepPT can produce predictions to virtually any targeted treatment and, importantly, including ones in early stages of development where such training data are still absent.
A notable finding of this study is the robustness of response predictions based on H&E slides when combining DeepPT and ENLIGHT. First, despite the inevitable noise introduced by the prediction of gene expression, the original ENLIGHT GI networks, designed to predict response from measured RNA expression, worked well as is in predicting response based on the DeepPT-predicted expression. Second, although DeepPT was trained using FFPE slides, it generalized well and could be used as is to predict expression values from FF slides. This demonstrates the applicability of DeepPT for predicting RNA expression either from FF or FFPE slides. Nevertheless, as promising as the results presented here are, they should, of course, be further tested and expanded upon by applying the generic pipeline presented here to many more cancer types and treatments.
Developing a response prediction pipeline from H&E slides, if reasonably accurate and further carefully tested and validated in clinical settings, could be of utmost benefit, as next-generation sequencing often takes several weeks to return a result. Many patients with advanced cancers require immediate treatment, and this method can potentially offer treatment options within a shorter time frame. However, one should cautiously note that, while promising, the results presented in this study await further testing and validation in carefully designed prospective studies before they may be applied in the clinic. We are hopeful that the results presented here will both expedite such efforts going forward and provide further impetus for making many more such cohorts publicly available to facilitate further developments of such generic prediction approaches.
Methods
Data collection
The datasets used in this study come from both publicly available and internal resources. Details are described below.
TCGA histological images and their corresponding gene expression profiles were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov). Only diagnostic slides from primary tumors were selected, making a total of 6,269 FFPE slides from 5,528 patients with breast (1,106 slides, 1,043 patients), lung (1,018 slides, 927 patients), brain (1,015 slides, 574 patients), kidney (859 slides, 836 patients), colorectal (514 slides, 510 patients), prostate (438 slides, 392 patients), gastric (433 slides, 410 patients), head and neck (430 slides, 409 patients), cervical (261 slides, 252 patients) and pancreatic (195 slides, 175 patients) cancer.
The TransNEO-breast dataset consists of FF slides and their corresponding gene expression profiles from 160 patients with breast cancer. Full details of the RNA library preparation and sequencing protocols, as well as digitization of slides, have been previously described48.
The NCI-brain histological images and their corresponding gene expression profiles were obtained from the archives of the Laboratory of Pathology at the NCI. They consist of 226 cases comprising a variety of central nervous system tumors, including both common and rare tumor types. All cases were subject to methylation profiling (to evaluate the diagnosis) and RNA sequencing.
The bintrafusp alfa-treated cohort consists of 58 patients with lung cancer (9 patients), cervical cancer (16 patients), and head and neck cancer (33 patients). FFPE slides were made available from the NCI.
The trastuzumab1 cohort is a subset of the TransNEO-breast dataset mentioned above, consisting of 64 patients who had received a combination of chemotherapy and trastuzumab.
Trastuzumab2, a HER2+ breast cancer cohort treated with a combination of trastuzumab and chemotherapy, consists of 85 patients and their FFPE slides40,56. FFPE slides were downloaded from The Cancer Imaging Archive database (https://www.cancerimagingarchive.net).
The ALKi dataset consists of 14 patients with ALK-mutated non-small cell lung cancer treated with crizotinib or alectinib. Corresponding FFPE slides were made available from the University of Colorado.
The PARPi dataset consists of 13 patients with germline BRCA-mutated pancreatic cancer treated with PARPis. FFPE slides were made available from Sheba Medical Center.
For each dataset, the classification of patients to responders and non-responders was based on the criterion used by the clinicians running the respective trial: for trastuzumab1 and trastuzumab2, response was defined as pathological complete response, whereas non-responders were defined as having residual disease. For bintrafusp alfa, responders were defined as patients with partial response or complete response. For ALKi, response was defined as a progression-free survival of more than 18 months. For PARPi, response was defined as an overall survival of more than 36 months. Full details can be found in Supplementary Table 2.
Histopathology image processing
We first used Sobel edge detection58 to identify areas containing tissue within each slide. Because the WSIs are too large (from 10,000 to 100,000 pixels in each dimension) to feed directly into the deep neural networks, we partitioned the WSIs at 20× magnification into non-overlapping tiles of 512 × 512 RGB pixels. Tiles containing more than half of the pixels with a weighted gradient magnitude smaller than a certain threshold (varying from 10 to 20, depending on image quality) were removed. Depending on the size of the slides, the number of tiles per slide in the TCGA cohort varied from 100 to 8,000 (Extended Data Fig. 5a–e). In contrast, TransNEO slides, for example, are much smaller, resulting in 100 to 1,000 tiles per slide (Extended Data Fig. 5f). To minimize staining variation (heterogeneity and batch effects), we applied Macenko’s method for color normalization to the selected tiles59.
Gene expression processing
Gene expression profiles were obtained as read counts for approximately 60,483 gene identifiers. Genes considered expressed were identified using edgeR, resulting in approximately 18,000 genes for each cancer type. The median expression across samples of each gene varied from 10 to 10,000 reads for each dataset (Extended Data Fig. 6a–f). To reduce the range of gene expression values and to minimize discrepancies in library size between experiments and batches, we performed normalization as described in our previous work47.
DeepPT architecture
Our model architecture was composed of three main units (Extended Data Fig. 1a,b).
Feature extraction.
The pretrained ResNet50 convolutional neural network model60, trained with 14 million natural images from the ImageNet database, was used to extract features from image tiles. Before feeding these tiles into the ResNet50 unit, the image tiles were resized to 224 × 224 pixels to match the standard input size for the convolutional neural network. Through the feature extraction process, each input tile is represented by a vector of 2,048 derived features. The Resnet50 pretrained model was trained on extensive datasets, such as ImageNet, comprising approximately 14 million images across 1,000 labels. It serves as a fundamental element enabling transfer learning, which has been widely used in digital pathology. In the context of DeepPT, we have omitted the last layers of ResNet50 to ensure that the extracted features are more generic and better fit for digital pathology tasks rather than being constrained by the original ImageNet classification task.
Feature compression.
We applied an autoencoder, which consists of a bottleneck of 512 neurons, to reduce the number of features from 2,048 to 512. This helps to exclude noise, avoid overfitting and reduce the computational demands. As shown in Extended Data Fig. 7a,b, a large number of ResNet features are constantly zero. This data sparsity is considerably reduced in the autoencoder features (Extended Data Fig. 7c,d). To demonstrate the benefit of the autoencoder module, we conducted an experiment by excluding the autoencoder module from our framework, trained a model on the TCGA-breast cohort while retaining the ResNet50 features and tested it on the TransNEO-breast cohort. The prediction accuracy of this model (ResNet) is notably lower than that of the model in which the autoencoder component is retained, as done in DeepPT (ResNet + autoencoder) (Extended Data Fig. 7e). This trend was also consistently observed in the NCI-brain cohort (Extended Data Fig. 7f), underscoring the beneficial impact of the autoencoder module.
Multilayer perceptron regression.
The purpose of this component is to build a predictive model linking the aforementioned autoencoded features to whole-genome gene expression. The model consists of three layers: (1) an input layer with 512 nodes, reflecting the size of the autoencoded vector; (2) a hidden layer whose size depends on the number of genes under shared consideration; and (3) an output layer with one node per gene. Because the training data consist of gene expression at the slide level (that is, bulk gene expression, as opposed to at spatial resolution), we averaged our per-tile predictions to obtain a mean value at the slide level.
DeepPT training and evaluation
We trained and evaluated each cancer type independently. To evaluate the performance of our model, we applied 5 × 5 nested cross-validation. For each outer loop, we split the entire patient population (of each cohort) into training (80%) and held-out test (20%) sets. We further split the training set into internal training and evaluation sets according to fivefold cross-validation. The models were trained and evaluated independently with each pair of training/validation sets. Averaging the predictions from the five different models represents our final prediction for each single gene on each held-out test set. We repeated this procedure five times across the five held-out test sets, making a total of 25 trained models. These models trained with the TCGA cohorts were used to predict the expression of each gene in a given external cohort by computing the mean over the predicted values of all models (Extended Data Figs. 1c,d and 8). Because each patient can have more than one slide, we averaged the slide-level predictions to obtain patient-level predictions.
Each training round was stopped at a maximum of 500 epochs, or sooner if the average correlation per gene between the actual and predicted values of gene expression on the validation set did not improve for 50 continuous epochs. The Adam optimizer with mean-squared error loss function was used in both the autoencoder and multilayer perceptron (MLP) models. A learning rate of 0.0001 and a minibatch size of 32 image tiles per step were used for both the autoencoder model and the MLP regression model. To reduce overfitting, we used a dropout of 0.2.
Data augmentation
Because the number of samples in the TCGA-pancreatic adenocarcinoma cohort is relatively small, to further reduce overfitting, we artificially increased the amount of data for this cohort by rotating the WSIs by 90°, 180° and 270°. During the test time, the average of four symmetries represents our prediction for each slide. No data augmentation was performed for other cohorts owing to its high computational demand.
Direct supervised model
The direct (end-to-end) supervised model was designed to classify responders and non-responders directly from their slides without the intermediate step of gene expression prediction. To this end, we applied the same computational deep-learning framework that was used for the prediction of gene expression, except that the MLP regression component was replaced by an MLP classification component. Each of the five evaluation datasets was processed independently. Following previous approaches6,9,20,23,45, all tiles from a given slide inherit the slide label. Because the number of samples for each dataset is relatively small, we evaluated the direct supervised model using leave-one-out cross-validation. For each held-out patient, we applied a bootstrap sampling technique to randomly split the remaining patients into training (80%) and validation (20%) sets 30 times, resulting in 30 models. For each model, slide-level prediction was computed by averaging tile-level predictions within that slide. The final prediction of each held-out slide was computed by averaging its predictions over the 30 models.
Implementation details
All analyses in this study were performed in Python 3.9.7 and R 4.1.0 with libraries including Numpy 1.20.3, Pandas 1.3.4, Scikit-learn 1.1.1, Matplotlib 3.4.3 and edgeR 3.28.0. Image processing, including tile partitioning and color normalization, was conducted with OpenSlide 1.1.2, OpenCV 4.5.4 and PIL 8.4.0. The feature extraction, feature compression (autoencoder unit) and MLP regression parts were implemented using PyTorch 1.12.0. Pearson correlation was calculated using Scipy 1.5.0.
ENLIGHT
ENLIGHT’s drug response prediction comprises two steps. (1) Given a drug, the GI engine identifies the clinically relevant GI partners of the drug’s target gene(s). The GI engine first identifies a list of initial candidate synthetic lethal (SL)/synthetic rescue (SR) interactions by analyzing cancer cell line dependencies based on the principle that SL/SR interactions should decrease/increase tumor cell viability, respectively, when ‘activated’ (for example, in the SL case, viability is decreased when both genes are underexpressed). It then selects those pairs that are more likely to be clinically relevant by analyzing a database of tumor samples with associated transcriptomics and survival data, requiring a significant association between the joint inactivation of target and partner genes and better patient survival for SL interactions and analogously for SR interactions. (2) The drug-specific GI partners are then used to predict a patient’s response to each drug based on the gene expression profile of the patient’s tumor. The EMS, which evaluates the match between patient and treatment, is based on the overall activation state of the set of GI partner genes of the drug targets, deduced from gene expression, reflecting the notion that a tumor would be more susceptible to a drug that induces more active SL interactions and fewer active SR interactions.
Statistics and reproducibility
In TCGA, between 7 and 100 patients were excluded from the analysis of each cohort owing to the low quality of H&E slides. In all other cases, no data were excluded from the analyses. No statistical methods were used to predetermine sample sizes, but our sample sizes are similar to those reported in previous publications44,49. Statistical tests, including the method used and the sample sizes, are specified throughout the article. All data met the assumption of the statistical tests used. P values were false discovery rate (FDR) corrected for multiple hypotheses throughout the article, as individually specified. The statistical methods for pathway enrichment analysis, survival analysis, OR and AP are detailed below. Data collection and analysis were not performed blind to the conditions of the experiments.
Pathway enrichment analysis
To test whether genes that are highly predictive by DeepPT (R > 0.4) are enriched for hallmarks of cancer, we performed a hypergeometric test for ten cancer hallmarks described in ref. 50, for which detailed gene sets were given in ref. 51. P values were then corrected using Benjamini–Hochberg FDR. Results are presented in Fig. 2f.
Survival analysis of prognostic signatures
We used the Cox proportional hazards method to calculate HRs for the three prognostic signatures based on actual and predicted expression, as presented in Fig. 4.
Odds ratio
In the clinical setting, the OR is the ratio between the odds of responding to a treatment (P(response)/P(no response)) of two disjoint groups. Specifically, given
P(response to treatment among ENLIGHT-matched patients)
P(no response to treatment among ENLIGHT-matched patients)
P(response to treatment among ENLIGHT-unmatched patients)
P(no response to treatment among ENLIGHT-unmatched patients)
The OR is (A/B)/(C/D).
Average precision
The AP is defined as , where X represents all the predictions of a predictive model, ‘thresholds’ is the set of unique values of X and P (response ∣ X > t) is the probability of response among patients with a predictive value greater than t.
Extended Data
Extended Data Fig. 1 |. Model architecture in detail and training strategies.
(a) The feature compression subnetwork consists of an input layer of 2,048 neurons, a bottleneck of 512 neurons, and an output layer of 2,048 neurons. (b) The MLP regression subnetwork consists of an input layer of 512 neurons, a hidden layer of 512 neurons, and an output layer with the number of neurons reflecting the number of genes. (c) In the ensemble learning strategy (bagging), five models were trained independently with five internal training-validation splits; these five model predictions were averaged to make the final prediction. (d) In the model selection strategy, the ‘best’ model with the highest performance on the validation set was chosen to make predictions on the test set. Of note, DeepPT uses ensemble learning.
Extended Data Fig. 2 |. The distribution of correlations between the predicted and actual gene expression values across the cohort samples.
The violin plots depict the correlations between the predicted and measured expression values across the cohort samples obtained by HE2RNA (light pink) and DeepPT (light blue) for all genes (a), the top 1,000 genes (b), the top 2,000 genes (c), and the top 3,000 genes (d) with the highest correlations. The results presented in this figure were measured by the mean of 5 folds, as reported in44. Except for this figure, all other results presented in this study were measured across the entire test samples, consistent with the approach used in49. P-values were calculated using the one-sided Mann-Whitney U test. In violin plots, the central mark is the median. The number of patients in each cohort is shown in parentheses.
Extended Data Fig. 3 |. Difference between histopathological features extracted from TCGA-Breast tiles and TransNEO-Breast tiles.
UMAP visualization of 2,048 histopathological features that were extracted by using pre-trained ResNet50 CNN. 4,000 image tiles from each dataset were selected randomly to illustrate. Each point represents each feature vector of one image tile.
Extended Data Fig. 4 |. Comparisons of ENLIGHT-DeepPT with other methods.
(a) Performance of ENLIGHT-DeepPT (light blue bars) and the respective drug target(s) expression (gray bars). (b) Performance of ENLIGHT-DeepPT when using the same methodology described in47 (light blue bars) and a version of ENLIGHT-DeepPT that incorporates the target expression in the scoring method for antibodies (gray bars). (c) Performance of ENLIGHT-DeepPT when using the same methodology described in47 to generate genetic interaction networks that constitutes ENLIGHT’s predictive biomarkers (light blue bars) and a revised methodology (gray bars) where we restricted ENLIGHT’s biomarker to only include genes that showed high positive correlation (R > 0.4) between actual and DeepPT-predicted values among the respective TCGA cohort (that is, according to the cancer type of each of the five drug response datasets). Results are shown for each of the three datasets where antibody drugs were used and the aggregation of them. Odds Ratio (OR) for each dataset were obtained by using the same clinical decision threshold that has been previously established in47. The number of patients in each cohort is shown in parentheses.
Extended Data Fig. 5 |. Histograms of the number of tiles per slide by cohort.
The number of tiles in each slide image from TCGA and NCI-Brain datasets ranges from 100 to 8,000 (a-e), while the number of tiles in each TransNEO-Breast slide image is much smaller, ranging from 100 to 1,000 (f ).
Extended Data Fig. 6 |. Histogram of median expression over slides.
The median expression over samples of each gene commonly varies from 10 to 100,000 for every dataset considered in this study (a to f).
Extended Data Fig. 7 |. The benefit of the autoencoder module.
(a-d) Difference between ResNet features and AutoEncoder features. Histograms of median and standard deviation of ResNet features (a-b) and AutoEncoder features (c-d). The TCGA-BRCA cohort was selected as an example. (e-f ) Model performance on external validation datasets. The violin plots depict the distribution of Pearson correlations between the predicted and experimentally measured expression values across the cohort samples for the top 1,000 genes with the highest correlation. The bar charts indicate the number of genes exhibiting Pearson correlations between the predicted and measured expression values across the cohort samples that are above 0.4. The results are presented separately for each external validation set, TransNeo-Breast (e) (n = 160 patients) and NCI-Brain (f ) (n = 226 patients). P-values were calculated using the one-sided Mann-Whitney U test. In violin plots, the central mark is the median.
Extended Data Fig. 8 |. The benefit of ensemble learning.
The violin plots depict the distribution of Pearson correlation for the top 1,000 genes, and the bar charts indicate the number of genes exhibiting Pearson correlations between the predicted and measured expression values across the cohort samples above 0.4. The results were obtained from model selection strategy (gray) and ensemble learning strategy (light blue). P-values were calculated using the one-sided Mann-Whitney U test. In violin plots, the central mark is the median. The number of patients in each cohort is shown in parentheses.
Supplementary Material
Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s43018-024-00793-2.
Acknowledgements
This work was partially supported by grant DP190103402 from the Australian Research Council (D.-T.H., E.A.S.) and by the Intramural Research Program of the National Institutes of Health (NIH), National Cancer Institute (NCI), Center for Cancer Research (CCR) (E.D.S., S.S., N.S., E.R.). This work used the supercomputational resources of the Australian National Computational Infrastructure (AUNCI) and the Australian National University Merit Allocation Scheme (ANUMAS). D.-T.H. would like to thank T. Bui for the helpful discussion.
Footnotes
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Competing interests
D.-T.H., E.A.S., E.R., G.D., R.A. and T.B. are listed as inventors on a patent (application no. 63/349,829, United States, 2022) filed based on the methodology outlined in this study. G.D., D.S.B., E.E., T.B. and R.A. are employees of Pangea Biomed. E.R. is a cofounder of Medaware, Metabomed and Pangea Biomed (divested from the latter). E.R. serves as a non-paid scientific consultant to Pangea Biomed under a collaboration agreement between Pangea Biomed and the NCI. The other authors declare no competing interests.
Additional information
Extended data is available for this paper at https://doi.org/10.1038/s43018-024-00793-2.
Data availability
TCGA histological images and their corresponding gene expression profiles were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov). The TransNEO-breast dataset is available at https://ega-archive.org/studies/EGAS00001004582. The NCI-brain and bintrafusp alfa cohorts were obtained from the NCI. The trastuzumab1 cohort is a subset of the TransNEO-breast dataset. Trastuzumab2 cohort data were downloaded from The Cancer Imaging Archive database (https://www.cancerimagingarchive.net). The ALKi dataset was made available by the University of Colorado, and the PARPi dataset was made available by Sheba Medical Center. Restrictions apply to the distribution of these data, which were used under license for the current study. Access to the ALKi dataset may be requested from S.P. (sharon.pine@cuanschutz.edu). Access to the PARPi dataset may be requested from T.G. (Talia.Golan@sheba.health.gov.il). All DeepPT-predicted expression and relevant response data, along with the code to calculate performance measures, are available via GitHub at https://github.com/PangeaResearch/enlight-deeppt-data. Source data are provided with this paper. All other data supporting the findings of this study are available from the corresponding authors upon reasonable request.
Code availability
The DeepPT framework is available for academic research purposes via Zenodo at https://doi.org/10.5281/zenodo.11125591 (ref. 61). ENLIGHT scores, given expression profiles (either measured directly from the tumor or predicted from slides), can be calculated using a web service at https://ems.pangeabiomed.com/. Any additional information is available from the corresponding authors upon request.
References
- 1.Golub TR et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999). [DOI] [PubMed] [Google Scholar]
- 2.Curtis C et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Doroshow DB & Doroshow JH Genomics and the history of precision oncology. Surg. Oncol. Clin. N. Am 29, 35–49 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rosenthal J et al. Building tools for machine learning and artificial intelligence in cancer research: best practices and a case study with the PathML toolkit for computational pathology. Mol. Cancer Res 20, 202–206 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ström P et al. Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study. Lancet Oncol. 21, 222–232 (2020). [DOI] [PubMed] [Google Scholar]
- 6.Fu Y et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020). [DOI] [PubMed] [Google Scholar]
- 7.Noorbakhsh J et al. Deep learning-based cross-classifications reveal conserved spatial behaviors within tumor histological images. Nat. Commun 11, 6367 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kim J et al. Unsupervised discovery of tissue architecture in multiplexed imaging. Nat. Methods 19, 1653–1661 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Coudray N et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med 24, 1559–1567 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Swiderska-Chadaj Z et al. Learning to detect lymphocytes in immunohistochemistry with deep learning. Med. Image Anal 58, 101547 (2019). [DOI] [PubMed] [Google Scholar]
- 11.Yu K-H et al. Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks. J. Am. Med. Inform. Assoc 27, 757–769 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Couture HD et al. Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. NPJ Breast Cancer 4, 30 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Campanella G et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med 25, 1301–1309 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Xu H et al. Spatial heterogeneity and organization of tumor mutation burden with immune infiltrates within tumors based on whole slide images correlated with patient survival in bladder cancer. J. Pathol. Inform 13, 100105 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kather JN et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Qu H et al. Genetic mutation and biological pathway prediction based on whole slide images in breast carcinoma using deep learning. NPJ Precis. Oncol 5, 87 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Schaumberg AJ, Rubin MA & Fuchs TJ H&E-stained whole slide image deep learning predicts SPOP mutation state in prostate cancer. Preprint at bioRxiv 10.1101/064279 (2018). [DOI] [Google Scholar]
- 18.Tsou P & Wu C-J Mapping driver mutations to histopathological subtypes in papillary thyroid carcinoma: applying a deep convolutional neural network. J. Clin. Med 8, 1675 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chang P et al. Deep-learning convolutional neural networks accurately classify genetic mutations in gliomas. AJNR Am. J. Neuroradiol 39, 1201–1207 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kather JN et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med 25, 1054–1056 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kim RH et al. A deep learning approach for rapid mutational screening in melanoma. Preprint at bioRxiv 10.1101/610311 (2019). [DOI] [Google Scholar]
- 22.Chen M et al. Classification and mutation prediction based on histopathology H&E images in liver cancer using deep learning. NPJ Precis. Oncol 4, 14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ghaffari Laleh N et al. Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology. Med. Image Anal 79, 102474 (2022). [DOI] [PubMed] [Google Scholar]
- 24.Mobadersany P et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. USA 115, E2970–E2979 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Cheng J et al. Integrative analysis of histopathological images and genomic data predicts clear cell renal cell carcinoma prognosis. Cancer Res. 77, e91–e100 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Beck AH et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl. Med 3, 108ra113 (2011). [DOI] [PubMed] [Google Scholar]
- 27.Bulten W et al. Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study. Lancet Oncol. 21, 233–241 (2020). [DOI] [PubMed] [Google Scholar]
- 28.Courtiol P et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med 25, 1519–1525 (2019). [DOI] [PubMed] [Google Scholar]
- 29.Boehm KM et al. Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer. Nat. Cancer 3, 723–733 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Chen Y et al. Computational pathology improves risk stratification of a multi-gene assay for early stage ER+ breast cancer. NPJ Breast Cancer 9, 40 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zheng H, Momeni A, Cedoz P-L, Vogel H & Gevaert O Whole slide images reflect DNA methylation patterns of human tumors. NPJ Genom. Med 5, 11 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wang H et al. Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features. J. Med. Imaging (Bellingham) 1, 034003 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Turkki R, Linder N, Kovanen PE, Pellinen T & Lundin J Antibody-supervised deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin stained breast cancer samples. J. Pathol. Inform 7, 38 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Saltz J et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Rep. 23, 181–193 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Page DB et al. Spatial analyses of immune cell infiltration in cancer: current methods and future directions: a report of the International Immuno-Oncology Biomarker Working Group on Breast Cancer. J. Pathol 260, 514–532 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wang X et al. Spatial interplay patterns of cancer nuclei and tumor-infiltrating lymphocytes (TILs) predict clinical benefit for immune checkpoint inhibitors. Sci. Adv 8, eabn3966 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hu J et al. Using deep learning to predict anti-PD-1 response in melanoma and lung cancer patients from histopathology images. Transl. Oncol 14, 100921 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Johannet P et al. Using machine learning algorithms to predict immunotherapy response in patients with advanced melanoma. Clin. Cancer Res 27, 131–140 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhang F et al. Predicting treatment response to neoadjuvant chemoradiotherapy in local advanced rectal cancer by biopsy digital pathology image features. Clin. Transl. Med 10, e110 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Honomichl N HER2 and trastuzumab treatment response H&E slides with tumor ROI annotations (HER2 tumor ROIs). The Cancer Imaging Archive 10.7937/E65C-AM96 (2022). [DOI] [Google Scholar]
- 41.He B et al. Integrating spatial gene expression and breast tumour morphology via deep learning. Nat. Biomed. Eng 4, 827–834 (2020). [DOI] [PubMed] [Google Scholar]
- 42.Wang Y et al. Predicting molecular phenotypes from histopathology images: a transcriptome-wide expression–morphology analysis in breast cancer. Cancer Res. 81, 5115–5126 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Monjo T, Koido M, Nagasawa S, Suzuki Y & Kamatani Y Efficient prediction of a spatial transcriptomics profile better characterizes breast cancer tissue sections without costly experimentation. Sci. Rep 12, 4133 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Schmauch B et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun 11, 3877 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Levy-Jurgenson A, Tekpli X, Kristensen VN & Yakhini Z Spatial transcriptomics inferred from pathology whole-slide images links tumor heterogeneity to survival in breast and lung cancer. Sci. Rep 10, 18802 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Alsaafin A, Safarpoor A, Sikaroudi M, Hipp JD & Tizhoosh HR Learning to predict RNA sequence expressions from whole slide images with applications for search and classification. Commun. Biol 6, 304 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Dinstag G et al. Clinically oriented prediction of patient response to targeted and immunotherapies from the tumor transcriptome. Med 4, 15–30 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sammut S-J et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Zheng Y et al. Digital profiling of cancer transcriptomes from histology images with grouped vision attention. Preprint at bioRxiv 10.1101/2023.09.28.560068 (2023). [DOI] [Google Scholar]
- 50.Hanahan D & Weinberg RA Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011). [DOI] [PubMed] [Google Scholar]
- 51.Iorio F et al. Pathway-based dissection of the genomic heterogeneity of cancer hallmarks’ acquisition with SLAPenrich. Sci. Rep 8, 6713 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Subramanian A et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Richardsen E et al. Evaluation of the proliferation marker Ki-67 in a large prostatectomy cohort. PLoS ONE 12, e0186852 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Whitfield ML, George LK, Grant GD & Perou CM Common markers of proliferation. Nat. Rev. Cancer 6, 99–106 (2006). [DOI] [PubMed] [Google Scholar]
- 55.Liberzon A et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Farahmand S et al. Deep learning trained on hematoxylin and eosin tumor region of interest predicts HER2 status and trastuzumab treatment response in HER2+ breast cancer. Mod. Pathol 35, 44–51 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Strauss J et al. Bintrafusp alfa, a bifunctional fusion protein targeting TGF-β and PD-L1, in patients with human papillomavirus-associated malignancies. J. Immunother. Cancer 8, e001395 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Kanopoulos N, Vasanthavada N & Baker RL Design of an image edge detection filter using the Sobel operator. IEEE J. Solid-State Circuits 23, 358–367 (1988). [Google Scholar]
- 59.Macenko M et al. A method for normalizing histology slides for quantitative analysis. in 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro 1107–1110 (IEEE, 2009); 10.1109/ISBI.2009.5193250 [DOI] [Google Scholar]
- 60.He K, Zhang X, Ren S & Sun J Deep residual learning for image recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016); 10.1109/cvpr.2016.90 [DOI] [Google Scholar]
- 61.Hoang D-T et al. DeepPT: a deep learning model for predicting transcriptomics from histopathology images. Zenodo 10.5281/zenodo.11125591 (2024). [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
TCGA histological images and their corresponding gene expression profiles were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov). The TransNEO-breast dataset is available at https://ega-archive.org/studies/EGAS00001004582. The NCI-brain and bintrafusp alfa cohorts were obtained from the NCI. The trastuzumab1 cohort is a subset of the TransNEO-breast dataset. Trastuzumab2 cohort data were downloaded from The Cancer Imaging Archive database (https://www.cancerimagingarchive.net). The ALKi dataset was made available by the University of Colorado, and the PARPi dataset was made available by Sheba Medical Center. Restrictions apply to the distribution of these data, which were used under license for the current study. Access to the ALKi dataset may be requested from S.P. (sharon.pine@cuanschutz.edu). Access to the PARPi dataset may be requested from T.G. (Talia.Golan@sheba.health.gov.il). All DeepPT-predicted expression and relevant response data, along with the code to calculate performance measures, are available via GitHub at https://github.com/PangeaResearch/enlight-deeppt-data. Source data are provided with this paper. All other data supporting the findings of this study are available from the corresponding authors upon reasonable request.