Highlights
-
•
Supervision Bias Correction: The proposed method, KD (Knowledge Distillation), effectively addresses the supervision bias inherent in censored data for survival prediction. By distilling knowledge from uncensored data, KD rectifies inaccurate hazards and enhances the accuracy of survival predictions.
-
•
Enhanced Prediction Accuracy: KD leverages both rectified censored data and uncensored data, leading to improved survival prediction accuracy. This approach not only harnesses the power of censored data but also aligns better with clinical reality, making it a valuable tool for survival analysis.
-
•
Superior Performance: The application of the KD method to target cancer sites using The Cancer Genome Atlas (TCGA) dataset consistently outperforms traditional machine learning and deep learning-based methods. This superiority is observed across both target cancer sites and independent cancer cohorts.
-
•
Clinical Relevance and Hidden Information: The KD method reveals hidden information from censored data, resulting in conclusions that align more closely with clinical knowledge and the true clinical scenario. This underscores the method’s ability to effectively utilize censored data and emphasizes its substantial value in both cancer research and clinical decision-making.
-
•
Open Access Resources: All data and codes related to the study are freely accessible, providing transparency and facilitating further research. They can be accessed at: https://datatellstruth.github.io/.
Keywords: Censored data, Machine learning, Survival analysis, Knowledge distillation, Knowledge abduction
Graphical abstract
Abstract
Survival analysis is a critical tool for cancer research, yet handling censored data remains challenging due to supervision bias and inaccurate hazard estimates. To address these issues, we propose a simple but effective method termed KD, which employs knowledge distillation using uncensored data to rectify the supervision bias in censored data. This approach leverages the combined power of both rectified censored data and uncensored data to improve survival prediction accuracy. Remarkably, our KD method not only effectively harnesses censored data but also better reflects clinical reality, demonstrating its immense value in survival analysis. We applied our KD method to 19 target cancer sites using The Cancer Genome Atlas (TCGA) dataset. Our results consistently outperform traditional machine learning and deep learning-based methods across both target cancer sites and independent cancer cohorts. More importantly, our data-driven approach enables the model to extract hidden information from censored data, leading to conclusions that align more closely with clinical knowledge and scenarios. This validation of our KD method’s effectiveness highlights the substantial value of rational censored data usage, providing valuable insights for cancer research and clinical decisions. All data and codes are freely available at: https://datatellstruth.github.io/.
1. Introduction
Survival analysis [1], [2] plays a pivotal role in cancer research and clinical decision-making, providing valuable insights into patients’ prognoses and treatment outcomes. By analyzing time-to-event data, such as the time until recurrence or death, survival analysis enables researchers and clinicians to understand the underlying dynamics of cancer progression and assess the effectiveness of therapeutic interventions. However, the accurate analysis of survival data is challenging, especially in the presence of censored data, where the exact event time is unknown for some individuals. If not properly addressed, the impact of censored data on survival analysis can lead to biased estimations and unreliable predictions [3], [4], [5], [6], potentially causing significant harm to patients by influencing treatment decisions based on flawed conclusions. Therefore, it is of utmost importance to develop robust and accurate methods that can handle censored data effectively, ensuring the reliability and validity of survival analysis in cancer research.
In the pursuit of handling censored data in survival analysis, previous methods have explored various approaches, mainly focusing on four aspects [6], including the complete-data analysis methods, the imputation methods, dichotomizing data, and likelihood-based methods. The first category comprises complete-data analysis methods, which involve the direct removal of censored data. However, this approach leads to a significant loss of valuable information and can result in decreased prediction accuracy. Another approach is the utilization of imputation methods, which rely on specific model assumptions. Unfortunately, such assumptions may not be suitable for censored data, leading to potential issues of underestimation or overestimation. Dichotomizing data is another strategy employed in previous methods, where the event occurrence is compared against non-occurrence within a fixed time period, while disregarding the actual survival time. Nevertheless, this method works under strict assumptions, such as low censoring rates and long risk periods [7]. Lastly, likelihood-based methods, particularly the Accelerated Failure Time (AFT) model [8], have shown promise in addressing censoring problems effectively. However, these methods often require assumptions about specific models or the underlying censoring mechanism. Despite the progress made by existing methods, one critical challenge that remains insufficiently addressed is the presence of supervision bias in censored data and its impact on the performance of survival analysis methods. The supervision bias poses inherent challenges to machine learning approaches, which may not fully capture the complexities of censored data and could lead to biased estimations. Therefore, there is a pressing need to develop an effective method that can mitigate the supervision bias and leverage censored data to improve survival prediction accuracy while ensuring robustness and generalizability in cancer research.
To overcome these challenges, we present a deep learning method, KD (Knowledge Distillation), which offers a data-driven approach to effectively rectify the supervision bias inherent in censored data, thereby enhancing the accuracy of genomic survival analysis. Our approach capitalizes on the synergistic combination of original uncensored data and rectified censored data. The core principle of our KD method lies in knowledge distillation, where we distill accurate survival hazards, free from supervision bias, directly from the uncensored data itself into a teacher model. This teacher model serves as a repository of unbiased knowledge, encapsulating the underlying distribution and true survival hazards present in the data. Subsequently, we facilitate knowledge transfer by allowing a student model to learn from the teacher model’s outputs. By doing so, the student model gains the ability to capture essential characteristics present in the uncensored data, equipping it with generalization capabilities crucial for accurate survival analysis. Through this distillation process, the student model becomes adept at rectifying the supervision bias in censored data. By effectively leveraging the knowledge acquired from the teacher model, the student model can compensate for the information gaps created by censored data, resulting in improved predictions and a more comprehensive understanding of survival patterns in cancer research. Our approach not only harnesses the potential of censored data but also enhances the overall performance of the survival analysis model by integrating it with uncensored data. The workflow of our KD method is presented in Fig. 1a.
Fig. 1.
Workflow of our proposed method for survival analysis. a, Overall framework of the KD (Knowledge Distillation) method. b, The distillation loss used in the KD method, where is the predicted hazard of the student model and is the unbiased hazard returned by the teacher model. c, Pipeline of the knowledge abduction process.
In the following sections, we present a comprehensive evaluation of our KD method on a diverse set of 19 cancer sites using The Cancer Genome Atlas (TCGA) dataset. We compare our method with traditional machine learning and deep learning-based approaches, demonstrating its superiority in predicting survival outcomes. Furthermore, we provide an in-depth analysis of the hidden information discovered through our KD method, showing its alignment with clinical knowledge and its ability to reflect the true clinical scenario effectively. By addressing the challenges associated with censored data and highlighting the value of rational censored data usage, our KD method offers promising insights for cancer research and clinical decision-making.
2. Results
In this section, we present the findings and outcomes of our study, highlighting the impact of our KD method on survival analysis, particularly in handling censored data. We provide a comprehensive analysis of the results obtained through rigorous experimentation and statistical evaluation.
2.1. Enhanced survival analysis prediction accuracy through effective utilization of censored data
We demonstrate a significant improvement in survival analysis prediction accuracy by effectively incorporating censored data with our KD method. Specifically, we present quantitative measures and evaluation metrics to support the efficacy of our method in distilling unbiased knowledge from uncensored data and rectifying the supervision bias of censored data.
For comparisons, we investigated six methods from two groups (i.e., the former two methods for traditional machine learning approaches, and the latter four for the deep learning based approaches):
-
•
AFT model [8]: An accelerated failure time model (AFT model) is a parametric model that provides an alternative to the commonly used proportional hazards models [9].
-
•
Survival trees [10]: Survival trees is a tree-structure algorithm, which uses log-rank scores to maximize survival differences and use it as a criterion for splitting tree nodes.
-
•
Cox-Net [11]: Cox-Net is built upon an artificial neural network framework to predict patient prognosis from high throughout transcriptomics data.
-
•
DeepSurv [12]: DeepSurv is a feed-forward neural network method based on the Cox proportional hazards model, which is utilized to model non-linear relationships between risk factors and survival time.
-
•
DeepHit [13]: The DeepHit model was originally designed for analyzing the competing risk of multiple events. In survival analysis, DeepHit is simplified as considering a single event, i.e., patient survival.
-
•
Meta-learning [14]: The meta-learning approach for genomic survival analysis is built upon a meta-learning paradigm for handling a new task with few samples.
We evaluate the survival prediction model performance with the most widely used evaluation metric, i.e., the concordance index (C-index) [15]. Concretely, the calculation of C-index involves dividing the number of correctly ordered pairs of subjects with predicted survival times by the total number of possible pairs. If a pair contains a censored event or an earlier time, it cannot be ordered. A perfect prediction results in a C-index value of 1.0, while a value of 0.5 indicates random prediction. The hyper-parameters in these methods are selected by cross-validation in the training set.
We select 19 target cancer sites from TCGA, including ACC, CESC, COAD, GBM, HNSC, KIRC, LIHC, LUAD, LUSC, PAAD, PCPG, PRAD, SARC, SKCM, STAD, TGCT, THCA, THYM, and UCEC,1 comprehensively considering the sample size of cohorts, proportion of censored data, and clinical interest. For fair comparisons, except for the AFT model and survival trees, regarding the deep learning based survival prediction methods, i.e., Cox-Net, DeepSurv, DeepHit and Meta-learning, we restrict their network architectures to the same as the architecture of our proposed method. The technical details of our KD method are described in the Supplementary Materials. For each cancer site, we conducted training and testing specifically for that individual cancer site. All methods are evaluated on the common testing data of the target task for fair comparisons, and we conduct 25 experiment trials for each method and report the averaged values (with the variance).
As shown in Fig. 2a, our KD method significantly outperforms the comparison methods in terms of C-index prediction results across all 19 target cancer sites. Specifically, we observed that our method achieved the highest C-index values compared to traditional machine learning approaches (such as the AFT model and survival trees) as well as the deep learning-based methods (including Cox-Net, DeepSurv, DeepHit, and Meta-learning). Over the 19 cancer sites, our KD method outperforms the state-of-the-art Meta-learning method [14] by 6.1% C-index, which is a significant improvement in survival analysis. Furthermore, our KD demonstrated not only superior performance in terms of prediction accuracy but also exhibited lower variance in most cases. We conducted 25 random trials for each method and calculated the averaged values along with the variance. Our method exhibited a lower variance compared to other evaluated methods, indicating its robustness and stability in predicting survival outcomes across different cancer sites. These findings highlight the effectiveness of our KD method in enhancing the prediction accuracy of survival analysis compared to the existing approaches. Additionally, we have reported the corresponding results of the comparisons on each individual cancer site among the 19 cancer sites in the Supplementary Materials, cf. Fig. S1.
Fig. 2.
Comparisons of C-index with 95% confidence intervals (trials) for survival prediction.
To further validate the effectiveness and generalization ability of our KD method, we selected additional independent cancer cohorts apart from TCGA for C-index prediction comparisons. The details of the independent cancer cohorts are described in the Supplementary Materials. We use the same models trained with TCGA data for these cancer sites and report the average results across 6 target cancer sites in Fig. 2b. On these independent cohorts, it is shown that compared to other deep learning approaches, our KD method achieved superior averaged prediction results (over 2.5% C-index improvement) on these independent cancer cohorts. Importantly, our method consistently demonstrated the lowest variance among all the evaluated methods. The reduced variance indicates the stability and robustness of our KD method in predicting survival outcomes across different datasets and cancer cohorts. Detailed comparison results of each individual cancer site among the independent cancer cohorts are reported in Fig. S2 of the Supplementary Materials. These results highlight the strong generalization ability of our KD method, as it not only outperformed other deep learning methods on independent cancer cohorts but also exhibited lower variability in its predictions. Such findings further support the effectiveness and reliability of our method in enhancing the significance and applicability of survival analysis findings.
2.2. Variable impact of human knowledge on survival analysis
In the field of cancer studies, prior evidence in the form of research conclusions (also known as human knowledge) exists, specifically relating to certain genes implicated in causing cancer. In order to incorporate this prior evidence into survival analysis, we introduce the KDKA (Knowledge Distillation and Knowledge Abduction) method. Unlike the KD method, KDKA operates at two levels of knowledge: knowledge distillation, which extracts unbiased knowledge from raw uncensored data as the same as doing in our KD, and knowledge abduction, which integrates human knowledge in the literature to rectify the supervision bias of censored data.
In concretely, KDKA finds censored data that is inconsistent with prior evidence, and then uses logical abduction to perform minimal inconsistent revisions [16] for the supervision bias rectification. Intuitively, knowledge abduction and knowledge distillation of our method are able to seamlessly couple human knowledge (i.e., prior evidence in the literature) with data knowledge (i.e., distilled unbiased information from uncensored data) to use these censored data after rectification in a reasonable way, and at the same time cooperate with uncensored data for survival analysis prediction. Beyond KD, the workflow of the knowledge abduction process is illustrated in Fig. 1c.
To evaluate the impact of human knowledge on survival analysis, we also conducted experiments on 19 target cancer sites and compared them with the results of KD. The comparisons are presented in Fig. 3, and the detailed C-index results of these two methods are presented in Table S5 of the Supplementary Materials. Our findings reveal that the incorporation of human knowledge yields varying effects on the analysis. While in some cases it enhances the accuracy and significance of the results, in others, the integration of prior evidence does not lead to improved performance.
Fig. 3.
Difference in C-index of KDKA (both data-driven and incorporating human knowledge) and KD (only data-driven) methods on 19 target cancer sites. The red bars in the bar chart indicate that incorporating human knowledge reduces the prediction accuracy in terms of C-index. On the other hand, the blue bars represent an improvement in C-index when human knowledge is incorporated. The majority of improvements and decreases are within a range of 5%. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The observed variations in the impact of human knowledge on survival analysis underscore the complex nature of incorporating prior evidence into data-driven algorithms. The differing effects across cancer sites suggest that the relevance and quality of the existing research conclusions about specific genes causing cancer may vary significantly from one context to another. In some cases, the integration of human knowledge might align well with the underlying data distribution, leading to improved prediction accuracy and meaningful insights. This alignment may be due to the presence of well-established gene-cancer associations supported by substantial evidence.
On the other hand, instances where the incorporation of prior evidence did not result in improved performance indicate the need for cautious evaluation of the relevance and reliability of existing research conclusions. It is crucial to recognize that the field of cancer research is dynamic and ever-evolving, with ongoing discoveries and revisions in gene-cancer associations. As such, not all research conclusions may accurately represent the true underlying relationships between genes and cancer survival.
Furthermore, regarding the differences in performance between KDKA and KD, our findings suggest that in cases of cancers with favorable prognosis or low morbidity (rare), the C-index of the KDKA method did not outperform that of the KD method, as illustrated in Fig. 3 and Table S6 of the Supplementary Materials. Only in instances involving cancers with poor prognosis and high morbidity did the C-index of KDKA show potential for improvement compared to KD. However, it is worth noting that this improvement did not appear to be statistically significant.
Besides, these observations highlight the importance of considering the variable impact of human knowledge on survival analysis. Also, these findings suggest that while data-driven approaches provide a reliable foundation, human understanding of cancer is an ongoing and evolving process. It is crucial to recognize that existing research conclusions may not always represent the ultimate truth. Therefore, we should place greater emphasis on the conclusions derived from the original data source–“data tells the truth”. Respecting the insights obtained directly from the data helps to ensure a more robust and reliable analysis, independent of potential biases or limitations associated with prior knowledge.
2.3. Attaining greater significance of findings through multi-cancer collaborative training
In this section, we present the results obtained from attaining greater significance of findings through multi-cancer collaborative training using our KD method. We trained a model on data from 18 out of the 19 target cancer sites and utilized this model to perform survival analysis predictions on the remaining target cancer site. To assess the performance and generalization ability of our method, we conducted Spearman correlation analyses between the predicted survival risks and clinical factors, including age, cancer stage, and gender, for the target cancer site. Additionally, we compared the correlations between the survival times and the clinical factors in the original dataset for the same target cancer site. This comparison enabled us to evaluate the generalization ability of our KD method across different cancer sites and verify its ability to accurately reflect clinical situations after handling censored data.
Specifically, we present the p-values of Spearman correlation in Table 1. We further evaluated the significance of the correlations presented in the table by seeking expert opinions from authoritative medical professionals in the field.2 They assessed the associations classified as either significant (-value 0.05) or non-significant (-value 0.05). The results revealed that out of the 46 correlation judgments, our method achieved an accuracy of , in correctly identifying significant correlations. In contrast, the accuracy of the original survival time data in determining significant correlations was only . These findings demonstrate not only the presence of bias of survival time in the original data due to the influence of censored data, affecting the clinical significance judgments based on survival time, but also highlight the effectiveness of our method in overcoming the supervision bias caused by censored data. Consequently, our method enables more accurate and clinically relevant assessments of significance. This further confirms the essence of our work, emphasizing the fundamental notion that “data tells the truth”.
Table 1.
Correlation analysis results (-value of Spearman correlation) of predicted survival risks / survival time vs. clinical factors in multi-cancer collaborative training. Note that, a -value less than 0.05 indicates a significant correlation, while values greater than 0.05 indicate a non-significant correlation. Green cells highlight correlations aligned with clinical observations, while red markers indicate non-conforming correlations.
![]() |
2.4. Lack of gender significant association with cancer survival risk
Our study also investigated the sex risk in cancer and suggests that there is no significant link to gender with cancer survival risk. In detail, by analyzing a large dataset comprising diverse cancer cases, we provide a comprehensive evaluation of gender’s impact on survival outcomes. Contrary to common assumptions or previous research, our findings revealed a lack of significant gender association with cancer survival risk across the 19 target cancer sites, cf. the “prediction vs. gender” column of Table 1. As a result, in the context of the studied cancer types, gender alone may not be a direct factor of survival risk. This finding does not challenge previous notions that suggested a gender disparity in cancer survival rates, but gender-based disparities may exist in other aspects of cancer, such as incidence rates or treatment response. Our findings contribute to a more comprehensive understanding of cancer survival and encourage further investigation into the multifaceted factors that influence disease outcomes. However, our findings suggest only the lack of a unilateral association between gender and survival risk. It is necessary future for a more nuanced understanding of the factors influencing cancer outcomes and highlights the importance of considering individual characteristics and tumor biology beyond gender. By elucidating the complex interplay between gender and cancer survival, we can refine risk stratification models and develop tailored interventions that address the specific needs of diverse patient populations.
2.5. Enrichment of cancer-associated pathways in genes identified by our method
Our method elucidates the mechanisms underlying tumor progression by calculating the risk score which reflects the hazard ratios of genes. A positive risk score with a high absolute value suggests that the associated gene significantly increases the likelihood of poor survival prediction (high risk), implying its potential role in promoting cancer progression. Conversely, a negative risk score with a high absolute value suggests that the associated gene significantly increases the likelihood of good survival prediction (low risk), indicating its potential role in inhibiting cancer progression.
We investigate the pathways associated with cancer development by applying the risk scores and their corresponding genes in gene set enrichment analysis (GSEA) for each cancer site. The results of GSEA are presented in Fig. 4 and Fig. S3 of the Supplementary Materials, and the raw data of risk scores is attached as an additional file in the manuscript tracking system. Our method reveals that certain signaling pathways may exhibit divergent functions in different cancer types. In ACC and PAAD, high-risk genes are associated with the cell cycle signaling pathway (, ). Conversely, in CESC, TGCT, THCA, and THYM, low-risk genes are associated with the cell cycle signaling pathway (, , , ). Additionally, in COAD, LUAD, PAAD, THYM and UCEC, high-risk genes are associated with the ribosome signaling pathway (, , , , ). Conversely, in GBM, PRAD and THCA, the low-risk genes are associated with the ribosome signaling pathway (, , ). Signaling pathways exhibit divergent associations with high- and low-risk genes across various cancer types, suggesting the opposite functional roles of these pathways in cancer progression.
Fig. 4.
Gene set enrichment analysis (GESA) of the KD genes in 6 cancer types. The gene set databases included Kyoto Encyclopedia of Genes and Genomes (KEGG) [17], Reactome [18] and WikiPathways [19]. “NES” denotes the normalized enrichment score. GESA of the KD genes in other cancer types can be found in Fig. S3 of the Supplementary Materials.
Furthermore, we observed associations between immune signaling pathways and cancer progression in specific cancer types. In PCPG and TGCT, high-risk genes are associated with the IL-10 signaling pathway and the IL-4/IL-13 signaling pathway. In CESC, low-risk genes are associated with the T Cell receptor signaling pathway (). In COAD, low-risk genes are associated with the IL-17 signaling pathways (). In SKCM, low-risk genes are associated with IL-3 signaling pathway (). In LIHC and SKCM, low-risk genes are associated with PD-1 pathway (, ). Immune signaling pathways exhibit diverse associations with high- and low-risk genes in different cancer sites, suggesting their pivotal roles in cancer progression.
3. Discussion
The main contribution of this work lies in the development of an effective Knowledge Distillation (KD) method for survival analysis, which rectifies the supervision bias of censored data and then leverages it with uncensored data to significantly improve prediction accuracy, while also exploring the variable impact of human knowledge and providing insights into cancer-associated pathways. All methods, datasets, and results are publicly available (https://datatellstruth.github.io/), which we hope will be a useful starting point for further extensions and a benchmark to evaluate future approaches in a comparable manner.
Censored data is inevitable in survival analysis, and in most cases, it accounts for a large proportion (e.g., in our experiments, the ratio of censored data is over 70%). Following the data-driven process, our proposed KD method utilizes knowledge distillation to distill unbiased knowledge from uncensored data itself to rectify the hazard values of censored data caused by supervision bias. As validated by experimental results, the rectified censored data would significantly benefit survival predictions when compared with state-of-the-art methods, cf. Fig. 2.
Specifically, knowledge distillation is the process of transferring the knowledge acquired by a teacher model to a student model [20]. The mechanism of knowledge distillation involves training the student model to replicate the outputs of the teacher model. During training, the student model attempts to reproduce the outputs of the teacher model for a given input. To accomplish this, the student model is trained on both the original training data and the soft targets generated by the teacher model. In this study, the soft targets consist of survival predictions derived from uncensored data. The use of soft targets allows the student model to learn from the teacher model’s outputs, which contain information about the underlying data distribution and unbiased survival hazards. This process helps the student model equip generalizable abilities that capture the essential characteristics of the uncensored data. Our method distills unbiased survival prediction knowledge from uncensored data using the teacher model and then provides it as richer guidance to the student network. This enables the student network to effectively rectify the supervision bias of censored data. In this way, on the one hand, the impact of supervision bias of censored data on survival prediction can be corrected, and on the other hand, a relatively large proportion of censored data can be better utilized to further improve the performance of survival prediction.
Beyond that, our findings reveal that the incorporation of human knowledge yields varying effects on the analysis, cf. Fig. 3. While in some cases it enhances the accuracy and significance of the results, in others, the integration of prior evidence does not lead to improved performance. The incorporation of human knowledge yields varying effects on survival analysis, which might be related to the clinical prognosis of cancer and its morbidity. Currently, the understanding of oncogenes, particularly in relation to rare cancers, remains nascent, and the supporting evidence is insufficient [[21], [22]]. Consequently, the incorporation of human knowledge may diminish the precision of the KD method in rare cancers. In addition, the main reason for death was tumor-related conditions for cancer patients with a poor prognosis, while as to those with a relatively better prognosis, deaths might be attributed to other accompanying diseases rather than tumor itself [[23], [24]]. Thus, the integration of oncogene-related human knowledge has the potential to enhance the precision of the KD method in cancers with a poor prognosis. This drives us to enhance the expansion of tumor data to better explore the genetic truth that affects the survival of cancer patients. While acknowledging the value of human knowledge, it is essential to prioritize the conclusions derived from the data itself. This approach promotes a more objective and reliable analysis of cancer survival, empowering us to uncover meaningful insights and advance our understanding of the disease.
Driven by the observed commonalities among certain types of cancers, our study aims to explore the feasibility of using multi-cancer collaborative training models [25]. Specifically, we aim to investigate whether a model developed for one specific type of cancer can be utilized to predict patient survival in other types of cancer. To assess this, we employ data from all 19 different types of cancers as both training and test datasets. Our results indicate that multi-cancer collaborative training models demonstrate effectiveness in predicting outcomes for diverse types of cancers, cf. Table 1. Consequently, we speculate that multi-cancer collaborative training models can be harnessed in future research to construct more resilient analyses, thereby facilitating the development of risk stratification through the utilization of patient profiles and cancer similarities.
Furthermore, we compared the correlations obtained from our predictions with the survival times in the original dataset for the same target cancer site, cf. Table 1. This analysis allowed us to evaluate the performance of our method in handling censored data and its ability to provide accurate survival predictions that align with the clinical reality. By comparing the correlations between the predicted risks and the clinical factors with those observed in the original dataset, including censoring information, we sought to verify whether our method effectively incorporated the information from censored data and provided reliable predictions that reflected the actual clinical outcomes. Through these correlations and comparisons, we aimed to validate the effectiveness of our KD method in both generalizing across different cancer sites and accurately handling censored data. These evaluations are crucial for establishing the significance and clinical relevance of our method in genomic survival analysis.
Our study also highlights the contentious relationships between signaling pathways and high- or low-risk genes observed across various cancer types, emphasizing the context-specific character of these pathways, cf. Fig. 4. This variability could be attributed to distinct molecular modifications, the tumor microenvironment, or interactions with other specific cellular processes within each cancer type. When unraveling the intricate interplay between signaling pathways and cancer progression, it becomes imperative to account for these distinctions, as they carry significant implications for the design of targeted therapeutic strategies. Furthermore, our findings underscore the significant impact of various immune signaling pathways on survival outcomes across diverse cancer types, highlighting the pivotal role of the immune system in tumor development. Specifically, our results unveil a pro-tumor role associated with IL-4, IL-10, and IL-13. This aligns with prior research indicating that inhibiting IL-4 can expand the population of tumor-infiltrating effector T cells and reduce tumor burden [26], thus substantiating the credibility of our discoveries. Conversely, our analysis suggests that IL-3 and IL-7 may exert anti-tumor effects in cancer progression. Previous studies have underscored the crucial role of IL-7 in T cell development [[27], [28]], implying its inhibitory potential in cancer progression. Finally, our results reveal an anti-tumor effect associated with the PD-1 signaling pathway in LIHC and SKCM, suggesting the prospective utility of PD-1/PD-1L inhibitors in treating these cancers. These revelations provide valuable insights into the intricate interplay between immune signaling pathways and cancer advancement, offering promising avenues for the development of targeted immunotherapeutic strategies tailored to specific cancer types.
Regarding censored data, we also found that progressively increasing censored data can gradually improve the accuracy and generalization of survival analysis predictions. In Fig. 5, we observe the impact of progressively incorporating censored data on the prediction accuracy of four specific cancer sites: CESC, LIHC, LUAD, and SKCM. As the proportion of censored data increases, we observe a notable improvement in the C-index results for each of these cancer sites. For CESC, the C-index prediction accuracy increases from 0.611 to 0.781 as censored data is included. Similarly, for LIHC, the C-index improves from 0.655 to 0.788. LUAD shows an enhancement from 0.628 to 0.714, and SKCM’s prediction accuracy rises from 0.731 to 0.801. These findings demonstrate that our KD method effectively benefits from the inclusion of censored data, leading to enhanced prediction accuracy in survival analysis. The observed improvements confirm the effectiveness of our KD method in harnessing the valuable information present in censored data. It highlights the potential of censored data to contribute significantly to the precision of survival analysis predictions. This finding is particularly relevant in cancer studies, where censored data is prevalent due to various factors, such as patients being lost to follow-up or the study’s finite duration. By demonstrating the ability of our KD method to successfully utilize censored data, we underscore the importance of carefully considering and appropriately incorporating such data in survival analysis models. It reaffirms that censored data, when used judiciously, can provide valuable insights and substantially contribute to the accuracy and significance of survival analysis predictions.
Fig. 5.
Effect of increasing censored data on C-index of our KD method. The plot illustrates the C-index results obtained by progressively increasing censored data from 0% to 100%, showcasing the impact of censored data on survival analysis prediction accuracy. The ratio of censored data of these four cancer sites is: 76.7% for CESC, 64.6% for LIHC, 63.7% for LUAD, and 71.8% for SKCM.
4. Conclusion
To conclude, our work demonstrates that rational utilization of censored data through our KD method enhances survival analysis predictions. By harnessing the inherent value of data and integrating rectified censored data with uncensored data, we achieve a more accurate reflection of clinical situations, emphasizing the core principle of “data tells the truth”. This will significantly enhance the advancement of cancer research, particularly for rare or hard-to-track cancers, thereby facilitating the provision of more comprehensive and precise clinical evidence to inform cancer treatment decision-making. These contributions hold promising implications for advancing the field of survival analysis and improving cancer treatments in the medical domain. Furthermore, our method can also be applied to tasks involving censored data in other scenarios, such as student dropout in education [29], project success in crowdfunding [30], and more.
CRediT authorship contribution statement
Xiu-Shen Wei: Conceptualization, Formal analysis, Funding acquisition, Methodology, Writing – original draft, Writing – review & editing. He-Yang Xu: Software, Validation. Ye Wu: Data curation, Writing – review & editing. Xiaoming Liu: Data curation, Writing – review & editing. Ruru Gao: Funding acquisition, Investigation, Project administration, Writing – review & editing. Jiacheng Liu: Data curation, Investigation, Validation, Writing – review & editing. Bowen Du: Data curation, Validation, Writing – review & editing.
Declaration of competing interest
The authors declare that they have no conflicts of interest in this work.
Acknowledgement
This work was supported by National Natural Science Foundation of China under Grant (62272231), National Key R&D Program of China (2021YFA1001100), Natural Science Foundation of Jiangsu Province of China under Grant (BK20210340), and the Fundamental Research Funds for the Central Universities (4009002401).
Biographies
Xiu-Shen Wei (BRID: 07913.00.65116) is now a professor at Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University (SEU). He has published more than sixty academic papers on the top-tier international journals and conferences, such as IEEE TPAMI, NeurIPS, CVPR, ICCV, ECCV, etc. He won eight world championships in international authoritative computer vision competitions. He was the Program Co-Chair of workshops in association with ICCV, IJCAI, ACM Multimedia, and ACCV, and he was the primary organizer of fine-grained visual analysis tutorials in CVPR, ICME and PRICAI. He was also selected as one of the World’s Top 2% Scientists (2024, 2023), and received the WuWenJun AI Excellent Young Scientist Award (2022), the Young Elite Scientist Sponsorship Program by CAST (2021), the Computer Federation Excellent Young Scientist Award (2021), and received the Best PC Member Award in CVPR 2017. His research interests are computer vision and machine learning. He has served as a Guest Editor of Pattern Recognition Journal, a Tutorial Co-Chair of ACCV 2022, a Senior PC member/Area Chair of CVPR, AAAI, IJCAI, ICME and BMVC.
Ruru Gao (BRID: 06915.00.93393) is an associate professor and master’s supervisor. She obtained both her bachelor’s and doctoral degrees from Tongji University. She earned her Ph.D. in 2018 and joined the School of Chemistry and Chemical Engineering at Nanjing University of Science and Technology in the same year. Her primary research focuses on molecular logic and fluorescence sensing. As the first author, she has published over ten papers in top international journals in related fields, such as ACS Nano and Chemical Science. She has led a youth project funded by the Natural Science Foundation of Jiangsu Province, and her achievements include accolades like Outstanding Graduate of Shanghai, National Scholarship for PhD Candidates, First Prize in the Micro-Lecture Competition of Jiangsu Province, Second Prize in the Young Teachers’ Lecture Competition of Jiangsu Province, First Prize in the Teaching Innovation Competition of Nanjing University of Science and Technology, and being recognized as an exemplary Party member at the university.
Jiacheng Liu, PhD, MD, Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China. H-index: 10.
Bowen Du (BRID: 08138.00.65911) received his BS and PhD degrees from the University of Science and Technology of China. He is currently a postdoctoral fellow at the Department of Urology, Jinling Hospital. His current research interests focus on the understanding the biogeochemical process of genitourinary cancer and developing bioinformatic tools for cancer genomics.
Supplementary material associated with this article can be found, in the online version, at 10.1016/j.fmre.2024.06.016
Please refer to Table S4 of the Supplementary Materials for the full names of these 19 target cancer sites.
Following the latest clinical practice guidelines for malignancies by NCCN: https://www.nccn.org/.
Contributor Information
Ruru Gao, Email: gaorr@njust.edu.cn.
Jiacheng Liu, Email: jiacheng6jc@163.com.
Bowen Du, Email: dubwac@gmail.com.
Appendix A. Supplementary materials
Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/
References
- 1.Klein J.P., Moeschberger M.L. Springer Science & Business Media; 2006. Survival Analysis: Techniques for Censored and Truncated Data. [Google Scholar]
- 2.Hosmer D.W., Lemeshow S., May S. John Wiley & Sons; 2008. Applied Survival Analysis: Regression Modeling of Time to Event Data. Wiley Series in Probability and Statistics. [Google Scholar]
- 3.Dey T., Lipsitz S.R., Cooper Z., et al. Survival analysis–time-to-event data and censoring. Nat. Methods. 2022;19:906–908. doi: 10.1038/s41592-022-01563-7. [DOI] [PubMed] [Google Scholar]
- 4.Wang P., Li Y., Reddy C.K. Machine learning for survival analysis: A survey. ACM Comput Surv. 2019;51(110):1–36. [Google Scholar]
- 5.Dey T., Lipsitz S.R., Cooper Z., et al. Regression modeling of time-to-event data with censoring. Nat. Methods. 2022;19:1513–1515. doi: 10.1038/s41592-022-01689-8. [DOI] [PubMed] [Google Scholar]
- 6.Turkson A.J., Ayiah-Mensah F., Nimoh V. Handling censoring and censored data in survival analysis: A standalone systematic literature review. Int. J. Math. Math. Sci. 2021;2021 [Google Scholar]
- 7.Okoli C., Schabran K. A guide to conducting a systematic literature review of information system research. Sprout. 2010;10:10–26. [Google Scholar]
- 8.Wei L.-J. The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Stat Med. 1992;11(14–15):1871–1879. doi: 10.1002/sim.4780111409. [DOI] [PubMed] [Google Scholar]
- 9.Cox S.D.R. Regression models and life-tables. J. R. Stat. Soc. Ser. B. 1972;34(2):187–220. [Google Scholar]
- 10.LeBlanc M., Crowley J. Survival trees by goodness of split. J. Am. Stat. Assoc. 1993;88(422):457–467. [Google Scholar]
- 11.Ching T., Zhu X., Garmire L.X. Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput. Biol. 2018;14(4) doi: 10.1371/journal.pcbi.1006076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Katzman J.L., Shaham U., Cloninger A., et al. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol. 2018;18(24) doi: 10.1186/s12874-018-0482-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lee C., Zame W., Yoon J., et al. Proceedings of the AAAI Conference on Artificial Intelligence. 2018. DeepHit: A deep learning approach to survival analysis with competing risks; pp. 2314–2321. [Google Scholar]
- 14.Liu Y.L., Zheng H., Devos A., et al. A meta-learning approach for genomic survival analysis. Nat. Commun. 2020;11:6350. doi: 10.1038/s41467-020-20167-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Harrell F.E., Califf R.M., Pryor D.B., et al. Evaluating the yield of medical tests. JAMA. 1982;247(18):2543–2546. [PubMed] [Google Scholar]
- 16.Zhou Z.-H. Abductive learning: Towards bridging machine learning and logical reasoning. Sci. China Inf. Sci. 2019;62:076101:1–076101:3. [Google Scholar]
- 17.Kanehisa M., Sato Y., Kawashima M., et al. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016;44(D1):D457–D462. doi: 10.1093/nar/gkv1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Fabregat A., Jupe S., Matthews L., et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2020;48(D1):D498–D503. doi: 10.1093/nar/gkz1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Pico A.R., Kelder T., van Lersel M.P., et al. WikiPathways: Pathway editing for the people. PLoS Biol. 2008;6(7) doi: 10.1371/journal.pbio.0060184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hinton G., Vinyals O., Dean J. NIPS Deep Learning and Representation Learning Workshop. 2014. Distilling the knowledge in a neural network; pp. 1–9. [Google Scholar]
- 21.Huntley C., Torr B., Sud A., et al. Utility of polygenic risk scores in uk cancer screening: A modelling analysis. Lancet Oncol. 2023;24(6):658–668. doi: 10.1016/S1470-2045(23)00156-0. [DOI] [PubMed] [Google Scholar]
- 22.Pashayan N., Easton D.F., Michailidou K. Polygenic risk scores in cancer screening: A glass half full or half empty? Lancet Oncol. 2023;24(6):579–581. doi: 10.1016/S1470-2045(23)00217-6. [DOI] [PubMed] [Google Scholar]
- 23.Gerstung M., Jolly C., Leshchiner I., et al. The evolutionary history of 2658 cancers. Nature. 2020;578(7793):122–128. doi: 10.1038/s41586-019-1907-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Poirion O.B., Jing Z., Chaudhary K., et al. DeepProg: An ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med. 2021;13(1):112. doi: 10.1186/s13073-021-00930-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chen H., Li C., Peng X., et al. A pan-cancer analysis of enhancer expression in nearly 9000 patient samples. Cell. 2018;173(2):386–399.e12. doi: 10.1016/j.cell.2018.03.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Maier B., Leader A.M., Chen S.T., et al. A conserved dendritic-cell regulatory program limits antitumour immunity. Nature. 2020;580:257–262. doi: 10.1038/s41586-020-2134-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Schluns K.S., Kieper W.C., Jameson S.C., et al. Interleukin-7 mediates the homeostasis of naïve and memory CD8 T cells in vivo. Nat. Immunol. 2000;1:426–432. doi: 10.1038/80868. [DOI] [PubMed] [Google Scholar]
- 28.Kondrack R.M., Harbertson J., Tan J.T., et al. Interleukin 7 regulates the survival and generation of memory CD4 cells. J. Exp. Med. 2003;198(12):1797–1806. doi: 10.1084/jem.20030735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ameri S., Fard M.J., Chinnam R.B., et al. Proceedings of the ACM International on Conference on Information and Knowledge Management. 2016. Survival analysis based framework for early prediction of student dropouts; pp. 903–912. [Google Scholar]
- 30.Li Y., Rakesh V., Reddy C.K. Proceedings of the ACM International Conference on Web Search and Data Mining. 2016. Project success prediction in crowdfunding environments; pp. 247–256. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/







