Abstract
Stage III non-small cell lung cancer (NSCLC) presents complex challenges in treatment decisions due to extensive disease characteristics and patient heterogeneity. Multidisciplinary teams (MDT) offer personalized treatments by integrating diverse expertise. However, resource and time constraints limit MDT practicality in healthcare. Addressing this, we propose an interpretable approach for intelligent MDT treatment recommendations. Our approach focuses on enhancing clinical text representation and the interpretability of recommendations. Using a dual-level embedding technique, the local and global textual information can be captured. By delving into intricate factors at word, phrase, and sentence levels, we explain treatment recommendations for heterogeneous patients. Specially, attention flow is employed to consider the relationship between attentions across multiple layers, and attention mechanisms screen out important words and sentences. Our method achieves over 85% across accuracy, precision, recall, and F1 score in stage III NSCLC treatment recommendations. Furthermore, we assess the association between our proposed model and patient survival outcomes by analyzing patients who did not undergo MDT consultation. The results demonstrate that patients receiving model-concordant treatments exhibited significantly higher survival rates at 1, 3, and 5 years compared to those receiving model-nonconcordant treatments. Moreover, Kaplan-Meier survival curves confirm the improvement in survival outcomes associated with model-concordant treatments. Additionally, ablation analysis validates the rationality of the model structure.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-026-39658-2.
Keywords: Intelligent treatment decision-making, Multidisciplinary team, Dual-level embedding, Three-level explanation, Stage III non-small cell lung cancer, Survival outcomes
Subject terms: Cancer, Diseases, Medical research
Introduction
Lung cancer has a high incidence and mortality rate on a global scale, with China particularly experiencing a significant impact, ranking at the forefront1,2. Non-small cell lung cancer (NSCLC) is a common type of lung cancer, and nearly 30% of NSCLC patients have already been in stage III at the time of initial diagnosis3. However, stage III NSCLC is a heterogeneous disease with poor prognosis, for which treatment options are difficult to define4,5. Therefore, this paper contributes to treatment decision-making for stage III NSCLC.
Multidisciplinary team (MDT) provides the most appropriate and personalized treatment for stage III NCSCL patients by integrating treatment opinions from multidisciplinary experts6,7. Moreover, more and more studies have shown that MDT could significantly improve patient prognosis, such as improving quality of life and prolonging survival7,8. Unfortunately, the worldwide medical resources are commonly scarce and unevenly distributed, leading to a limited number of hospitals equipped with MDTs9. Particularly, when facing with a large patient population, it is more difficult to ensure that the vast majority of patients have access to treatment recommendations provided by MDTs10. Although some studies provide treatment recommendations based on clinical practice guidelines11–14 and research literature15,16, Lin, et al. 17 pointed out that clinical practice guidelines do not respond well to patient heterogeneity and cannot provide the same personalized treatment plans as those provided by MDT. Treatment recommendations based on research literature also not take into account individual patient differences17, and some information may be out of date18, and there may be conflicting evidence affecting treatment choice19. Therefore, personalized treatment plans that consider individual patient characteristics are necessary, and there is a need to explore intelligent treatment decision-making method to provide stage III NCSCL patients with the treatment plan comparable to the one developed by MDT.
Electronic medical records (EMRs) detailing individual patient characteristics20 have been used for treatment recommendations to provide personalized care4,11,21–29. However, the information available in EMRs has not been fully utilized in these studies. Hendriks, et al. 30, Lin, et al. 17, Sesen, et al. 4 and Yan, et al. 31 only selected a few factors based on experience and knowledge, Najafabadipour, et al. 28, Solarte-Pabón, et al. 29, Zeng, et al. 23 and Zhang, et al. 32 only considered certain medical entities, Min, et al. 33 and Zeng, et al. 22 only analyzed word vector information in clinical text. Furthermore, the performance evaluation of the model in current studies is primarily focused on prediction effects without analysis of the association between the recommended regimen and patient survival outcomes22,30,31. Finally, for the sake of practical use in clinical practice, it is crucial to improve the interpretability of the model so that clinicians could understand the reasoning behind the recommended treatment and make informed decisions31.
To overcome the shortcomings mentioned above and to provide stage III NSCLC patients with the treatment options that closely approximate those developed by MDT, this paper puts forward an interpretable intelligent treatment decision-making method. The contributions of this paper are summarized as follows.
Dual-level representation of EMRs using word and sentence embeddings. The dual-level representation considers both local information representation through word embedding and global information representation through sentence embedding32, which ensures adequate representation of clinical text.
Treatment decision of each highly heterogeneous patient is deeply explained at three levels with the help of attention mechanisms and attention flows: word, phrase, and sentence. This three-level explanation can achieve global interpretation by considering the relationship between attentions across multiple layers, and ensure that interpretation is complete and reliable by complementing and supporting each other with explanations at three levels.
The proposed model is evaluated in terms of prediction performance and survival analysis. The prediction performance is assessed by analyzing the accuracy of treatment recommendation. Associations between recommended treatment and patient survival are assessed not only through 1-year, 3-year, and 5-years survival rate comparisons but also using Kaplan–Meier curves and multivariable Cox regression33.
An interpretable intelligent MDT treatment decision-making method based on EMRs is proposed, which can approximate MDT recommendations with high accuracy for cancer patients in the studied dataset.
The rest of the paper is organized as follows. In Sect. 2, we review previous work on the treatment recommendation and MDT treatment recommendation. In Sect. 3, we describe the four main components of the MDT treatment decision-making model: data collection and preprocessing, dual-level embedding representation, model performance evaluation and three-level explanation. Subsequently, the prediction performance and the association between the recommended regimen and patient survival outcomes under two tertiary hospitals are analyzed, the treatment recommendations are interpreted at three levels, and the rationality and superiority of the model is verified by ablation experiments. In the final section, we summarize the findings and limitations of the study, as well as future research directions.
Prior work
MDT treatment recommendation
MDT can effectively improve patient prognosis34, and enhance patient satisfaction8,35, but at the same time MDT consumes a lot of medical resources and is very time-consuming36. Therefore, some scholars have been more inclined to exploring intelligent MDT treatment recommendation.
Table 1 presents information about representative MDT treatment recommendation studies. The studies were identified through searches in Web of Science, PubMed, and Google Scholar using keywords such as “multidisciplinary team”, “treatment recommendation” and “machine learning”. Several characteristics of the current study can be found in the Table (1) Current MDT treatment recommendation studies concentrate on cancer, particularly breast cancer21,37–39. There is a research gap in MDT treatment recommendations for lung cancer in specific stage. (2) Most studies are based on clinical practice guidelines or experience or knowledge to determine treatment recommendation variables4,21,37–39. It is questionable whether these variables could truly and adequately reflect the individual characteristics of the patient. (3) Common interpretable machine learning methods such as decision trees, logistic regression and Bayesian networks are widely used for MDT treatment recommendations4,21,37,39–41. This mainly because clinicians need to understand the underlying reasoning behind a model’s predictions in order to make informed decisions about patient care. (4) Classification model assessment metrics are widely used in MDT treatment recommendation21,37–41, but less research has been conducted to assess the association between the recommended regimen and the prognostic effect of the recommended regimen4. Thus, there is still much room for development of MDT treatment recommendation research.
Table 1.
MDT treatment recommendation studies.
| Study | Sample size | Disease | Factors | Method | Metric | Interpretability |
|---|---|---|---|---|---|---|
| 30 | 504 | Breast cancer | The factors in the Dutch breast cancer guideline | Clinical decision trees (CDT) | Concordance of recommendations generated by the MDT versus the CDTs | Explainable decision trees |
| 31 | 1535 | Breast cancer | 12 categorical attributes | K-nearest neighbor classification (k-NN) | Accumulative accuracy | Not consider explanation |
| 40 | 304 | Basal cell carcinoma | 6 variables after screening based on the variance-inflation factor | Classification tree | Accuracy | Explainable classification trees |
| 4 | 4020 | Lung cancer | 13 patient and disease specific variables | Rule-based decision support approach, Bayesian networks | Confusion matrix, AUC, 1-year survival | Explainable methods |
| 17 | 1924 | Breast cancer | Dozens of variables | 10 supervised machine learning classifiers | AUC, sensitivity, specificity, and positive likelihood ratio | Some methods are interpretable |
| 41 | 66 | Laryngeal cancer | 303 variables | Inference algorithms and probabilistic graphical models on Bayesian networks | Accuracy, confusion matrix, ROC curve, AUC, and calibration curve | Explainable method |
| 39 | 3340 | Breast cancer | 48 variables | Combine k-NN classifiers and rule-based methods | Hit rate (HR) and the binary classification error rate (BER) | Not consider explanation |
Note: Receiver operating characteristic (ROC) curve, area under the ROC curve (AUC).
Treatment recommendation
There is a wealth of research on treatment recommendations. Table 2 summarizes representative studies on treatment recommendations published in recent years. They were selected from Web of Science, PubMed, and Google Scholar using keywords such as “treatment recommendation” and “machine learning”. The studies reflect variation in disease domains, data sources, methodological strategies, evaluation metrics and interpretability. (1) Scholars have analyzed the treatment recommendation for chronic diseases as well as some common diseases24–27, in addition to the rich research on the treatment of cancer12–16,22,23,42. (2) Not only are clinical practice guidelines and research literature used to determine treatment decision options22–27,42, but information in the EMRs, such as clinical note, demographics, and laboratory test results, is explored to adequately represent patient characteristics in order to provide more appropriate treatment options12–16. Moreover, information in clinical notes is often represented through medical entities or word embeddings22,23. (3) With the development of machine learning techniques, scholars are increasingly resorting to reinforcement learning to enable treatment recommendations24,25,27. Nevertheless, the computational complexity of reinforcement learning is high and the definition of reward functions is very difficult in medical field43. (4) In addition to classification metrics12–16,22,26, scholars are beginning to focus on the association between patient survival outcomes and the recommended regimen provided by treatment recommendation models23–25,27,42. (5) More and more models are considering interpretability given the need for clinical practice 12–15,22−24,26.
Table 2.
Treatment recommendation studies.
| Study | Disease | Factors | Method | Metric | Interpretability |
|---|---|---|---|---|---|
| 22 | Prostate Cancer, Oropharynx Cancer, Esophagus Cancer | Clinical notes represented by ‘structured + bow’ or ‘structured + doc2vec’ or ‘structured + fasttext’ | Five machine learning methods | Sensitive, Precision, F1-score | t-SNE |
| 23 | Prostate cancer, Stage I NSCLC | Clinical text represented by the TF-IDF of biomedical entities | Lasso | Hazard ratios | Explainable method |
| 15 | Lung cancer | Nine key clinical concepts | Iterative algorithms for concept recognition and recommended guidelines for treatment plans | F1-score, recall, precision | Each recommendation is supported by literature or consensus guide |
| 42 | NSCLC | 127 features, including patient characteristics, tumor stage, and treatment strategies | 3-layer neural network | 3-year and 5-year survival | Not consider explanation |
| 26 | Low back pain | A multivariate sequence of health measurements over time (i.e., symptoms such as pain or activity limitation) | Variable-duration copula hidden Markov model | F1-score, balanced accuracy, cost, effectiveness | Three states with clinically relevant interpretations |
| 27 | Type 2 Diabetes | Demographic information, physical measurements, medical history, and laboratory data | Reinforcement learning | Short-term and long-term outcome | Not consider explanation |
| 14 | Colorectal cancer | 11 concept groups | Guideline-based treatment evaluation | Precision, recall, accuracy, and F1-score | Explainable method |
| 25 | Top 2,000 diseases in Multiparameter intelligent monitoring in intensive care database | Demographics, lab values, vital signs, and output events | Supervised reinforcement learning with recurrent neural network | The mortality rates, Jaccard coefficient | Not consider explanation |
| 24 | Comorbidity and Sepsis | Demographics, lab values, vital signs | Hierarchical imitation learning | Mortality rate, AUC and Jaccard coefficient |
Providing explainable recommendation with hierarchical structure |
| 16 | Various cancers | COSMIC Cancer Gene Census, COSMIC Mutation Data, Genomic Data Commons and 26 million Medline abstracts | Generative adversarial neural network | Precision, recall, infNDCG, R-prec, P@5, P@10, P@30, S@10 | Not consider explanation |
| 13 | Breast cancer | Guideline, domain knowledge | A care plan builder based on rules | Accuracy | Explainable method |
| 12 | Breast cancer | Three Clinical practice guidelines (NCCN、AP-HP and ONK) | The guideline-based decision support module | The compliance rate of guidelines with the recommended treatment | Explainable method |
A review of the literature listed in Tables 1 and 2 reveals the following three points for improvement. First of all, most studies rarely consider and even ignore MDT treatment decisions. To put it another way, they focused on the use of clinical practice guidelines or research literature or expert experience to develop treatment regimens with strong generalizability and weak personalization. Unluckily, patients with stage III NSCLC are highly heterogeneous and require adequate analysis of individual patient characteristics to provide personalized treatment plans. Secondly, the existing literature mainly assesses prediction outcomes through indicators such as accuracy and F1-score, while the association between the recommended regimen and patient survival outcomes has received relatively little attention, despite its relevance for assessing the clinical utility of treatment recommendations. The model should not only have good predictive effect, but also bring practical application value. Third, most existing studies tend to use word embeddings or medical entities to represent clinical texts, which only reflects local information in the text and cause information loss to a certain extent. Finally, from the perspective of model interpretability, previous studies gradually highlight the transparency and interpretability of models. However, there is still a need to continue to improve the interpretability of the models to promote their practical application.
Methodology
In this paper, we built an interpretable MDT treatment decision-making model consisting of four components. (1) Data collection and preprocessing: Collect EMRs of patients with stage III NSCLS, determine the treatment for recommendation and divide the patients into MDT and non-MDT groups. (2) Dual-level embedding representation: word and sentence embedding representation. (3) Model performance evaluation: assessment of prediction performance and evaluation of associations with survival outcomes. (4) Three-level explanation: word, phrase, and sentence levels model interpretation. The framework of the proposed model is shown in Fig. 1.
Fig. 1.
The framework of the proposed model.
Data collection and preprocessing
We collected EMRs of patients with stage III NSCLC between January 1, 2017 and October 30, 2022 from two tertiary hospitals in Hunan Province, China. Patients with the first diagnosis of stage III NSCLC and patients diagnosed with stage III NSCLC since recurrence or progression were all the research sample. Samples for which no treatment plan given were excluded. Ultimately, a total of 2,876 stage III NSCLC patients were used for this study. The present study was approved by the Ethics Committee of The Second Hospital of Central South University and Hunan Cancer Hospital, and conducted in accordance with the principles of Declaration of Helsinki. All patients signed written informed consent.
There are various types of notes in the EMRs. From the 36,490 collected notes for these patients, progress notes are selected as the basis for recommending treatments and treatment plan sheets are chosen to outline the specific treatment plans for these patients. To prevent potential information leakage, only progress notes documented from the time of diagnosis up to the date of the MDT discussion or treatment decision were included. Notes written after treatment initiation were excluded. In addition, explicit future-oriented or order-set expressions were manually screened and removed to ensure that the input text did not directly reveal the target treatment. This manual screening was independently performed by three trained researchers using a predefined set of treatment-related keywords. Representative keywords included “start concurrent chemoradiation”, “initiate cisplatin”, “schedule radiotherapy”, “begin adjuvant chemotherapy”, “prescribe paclitaxel”, and “administer immunotherapy” among others related to treatment initiation, drug orders, and procedural plans. The retained progress notes mainly contained clinical and diagnostic information, such as positron emission tomography-computed tomography (PET-CT), lung-enhanced computed tomography (CT), biopsy, immunohistochemistry, magnetic resonance imaging of the head, abdominal CT, etc., as well as demographic information of the patients, such as age, gender, clinical stage, etc. This information gives a full picture of the patient’s basic profile and provides the basis for the doctor’s decisions. A representative example of such a progress note, highlighting the clinical and diagnostic information used for treatment recommendation, is provided in Supplementary Material.
On the basis of the treatment plan sheets in EMR, we continued to consult clinical experts, studied clinical practice guidelines and a large amount of literature4,44. Following this, the collected samples were divided the treatment into the following six groups according to the specific situation of them. The 6 treatment plans are surgery, surgery after neoadjuvant therapy, radiotherapy, chemotherapy, chemoradiotherapy (sequential/concurrent) and chemotherapy + immunotherapy/targeted therapy, respectively. Treatments that only involve a very small number of patients are not considered. Finally, a total of 2,521 patients who chose one of these six treatment options were included in the investigation. Among them, 22.8% of patients received a consultation from the MDT. The proportion of patients who did and did not receive MDT consultation (non-MDT) under each treatment option is shown in Table 3.
Table 3.
Treatments received by patients.
| MDT | SUR | SURANT | CHE | RAD | CHE (S/C) | CHE + IMM/TAR |
|---|---|---|---|---|---|---|
| 23.69% | 21.25% | 8.01% | 5.92% | 19.86% | 22.12% | |
| non-MDT | 17.57% | 3.65% | 36.67% | 1.59% | 8.58% | 31.84% |
Note: SUR refers to ‘Surgery’, SURANT refers to ‘Surgery after neoadjuvant therapy’, CHE refers to ‘Chemotherapy’, RAD refers to ‘Radiotherapy’, CHE (S/C) refers to ‘Chemoradiotherapy (Sequential/Concurrent)’, CHE + IMM/TAR refers to ‘Chemotherapy + immunotherapy/targeted therapy’.
As can be seen in Table 3, the two patient groups (MDT vs. non-MDT) used different treatment protocols. The MDT group made greater attempts at various treatment options, whereas patients in the non-MDT group were more inclined to using chemotherapy + immunotherapy/targeted therapy. Compared to the non-MDT group, the MDT group had a higher proportion of surgery after neoadjuvant therapy, reduced the proportion of chemotherapy and increased the proportion of chemoradiotherapy.
The observed differences in treatment patterns between the MDT and non-MDT groups likely reflect the advantages of multidisciplinary review in NSCLC management. Multidisciplinary experts can better cope with the heterogeneity of NSCLC patients, comprehensively analyze the specific conditions of the patients, and provide them with targeted treatment options. While MDTs may indeed manage a higher proportion of complex cases, these differences also suggest that MDTs are able expand therapeutic options beyond those typically considered by individual clinicians. By contrast, individual doctors are confined to their specific field of expertise and can only propose treatment options within their own discipline.
Therefore, the aim of this study is to learn from the MDT decision-making process to develop treatment recommendation method capable of providing optimal, patient-specific therapeutic strategies, thereby supporting clinical decision-making.
Treatment decision-making method based on dual-level embedding representation
This subsection introduces the treatment decision-making method based on dual-level embedding representation. Among them, the word level embedding takes into account the semantic relationships and similarities between words, and the sentence level embedding considers the semantic and contextual information of the sentence44. This article adequately represents the intra- and inter-sentence information in the clinical text through both word and sentence levels.
Word-level embedding representation
SMedBERT45 was used for word-level embedding representation in this study. The model is an improvement model upon BERT, as it leverages knowledge graphs to inject knowledge into BERT so as to enhance the language model’s understanding capabilities. To be specified, during the pre-training process, incorporating the entity types of entities present in medical texts along with their related neighboring entities deepens the model’s understanding of complex entities and their associated relations in the medical text, thereby enhancing the model’s representation capability.
Given that the author of SMedBERT45 have not publicly disclosed a knowledge graph, this research utilized Chinese Medical Knowledge Graph (CMeKG)46 to acquire entities, entity types, and relations from collected medical record texts. The framework of SMedBERT based on CMeKG is illustrated in Fig. 2. CMeKG is a comprehensive knowledge graph in the medical domain, encompassing 11,076 diseases, 18,471 drugs, 14,794 symptoms, and 3,546 clinical techniques. With the help of CMeKG, 8,832 triplets, 5,160 entities, 16 kinds of relations and 9 kinds of entity types have been extracted from the collected medical texts. Table 4 demonstrates the extracted relations and entity types. These abundant sources of knowledge will contribute to the comprehensive and accurate representation of words, especially complicated entities.
Fig. 2.
The framework of the SMedBERT based on CMeKG45.
Table 4.
Relations and entity types inventory.
| Relation | Content |
|---|---|
| Clinical manifestations, Adverse reactions, Tests, Complications, Specifications, Therapeutics, Causes, Dosage, Ingredients, Indications, Treatments, Efficacy, Precautions, Associated Diseases, Associated Symptoms, Drug interactions, Functions, Diseases, Symptoms, Drug interactions, and Functions | |
| Entity type | Disease, Body, Symptom, Medical Procedure, Medical Device, Drug, Department, Microbiological, Medical test item |
The embedding representation of entities and relations differs from the embedding representation of words obtained through tokenization of text. The words in the text are represented by token embedding, segment embedding, and location embedding. Whereas, the embedding representations of entities and relations are yielded by using the TransR algorithm47, which analyzes their semantic correlations within the knowledge graph to derive their embedding representations. The embedding of entity types is obtained from the embeddings of entities. In this study, the average of the embeddings of entity within each entity type is utilized as the embedding representation for entity type. The incorporation of entity type embedding enhances the learning of neighboring entities in subsequent steps to a great extent45.
Considering substantial interconnectedness of entities in a knowledge graph, it is crucial to carefully choose relevant neighboring entities for text analysis. This selection process serves to reduce computational burden and minimize the injection of knowledge noise. To accomplish this, we employ PageRank48 to compute the weights of neighboring entities for each entity, enabling us to identify the most influential one.
In order to take advantage of the various embeddings mentioned above, SMedBERT incorporates neighbor entity type attention, neighbor entity node attention to learn the structural semantic knowledge. Moreover, the masked neighbor module and masked mention module are added to facilitate the information interaction between mentions and neighbors45. Therefore, the SMedBERT can represent medical text with knowledge.
Based on the original configuration of the SMedBERT model parameters, this study performed 50 epochs of fine-tuning on the collected medical record data to generate word vector representations for the medical text. Figure 3 illustrates the change in losses during the fine-tuning process. As can be found in Fig. 3, the loss decreases as epoch increases and finally levels off around 0.10, which indicates that the trained model developed in the study has been able to well represent the medical text. In addition, Figs. 4 and 5 show the number of words and entities in the sample, respectively.
Fig. 3.
Loss during fine-tuning process.
Fig. 4.
Frequency distribution of words in MDT and non-MDT samples (Note: 512 is the max sequence length.).
Fig. 5.
Frequency distribution of entities in MDT and non-MDT samples.
According to Figs. 4 and 5, the following points are found. (1) The number of words in most of the samples in MDT and non-MDT have not exceeded 512. Exceeding 512 means that some information in the word vector representation of the text cannot be represented due to the limitation of the text length. (2) Fig. 5 reveals that a substantial number of entities are accurately identified from the samples using CMEKG, indicating a commendable performance in entity recognition. The integration of specialized medical knowledge enhances the model’s comprehension capabilities. (3) The distribution of words and entities shows a similar pattern between the MDT and non-MDT groups. This provides a foundational basis for recommending treatment patterns for the non-MDT group based on insights gained from the MDT group.
Sentence-level embedding representation
The co-training drove retrieval oriented masking (coROM)49, which was initially developed for paragraph retrieval, has been extended to support a wide range of applications, including sentence vector representation, dual-sentence text similarity, and query & multiple candidate documents similarity ranking. coROM was trained under 959,526 passages in medical field of the multi-domain Chinese dataset for passage retrieval (Multi-CPR)50. This model demonstrates strong generalization capabilities as a result of being trained on a large-scale training corpus. Furthermore, the model benefits from high-consistency annotations (annotations with more than 80% consistency from at least 5 independent annotators), which ensure its reliability. In general, coROM performs well in capturing semantic relevance between sentences and provides powerful representation of contextual information from the sentence embedding level.
The distribution of the number of sentences contained in the samples of MDT and non-MDT group is shown in Fig. 6. From the figure, it can be observed that the majority of samples have fewer than 25 sentences, with a maximum number of sentences not exceeding 50. In this study, a maximum limit of 50 sentences was set. Compared to the word level embedding representation, the sentence embedding representation ensures that there is no information loss in the text representation.
Fig. 6.
Frequency distribution of sentences in MDT and non-MDT samples.
A splicing method based on an attention mechanism
To achieve a comprehensive representation of text in this study, an attention mechanism51 is employed to combine word embedding and sentence embedding. The splicing method not only efficiently integrates the two levels of embedding, but also considers the importance of each level of embedding.
Firstly, we employ the attention mechanism to concatenate the dual-level embeddings, addressing the limitation of disregarding the importance of different embedded sentences or words when directly concatenating. Subsequently, a convolutional neural network-static (CNN-static) proposed by Kim52 with the capability to handle high-dimensional data processes concatenated text representations to obtain the optimal treatment recommendation. The detailed computations are summarized as follows.
The attention weights under the word and sentence levels and the splicing of dual-level embeddings are calculated with the help of the following Eq.
![]() |
1 |
where
and
are attention weights vector for words and sentences respectively,
and
are the embeddings of word level and sentence level respectively, and
is the activation function. Words or sentences with large attention weights are important words or sentences in the text and have a significant impact on classification decisions.
and
are learnable vector parameters. H is the textual representation that integrates word and sentence embeddings. It will be fed into the CNN-static to obtain text classification i.e., treatment recommendation.
Model performance evaluation
We assess model performance from two perspectives, namely prediction performance and association with survival outcomes. The MDT samples are utilized to train the treatment recommendation model. The 5-fold cross-validation is employed to evaluate the prediction performance of the model. Since treatment recommendation is essentially a classification problem, the confusion matrix, accuracy, precision, recall, and F1-score are used to assess prediction performance4,22.
To assess the association between the treatment by recommendation model and patient survival, the model concordance as an exposure variable27 is introduced to analyze the prognosis in the non-MDT patient group. For each patient in the non-MDT group, we apply the trained model to their EMR data to generate a recommended treatment. We then compare this model-generated recommendation with the actual treatment the patient receives. Patients whose real-world treatment matches the model recommendation are classified as ‘model-concordant’; whereas those whose treatment differs are classified as ‘model-nonconcordant’. Survival outcomes are subsequently compared between these two groups. Considering that only 6 years of EMRs are available, we select 1-year, 3-year survival and 5-year survival as assessment indicators4. Furthermore, the Kaplan-Meier method and multivariable Cox proportional hazards regression are employed to construct survival curves and to assess statistical associations between model concordance and survival outcomes.
Three-level explanation
Interpreting models in the medical domain is vital for both facilitating understanding among non-experts and improving the reliability and trustworthiness of the models. This section focuses on providing comprehensive explanations at three levels: word, phrase, and sentence. These explanations are achieved by leveraging attention mechanisms and attention flows53. Important words and sentences are rapidly acquired through attention mechanisms, while crucial phrases are obtained by analyzing the relationships of attentions between different layers by attention flow.
First of all, the words and sentences that are important in the text representation are identified according to the attention value in Eq. (1), the larger the attention value the more important the word or sentence. Then, based on the attention flows proposed by DeRose, et al. 53, the important words are tracked layer by layer. Specifically, the important words for each layer are selected based on multi-head attention from the subsequent layer to the current one. The important words in the last layer have already been obtained in the first step using the attention mechanism. The other layers select the words that have the greatest impact on the important words in the previous layer based on the multi-head attention as the important words in that layer. The relevant calculation formula is shown below.
![]() |
2 |
where
is the j-th attention head, which used to show the word dependencies between layers.
is the embedding vectors of n tokens at layer l,
are projection matrices to obtain queries and keys. We only consider words at layer l that have a significant influence (
exceeds the threshold
) on important words at layer l + 1 and count the number of heads that have a significant influence among all attention heads for that word in layer l. The greater the number of attention heads with significant impact, the stronger the role of the l-th layer words in determining the importance of the l + 1-th layer’s important words. This, in turn, increases the likelihood of these words being identified as important.
Finally, the important words from each layer are selected to form phrases based on the attention flows, which provides a phrase-level interpretation of the model results. Thus, the three levels of word, phrase and sentence interpretation are completed. The three levels of interpretation corroborate and complement each other.
Results
The treatment recommendation model was trained on the MDT samples and its prediction performance was evaluated. The performance of the trained treatment recommendation model was assessed using non-MDT samples to examine the association between model concordance and patient survival outcomes. For recommended treatment plan, we leveraged attention mechanism and attention flows to provide explanations at three levels: word, phrase, and sentence. Finally, we conducted ablation experiments to demonstrate the rationality and superiority of the model’s structure.
Performance of the model
The treatment recommendation model proposed in this study underwent a comprehensive evaluation from the perspectives of both prediction performance and associations with survival outcomes. During the evaluation process, we first assessed the model’s predictive capabilities, and measured its accuracy and reliability in predicting suitable treatment options based on the input data. Subsequently, we focused on evaluating the association between model concordance and patient survival.
Prediction performance
Based on word embedding and sentence embedding representations by methods in Sect. 3.2.1 and 3.2.2, CNN-static with an attention mechanism was used to predict the treatment plan. The CNN-static52 was equipped with 200 filters of sizes [10, 15, 20], with dropout rate of 0.4, a learning rate set to 0.01, and it was trained for 500 epochs. To assess the model’s performance, we conducted 5-fold cross-validation. Figure 7 shows the results of the normalized confusion matrix obtained during cross-validation, which allows us to fairly compare the model’s performance across various categories despite differences in sample sizes. Table 5 summarizes the prediction results for each fold. To further account for variability across categories, Table 6 reports the performance metrics for each category along with 95% confidence interval (CI).
Fig. 7.
Results in 5-fold cross validation (Note: All values in the figure are represented as percentages. ‘Statistics’ represents the aggregated confusion matrix obtained from 5-fold cross-validation. The meaning of the labels: 0 refer to ‘Surgery’, 1 refer to ‘Surgery after neoadjuvant therapy’, 2 refer to ‘Chemotherapy’, 3 refer to ‘Radiotherapy’, 4 refer to ‘Chemoradiotherapy (Sequential/Concurrent)’, 5 refer to ‘Chemotherapy + immunotherapy/targeted therapy’.).
Table 5.
Prediction performance of the proposed model through 5-fold cross-validation.
| Fold 1 | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| 0.8783 | 0.8479 | 0.8720 | 0.8554 | |
| Fold 2 | 0.8609 | 0.8286 | 0.8600 | 0.8391 |
| Fold 3 | 0.8889 | 0.8422 | 0.8797 | 0.8567 |
| Fold 4 | 0.9217 | 0.8955 | 0.9102 | 0.9007 |
| Fold 5 | 0.9123 | 0.8798 | 0.8976 | 0.8873 |
| Overall | 0.8924 (0.8616–0.9232) | 0.8588 (0.8243–0.8934) | 0.8839 (0.8590–0.9088) | 0.8679 (0.8365–0.8993) |
Note: In each fold, the evaluation metrics are computed using macro-average, and “Overall” represents the mean value of each metric across all folds, along with its 95% CI.
Table 6.
Prediction performance of the proposed model across treatment categories (5-fold cross-validation).
| Treatment category | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|
| SUR | 0.9230 (0.8686–0.9774) | 0.8857 (0.8333–0.9381) | 0.9032 (0.8632–0.9433) | 0.9549 (0.9342–0.9755) |
| SURANT | 0.9022 (0.8332–0.9712) | 0.8636 (0.8347–0.8925) | 0.8819 (0.8429–0.9208) | 0.9497 (0.9242–0.9753) |
| CHE | 0.7011 (0.5735–0.8287) | 0.8455 (0.7791–0.9118) | 0.7635 (0.6721–0.8550) | 0.9601 (0.9507–0.9695) |
| RAD | 0.7262 (0.6400–0.8124) | 0.8511 (0.7989–0.9033) | 0.7820 (0.7282–0.8358) | 0.9705 (0.9581–0.9828) |
| CHE (S/C) | 0.9235 (0.8834–0.9636) | 0.9541 (0.9405–0.9677) | 0.9383 (0.9159–0.9607) | 0.9757 (0.9667–0.9847) |
| CHE + IMM/TAR | 0.9769 (0.9350–1.0000) | 0.9034 (0.8444–0.9624) | 0.9383 (0.8943–0.9822) | 0.9739 (0.9498–0.9981) |
From the results presented in Fig. 7, Tables 5 and 6, several conclusions can be drawn in the following. (1) The model achieves a high recall of 0.9541 for the class ‘Chemoradiotherapy (Sequential/Concurrent)’ and 0.9034 for ' Chemotherapy + immunotherapy/targeted therapy ‘, indicating that it successfully identifies most patients who should receive these treatments. Moreover, the precision for ‘Chemotherapy + immunotherapy/targeted therapy’ (0.9769) and ‘Chemoradiotherapy (Sequential/Concurrent)’ (0.9235) suggests that the model’s recommendations in these categories are highly accurate. The CIs in Table 6 further confirm the reliability and robustness of these results. (2) The model shows relatively lower precision in ‘Chemotherapy’ (0.7011) and ‘Radiotherapy’ (0.7262), though their recall values remain high (0.8455 and 0.8511, respectively). This indicates that while the model can identify most patients needing these treatments, it sometimes produces false positives. This performance may be attributed to the smaller sample sizes in these categories, limiting the model’s learning. Nevertheless, Table 6 shows that the model still maintains high accuracy for these categories (0.9601 for Chemotherapy and 0.9705 for Radiotherapy), demonstrating its potential for practical use. (3) Overall, across the five folds, the model achieves macro-average accuracies ranging from 0.8609 to 0.9217 (Table 5), with an overall indicator of 0.8924 ± 0.0222. Similarly, macro-average precision, recall, and F1-score are all above 0.85, with modest variance between folds. Table 6 further confirms that the model’s accuracy remains consistently high across all treatment categories, with narrow 95% CI, demonstrating strong generalization ability and robustness.
Survival analysis
We stratified non-MDT patients into model-concordant and model-nonconcordant groups to compare their survival outcomes. To determine model concordance, we applied the trained model to each patient’s EMR data to generate a recommended treatment and compared it with the treatment actually received. Patients whose treatments matched the model recommendation were classified as ‘model-concordant’, while those with differing treatments were classified as ‘model-nonconcordant’. This evaluation approach, which considers model concordance as the exposure factor, has been widely utilized for outcome assessments27. Time zero was defined as the date of diagnosis, and patients were followed until the occurrence of the event (death) or censoring.
Among the non-MDT patient population, 23.37% of the patients adopted treatment plans concordant to the model recommendations. Table 7 summarizes the baseline characteristics of non-MDT patients stratified by model concordance. The baseline characteristics were comparable between the model-concordant and model-nonconcordant groups in terms of age, sex, disease stage, and disease origin. Table 8 presents the one-year, three-year, and five-year survival outcomes of the model-concordant group and the model-nonconcordant group, and the chi-square test was conducted to analyse the differences in survival between the two groups. The results indicate that the model-concordant group exhibits higher observed survival rates, with statistically significant differences at a 99% confidence level in the one-year, three-year, and five-year survival periods. These findings highlight a statistically significant association between model concordance and patient survival in the studied dataset.
Table 7.
Baseline characteristics of non-MDT patients stratified by model concordance.
| Characteristic | Model-concordant (n = 455) | Model-nonconcordant (n = 1492) | p-value |
|---|---|---|---|
| Age, years (mean ± standard deviation) | 63.9 ± 9.4 | 64.3 ± 10.1 | 0.4353 |
| Sex, n (%) | 0.7326 | ||
| Male | 284 (62.4%) | 918 (61.5%) | |
| Female | 171 (37.6%) | 574 (38.5%) | |
| Disease stage, n (%) | 0.7266 | ||
| Stage IIIA | 183 (40.2%) | 569 (38.1%) | |
| Stage IIIB | 147 (32.3%) | 498 (33.4%) | |
| Stage IIIC | 125 (27.5%) | 425 (28.5%) | |
| Disease origin, n (%) | 0.4558 | ||
| De novo | 394 (86.6%) | 1271 (85.2%) | |
| Recurrent/progression | 61 (13.4%) | 221 (14.8%) |
Note: Continuous variables are presented as mean ± standard deviation and compared using the Welch t-test. Categorical variables are presented as number (percentage) and compared using the chi-square test.
Table 8.
One-year, three-year, and five-year survival outcomes of the model-concordant group and the model-nonconcordant group.
| The model-concordant group | One-year survival | Three-year survival | Five-year survival |
|---|---|---|---|
| 49.01% | 23.96% | 11.43% | |
| The model-nonconcordant group | 32.51% | 8.71% | 2.82% |
(p-value) |
24.87(6.12e-07) | 73.82(8.55e-18) | 54.44(1.61e-13) |
The Kaplan-Meier survival analysis33 revealed a highly significant difference between the model-concordant and model-nonconcordant groups, with a log-rank p-value of 1.2804 × 10⁻¹⁴ shown in Fig. 8. The estimated one-year, three-year, and five-year survival probabilities for the model-concordant group were 46.4% (95% CI, 41.6–50.8%), 24.0% (95% CI, 20.0–27.8%), and 11.4% (95% CI, 8.5–14.3%), respectively, whereas for the model-nonconcordant group, the corresponding survival probabilities were markedly lower at 33.2% (95% CI, 30.8–35.6%), 8.7% (95% CI, 7.3–10.1%), and 2.8% (95% CI, 2.0–3.7%). These trends are consistent with the data presented in Table 8 and indicate that patients in the model-concordant group experienced substantially better survival outcomes over time. The Kaplan-Meier curves further illustrate that the divergence in survival probabilities between the two groups becomes most pronounced between 500 and 2000 days, and the number of patients at risk provided at each time point reinforces the reliability of these estimates.
Fig. 8.
Kaplan-Meier Survival Curve: A comparison between model-concordant group and model-nonconcordant group.
Multivariable Cox proportional hazards regression analysis was performed to assess whether the association between model concordance and survival remained independent of other clinical factors, including disease origin, stage III subtype, age, and sex. The results shown in Table 9 demonstrated that model concordance was a strong independent predictor of survival, with a hazard ratio (HR) of 0.67 (95% CI, 0.60–0.76, p < 0.001), indicating that patients in the concordant group had a substantially lower risk of adverse outcomes. In addition, several other covariates also exhibited statistically significant associations with survival. Specifically, patients with recurrent or progressive disease had a higher risk of adverse outcomes compared to those with de novo stage III disease (HR 1.12, 95% CI 1.01–1.24, p = 0.03), and higher stage III subtypes were linked to progressively higher risk (stage IIIB vs. IIIA: HR 1.18, 95% CI 1.05–1.33, p = 0.007; stage IIIC vs. IIIA: HR 1.42, 95% CI 1.25–1.61, p < 0.001). Age was also a modest but significant risk factor (HR 1.02 per year, 95% CI 1.01–1.03, p < 0.001), whereas sex did not reach statistical significance (HR 1.08, 95% CI 0.97–1.20, p = 0.15). These results suggest that while model concordance remains the strongest independent predictor, survival outcomes are also influenced by disease characteristics and patient age.
Table 9.
Multivariable Cox proportional hazards analysis of factors associated with survival in non-MDT patients.
| Variable | HR | 95% CI | p-value |
|---|---|---|---|
| Model concordant | 0.67 | 0.60–0.76 | < 0.001 |
| Disease origin (recurrent/progression vs. de novo) | 1.12 | 1.01–1.24 | 0.03 |
| Stage IIIB vs. IIIA | 1.18 | 1.05–1.33 | 0.007 |
| Stage IIIC vs. IIIA | 1.42 | 1.25–1.61 | < 0.001 |
| Age (per year) | 1.02 | 1.01–1.03 | < 0.001 |
| Sex (male vs. female) | 1.08 | 0.97–1.20 | 0.15 |
Together, the Kaplan–Meier analysis and multivariable Cox regression provide complementary evidence that model concordance is associated with markedly improved long-term survival. This underscores the potential utility of the model not only in predicting outcomes but also in stratifying patients for prognostic assessment, highlighting its clinical relevance in guiding management strategies.
Model interpretation
This study elucidates the model results at the word, phrase, and sentence levels, aiming to explore and reveal the rationale for treatment recommendations. During the interpretation process, we randomly selected two samples under different treatment plans to present the model interpretation results. Following the approach described in Sect. 3.4, the obtained key words, phrases, and sentences are demonstrated in Figs. 9–10.
Figs. 9.
The explanation for the treatment of chemotherapy + immunotherapy/targeted therapy.
Fig. 10.
The explanation for the treatment of surgery after neoadjuvant therapy.
Figs. 9–10 provide a comprehensive insight into the model’s interpretability. These figures first present the important words and sentences based on the attention values. Subsequently, it highlights the key words from the last two layers using attention flows (although the study considered key words from the last five layers, only the last two are displayed due to space limitations). Finally, the model leverages these key phrases and important sentences to elucidate treatment options, such as the potential for surgery, the need for adjuvant therapy, and the suitability for immunotherapy. To sum up, with the help of Figs. 9–10, the basis of the model’s recommendation results can be clearly understood. These key sentences and phrases serve as a reasonable basis for decision-makers to understand the model’s results, which provides valuable insights into the patient’s condition and supports the treatment recommended by the model.
Ablation study
To demonstrate the superior performance of the model and showcase the roles of each module in prediction, we conducted ablation experiments. That is to say, we examined the advantages of dual-level embedding, the strengths of knowledge-enhanced word embedding, and the benefits of splicing word embedding and sentence embedding through attention mechanisms. The observed performance improvements are based on the current dataset.
Effect of dual-level embedding
In the embedding representation of clinical texts, we employed a dual-level embedding approach. To investigate the impact of dual-level embeddings on recommendation performance, we compared the model’s effectiveness under two scenarios: utilizing only word-level embedding (SMedBERT) and employing only sentence-level embedding. The comparative results are presented in Table 10. The table not only provides the model’s performance across different categories but also showcases the overall model performance under macro-average metrics. Comparing Table 10 with Table 6 reveals that the model’s prediction performance is poorer when predicting categories such as chemotherapy and radiotherapy under either sentence or word embeddings alone. This is primarily attributed to insufficiently representing information in the text, leading to suboptimal performance when predicting these easily confused categories. By comparing Table 10 with Table 5, it is evident that under macro-average evaluation metrics, the dual-level embedding approach yields a performance improvement of at least 5% compared to using a single embedding representation method. This further validates that dual-level embeddings can comprehensively represent textual information, demonstrating significant advantages over single-level embedding.
Table 10.
Effect of dual-level embedding.
| Methods | Category | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|---|
| Only word embedding by SMedBERT | SUR | 0.9460 (0.9254–0.9665) | 0.9199 (0.8634–0.9763) | 0.8480 (0.7995–0.8965) | 0.8818 (0.8417–0.9219) |
| SURANT | 0.9373 (0.9107–0.9639) | 0.8750 (0.8156–0.9344) | 0.8279 (0.8073–0.8485) | 0.8503 (0.8183–0.8824) | |
| CHE | 0.9442 (0.9387–0.9496) | 0.6085 (0.5145–0.7024) | 0.7459 (0.6544–0.8374) | 0.6694 (0.5797–0.7591) | |
| RAD | 0.9477 (0.9346–0.9607) | 0.5553 (0.4491–0.6615) | 0.7022 (0.5979–0.8066) | 0.6189 (0.5184–0.7194) | |
| CHE (S/C) | 0.9512 (0.9418–0.9605) | 0.8417 (0.7971–0.8863) | 0.9144 (0.8732–0.9557) | 0.8764 (0.8371–0.9156) | |
| CHE + IMM/TAR | 0.9634 (0.9345–0.9923) | 0.9691 (0.9087–1.0000) | 0.8610 (0.8016–0.9203) | 0.9113 (0.8599–0.9627) | |
| Overall | 0.8448 (0.8097–0.8800) | 0.7949 (0.7455–0.8443) | 0.8166 (0.7769–0.8562) | 0.8013 (0.7562–0.8465) | |
| Only sentence embedding | SUR | 0.9457 (0.9250–0.9663) | 0.9191 (0.8631–0.9750) | 0.8458 (0.7954–0.8962) | 0.8802 (0.8401–0.9203) |
| SURANT | 0.9335 (0.9028–0.9642) | 0.8668 (0.8027–0.9309) | 0.8212 (0.7827–0.8597) | 0.8428 (0.8018–0.8838) | |
| CHE | 0.9421 (0.9367–0.9476) | 0.5974 (0.4773–0.7174) | 0.7459 (0.6544–0.8374) | 0.6618 (0.5551–0.7684) | |
| RAD | 0.9544 (0.9400–0.9687) | 0.6176 (0.5606–0.6746) | 0.7022 (0.5979–0.8066) | 0.6552 (0.5928–0.7176) | |
| CHE (S/C) | 0.9491 (0.9398–0.9584) | 0.8409 (0.7975–0.8843) | 0.9078 (0.8813–0.9342) | 0.8729 (0.8406–0.9052) | |
| CHE + IMM/TAR | 0.9562 (0.9259–0.9866) | 0.9320 (0.8786–0.9854) | 0.8600 (0.8013–0.9187) | 0.8944 (0.8404–0.9485) | |
| Overall | 0.8405 (0.8036–0.8773) | 0.7956 (0.7492–0.8420) | 0.8138 (0.7726–0.8551) | 0.8012 (0.7567–0.8457) |
Effect of splicing through attention mechanisms
In the context of dual-level embeddings, we employ an attention mechanism to merge the two levels of embeddings, rather than directly concatenating them. This approach serves two purposes: on the one hand, it facilitates the interpretation of recommendation results; on the other hand, it enhances the efficiency of information integration. To investigate the impact of attention-based integration compared to direct concatenation, we compared their performance in the model. Table 11 provides a detailed presentation of the model’s prediction performance across different categories and its overall performance under macro-average metrics when using direct splicing. Comparing this with Table 6 reveals a significant improvement in the model’s prediction performance for each category upon introducing the attention mechanism. Further comparison with Table 5 indicates an overall performance improvement of approximately 1% due to the introduction of the attention mechanism. These results validate the superiority of the attention mechanism in more effectively integrating information from two hierarchical levels.
Table 11.
Effect of splicing through attention mechanisms.
| Methods | Category | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|---|
| Direct splicing | SUR | 0.9548 (0.9341–0.9754) | 0.9230 (0.8686–0.9774) | 0.8857 (0.8333–0.9381) | 0.9032 (0.8632–0.9433) |
| SURANT | 0.9478 (0.9184–0.9772) | 0.9002 (0.8295–0.9710) | 0.8557 (0.8108–0.9006) | 0.8768 (0.8294–0.9241) | |
| CHE | 0.9582 (0.9444–0.9720) | 0.6823 (0.5468–0.8179) | 0.8396 (0.7607–0.9185) | 0.7509 (0.6418–0.8600) | |
| RAD | 0.9704 (0.9580–0.9827) | 0.7262 (0.6400–0.8124) | 0.8511 (0.7989–0.9033) | 0.7820 (0.7282–0.8358) | |
| CHE (S/C) | 0.9704 (0.9580–0.9827) | 0.9216 (0.8799–0.9634) | 0.9302 (0.9087–0.9517) | 0.9257 (0.8999–0.9514) | |
| CHE + IMM/TAR | 0.9686 (0.9416–0.9957) | 0.9497 (0.9013–0.9982) | 0.9034 (0.8444–0.9624) | 0.9258 (0.8750–0.9766) | |
| Overall | 0.8851 (0.8494–0.9208) | 0.8505 (0.8107–0.8904) | 0.8776 (0.8448–0.9105) | 0.8607 (0.8231–0.8983) |
Effect of knowledge infusion
In the word-level embedding representation, a knowledge graph (CMeKG) is utilized to enhance the capabilities of the BERT model. To study the impact of knowledge infusion on model performance, we compared the results with and without the incorporation of the knowledge graph. Table 12 presents the model’s prediction performance across various categories and its overall performance under macro-average performance in the absence of a knowledge graph. Compared to Table 5, it can be observed that knowledge injection results in approximately a 1% performance improvement. Although this improvement may seem modest, contrasting it with Table 6 reveals a substantial enhancement in the model’s prediction performance for chemotherapy and radiotherapy. This validates that knowledge infusion indeed enhances the performance of the word representation method, and provides more accurate information for the model.
Table 12.
Effect of knowledge infusion.
| Method | Category | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|---|
| Word embedding by BERT without knowledge graph | SUR | 0.9547 (0.9315–0.9779) | 0.9241 (0.8731–0.9751) | 0.8897 (0.8166–0.9628) | 0.9050 (0.8669–0.9431) |
| SURANT | 0.9477 (0.9212–0.9742) | 0.8886 (0.8222–0.9550) | 0.8692 (0.8391–0.8993) | 0.8782 (0.8387–0.9176) | |
| CHE | 0.9634 (0.9546–0.9722) | 0.7256 (0.6156–0.8356) | 0.8436 (0.7806–0.9067) | 0.7777 (0.7045–0.8508) | |
| RAD | 0.9651 (0.9519–0.9783) | 0.6743 (0.6027–0.7458) | 0.8511 (0.7989–0.9033) | 0.7512 (0.6981–0.8044) | |
| CHE (S/C) | 0.9651 (0.9574–0.9728) | 0.9194 (0.9028–0.9361) | 0.9017 (0.8821–0.9214) | 0.9104 (0.9020–0.9187) | |
| CHE + IMM/TAR | 0.9703 (0.9451–0.9954) | 0.9611 (0.9141–1.0000) | 0.9069 (0.8545–0.9592) | 0.9329 (0.8881–0.9777) | |
| Overall | 0.8831 (0.8573–0.9090) | 0.8489 (0.8145–0.8832) | 0.8770 (0.8517–0.9024) | 0.8592 (0.8281–0.8904) |
In summary, through the adoption of dual-level embedding and knowledge injection in the model’s architecture, along with the incorporation of attention mechanisms, we successfully develop a rational and effective model that can efficiently process information and achieve remarkable performance improvements in the treatment recommendation task.
Conclusions
Lung cancer, as the leading cause of death in the world, brings a heavy economic burden and serious physical damage to people. MDT can provide the best treatment plan for cancer patients and enhance their prognosis. However, MDT consumes too much time and energy of specialist doctors in treatment decisions, which is not applicable in countries with limited medical resources. This paper proposes develops an interpretable MDT treatment recommendation approach to provide optimal treatment options for patients with stage III NSCLC with high heterogeneity. A dual-level embedding approach that integrates word embedding and sentence embedding for clinical text representation is proposed in this study. And a comparative analysis with word embedding-only and sentence embedding-only further illustrates that the proposed approach considers both local and global information and more adequately represents clinical text. Subsequently, taking into account the heterogeneity among patients, the basis for treatment decision making is dissected in depth at three levels: word, phrase and sentence, identifying key factors influencing treatment. Specially, attention flow is employed to consider the relationship between attentions across multiple layers, and important words and sentences are screened out through attention mechanisms. Finally, using stage III NSCLC in China as an example, it is concluded not only that the proposed model can highly simulate MDT treatment recommendations, but also that the recommended treatments show potential to be associated with improved survival prognosis.
Additionally, there are also some limitations. First, the evaluation of the proposed model was restricted to treatment decision-making for stage III NSCLC, and only textual information extracted from EMRs was included. Second, as a retrospective study, potential unmeasured confounding factors may have influenced the observed results, which could not be fully controlled. Third, the dataset was derived from only two hospitals, which may limit the external validity and generalizability of the findings. Finally, imbalances in the distribution of treatment categories existed in some cases, which might have affected the robustness of the model’s performance.
Future studies will aim to address these limitations. First, expanding the model evaluation to additional cancer types and incorporating structured and unstructured EMR data will broaden the applicability and improve decision support. Second, prospective study designs and advanced statistical methods for confounding control (e.g., propensity score matching or causal inference techniques) should be employed to mitigate the influence of unmeasured confounders. Third, validation on multi-center, large-scale, and geographically diverse datasets is necessary to enhance the generalizability and external validity of the model. Finally, data balancing techniques and more representative sampling strategies will be explored to reduce treatment category imbalances and further improve model stability and fairness.
Supplementary Information
Below is the link to the electronic supplementary material.
Author contributions
Ziyu Chen: Conceptualization, Methodology, Writing-Original Draft. Naijie Chai: Visualization, Investigation, Writing-Review & Editing. Jianqiang Wang: Writing, Reviewing, Editing-Supervision. Xiaokang Wang: Data Curation, Software, Visualization-Review & Editing. Zhenqian Fan: Project administration, Writing-Review & Editing.
Funding
The authors thank the editors and anonymous reviewers for their helpful comments and suggestions. This work was supported by the Natural Science Foundation of Shaanxi Province (Nos. 2025JC-YBQN-996 and 2025JC-YBQN-988), Research Project of Humanities and Social Sciences of the Ministry of Education (Nos. 25YJC630007 and 25YJCZH011), the Postdoctoral Fellowship Program of CPSF under Grant Number GZC20251119.
Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA-CANCER J. CLIN.71, 209–249 (2021). [DOI] [PubMed] [Google Scholar]
- 2.Liao, H., Long, Y., Tang, M., Streimikiene, D. & Lev, B. Early lung cancer screening using double normalization-based multi-aggregation (DNMA) and Delphi methods with hesitant fuzzy information. COMPUT. IND. ENG.136, 453–463 (2019). [Google Scholar]
- 3.Chinese Anti-Cancer Association. o. L. C. S. & lung cancer group of oncology Branch, C. M. A. Chinese expert consensus on the multidisciplinary clinical diagnosis and treatment of stage Ⅲ non-small cell lung cancer (2019). Chin. J. Oncol.41, 881–890 (2019). [DOI] [PubMed] [Google Scholar]
- 4.Sesen, M. B. et al. Lung cancer assistant: a hybrid clinical decision support application for lung cancer care. J. R SOC. INTERFACE. 11, 20140534 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hou, G. et al. Deep learning approach for predicting lymph node metastasis in non-small cell lung cancer by fusing image–gene data. Eng. Appl. Artif. Intell.122, 106140 (2023). [Google Scholar]
- 6.Spicer, J. D. et al. Neoadjuvant and adjuvant treatments for early stage resectable NSCLC: consensus recommendations from the international association for the study of lung cancer. J. THORAC. ONCOL.19 (10), 1373–1414 (2024). [DOI] [PubMed] [Google Scholar]
- 7.Batra, U., Munshi, A., Kabra, V. & Momi, G. Relevance of multi-disciplinary team approach in diagnosis and management of stage III NSCLC. Indian J. Cancer. 59, S46–S55 (2022). [DOI] [PubMed] [Google Scholar]
- 8.Zhong, W. Z. et al. Chinses expert consensus on the multidisciplinary team diagnosis and treatment of lung cancer. Chin. J. Oncol.42, 817–828 (2020). [DOI] [PubMed] [Google Scholar]
- 9.Lv, Y., Yang, J., Chen, Y. & Jiang, L. Investigation and analysis of the status quo of multidisciplinary collaborative diagnosis and treatment model in grade III hospitals in China. Chin. Hosp.25, 21–23 (2021). [Google Scholar]
- 10.Hao, J. et al. Status quo and analysis of quality management of multidisciplinary diagnosis and treatment for tumors in comprehensive hospitals. Mod. Hosp. Manage.20, 28–31 (2022). [Google Scholar]
- 11.Maguire, F. B. et al. A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries: A study of non-small cell lung cancer in California. PLoS One. 14, e0212454 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bouaud, J. et al. Implementation of an ontological reasoning to support the guideline-based management of primary breast cancer patients in the DESIREE project. ARTIF. INTELL. MED.108, 101922 (2020). [DOI] [PubMed] [Google Scholar]
- 13.Kouz, H., Bouaud, J., Guézennec, G. & Séroussi, B. From atomic Guideline-Based recommendations to complete therapeutic care plans: A Knowledge-Based approach applied to breast cancer management. Stud. Health Technol. Inf.275, 107–111 (2020). [DOI] [PubMed] [Google Scholar]
- 14.Becker, M., Kasper, S., Böckmann, B., Jöckel, K. H. & Virchow, I. Natural Language processing of German clinical colorectal cancer notes for guideline-based treatment evaluation. Int. J. Med. Inf.127, 141–146 (2019). [DOI] [PubMed] [Google Scholar]
- 15.Simon, G. et al. Applying artificial intelligence to address the knowledge gaps in cancer care. ONCOLOGIST24, 772–782 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Werra, L., Schöngens, M., Uzun, E. D. G. & Eickhoff, C. Generative Adversarial Networks in Precision Oncology, in Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, 145–148 (2019).
- 17.Feuerriegel, S. et al. Causal machine learning for predicting treatment outcomes. Nat. Med.30, 958–968 (2024). [DOI] [PubMed] [Google Scholar]
- 18.Ioannidis, J. P. A. Why most clinical research is not useful. PLoS Med.13, e1002049 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Greenhalgh, T. How to read a paper: the basics of evidence-based medicine (ed. John Wiley & Sons) 152–168B M J, (2014).
- 20.Mateu-Sanz, M. et al. Redefining biomaterial biocompatibility: challenges for artificial intelligence and text mining. Trends Biotechnol.42 (4), 402–417 (2024). [DOI] [PubMed] [Google Scholar]
- 21.Lin, F. P., Pokorny, A., Teng, C., Dear, R. & Epstein, R. J. Computational prediction of multidisciplinary team decision-making for adjuvant breast cancer drug therapies: a machine learning approach. BMC CANCER. 16, 929 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zeng, J. et al. Natural Language processing to identify cancer treatments with electronic medical records. JCO CLIN. CANCER INFO. 5, 379–393 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zeng, J., Gensheimer, M. F., Rubin, D. L., Athey, S. & Shachter, R. D. Uncovering interpretable potential confounders in electronic medical records. Nat. Commun.13, 1014 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wang, L., Tang, R., He, X. & He, X. Hierarchical imitation learning via subgoal representation learning for dynamic treatment recommendation, in Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 1081–1089 (2022).
- 25.Wang, L., Zhang, W., He, X. & Zha, H. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2447–2456 (2018).
- 26.Naumzik, C., Feuerriegel, S. & Nielsen, A. M. Data-driven dynamic treatment planning for chronic diseases. Eur. J. Oper. Res.305, 853–867 (2023). [Google Scholar]
- 27.Sun, X. et al. Effective treatment recommendations for type 2 diabetes management using reinforcement learning: treatment recommendation model development and validation. Med. Internet Res.23, e27858 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Najafabadipour, M., Tuñas, J. M., Rodríguez-González, A. & Menasalvas, E. Analysis of electronic health records to identify the patient’s treatment lines: Challenges and opportunities, in Artificial Intelligence XXXVI. (eds Max Bramer & Miltos Petridis) 437–442 (Springer International Publishing).
- 29.Solarte-Pabón, O. et al. Extracting cancer treatments from clinical text written in Spanish: A deep learning approach, in. IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA). 1–6 (2021). (2021).
- 30.Zhang, X. et al. Extracting comprehensive clinical information for breast cancer using deep learning methods. Int. J. Med. Inf.132, 103985 (2019). [DOI] [PubMed] [Google Scholar]
- 31.Min, X., Li, W., Yang, J., Xie, W. & Zhao, D. Dual-level diagnostic feature learning with recurrent neural networks for treatment sequence recommendation. J. Biomed. Inf.134, 104165 (2022). [DOI] [PubMed] [Google Scholar]
- 32.Li, R., Zhao, X. & Moens, M. F. A brief overview of universal sentence representation methods: A linguistic view. ACM Comput. Surveys. 55, Article56 (2022). [Google Scholar]
- 33.Jiang, Y. et al. Predicting peritoneal recurrence and disease-free survival from CT images in gastric cancer with multitask deep learning: A retrospective study. Lancet Digit. Health. 4, e340–e350 (2022). [DOI] [PubMed] [Google Scholar]
- 34.Tamburini, N. et al. Multidisciplinary management improves survival at 1 year after surgical treatment for non-small-cell lung cancer: A propensity score-matched study. EUR. J. CARDIO-THORAC. 53, 1199–1204 (2018). [DOI] [PubMed] [Google Scholar]
- 35.Lucarini, A. et al. From cure to care: the role of the multidisciplinary team on colorectal cancer patients’ satisfaction and oncological outcomes. Multidiscip Healthc.15, 1415–1426 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Koco, L. et al. The effects of multidisciplinary team meetings on clinical practice for colorectal, lung, prostate and breast cancer: A systematic review. CANCERS13, 4159 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hendriks, M. P. et al. Clinical decision trees support systematic evaluation of multidisciplinary team recommendations. BREAST CANCER RES. TR.183, 355–363 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Yan, Z., Jian, C., Kunwei, S., Xiaosong, C. & Siji, Z. A multi-disciplinary medical treatment decision support system with intelligent treatment recommendation, in 2nd IEEE International Conference on Computer and Communications (ICCC). 838–842 (2016).
- 39.Zhu, N., Cao, J., Shen, K., Chen, X. & Zhu, S. A decision support system with intelligent recommendation for multi-disciplinary medical treatment. ACM T MULTIM COMP.16, 1–33 (2020). [Google Scholar]
- 40.Andrew, T. W. et al. Machine-learning algorithm to predict multidisciplinary team treatment recommendations in the management of basal cell carcinoma. Br. J. Cancer. 126, 562–568 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Cypko, M. A. et al. Validation workflow for a clinical bayesian network model in multidisciplinary decision making in head and neck oncology treatment. INT. J. COMPUT. ASS RAD. 12, 1959–1970 (2017). [DOI] [PubMed] [Google Scholar]
- 42.She, Y. et al. Development and validation of a deep learning model for non-small cell lung cancer survival. JAMA NETW. OPEN.3, e205842 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Wang, L. et al. Adversarial cooperative imitation learning for dynamic treatment regimes, in Proceedings of The Web Conference 2020. 1785–1795 (2020).
- 44.Ettinger, D. et al. (ed, S.) Non-Small cell lung Cancer, version 3.2022, NCCN clinical practice guidelines in oncology. JNCCN20 497–530 (2022). [DOI] [PubMed] [Google Scholar]
- 45.Zhang, T. et al. SMedBERT: A knowledge-enhanced pre-trained language model with structured semantics for medical text mining, in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 5882–5893 (2021).
- 46.Zan, H. et al. Construction of Chinese medical knowledge graph based on multi-source corpus. J. Zhengzhou University(Natural Sci. Edition). 52, 45–51 (2020). [Google Scholar]
- 47.Lin, Y., Liu, Z., Sun, M., Liu, Y. & Zhu, X. Learning entity and relation embeddings for knowledge graph completion, in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2181–2187 (2015).
- 48.Page, L., Brin, S., Motwani, R. & Winograd, T. The PageRank Citation Ranking: Bringing order To the Web (Stanford InfoLab., 1999).
- 49.Long, D., Zhang, Y., Xu, G. & Xie, P. Retrieval oriented masking pre-training Language model for dense passage retrieval. Preprint at.10.48550/arXiv.2210.15133 (2022). [Google Scholar]
- 50.Long, D. et al. Multi-CPR: A multi domain chinese dataset for passage retrieval, in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 3046–3056 (2022).
- 51.Trigueros, O., Blanco, A., Lebeña, N., Casillas, A. & Pérez, A. Explainable ICD multi-label classification of EHRs in Spanish with convolutional attention. Int. J. Med. Inf.157, 104615 (2022). [DOI] [PubMed] [Google Scholar]
- 52.Kim, Y. Convolutional neural networks for sentence classification, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751 (2014).
- 53.DeRose, J. F., Wang, J. & Berger, M. Attention flows: analyzing and comparing attention mechanisms in Language models. IEEE Trans. Vis. Comput. Graph. 27, 1160–1170 (2021). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.













