Skip to main content
Journal of Translational Medicine logoLink to Journal of Translational Medicine
. 2025 Aug 5;23:864. doi: 10.1186/s12967-025-06795-7

DrugBERT: a BERT-based approach integrating LDA topic embedding and efficacy-aware mechanism for predicting anti-tumor drug efficacy

Weiwei Zhu 1,2, Xiaodong Jiang 3, Lei Zhang 4, Peng Zhou 5, Xinping Xie 6, Hongqiang Wang 2,
PMCID: PMC12326809  PMID: 40764962

Abstract

Background

Due to the complexity of tumor genetic heterogeneity, personalized medicine has progressively emerged as the central focus of cancer research. However, how to accurately predict the drug response of patients before receiving treatment is the critical challenge to the development of this field.

Methods

This paper proposes DrugBERT, a BERT-based framework integrated with LDA topic embedding and a drug efficacy-aware mechanism for predicting the efficacy of antitumor drugs. The method incorporates LDA-generated topic embedding as a semantic enhancement module into the BERT language model and introduces a drug efficacy-aware attention mechanism to prioritize drug efficacy-related semantic features. The model is via LSTM to capture long-range dependencies in clinical text data. In addition, the SMOTE algorithm is used to synthesize samples of the minority class to solve the problem of data imbalance.

Results

The proposed method DrugBERT demonstrated remarkable performance on a dataset of 958 patients with non-small cell cancer treated with antitumor drugs. Furthermore, when validated on an independent dataset of 266 bowel cancer patients, the model achieved a 3% improvement in AUC over previous methods, signifying its robust generalization capability.

Conclusions

DrugBERT can help predict the efficacy of antitumor drugs based on clinical text while exhibiting strong generalization capability. These findings highlight its potential for optimizing personalized therapeutic strategies through language model.

Keywords: Drug efficacy prediction, LDA topic embedding, BERT, Self-attention mechanism, Clinical text data

Background

Precision medicine can develop individualized medication regimens for each patient based on their unique characteristics [1]. Recent advances in pharmacogenomics have generated extensive genomic data, providing rich resources for constructing drug efficacy prediction models [2]. Although these efforts have achieved some progress in laboratory or clinical practice, the high cost of genomic data measurement remains a limiting factor. In contrast, clinical data are low-cost and abundant, particularly radiomic reports such as computed tomography (CT), magnetic resonance imaging (MRI), and B-mode ultrasound (BU) [3]. These data contain rich clinical knowledge and patient information, holding significant value for disease diagnosis, treatment, and prevention. In previous work, we proposed a series of clinical text-based drug efficacy prediction methods [4, 5], which demonstrated good predictive capabilities. However, their performance and generalizability require further improvement.

The advent of large language models (LLMs) has introduced novel opportunities to advance oncology research frameworks [6]. These deep learning-based models, pretrained on extensive datasets, demonstrate superior performance in diverse natural language processing (NLP) tasks. Crucially, their attention mechanisms enable selective focus on the most relevant input components, enhancing both task-specific accuracy and generalization capabilities [7]. Researchers adapt LLMs to address varied medical applications [8]. Santos et al. [9]developed PathologyBERT, a model tailored for analyzing pathology reports from Emory University Hospital, which achieves promising diagnostic accuracy in breast cancer classification using 347,173 reports for training, validation, and testing. Similarly, Zhang et al. [10] proposed a BERT-based BiLSTM-Transformer network for lung cancer screening and staging via 359 Chinese CT reports, demonstrating robust clinical entity extraction with a macro-F1 score of 85.96% and a micro-F1 score of 90.67% under strict exact-match criteria. The increasing feasibility and utility of LLMs in medicine enable them to identify hidden patterns and correlations within healthcare big data. This capability assists researchers in interpreting complex medical information, predicting patient outcomes, and facilitating the development of personalized treatment plans.

This paper proposes DrugBERT, a radiomic text-based framework for predicting anticancer drug efficacy. The method improves BERT [11] text embedding by incorporating drug efficacy keyword topic information, employs attention-enhanced mechanisms to prioritize inter-keyword relationships, and integrates LSTM [12] for optimized prediction. The contributions of our work can be listed as follows: (1) The LDA-generated topic embeddings were injected as a semantic enhancement module into BERT’s input layer, effectively integrating topic information of drug efficacy keywords (2). We propose a drug efficacy-aware attention mechanism that enables the model to focus on drug efficacy-related semantic features in text sequences, thereby improving the accuracy of drug response prediction (3). The model uses LSTM to better capture long-range dependencies and provides a feasible method for modeling long sequences of clinical small-scale datasets.

Methods

Overview of DrugBERT

Figure 1 illustrates the architecture of the DrugBERT model based on the BERT language model. The workflow comprises the following steps: First, use special tokens [CLS] and [SEP] to demarcate different semantic segments. Second, the model incorporates prior knowledge of drug efficacy representation. Drug-effect-related features are extracted using an Latent Dirichlet Allocation (LDA) topic model [13], and this domain knowledge is fused with original word vectors through weighted integration of topic embedding, generating a clinical enhanced text representation. Third, we design a drug efficacy-aware multi-head self-attention mechanism and incorporate to compute novel attention scores [14], providing preparation for the subsequent prediction layer. We name this model the “Improved Bert”. Finally, the sequence processing capability of the LSTM model is integrated to obtain the final prediction results. Each of these key components will be explained in detail in subsequent sections.

Fig. 1.

Fig. 1

Architecture of the Antitumor Drug Efficacy Prediction Model (DrugBERT). The input layer of the model incorporates LDA embeddings, while the attention mechanism integrates drug efficacy keyword-specific attention scores. The input representations undergo sequential processing through L identical structural layers before advancing to subsequent prediction layers. Then, the features is processed through an LSTM model, followed by a fully connected layer to classify drug efficacy labels

Constructing LDA model for clinical text data

We encode the text data for each patient based on LDA: Assuming M patients, N observable words and K potential topics, the examination text for each patient can be represented as Inline graphic, where Inline graphic, Inline graphic denotes the jth patient’s text, Inline graphic denotes the tth word of a text, and the topic set can be denoted as Inline graphic, where Inline graphic denotes the ith topic. The joint probability distribution of all variables can be calculated according to Eq. (1). The probability relationship of variables is shown in Fig. 2.

Fig. 2.

Fig. 2

The Architecture of the LDA model. α: the hyperparameter of the prior Dirichlet distribution for each topic distribution; β: the hyperparameter of the prior Dirichlet distribution for each topic word distribution; θ: the probability distribution of topics; φ: the probability distribution of words; 𝑧: the hidden topics; w: the observable words in the document

graphic file with name d33e445.gif 1

Here Inline graphic denotes the probability that a document will apper; Inline graphic denotes the Dirichlet distribution of documents over topics; Inline graphic denotes the Dirichlet distribution of topics over words; Inline graphic denotes the probability of a topic appearing given a document; Inline graphic denotes the probability of a word appearing in a given topic.

We select the optimal number of hidden topics by the perplexity of the model [15], which is obtained by plotting the perplexity-topic curve. The following equation calculates the perplexity:

graphic file with name d33e491.gif 2

where Inline graphic denotes the total number of words in the ith text document and Inline graphic represents the probability of the word vector wi. The lower the perplexity of Eq. (2), the better the model.

Integrated LDA topic embedding layer

The clinical radiomic text data comprise tokenized and stopword-removed word sequences for each patient. To construct the model input, the input sequence format is structured as follows:

graphic file with name d33e530.gif 3

The special tokens [CLS] (indicating sentence initiation) and [SEP] (delimiting sentence boundaries) are predefined in the BERT architecture. The BERT embedding module synthesizes three types of embeddings—token, position, and segment embeddings—whose summation generates the vector representation for each token: token embeddings map vocabulary terms to semantic vector spaces, encoding lexical meaning; position embeddings inject sequential ordering of tokens; and segment embeddings differentiate between distinct sentences within the input text. Collectively, these embeddings produce contextualized representations of the input sequence.

The LDA topic model is used to derive both the topic probability distribution and the word probability distribution of each topic within the text dataset. By combining the probabilities of each word across K topics, we obtain a K-dimensional embedding representation of the word, as shown in Fig. 3. Furthermore, this embedding is mapped to the same dimension as the original BERT input through a fully connected layer.

Fig. 3.

Fig. 3

Embedding Representation Based on LDA Model. The probability distribution values of each keyword across all K topics are concatenated to construct a 1×K-dimensional topic embedding representation

To incorporate the hidden topic information, this study introduces an LDA topic embedding layer based on the original embedding module, which is used to inject drug efficacy keyword information into each token in the input sequence. The final input embedding vector is defined as:

graphic file with name d33e574.gif 4

Where Inline graphic denotes the LDA embedding vector corresponding to word w, provided that w belongs to the DEKR. The final embedding vector is denoted as Inline graphic,while words outside DEKR remain unprocessed by LDA embedding, as illustrated in Fig. 4.

Fig. 4.

Fig. 4

Integrated drug efficacy-related keyword embedding representation. The original BERT input representation is augmented with drug efficacy keyword LDA topic embeddings. The final input embedding is formulated as the summation of token embeddings, segment embeddings, position embeddings, and LDA topic embeddings

Drug efficacy-Aware attention mechanism

The DrugBERT is designed based on the BERT encoder. Each encoder layer contains drug efficacy-aware multi-head self-attention mechanism [16], feed forward network, residual connection, and layer normalization. This module introduces drug efficacy feature attention weights to compute novel attention scores, aiming to strengthen the contextual relationship modeling between drug efficacy relevant keywords within the input sequence. The conventional multi-head mechanism learns vector representations of tokens from divergent perspectives through parallel attention heads: each head computes context-aware attention weights via scaled dot-product operations between query (Q), key (K), and value (V) vectors. For the k-th attention head, the self-attention computation is formulated as:

graphic file with name d33e633.gif 5
graphic file with name d33e639.gif 6

Here, the scaling factor Inline graphic (where Inline graphic denotes the dimensionality of the key vectors) is applied to normalize the dot-product matrix Inline graphic. Additionally, a mask matrix Inline graphic guides attention flow toward relevant positions by obscuring invalid tokens (e.g., [PAD] tokens) in the input sequence. Within this global attention mechanism, each token interacts with all other tokens to generate a similarity matrix Inline graphic, which captures pairwise contextual dependencies across the sequence.

Additionally, to enable the model to learn relationships among medical entities across the entire input sequence, a drug efficacy-aware attention mechanism is implemented. This mechanism enhances attention weights between drug efficacy relevant keywords. The drug efficacy-aware attention mechanism constrains attention computation to tokens associated with drug efficacy-related keywords. Specifically, from the K topics, m topics demonstrating significant drug efficacy relevance are selected, with the top w probability-ranked words extracted of chosen topic. After deduplication of the m×w candidate terms, a Drug Efficacy-Related Keyword Repository (DEKR) containing n unique keywords is constructed. If tokens wordi and wordj both belong to the DEKR, mutual attention is permitted. The similarity matrix of this mechanism is derived by augmenting the standard dot-product attention matrix Inline graphic with a local mask matrix Inline graphic, formulated as:

graphic file with name d33e726.gif 7
graphic file with name d33e733.gif 8

Subsequently, a gated mechanism [17] is used to integrate the two similarity matrices: the global attention matrixInline graphicand the drug efficacy-aware attention matrix Inline graphic. This method dynamically adjusts attention scores through adaptive weighting, with the final attention score formulated as:

graphic file with name d33e756.gif 9
graphic file with name d33e762.gif 10

Here, Inline graphicdenotes the Sigmoid function [18], h represents the hidden state from the preceding layer, Inline graphic is a linear transformation layer with randomly initialized weights, and Inline graphic denotes the corresponding bias term. All attention heads are concatenated to form the final multi-head output, as shown in Fig. 5.

Fig. 5.

Fig. 5

Drug efficacy-Aware Attention Mechanism. Building upon the standard attention mechanism, this module enhances attention scores for drug efficacy relevant keywords and integrates them through a gated adaptive weighting strategy. The final attention output combines both global and domain-specific contextual dependencies

Subsequent operations include a feed-forward network, which comprises two linear transformation layers and a Gaussian Error Linear Unit (GELU) activation [19]. The encoder output is ultimately processed sequentially through L identical encoder layers, each maintaining identical architectural components.

Antitumor drug efficacy prediction based on LSTM

Subsequently, the word embeddings derived from the previous layer isfed into an LSTM model to obtain the final hidden state sequence:

graphic file with name d33e819.gif 11

whereInline graphic, and Inline graphiccorresponds to the dimensionality of the LSTM hidden layer.

Finally, the resultant sequence is processed through a fully connected layer and a softmax layer to generate the predicted labels of drug efficacy.

Evaluation metrics

Model performance was evaluated using five established metrics: Precision (Prec), Recall (Rec), F1-score, Accuracy (Acc), and the Area Under the Receiver Operating Characteristic curve (AUC-ROC) [20]. Among these, classification performance is positively correlated with the AUC-ROC value, which quantifies the integrated area beneath the ROC curve. The remaining metrics are computed as follows:

graphic file with name d33e849.gif 12
graphic file with name d33e855.gif 13
graphic file with name d33e861.gif 14
graphic file with name d33e867.gif 15

where Tp represents the number of true positives the model correctly predicted, Tn represents the number of true negatives the model correctly predicted, Fp represents the number of false positives the model incorrectly predicted, and Fn represents the number of false negatives the model incorrectly predicted.

Experimental results

Dataset

To evaluate the proposed method, we collected the lung cancer dataset from the EMR system of our collaborated hospital locally. The dataset comprised 958 cancer patients receiving first-line platinum-based therapy (non-small cell lung cancer, NSCLC) [21]. Each patient’s data included pre-chemotherapy imaging reports (CT, MRI, BUS) and textual drug efficacy ratings. Originally, tumor response was categorized according to the Response Evaluation Criteria in Solid Tumors (RECIST) [22] into four classes: complete response (CR), partial response (PR), stable disease (SD), or progressive disease (PD). A binary classification system was adopted, where CR, PR, and SD were classified as responsive [1], while PD was labeled as non-responsive (0). The cohort consisted of 691 responsive and 267 non-responsive cases.

Five-fold cross-validation [23] was implemented to evaluate predictive performance, with distinct training/testing splits (764 vs. 194 patients, respectively), as shown in Table 1. To address class imbalance in the training set due to limited non-responsive samples, the Synthetic Minority Oversampling Technique (SMOTE) [24] algorithm was applied to equilibrate class distributions. SMOTE is a classic algorithm used to address the problem of data imbalance in machine learning. Its core idea is to improve the prediction performance of classification models on minority classes by artificially synthesizing minority class samples to increase the proportion of minority classes in the dataset.

Table 1.

Summary of lung cancer clinical dataset

RECIST class No. Patients Responses to Cisplatin
CR\PR\SD 691(Training:552, Test: 139) responsive
PD 267(Training:212, Test: 55) non-responsive
Total 958(Training: 764, Test: 194) -

Word embedding

LDA model was used on the dataset comprising 958 patients to derive latent topics, with the optimal number of hidden topics determined as 16 through perplexity variation. During previous drug response association analysis [25], two efficacy-related topic groups were identified: the Response Group (RG, including Topics 8, 12, and 16) and the Non-response Group (NG, including Topics 4, 7, 13, 14, and 15). A drug efficacy keyword repository was constructed by extracting the top 100 words from each group (totaling 800 words) followed by deduplication.

Since the LDA model has computed the probability distribution of each word across 16 topics, combining their probabilities under these topics get a 16-dimensional embedding representation for every word in the drug efficacy-related keyword repository. The embedding was mapped to the same dimensions as the original BERT inputs via a fully connected layer. Finally, the word embedding vectors for patient textual data were obtained by augmenting the original BERT inputs with LDA-derived topic embeddings.

Parameter setting

The components of our method include the basic BERT model, the Integration of LDA Topic Embedding Layer, Drug Efficacy-Aware Attention Mechanism, and LSTM. The pre-trained BERT, which consists of 12 encoder layers with hidden sizes of 768 and 12 multi-heads, was used for the base model. For the LDA model, we determined the optimal number of hidden topics by examining the perplexity. We plot the perplexity changing curve with topic number. The curve suggests that the perplexity reaches the lowest at 16, indicating the optimal number of topics 16. Then, α = β = 0.0625 as default. The training parameters include Epoch, Batch size, Pad size, and Learning rate, which are briefly introduced as follows: an Epoch refers to one complete iteration through the entire training dataset during model training; the Batch size is the number of training samples processed simultaneously in one forward/backward pass; the Pad size is the fixed length to which input sequences are standardized by adding placeholder tokens; and the Learning rate is the step size controlling the magnitude of parameter updates during optimization. During training, we tried the Epoch numbers of 5, 10, 15, 20, 25, 50, 100, and 200; the Batch size values of 16, 32, and 64; the Pad size values of 32, 64, 128, 256, 300, 400, and 500; the Learning rates of 1 × 10− 3, 1 × 10− 4, 1 × 10− 5, 1 × 10− 6, and 5 × 10− 5; the Dropout rates of 0.1, 0.2, 0.3, 0.4, 0.5, and 0.6. For LSTM, the primary parameter is the hidden-layer size, which denotes the number of memory neurons in the network’s hidden layer. We varied the values of the hidden-layer size among {64, 128, and 256}. As a result, the best performance was obtained when Epoch = 20, Batch size = 32, Pad size = 400, Learning rate = 1 × 10− 5, Dropout rate = 0.5, LSTM hidden size = 64. The descriptions of these parameters and the final values used are shown in Table 2. All the experiments were performed using an NVIDIA GeForce RTX 4060 and an Intel Core i9-13900HX CPU.

Table 2.

The main parameters used in the model

Parameters Description value
Number of topics The optimal number of hidden topics of LDA model 16
α The hyperparameter of the prior Dirichlet distribution for each topic distribution of LDA model 0.0625
β The hyperparameter of the prior Dirichlet distribution for each topic word distribution of LDA model 0.0625
Epoch An epoch refers to one complete iteration over the entire dataset during training. 20
Bach size The number of training samples processed together in one iteration during model training. 32
Pad size Pad_size is the fixed length to which all input sequences are resized. 400
Learning rate It determines the step size at each iteration when updating model parameters (e.g., weights and biases) to minimize the loss function. 1 × 10− 5
Dropout rate Dropout is a regularization technique to compromise the precision and the complexity of the neural network 0.5
LSTM hidden size The number of hidden units in an LSTM layer. 64

Prediction performance

For systematic performance benchmarking, mainstream text embedding models and machine learning classifiers were incorporated, including LDA, TF-IDF, graph convolutional network (GCN) [26] embedding models, and classifiers such as KNN, LR, SVM, DT, and NN. An optimized validation protocol was implemented with meticulously selected parameter configurations for each model. Experimental results are summarized in Table 3. The proposed method demonstrates superior performance in predicting antitumor drug efficacy: precision (0.84), F1-score (0.86), recall (0.87), accuracy (0.83), and AUC (0.82). Evaluation experiments on real-world medical datasets validated its superior capability in individualized disease risk prediction, confirming the efficacy of our methodology. Prediction performance and loss curves of the proposed method are illustrated in Fig. 6.

Table 3.

Performance comparison (mean ± sd) with previous methods on lung cancer patient’s dataset

Methods Prec Rec F1 Acc AUC
LDA + LR [4] 0.89 ± 0.02 0.65 ± 0.03 0.75 ± 0.02 0.69 ± 0.03 0.72 ± 0.03
LDA + KNN [4] 0.85 ± 0.01 0.77 ± 0.06 0.81 ± 0.04 0.73 ± 0.04 0.70 ± 0.03
LDA + DT [4] 0.86 ± 0.02 0.74 ± 0.05 0.80 ± 0.03 0.73 ± 0.03 0.70 ± 0.05
LDA + SVM [4] 0.91 ± 0.02 0.73 ± 0.03 0.81 ± 0.01 0.75 ± 0.01 0.77 ± 0.02
TF-IDF + NN [5] 0.81 ± 0.02 0.80 ± 0.03 0.81 ± 0.02 0.73 ± 0.03 0.77 ± 0.05
GCN + NN [5] 0.79 ± 0.01 0.81 ± 0.03 0.80 ± 0.02 0.71 ± 0.02 0.62 ± 0.02
LDA + NN [5] 0.81 ± 0.01 0.89 ± 0.04 0.85 ± 0.02 0.77 ± 0.03 0.81 ± 0.03
DrugBERT 0.84 ± 0.01 0.87 ± 0.03 0.86 ± 0.01 0.83 ± 0.03 0.82 ± 0.01

Fig. 6.

Fig. 6

ROC curve (a) and loss curve (b) of the proposed method

Training and inference efficiency

The computational hardware environment for our algorithm experiments consisted of an Intel Core i9-13900HX CPU. The GPU used was an NVIDIA GeForce RTX 4060. The experiments used CUDA 12.6, with Python 3.7 and TensorFlow 2.6.0 as the primary programming framework. Averagely, the training time of our method is 11 min and the inference time for a sample takes less than 1 s. its fast inference speed enables real-time diagnostic support for physicians in clinical practice.

Ablation study

To rigorously evaluate the contributions of our model components, we established a baseline model containing only the base BERT architecture. Each component was incrementally incorporated into this baseline to systematically analyze their individual impacts on predictive performance.

We used the basic BERT model as the baseline algorithm and conducted the ablation experiments of the three parts: the Integration of LDA Entity Embedding Layer, Drug Efficacy-Aware Attention Mechanism, and LSTM. Table 4 shows the results of the ablation experiments. When only the LDA Entity Embedding Layer was added, the AUC increased from 0.76 to 0.79. When only the Drug Efficacy-Aware Attention was added, all indicators showed significant improvements, with the AUC increasing from 0.76 to 0.8. When both the LDA Entity Embedding Layer and the Drug Efficacy-Aware Attention are added, the F1 score is significantly improved. When only the LSTM algorithm was added, the F1 score reached 0.82, representing a substantial improvement. The experimental results show that each of these three parts makes a positive contribution to the model’s performance. Among them, the LSTM has the most significant impact on improving the model’s performance, followed by the Drug Efficacy - Aware Attention Mechanism, and the Integration of LDA Entity Embedding Layer also plays a certain role in enhancing the model’s performance.

Table 4.

The results of the ablation experiments (mean ± standard deviation)

Methods Acc AUC Prec Rec F1
Baseline 0.74 ± 0.02 0.76 ± 0.03 0.79 ± 0.03 0.74 ± 0.02 0.75 ± 0.02
+LDA Entity 0.76 ± 0.01 0.79 ± 0.02 0.79 ± 0.01 0.76 ± 0.02 0.78 ± 0.01
+Drug Efficacy-Aware Attention 0.81 ± 0.01 0.80 ± 0.01 0.82 ± 0.01 0.81 ± 0.02 0.79 ± 0.02

+LDA Entity

+Drug Efficacy-Aware Attention

0.78 ± 0.01 0.80 ± 0.02 0.82 ± 0.02 0.81 ± 0.02 0.82 ± 0.02
+LSTM 0.81 ± 0.01 0.80 ± 0.02 0.83 ± 0.01 0.81 ± 0.02 0.82 ± 0.01
DrugBERT 0.83 ± 0.03 0.82 ± 0.01 0.84 ± 0.01 0.87 ± 0.03 0.86 ± 0.01

Statistical significance analysis

Permutation testing [27] was employed to rigorously validate the statistical significance of the proposed method. Under the null hypothesis positing no difference between actual and randomized outcomes, we preserved the labels of the 5-fold cross-validation training set while permuting test set labels. The model was retrained and evaluated under this randomized configuration. This procedure was repeated 1,000 times, with the p-value defined as the proportion of permutations where AUC values exceeded the observed value (0.82). Experimental results demonstrate statistical significance (p = 0.00 < 0.05), thereby rejecting the null hypothesis and confirming a substantive divergence between empirical and random outcomes. Figure 7 illustrates the distribution of randomized AUC values, where the red dashed line denotes the AUC of our method. This analysis confirms the statistical validity and clinical utility of the proposed framework for drug efficacy prediction.

Fig. 7.

Fig. 7

AUC distribution of permutation results

Evaluation on an independent test set

To further validate the reliability of the proposed method, an independent colorectal cancer dataset was utilized. This dataset comprised 266 cancer patients undergoing platinum-based chemotherapy as first-line treatment, with pre-chemotherapy radiomic reports for each patient. Among the 266 samples, 225 were responders and 41 were non-responders. Established benchmark methods were implemented on this independent dataset for performance comparison. The experimental results, as presented in Table 5, demonstrate that the proposed method achieved superior therapeutic efficacy prediction performance: accuracy = 0.87, recall = 0.92, F1-score = 0.89, precision = 0.89, and AUC = 0.63. These findings conclusively validate the significant predictive efficacy of our method in evaluating cancer treatment responses.

Table 5.

Performance comparison (mean ± sd) with previous methods on bowel cancer patient’s dataset

Methods Prec Rec F1 Acc AUC
LDA + LR[4] 0.91 ± 0.06 0.41 ± 0.05 0.56 ± 0.05 0.46 ± 0.04 0.58 ± 0.09
LDA + KNN[4] 0.86 ± 0.02 0.64 ± 0.07 0.73 ± 0.05 0.61 ± 0.06 0.53 ± 0.06
LDA + DT[4] 0.86 ± 0.04 0.72 ± 0.07 0.78 ± 0.05 0.66 ± 0.06 0.54 ± 0.09
LDA + SVM[4] 0.89 ± 0.04 0.68 ± 0.03 0.77 ± 0.01 0.66 ± 0.01 0.60 ± 0.03
LDA + NN[5] 0.85 ± 0.01 0.93 ± 0.04 0.89 ± 0.02 0.80 ± 0.03 0.60 ± 0.01
DrugBERT 0.89 ± 0.02 0.92 ± 0.02 0.89 ± 0.01 0.87 ± 0.01 0.63 ± 0.01

Discussion

The current study successfully demonstrates the potential of the DrugBERT method in predicting antitumor drug efficacy based on clinical text data. Compared to previously proposed methods, this model has proven more effective in mining hidden information embedded within clinical textual data.

The key advantage of our method lies in its integration of the LDA topic model to generate topic embeddings as a semantic enhancement module. These embeddings are injected into the BERT input layer, constructing a joint representation of topic and context features, thereby effectively learning complex patterns and relationships concealed in textual data. Furthermore, the designed dynamic weight allocation module enhances the model’s focus on drug efficacy-related keywords during self-attention computations, improving its ability to capture critical drug efficacy features.

In the field of clinical oncology treatment, predicting the efficacy of antitumor drugs is of utmost significance. Especially, the oncology departments of comprehensive large tertiary-level Grade A hospitals and specialized cancer hospitals, which conduct in-depth cancer research and treatment for specific types of tumors, have posed critical needs for drug efficacy prediction.

In practical application, doctors need to first collect radiological examination text data of patients and then input into the model. Based on the prediction results, clinicians preferentially select the predicted antitumor drug for patients with expected good efficacy. For patients with predicted poor efficacy, blind use of the drug can be avoided, and alternative therapies or combination treatment plans are considered instead, thereby improving the specificity and effectiveness of treatment.

Limitations

Several limitations exist in this study. The first limitation is that the current cancer dataset is relatively limited in variety. Therefore, the proposed method in this paper requires validation on a broader range of cancer types and drugs to ensure its generalizability. Another limitation is that leveraging the BERT model demands substantial computational resources [28], which may impede model deployment in practical applications. Specifically, it is not well-suited for edge computing on devices due to the high resource requirements.

Conclusions

This study applied the BERT model for clinical antitumor drug efficacy prediction successfully. The proposed DrugBERT method integrates LDA-derived topic keywords into BERT’s text embeddings, uses a drug efficacy-aware attention mechanism to explore keyword relationships, incorporates an LSTM to capture long-term dependencies in clinical texts, and applies the SMOTE algorithm for data balancing. Experiments on NSCLC and bowel cancer datasets demonstrated the superior performance of the proposed method. Future work will focus on introducing advanced large language models [29] and integrating additional modalities such as genomic or proteomic data to further improve performance.

Acknowledgements

We thank the anonymous Reviewers for their careful reading of our manuscript and their many insightful comments.

Abbreviations

ACC

Accuracy

AUC-ROC

Area Under the Receiver Operating Characteristic curve

BU

B-mode ultrasound

CNN

Convolutional Neural Networks

CR

Complete response

CT

Computed tomography

GCN

Graph Convolutional Network

GELU

Gaussian Error Linear Unit

MRI

Magnetic resonance imaging

LDA

Latent Dirichlet Allocation

LLMs

Large language models

NLP

Natural language processing

NSCLC

Non-small cell lung cancer

PD

Disease progression

PR

Partial response

PREC

Precision

RAG

Retrieval-Augmented Generation

REC

Recall

SD

Stable disease

SMOTE

Synthetic Minority Oversampling Technique

TF-IDF

Term Frequency-Inverse Document Frequency

Author contributions

Hongqiang Wang, Xinping Xie, and Weiwei Zhu conceived the idea of the study and wrote the main manuscript text; Xiaodong Jiang, Lei Zhang, and Peng Zhou performed data collection, data analysis, and system programming. Xinping Xie and Hongqiang Wang reviewed the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China [grant numbers 61973295, 81872276, 52373160]; the Anhui Province’s key Research and Development Project [grant number 201904a07020092]; University Science Research Project of the Education Department of Anhui Province [grant number KJ2021A0633]; Natural Science Foundation of Higher Education in Anhui Province [grant number 2024AH051569]; Laboratory of Operations Research and Data Science of Anhui Jianzhu University [grant number YCSJ2024ZR02]; Research Initiation Fund Project of Introduced High-level Talents [grant number 60423018], Anhui Clinical Medical Research Transformation Special Project (202304295107020050); Anhui Provincial Higher Education Institutions Middle-aged and Young Faculty Development Program (JNFX2024029,YQYB2023011); Anhui Provincial Natural Science Foundation (grant number 2022AH050247).

Data availability

The data included in the study are available from the corresponding author upon reasonable request.

Declarations

Ethics approval and consent to participate

This study was conducted in accordance with the guidelines of the Declaration of Helsinki and approved by the Institutional Research Ethics Committee of Anhui Chest Hospital (NO. KJ2024-018).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Ladbury C, Amini A, Govindarajan A, Mambetsariev I, Raz DJ, Massarelli E, et al. Integration of artificial intelligence in lung cancer: rise of the machine. Cell Rep Med. 2023;4(2):100933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Peng W, Lin J, Dai W, Yu N, Wang J. Hierarchical graph representation learning with multi-granularity features for anti-cancer drug response prediction. IEEE J Biomedical Health Inf. 2024. [DOI] [PubMed]
  • 3.Zeng Q, Cao X, Feng J, Shan H, Chen X. Imaging technology in oncology Pharmacological research. Front Media SA. 2021;12:711387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Xie X, Li D, Pei Y, Zhu W, Du X, Jiang X, et al. Personalized anti-tumor drug efficacy prediction based on clinical data. Heliyon. 2024;10(6):e27300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhu W, Zhang L, Jiang X, Zhou P, Xie X, Wang H. A method combining LDA and neural networks for antitumor drug efficacy prediction. Digit Health. 2024;10:212–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Liu W, Li J, Tang Y, Zhao Y, Liu C, Song M, et al. DrBioRight 2.0: an LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis. Nat Commun. 2025;16(1):2256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:1–8. [Google Scholar]
  • 8.Zhang Y, Liu C, Liu M, Liu T, Lin H, Huang C-B, et al. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief Bioinform. 2024;25(1):467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Santos T, Tariq A, Das S, Vayalpati K, Smith GH, Trivedi H et al. PathologyBERT-pre-trained vs. a new transformer language model for pathology domain. AMIA annual symposium proceedings; 2023:962–971. [PMC free article] [PubMed]
  • 10.Zhang H, Hu D, Duan H, Li S, Wu N, Lu X. A novel deep learning approach to extract Chinese clinical entities for lung cancer screening and staging. BMC Med Inf Decis Mak. 2021;21:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Devlin J, Chang M-W, Lee K, Toutanova K, editors. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers); 2019.
  • 12.Sembiring I, Wahyuni SN, Sediyono E. LSTM algorithm optimization for COVID-19 prediction model. Heliyon. 2024;10(4):e26158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3:993–1022. [Google Scholar]
  • 14.Nagra AA, Alyas T, Hamid M, Tabassum N, Ahmad A. [Retracted] training a feedforward neural network using hybrid gravitational search algorithm with dynamic multiswarm particle swarm optimization. Biomed Res Int. 2022;20221:2636515. [DOI] [PMC free article] [PubMed] [Retracted]
  • 15.Gan J, Qi Y. Selection of the optimal number of topics for LDA topic model—taking patent policy analysis as an example. Entropy. 2021;23(10):1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lin I, Wang T, Gao S, Tang S, Lee TS. Self-Attention-Based contextual modulation improves neural system identification. arXiv preprint arXiv:240607843. 2024.
  • 17.Munir HS, Ren S, Mustafa M, Siddique CN, Qayyum S. Attention based GRU-LSTM for software defect prediction. PLoS ONE. 2021;16(3):e0247444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gu J, Lu Y, Huang X, Yin Z. Sigmoid function model of parallel-connected DC–DC converters and analysis of their dynamic characteristics. Chaos: Interdisciplinary J Nonlinear Sci. 2024; 34(7). [DOI] [PubMed]
  • 19.Alavi SF, Chen Y, Hou Y-F, Ge F, Zheng P, Dral PO. ANI-1ccx-gelu universal interatomic potential and its Fine-Tuning: toward accurate and efficient anharmonic vibrational frequencies. J Phys Chem Lett. 2025;16:483–93. [DOI] [PubMed] [Google Scholar]
  • 20.Powers DM. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. ArXiv Preprint arXiv:201016061. 2020.
  • 21.Alduais Y, Zhang H, Fan F, Chen J, Chen B. Non-small cell lung cancer (NSCLC): A review of risk factors, diagnosis, and treatment. Medicine. 2023;102(8):e32899. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Litière S, Bogaerts J. Imaging endpoints for clinical trial use: a RECIST perspective. J Immunother Cancer. 2022;10(11):e005092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ngoc TT, Le Van Dai CMT, Thuyen CM. Support vector regression based on grid search method of hyperparameters for load forecasting. Acta Polytech Hungarica. 2021;18(2):143–58. [Google Scholar]
  • 24.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. [Google Scholar]
  • 25.Xu S, Leng Y, Feng G, Zhang C, Chen M. A gene pathway enrichment method based on improved TF-IDF algorithm. Biochem Biophys Rep. 2023;34:101421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lee D, Choo H, Jeong J. Gcn-based Lstm autoencoder with self-attention for bearing fault diagnosis. Sensors. 2024;24(15):4855. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Nder H, Cebeci Z. A review on the permutation tests. Biostatistics Biometrics Open Access J. 2017;3:68–9. [Google Scholar]
  • 28.Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large Language model (LLM) evaluations in clinical medicine. BMC Med Inf Decis Mak. 2025;25(1):117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Liu S, McCoy AB, Wright A. Improving large Language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J Am Med Inform Assoc. 2025;32(4):605–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data included in the study are available from the corresponding author upon reasonable request.


Articles from Journal of Translational Medicine are provided here courtesy of BMC

RESOURCES