LLM-driven collaborative framework for knowledge-enhanced cancer pain assessment and management

Haixiao Liu; Yue Hu; Dongtao Li; Boyuan Shi; Yupeng Niu; Xinche Zhang; Guangda Zheng; Changlin Li; Lingyun Wang; Yanju Bao

doi:10.1038/s41746-026-02362-6

. 2026 Jan 19;9:180. doi: 10.1038/s41746-026-02362-6

LLM-driven collaborative framework for knowledge-enhanced cancer pain assessment and management

Haixiao Liu ^1,^#, Yue Hu ^1,^#, Dongtao Li ^1,^#, Boyuan Shi ², Yupeng Niu ³, Xinche Zhang ², Guangda Zheng ⁴, Changlin Li ⁵, Lingyun Wang ⁵, Yanju Bao ^1,^✉

PMCID: PMC12917166 PMID: 41554973

Abstract

Due to its multi-factor mechanism, variable opioid response, and high-risk adverse reactions, cancer pain remains a major challenge in oncology. To overcome these obstacles, we have developed a collaboration framework based on large language models (LLMs): OncoPainBot. This framework can simulate the reasoning and decision-making of multiple clinical experts to conduct comprehensive cancer pain assessment and management. Our OncoPainBot integrates four specialized agents: Pain-Extraction, Pain-Mechanism Reasoning, Treatment-Planning, and Safety-Check, each corresponding to a unique clinical role. In this paper, we compare seven LLMs and three Retrieval-Augmented Generation(RAG) strategies to determine the optimal model configuration. The final framework was verified on 516 real-world electronic medical records of cancer pain collected. We tested our solution through multiple dimensions. Ultimately, Claude-4 combined with RAG achieved the best overall performance, demonstrating outstanding semantic consistency and evidence-based reasoning in multiple metrics. In clinical validation, OncoPainBot achieved a high degree of consistency between the generated reports and actual clinical documents, while maintaining a high decision-making accuracy (0.841) in the analgesic recommendation task. At the same time, our error analysis shows that most of the differences are caused by patient-specific factors and monitoring recommendations rather than incorrect drug selection, which demonstrates the reliability of our framework. OncoPainBot has demonstrated the feasibility of a cancer pain management system based on LLMs, providing a transparent, evidence-based, and clinical-based framework for personalized analgesic care.

Subject terms: Cancer, Computational biology and bioinformatics, Health care, Medical research, Oncology

Introduction

Cancer pain is one of the most common and tormenting symptoms among cancer patients, with approximately 70% of advanced cancer patients suffering greatly from it¹. The manifestations of this kind of pain are diverse, mainly including bone metastasis pain, neuropathic pain, postoperative pain, and acute pain related to chemotherapy or radiotherapy². Nowadays, with the advancement of cancer treatment technology, the survival rate of cancer patients has been significantly improved. However, it must be admitted that pain management remains a thorny and difficult challenge, especially for patients with advanced cancer, where the effect of pain control is often not very satisfactory. Cancer pain not only seriously affects the physical functions of patients and lowers their quality of life, but also triggers psychological problems such as depression and anxiety, making the overall health condition of patients even worse. Therefore, effectively assessing and managing pain is of vital importance for enhancing patients’ quality of life and treatment outcomes³. Although existing treatment plans like the WHO three-step pain relief Principle provide a theoretical basis for clinical management, they still face considerable challenges in practical operation⁴. Current pain assessment often relies on subjective judgment and lacks unified and quantifiable tools. This leads to possible differences in pain assessment among different medical staff, which in turn affects the choice of pain relief plans. Especially when dealing with patients with advanced cancer, due to the complexity of the condition and the variable and dynamic manifestations of pain, medical staff will find it very difficult to formulate effective pain management strategies. In addition, cancer pain treatment usually requires individualized adjustments tailored to the individual. Factors such as age, gender, organ function, and the patient’s response to previous treatments must be taken into account, as these all influence the selection of treatment plans and the adjustment of drug dosages⁵.

The complexity of cancer pain management is first reflected in the diversity of pain mechanisms. Nociceptive pain and neuropathological pain often occur simultaneously in patients, and the type of pain may change depending on the treatment situation⁶. Based on these circumstances, symptomatic pain treatment should be carried out. First, it is essential to accurately identify the underlying pain mechanism. Through the combined use of opioids, non-steroidal anti-inflammatory drugs (NSAIDs), and other adjuvant drugs, the effect of combined treatment can be achieved⁷. We need to control the synergistic effects of drugs from the aspects of pain-relieving effects and functions, such as sedation and sedation, in order to maintain balance. Secondly, it is also a challenge for patients to determine the appropriate dosage, especially for those with impaired organ function⁸. It is essential to carefully and strictly titrate the dosage in terms of relieving pain, reducing reactions, and avoiding overexertion. The safety and availability of drugs are also of paramount importance in cancer pain management. The liver and kidney functions of cancer patients often increase the risk of drug poisoning. Cancer pain management is individualized, and the condition, medical history, and drugs may come into contact with each other, with the integration of multiple collaborative disciplines. Only in this way can management and care be safe, comprehensive, and effective⁹.

In the current era of rapid scientific research, technology is gradually permeating every aspect of social life. Large language model (LLM) such as ChatGPT and Claude have even begun to demonstrate their potential in the medical field, especially in cancer pain management. In medical colleges and universities, LLMs have gradually begun to automate the review of a large amount of unstructured text (such as clinical notes, electronic medical records, etc.) and pain assessment in clinical teaching, extracting information such as the type, intensity, and treatment experience of pain^10,11. LLMs can suddenly understand and integrate knowledge from fields such as treatment, medication, and clinical guidelines to write the evidence base for clinical treatment and continuously update it¹². By integrating constantly updated clinical information following domain-specific training, LLMs can continuously update clinical guidelines, a feature that brings significant competitiveness. LLMs not only provide a decision-making structure for clinicians, but also offer the process of training their model to think for clinicians, thereby improving decision-making¹³. There is no doubt that LLMs provide a ‘safety net’ for cancer pain management. This is achieved through the intelligent and personalized optimization of the clinical process, which remains under the crucial safety management of clinicians.

However, existing LLM-based assistants and agent-style clinical decision support approaches remain insufficient for cancer pain management in routine practice. First, many systems operate as a single-pass “chat” generator with limited grounding, which increases the risk of hallucinated medication details and non-traceable recommendations when clinical notes are incomplete or noisy^14–16. Second, cancer pain decisions are inherently safety-critical and context-dependent (opioid titration, organ dysfunction, polypharmacy, constipation prophylaxis, breakthrough pain), yet most prior approaches do not implement an explicit safety-checking step that systematically verifies contraindications, interactions, and monitoring requirements^17–19. Third, existing multi-agent frameworks often lack a standardized workflow decomposition that mirrors real multidisciplinary pain management (structured extraction → mechanism reasoning → planning → safety verification), making outputs harder to audit and integrate into documentation^17,20. Finally, evidence drift and guideline updates can rapidly invalidate static prompts or knowledge snapshots; without a clear retrieval-and-versioning mechanism, maintaining guideline adherence over time is challenging¹⁶. Collectively, these limitations motivate a knowledge-grounded, auditable, clinician-in-the-loop framework tailored to the end-to-end cancer pain workflow.

LLMs have already made significant strides in fields such as medical diagnostics, clinical decision support, and drug discovery. In medical diagnostics, LLMs are used in assisting the interpretation of radiology and pathology reports. In clinical decision support, LLMs recommend the most appropriate treatment for the patient and check for potential drug interactions. In the field of drug discovery, LLMs are used to predict the properties of molecules and recommend drug candidates.^14,15. However, the application of LLMs in cancer pain management has never been undertaken. Other medical fields, LLMs will be used for diagnosis or therapy planning, cancer pain management is different and more complex, as pain is multi-factorial and requires complex medication management^21,22. This study is the first to focus on the application of LLMs in cancer pain management to develop automated, evidence-base,d and personalized solutions for the individual patient.

The aim of this study is to construct a new collaborative framework driven by LLMs, with the goal of enhancing the assessment and management of cancer-related pain, named OncoPainBot. This collaborative system combines LLMs and enhanced knowledge. It can extract pain-related data from clinical records without human intervention, automatically generate personalized pain management suggestions, and perform immediate safety checks. The purpose of this system design is to integrate the most cutting-edge artificial intelligence and evidence-based medicine to enhance the accuracy, consistency, and efficiency of cancer-related pain management. This system not only can automatically complete the characterization of pain symptoms, generate treatment plan,s and conduct automated safety checks for clinical standards, but also enhances the automatic generation of suggestions by the autonomous system, which complies with the latest clinical standards and safety protocols. The main innovations of this study include:

This system simulates the collaboration of four different roles through four modules and conducts cancer-related pain management in a multi-disciplinary team mode.
This study compared multiple mainstream language models with enhanced on-demand retrieval/generation comparisons in order to design the language model that best fits cancer pain management.
This study verified the multi-dimensional superiority of one of the language models designed in this research in the framework of cancer-related pain management from 516 real retrospective electronic medical record data from multiple perspectives. The overall workflow of the study is shown in Fig.1.

Fig. 1 — The workflow includes four major stages: a Construction of the Clinical Knowledge Base — integration of NCCN, CSCO, and WHO cancer pain management guidelines to build an authoritative evidence corpus for knowledge retrieval. b Model Architecture Design — simulation of a multidisciplinary clinical workflow through four specialized LLM agents: Pain-Extraction, Pain-Mechanism Reasoning, Treatment-Planning, and Safety-Check. c Comparative Evaluation of LLMs and RAG Strategies — systematic benchmarking of multiple foundation models and retrieval configurations (No RAG, Vanilla RAG, GraphRAG) to determine the optimal backbone for OncoPainBot. d Multi-Agent Clinical Validation — real-world evaluation on retrospective cancer pain medical records, assessing both semantic alignment (linguistic coherence) and clinical decision accuracy to ensure reliability, safety, and translational applicability.

Results

Experimental setup

All experiments were performed on a reliable deep learning server running Ubuntu 22.04.4 LTS. In the interests of consistency of deployment and operational stability, the entire project was deployed on the server. The server is equipped with multiple NVIDIA RTX 5090 GPUs, which are FP8 supported, providing significant acceleration for the pipeline. The software stack included Python 3.12.4 and PyTorch 2.8.0, featuring improved asynchronous garbage collection and I/O to handle clinical data sets. The GPU acceleration stack included CUDA 12.8 and cuDNN 12.8. Key libraries were Hugging Face Transformers, scikit-learn, and timm. Integrating the FastGPT framework also provided orchestration of the inference services, allowing online inference and hosted, API-driven deployment, alongside retrieval-augmented generation (RAG).

Performance comparison of LLMs for OncoPainBot

This work involves the evaluation of several LLMs on three medical question-answering datasets: MedMCQA-Cancer, MedMCQA, and PubMedQA. In particular, MedMCQA-Cancer is a subset of the MedMCQA dataset, which narrowed down the dataset to oncology using keyword filtering. The tested models consist of DeepSeek, Gemini 2.5 Pro, Claude 4, Kimi, ChatGPT 4o, Doubao 1.5 Pro, and GLM4.5. The objective was to find the base model that was most appropriate for the OncoPainBot integration and its accuracy, speed, and efficiency in cancer pain management. Claude 4 was the most accurate of all models tested, especially on MedMCQA and PubMedQA datasets, where it scored more than 80% accuracy (Fig. 2a, b). For more details, see Supplementary Tables 1–3. The other models, including Gemini 2.5 Pro and GLM4.5, had performance that was close to Claude 4 in accuracy, but this consistent outperforming shows that Claude 4 is the most capable of understanding and processing complex cancer pain information.

On response times (Fig. 2c), Claude 4 had a latency of about 30 seconds, which is on the higher end. The trade-off of speed for accuracy is one that is reasonable to make, especially in a pain management context like with OncoPainBot, where accurate pain management recommendations take priority. Looking at overall performance (Fig. 2d) in terms of accuracy vs response time vs cost, ChatGPT 4o and Gemini 2.5 Pro provided the optimal trade-off in terms of speed vs accuracy, with ChatGPT 4o standing out as the overall best. That said, for the goals of OncoPainBot around personalized pain management and accuracy, Claude 4 was most optimal, even if its response time was longer.

Comparison of Different RAG Strategies

In order to improve OncoPainBot, we evaluated the impacts of different RAG strategies with respect to Claude 4, ChatGPT 4o, Gemini 2.5 Pro and GLM4.5, the four best-performing LLMs. This was done based on three criteria of response time, accuracy, and efficiency of retrieval, to determine the optimal candidate to be incorporated in OncoPainBot. Integration of RAG, as illustrated in Fig. 3a has brought about improvement in accuracy for all models when compared to baselines without RAG. Out of the two retrieval techniques used, the GraphRAG method had the highest level of accuracy, especially in conjunction with Claude 4 and GPT 4o in the MedMCQA. This improvement illustrates the advantage of the retrieval of graph-structured knowledge pertaining to the complex semantics of cancer pain management. A further look within the Vanilla RAG configurations (Fig. 3b) indicates that Hybrid RAG, where both semantic and keyword-based retrievals were used, yielded better results than either Semantic RAG or BM25 RAG. This suggests that the combination of the two approaches employed retrieval strategies that optimize both precision and recall for clinical content retrieval. Nevertheless, as illustrated in Fig. 3c, the different strategies had different response times, where GraphRAG had the highest latency as a result of its multi-hop reasoning and graph traversal activities, while the BM25 RAG was the fastest, thereby incurring the least latency.

To evaluate the impact of knowledge base size, we incrementally expanded the Claude 4 + RAG knowledge base from 1000 to 5000 chunks (Fig. 3d). The accuracy peaked at around 1000–2000 chunks (0.869) and gradually declined beyond that point, indicating that overly large knowledge bases may introduce noise and reduce retrieval precision. The results demonstrate that Claude 4 + HybridRAG provides the optimal balance between accuracy and clinical reasoning depth. This configuration was specifically selected based on the clinical priority of cancer pain management: safety over speed. In routine analgesic prescribing and ward rounds, clinicians have a higher tolerance for latency (20–30 seconds) but maintain a zero-tolerance policy for hallucinations regarding opioid dosages or contraindications. Therefore, we prioritized the high-accuracy Hybrid RAG approach over faster but less precise alternatives. Furthermore, while the system uses a unified strategy rather than dynamic switching, Hybrid RAG implicitly addresses diverse clinical tasks: it leverages keyword-based retrieval (BM25) to accurately locate specific drug names and dosages, while simultaneously employing semantic retrieval to comprehend complex, unstructured narratives of pain mechanisms. This dual capability ensures robust performance across the heterogeneous queries inherent in real-world cancer pain scenarios. The specific statistical results are shown in Supplementary Tables 4–6.

Semantic alignment validation on clinical real-world datasets

In order to assess OncoPainBot’s real-world extensibility, we examined 516 clinical notes concerning cancer pain, performing a fully realized semantic alignment analysis. OncoPainBot, MDAgents, Claude 4, ChatGPT 4o, Gemini 2.5 Pro, and GLM 4.5 were compared across various dimensions of natural language processing. For a thorough breakdown of the comparison, please see Supplementary Table 7 and the corresponding visualizations in Fig. 4. OncoPainBot achieved top or near top results in the majority of the dimensions compared, including AI evaluation scores (Fig. 4a), semantic similarity (Fig. 4b), machine translation (BLEU score, Fig. 4c), longest common subsequence (ROUGE-L, Fig. 4d), and bi-gram (ROUGE-2, Fig. 4e) and unigrams (ROUGE-1, Fig. 4f) overlap. OncoPainBot convincingly imitated the reporting style and clinical thought processes of real physicians. The AI evaluation scores were substantial outliers to all baselines, and the BERTScore outcomes on semantic similarity suggested strong alignment with the submitted notes. Furthermore, the model’s robustness was confirmed by the leave-out analysis, where OncoPainBot was found to produce a complete narrative for each document, as shown by over 85% of the subsequences from the reference document lying in the produced document, as measured by the ROUGE-L and ROUGE-2 metrics. On the other hand, Gemini 2.5 Pro and GLM 4.5 LLMs showed an inability to develop phrases fluently and even lost the context of their sentences, which caused them to have low overlap scores.

Fig. 4 — a AI evaluation performance: Overall evaluation scores (0–1) reflecting model-level text quality and coherence. b Semantic similarity analysis: BERTScore results showing alignment between generated and reference clinical texts. c Machine translation quality analysis: BLEU scores quantifying linguistic fidelity between predicted and real-world records. d Longest common subsequence analysis: ROUGE-L scores assessing the preservation of key phrase order and structure. e Lexical overlap analysis: ROUGE-1 scores evaluating unigram-level lexical consistency. f Bi-gram overlap analysis: ROUGE-2 scores measuring phrase-level overlap and contextual fluency in generated clinical outputs.

Analgesic recommendation validation of clinical real-world datasets

Using 516 retrospective EMRs, we assessed whether the OncoPainBot’s recommended analgesic plans match real clinical prescriptions and safety rules. As shown in Fig. 5a, b, overall agreement with ground-truth prescriptions was high; most errors clustered at the decision boundary between adjacent WHO ladder steps or dose-titration choices rather than gross class mistakes. Stratified analyses (Fig. 5c, d) reveal clearer strengths and weaknesses: agreement was highest in headache and severe-pain strata and lower for mixed/visceral or breakthrough-pain scenarios that require combining adjuvants (gabapentinoids, SNRIs) with opioids. Figure 5e breaks down the 82 total errors observed in the analgesic recommendation validation. The most frequent issues were patient-factor errors (12/82, 14.6%) and monitoring errors (11/82, 13.4%). While grouped in Fig. 5e for visual clarity, a granular analysis of the 12 patient-factor cases reveals specific clinical challenges. These discrepancies primarily stemmed from undocumented analgesic tolerance (where the model underestimated the required starting dose for tolerant patients), idiosyncratic descriptions (vague subjective complaints that hindered precise mechanism inference), and complex pharmacokinetic considerations in patients with multi-organ dysfunction that exceeded the model’s standard titration logic.

Fig. 5 — a Overall model performance assessment: Radar chart displaying key classification metrics—accuracy, precision, recall, specificity, and F1-score—summarizing model consistency. b Confusion matrix: Visualization of prediction outcomes comparing true versus predicted analgesic decisions. c Pain site recognition accuracy: Performance across major pain types (neuropathic, soft tissue, visceral, bone, headache). d Pain level recognition accuracy: Model accuracy in differentiating pain severity levels (severe, breakthrough, moderate, mild). e Error type distribution: Pie chart illustrating proportions of identified error sources, including dosage, monitoring, and timing errors. f Age group recognition accuracy: Classification accuracy across different age groups, reflecting the model’s generalizability in age-related pain response prediction.

Timing (12.2%), drug-interaction (12.2%), and dosage (12.2%) errors made up most of the remaining cases, reflecting the difficulties of balancing analgesic effectiveness with safety within complicated cancer pain treatment protocols. OncoPainBot analgesic recommendation accuracy, across all age strata, showed no differences, as evidenced in Fig. 5f, showing its reliability across age differences. The model scored the most in the elderly group of 66–80 years (0.863), followed by young adults (0.839 in 18–40 years) and middle-aged (0.830 in 41–65 years), with a small drop in the very old (81+, 0.818). The smallest of drops in the oldest group probably represents careful changes made by the algorithms, balancing against lower recognition ability due to adjusted rules made for risks of polypharmacy and comorbid conditions.

Practical analysis of real-world cases

A real-world patient case was shown to trace the end-to-end reasoning process across all four agents in OncoPainBot. Figure 6 illustrates a representative case of a 65-year-old male with thoracic spine bone metastases and poorly controlled cancer pain. The workflow demonstrates the system’s layered reasoning pipeline starting from unstructured clinical narratives. First, the Pain-Extraction Agent analyzes the raw free-text input (top box in Fig. 6) and transforms it into a standardized, structured record (blue box in Fig. 6), extracting key variables such as pain location, intensity (NRS = 8), and medication history. This structured output is generated automatically by the model, not manually input; the Pain-Mechanism Reasoning Agent inferred mixed nociceptive due to bone invasion and nerve compression; the Treatment-Planning Agent retrieved guideline-aligned evidence via RAG and proposed a morphine-based individualized regimen with stepwise titration and breakthrough-dose adjustment; and the Safety-Check Agent reviewed renal and hepatic profiles, detecting mild anemia and recommending closer monitoring rather than immediate dose reduction. The integrated decision chain mirrored multidisciplinary tumor-pain management logic, delivering interpretable reasoning at each step. Notably, the model’s feedback loop refined its recommendation when data were incomplete (“Data Not Enough”). This example highlights how OncoPainBot can bridge automated text reasoning and patient-specific decision-making, producing transparent, guideline-based analgesic plans that align closely with expert clinical workflows.

Fig. 6 — The workflow illustrates the end-to-end reasoning process of OncoPainBot, from patient data extraction and completeness assessment to retrieval-augmented generation (RAG)-based treatment planning and individualized opioid dosing recommendations. The system provides evidence-grounded, interpretable analgesic guidance integrating patient-specific clinical context.

Discussion

The purpose of this study is to discuss the development of a one-of-a-kind collaborative framework -OncoPainBot for optimizing the management of cancer-related pain utilizing the pain management capabilities of the LLM. This is done through the simulation of a clinical expert through the actions of four agents: Pain-Extraction, Pain-Mechanism Reasoning, Treatment-Planning, and Safety-Check. The framework is designed for the clinical workflow, automating every step of the pain management cycle from pain identification to management, personalized with recommendations that are clinically verified and safe for the patient. The validation of the system was done in two steps. In Phase 1, the performance of various combinations of base models and RAG strategies was evaluated for the best-performing combinations in isolation from the system. In Phase 2, actual clinical data were used to evaluate the system’s performance in terms of linguistic consistency and accuracy of the clinical decisions made. The results of the two Phases demonstrated the system’s ability to complement automated workflows in the clinical domain and improve the management of cancer-related pain by offering safe, clinically verified pain management recommendations.

This proposed LLM-driven collaborative framework for cancer pain management advances significantly in simulating the functions of four agents representing different members of the clinical team. These Pain-Extraction, Pain-Mechanism Reasoning, Treatment-Planning, and Safety-Check agents emulate real-world clinical workflows and interact to automate the entire pain management process. The Pain-Extraction Agent acts as a clinician by retrieving and structuring pain-related information, including the pain’s intensity, location, and duration, from the patient’s medical record, ensuring the framework’s outputs can be verified against real-world data. The Pain-Mechanism Reasoning Agent, acting as a pain specialist, determines the type of pain (nociceptive, neuropathic, or mixed) and modifies the system to align treatment options to the pain mechanism. The Treatment-Planning Agent develops a treatment plan based on the pain profile, comorbidities, and clinical guidelines of the patient. In a real-world scenario, the Pain-Mechanism Reasoning Agent acts as a pain specialist. Ultimately, emulating the role of the pharmacist, the Safety-Check Agent juxtaposes the outlined treatment course against the databases of medications, looking for possible contraindications, clashes between medications, as well as errors in the dosage prescribed, thus ensuring that the final recommendations are compliant with safety regulations. The main advantage of this framework is the ability to synthesize the knowledge and expertise of a number of clinicians in distinct fields and provide the system with cohesive, cooperative intelligence.

For assessing cancer pain care management practices, we analyzed 7 base LLMs and 3 RAG strategies. For this purpose, the Claude-4 and hybrid RAG combination proved to be the best performing. This stems from Claude-4’s impressive NLP. Claude-4 excels in factoring and manipulating complicated clinical language, pain management narratives in particular, with the necessary fluency and intricacy to accomplish the task. Claude-4 excels in designing individualized clinical information, within a defined articulate narrative, for the purpose of cancer pain management. Claude-4 hybrid RAG implementation further enhances Claude-4 performance. The hybrid RAG strategy augments the retrieval effectiveness of the most relevant, current, and evidence-informed clinical content from primary external knowledge clinical repositories, clinical practice guidelines, medication databases, and other relevant recent investigations. This retrieval strategy effectively champions the relevance and contemporariness of the clinical information generated. As an example, in managing pain related to cancer, the model may have access to the most current pharmacological guidelines or information on drug interactions and provide recommendations that are in line with best practices.

The initial phase of real-world validation for the proposed OncoPainBot, utilizing 516 records of cancer pain patients, revealed the first two of five areas of interest: semantic alignment and clinical decision-making accuracy. On the semantic alignment front, the framework had an impressive ability to mimic the diction and subtle characteristics of actual clinical records. The framework was also found to be highly aligned with real patient records, demonstrating the ability to painlessly capture critical pain details, including pain location, pain intensity, and a history of treatment. This means the framework was able to synthesize pain assessments that were both clinically and linguistically plausible. On the clinical decision-making accuracy front, the framework’s proposed treatment was evaluated as a classification challenge of whether or not the proposed treatment was appropriate (yes or no) based on the actual clinical decision made. The framework was also shown to perform reasonably well in the description of pain management strategy recommendations, with clinical decision-making accuracy of 84.1%, precision of 85%, and recall of 94.6%. Nevertheless, in the description of pain management strategies with a clinical decision accuracy of 85.6%, a few clinical decision-making inaccuracies due to inappropriate suggested treatment or omitted potential clinical decisions were also noted. This was particularly the case in complicated patients with multiple comorbidities, atypical pain presentations, which require better fine-tuning of the model and more work.

This study is poised to significantly impact the automation of healthcare pain management systems and empower clinicians managing cancer pain to make more informed decisions¹⁷. Healthcare professionals often face the challenge of managing complex and evolving pain dilemmas^18,23. Our clinical automation system is designed to provide comprehensive support for pain management, treatment, and safety checks. By facilitating the efficient automation of the pain management and treatment process, this integration aims to assist clinicians, ultimately prioritizing safe and effective patient outcomes. The system is specifically engineered for seamless integration with multiple existing pain management programs, thereby streamlining the overall workflow and assisting pain management health clinicians.

In future clinical practice, we envision OncoPainBot as a clinician-in-the-loop decision support tool integrated into routine cancer pain workflows rather than an autonomous prescriber. A realistic use scenario is EMR-triggered assistance at key checkpoints (e.g., initial pain evaluation, reassessment after opioid titration, discharge planning, or palliative care consultation). The system would ingest information already documented in routine care, including pain scores (NRS/VAS), pain descriptors and location, analgesic exposure and prior response, breakthrough pain frequency, comorbidities, hepatic/renal function indices, and concurrent medications and generate a structured report consisting of (i) standardized pain feature extraction, (ii) mechanism-oriented classification (nociceptive/neuropathic/mixed) with rationale, (iii) guideline-grounded treatment options with titration and breakthrough dosing logic, and (iv) safety alerts with monitoring and follow-up suggestions. Importantly, when key clinical variables are missing, the system explicitly reports uncertainty (“Data Not Enough”) and requests verification instead of forcing deterministic recommendations. The treating clinician remains responsible for final decision-making, documentation, and prescribing, and can accept, modify, or reject the system output.

Due to the iterative and sometimes conflicting nature of clinical decision-making, we define two additional core conflict resolution workflows to enhance the flexibility of the framework: for the conflict of the Safety-Check Agent and the Treatment-Planning Agent (e.g., the latter suggests an opioid dose as the standard, and the former identifies risks of accumulation due to renal insufficiency), the Safety-Check Agent’s opinions are then used as the primary reference, due to the unavoidability of patient safety considerations when dealing with opioids for cancer pain management, and then the treating clinician considers the conflicting elements and integrates the patient’s current clinical condition (e.g., renal function changes; the effect of pain control) and then makes final changes (e.g., lower the dose, substitute a drug, or closer monitoring); however, when the clinician’s judgment contradicts the system’s final output (e.g., the system suggests step 3 opioids, but the clinician believes there is underdiagnosed neuropathic pain, which warrants an adjuvant), clinical decision-making is entirely with the clinical team, although we do recommend a consult with at least two relevant specialists (e.g., pain management physicians, oncologists) for a thorough review of the patient’s clinical data, the inferred pain mechanism, and the guidelines relevant to the clinical system’s output, and the final decision on the treatment plan is with the consulting team, the output of the system is kept as a reference for transparency, accountability, and improvement of the model.

Translating an LLM-based decision support system into practice also requires explicit governance, regulatory awareness, and ethical safeguards. Depending on jurisdiction and intended use, a system that influences analgesic selection, dosing, or monitoring may fall under regulated clinical decision support software or software as a medical device (SaMD), necessitating risk management, verification/validation, cybersecurity controls, and quality management procedures. From an ethics perspective, deployment should prioritize privacy-preserving configurations (e.g., institution-approved on-premise or secure environments), strict access control, and audit logs. We further recommend an auditable design in which each recommendation is stored with model version, knowledge-base version, retrieved evidence context, and clinician feedback to support traceability and continuous improvement. Finally, fairness and generalizability should be evaluated across demographic subgroups, cancer types, and institutions to ensure the system does not amplify existing inequities in pain treatment. Under these conditions, OncoPainBot is positioned as a supportive tool that augments clinician reasoning and standardizes guideline adherence, rather than replacing clinical judgment.

This study has some limitations that will need to be explored in more depth in the future. One is that the framework was evaluated based on retrospective data analysis, and data, while useful, can be an unreliable standalone variable for assessing the real-time and dynamic clinical decision-making process, and variability in patient reactions over time. The framework also depends on the available clinical guidelines, literature, and databases, and their limitations (in coverage, accuracy, etc.), that will undoubtedly improve over time with more studies and treatments¹⁶. The model shows descent performance rates on ‘typical’ scenarios, and fixture clinical scenarios (statistically) lead to performance drop. For instance, there was an increased error rate for patients with mixed pain mechanisms or multi-organ dysfunction; therefore, given the case complexity, current LLM reasoning paradigms may be the touch-point for gaps in nuanced depth of clinical judgment for polypharmacy and finer dose adjustments. Dry as it may be, at least 516 patients (from various demographics) represented the validation cohort, though the data was derived from a single center, factoring a certain level of uniformity in the prescribing habits and clinical documentation practices, and leading to possibly overestimating the model performance, also with the expectation of multi-center endeavors and more varied case distributions. These factors indicate that the model system is still in the stage of early feasibility exploration. Its core value lies in the preliminary verification of the technical framework’s feasibility and the effectiveness of its core functions, rather than a comprehensive validation of its clinical universality. Future research should further expand diverse external validation datasets and conduct in-depth verification across a broader range of clinical scenarios, thereby optimizing the model’s reasoning strategies and enhancing its adaptability and robustness in complex clinical contexts. Finally, although the framework shows great promise in automating pain management, the system cannot replace the clinical judgment of experienced healthcare providers, particularly in cases where nuanced, individualized care is required¹⁹. Therefore, at the current stage, OncoPainBot should be used only as non-interventional, clinician-supervised decision support rather than as an autonomous prescribing system. The next step is a prospective, multi-center, workflow-embedded evaluation that measures not only prescription agreement but also patient-centered outcomes, clinician workload, and predefined safety endpoints, supported by continuous auditing and robustness/fairness monitoring for responsible deployment²⁰.

In conclusion, we developed OncoPainBot, an LLM-driven collaborative framework that simulates multidisciplinary cancer pain management via four specialized agents and integrates retrieval-augmented generation to ground recommendations in clinical evidence. Across systematic benchmarking of foundation models and retrieval strategies, the selected configuration achieved strong performance, and real-world validation on 516 retrospective EMRs demonstrated high semantic alignment with clinician documentation and robust analgesic recommendation accuracy (0.841). By structuring outputs into interpretable modules, the framework achieves intrinsic interpretability through process transparency. Unlike post-hoc visualization methods, our approach relies on the explicit sequential dependency of four agents and their internal Chain-of-Thought (CoT) outputs. At the current stage, OncoPainBot is best positioned as a clinician-in-the-loop tool to assist documentation, guideline adherence, and safety checks, with final decisions remaining under physician control. Future work will prioritize prospective, multi-center, workflow-embedded evaluation with patient-centered outcomes and safety endpoints, together with auditable governance (model/knowledge-base versioning, logging, and update procedures) to enable responsible clinical translation.

Methods

Knowledge base construction

We systematically collected clinical practice guidelines from major international oncology organizations. The knowledge base encompassed 18 cancer types, including lung, breast, hepatocellular carcinoma, prostate, colorectal, gastric, pancreatic, renal, bladder, cervical, ovarian, endometrial, melanoma, lymphoma, head and neck, thyroid, testicular, and esophageal cancers. Guidelines were sourced from NCCN, ESMO, ASCO, AASLD, EAU, and ATA^2,24–28. The knowledge base was organized into four layers: (1) Data Sources (international guidelines); (2) Knowledge Coverage (18 cancer types and pain protocols); (3) Knowledge Processing (extraction of pain assessment criteria, WHO ladder recommendations, dosing, safety, and contraindications); (4) Core Applications (integration into the LLM-driven framework for automated assessment and planning).Each guideline was processed to extract pain classification systems, recommended assessment tools (e.g., NRS, VAS), pharmacologic algorithms per the WHO ladder, and special considerations for organ dysfunction or comorbidities. We aggregated 2023–2024 clinical practice guidelines from NCCN, ESMO, ASCO, AASLD, EAU, and ATA, and organized them hierarchically by cancer category, guideline type, and processing pipeline (see Fig. 7a). A complete inventory of guideline sources, years, versions, and license notes is provided in Supplementary Table 8.

Fig. 7 — a Authoritative Knowledge Sources: Integration of multidisciplinary cancer guidelines from NCCN, ASCO, ESMO, AASLD, and others, covering 19 cancer types and four major guideline categories (screening, diagnosis, treatment, follow-up). Knowledge was processed through structured extraction, indexing, and semantic encoding to form the clinical retrieval corpus. Real-World Clinical Cohort Summary (n = 516): Word cloud visualizations illustrate data distribution across key dimensions — b primary cancer types, c metastasis sites, d pain characteristics, e analgesic medications, f WHO analgesic ladder stages, and g treatment outcomes.

To reduce “evidence drift” as guidelines evolve, we planned the knowledge base as a living, versioned resource rather than a one-off snapshot. In practice, we periodically monitor primary sources (e.g., NCCN/ESMO/ASCO) for new releases using the published version/date information and document-level change detection. When an update is identified, we do not overwrite the existing corpus; instead, we rebuild the retrieval corpus in a controlled manner by re-parsing the updated guideline, regenerating chunks with the same segmentation rules while attaching consistent provenance metadata (organization, guideline title, release version/date, section heading, and processing timestamp), and re-embedding/re-indexing the updated corpus under the same retrieval configuration. Each rebuild produces an immutable “knowledge snapshot” with a unique identifier, and model outputs record the snapshot ID so that every recommendation can be traced to exactly which guideline version supported it at the time of inference. Historical snapshots are retained for auditability and reproducibility, and rollback is supported by switching the active snapshot to a previously validated version. Before activating a new snapshot, we run a lightweight regression suite on a fixed set of representative clinical queries (opioid escalation/titration, contraindications in organ dysfunction, drug–drug interactions, breakthrough pain, and monitoring recommendations) to ensure that safety-critical behaviors do not degrade after an update; any flagged changes are reviewed prior to release.

LLM-Driven Collaborative Framework- OncoPainBot

OncoPainBot consists of 4 specialized LLMs that mimic types of professionals for breadth in cancer pain management (Fig. 8). The following outlines how the system works:

Pain-Extraction LLM (Mimicking First Contact Physician): This module focuses on the patients’ pain history (e.g., location, type, severity, duration). The program runs through medical records in order to document the pain history in a structured format. It employs NLP and CoT to generate a structured pain evaluation to be used to triage and guide treatment recommendations.
Pain-Mechanism Reasoning LLM (Mimicking Pain Specialty Physician): This module uses the medical history and data to reason and hypothesize about the pain’s underlying mechanisms for type categorization. The module explains the types of pain, including musculoskeletal, neuropathic, and visceral, and pain mechanisms for each type of pain. This module uses a combination of RAG, guideline querying, and CoT to generate text that illustrates pain, its mechanisms, and the type.
Treatment-Planning LLM (Simulated by the Clinician/Multidisciplinary Team): This function formulates specific treatment strategies after discerning the type and intensity of pain by outlining suitable medicine and other adjunct measures. This model gets its references from RAG for treatment protocols from primary source guidelines (including) and uses CoT for customized pain management plans of varying degrees of intensity and full specification.
Safety-Check LLM (Simulated by the Pharmacist): This module guarantees the suggested therapy strategy for safety by determining if the chosen medications will not be harmful to the patient’s condition. It uses RAG and drug safety knowledge databases to ensure that the treatment plan is safe and provides recommendations for alterations when necessary. No new drug is prescribed if the patient’s condition is kept under treatment.

The combination of these modules provides a simulation of primary physicians/pain specialists/clinicians/pharmacists, and offers a collaborative, multi-agent framework that incorporates features of pain management processes. Each module is crafted to function individually but also dependently, ensuring nothing is overlooked when addressing strategies for cancer pain management. This model is available for public use and can be found on the GitHub platform. The open-access repository is indicated here: https://github.com/nine-github/OncoPainBot. The prompt word templates of all four agents (Pain-Extraction, Pain-Mechanism Reasoning, Treatment-Planning, and Safety-Check) in this framework are shown in Supplementary Table 9.

Comparison of LLM base models

In this study, the DeepSeek, Gemini 2.5 Pro, Claude 4, Kimi, ChatGPT 4o, Doubao 1.5 Pro, and GLM4.5 models were analyzed in terms of their performance on MedMCQA-Cancer, MedMCQA, and PubMedQA datasets^29–31. We aimed to characterize the models in terms of their mistake rate, latency, and proficiency in multidimensional computations. The selected models were a mix of retrieval and generative models, allowing us to analyze performance from different perspectives in relation to complexity. Models were rated based on their accurate answers, and performance was analyzed using statistical measures of classification, such as rates of accuracy, precision, sensitivity, as well as the F1 score. In addition to accuracy, we also monitored response time³² as an important metric to evaluate performance threshold in terms of instantaneous response rate³³. The models were placed in an identical cloud-computing environment to ensure balance in resource allocation. The models were compared in terms of their relative pricing to evaluate cost-wise efficiency, as well. We formatted datasets with required changes, including changes in medical terms, tokenizing, etc., to suit different models. We used confusion matrices to analyze misclassified instances to improve classification. All models were compared in a simulation to assess real-life performance on medical question answering. The design offered a range of models for evaluation to ensure accurate performance evaluation on real-world tasks.

RAG strategy comparison

Without a doubt, the model’s performance is heavily influenced by the strategy one employs to compute similarity in the retrieval-augmented generation (RAG) powered by information retrieval and natural language processing. Investigating mechanisms of retrieval, one looks at three strategies in consideration of outcomes, and these strategies are no RAG, Vanilla RAG, and GraphRAG^34–36. Central to these strategies is the choice of similarity metric: with the traditional BM25 model, a document D and a query Q are assessed for relevance by

Score (D, Q) = \sum_{i = 1}^{n} IDF (q_{i}) \frac{f (q_{i}, D) \cdot (k_{1} + 1)}{f (q_{i}, D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}

integrating inverse document frequency (IDF) and term frequency (TF), while using the saturation parameter $k_{1}$ and the length-normalization parameter $(b)$ to control the influence of high term counts and document length, respectively. By contrast, cosine similarity measures semantic proximity between a query vector $A$ and a document vector $B$ as

Similarity (A, B) = \cos (θ) = \frac{A \cdot B}{∥A∥ ∥B∥} = \frac{\sum_{i = 1}^{N} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{N} A_{i}^{2}} \sqrt{\sum_{i = 1}^{N} B_{i}^{2}}}

with values in $[- 1,1]$ reflecting semantic alignment.

This allowed us to assess Vanilla RAG, aiming to improve retrieval methods: semantic RAG, RAG with BM25, and hybrid RAG, then focusing on choices of similarity computation on the MedMCQA dataset with regard to accuracy, response time, and efficiency on computation. We further study the knowledge base scaling from 1000 to 5000 chunks to determine the retrieval precision and time of response. All of the experiments were performed under the same conditions in the cloud set to maintain the integrity of the experiment. We present our findings in the form of precision, recall, F1, and accuracy, as well as confusion matrix analysis, which provides insight into the appropriate medical question answering to use RAG and to what extent the knowledge base should be employed.

OncoPainBot clinical performance validation

We retrospectively analyzed the electronic medical records (EMRs) of 516 cancer pain patients from the China Academy of Chinese Medical Sciences Guang’anmen Hospital, between September 2023 and March 2025. All patient records were de-identified in compliance with institutional review board (IRB) approval and under the oversight of the Health Insurance Portability and Accountability Act (HIPAA)³⁷. This retrospective analysis of patient data was conducted in accordance with the principles of the Declaration of Helsinki, and all patients waived the requirement for written informed consent. The ethical approval number for this study is MDKN-2024-058. The inclusion criteria were as follows: histologically confirmed malignancy, documented cancer-related pain requiring analgesic intervention, complete EMR with baseline assessment, treatment plan, and follow-up, and patients aged 18 years or older. The retrospective EMRs were de-identified and primarily used as free-text clinical narratives rather than structured database tables. Therefore, we did not apply conventional field-mapping rules between EMR columns. Instead, we operationalized “mapping” as schema-based information extraction from narrative notes. Specifically, the Pain-Extraction agent converts free-text statements into a fixed structured schema using a constrained prompt template. For example, a narrative such as “the patient may have a fever, around 38°C” is normalized into a structured item (temperature = 38°C; symptom = fever; certainty = uncertain) when such information is present. The extracted schema includes pain site, pain descriptors, pain intensity (NRS/VAS when documented), prior analgesic exposure and response, breakthrough pain description if available, comorbidities, concurrent medications, and key hepatic/renal laboratory indices when documented. When schema items were unavailable, they were recorded as ‘N/A’; however, to ensure safety, the ‘Data Not Enough’ mechanism was governed by a clinical necessity logic derived from NCCN guidelines. This mechanism enforced a minimum data set where the absence of critical determinants—specifically pain intensity for WHO ladder placement, primary pain site for adjuvant selection—would immediately trigger a system halt, preventing the model from generating prescriptions based on insufficient clinical evidence. Laboratory values were preserved as documented in the EMR (including units) and used for safety-context reasoning; no additional numerical normalization was performed beyond keeping the recorded value/unit. To fit model context limitations, long notes were handled by prioritizing pain-related narrative, medication history, and laboratory lists, while non-essential text was truncated to retain clinically relevant content for analgesic decision support. Ground truth for evaluation was defined as the de-identified, clinician-documented cancer pain assessment and analgesic management decision recorded in the EMR during routine care after standard clinical evaluation. Reference labels (e.g., WHO analgesic ladder step/regimen category) were derived from the original EMR documentation using a prespecified field-to-label mapping based on recorded prescriptions and clinical notes; records with missing key fields or ambiguous documentation were flagged and resolved through clinician review.

We summarized term frequencies across cancer types, metastasis sites, pain characteristics, analgesics, WHO analgesic ladder steps, and outcomes (see Fig. 7b–g). The cohort consisted of 292 males (56.6%) and 224 females (43.4%), with a mean age of 59.69 ± 10.34 years (range: 18-85). The primary cancers included hepatocellular carcinoma (HCC) (163 patients, 31.6%), kidney cancer (137 patients, 26.6%), lung cancer (52 patients, 10.1%), breast cancer (46 patients, 8.9%), prostate cancer (28 patients, 5.4%), colorectal cancer (20 patients, 3.9%), gastric cancer (14 patients, 2.7%), and others (56 patients, 10.9%). The most common metastatic sites were bones (161 patients, 31.2%), followed by the spine (87 patients, 16.9%), and other sites (252 patients, 48.8%), including ribs (15 patients, 2.9%) and hip/femur (1 patient, 0.2%).

At baseline, the mean Numerical Rating Scale (NRS) score was 6.89 ± 1.85, with pain severity categorized as mild in 15 patients (2.9%), moderate in 159 patients (30.8%), and severe in 277 patients (53.7%). The most common NRS scores were 8 (20.7%), 7 (17.2%), and 6 (13.0%). Post-treatment, the mean NRS score was reduced to 2.47 ± 1.03, with most patients reporting scores of 3 (45.5%), 2 (40.1%), and 1 (5.7%). Complete pain resolution (NRS = 0) was achieved in 9 patients (2.4%). The primary cancers were dominated by hepatocellular carcinoma (31.6%) and kidney cancer (26.6%) (Fig. 9a). Age approximated a normal distribution (mean 59.7 ± 10.3 years; Fig. 9b). Pre-treatment NRS scores clustered at 6–8 and shifted to 1–3 after treatment (Fig. 9c). WHO analgesic ladder usage was predominantly Step 3 (64.5%; Fig. 9d), consistent with the high baseline pain severity. Metastases most frequently involved multiple bones and the spine (Fig. 9e). A mild negative association between baseline and post-treatment NRS was observed, indicating larger absolute reductions at higher baselines (Fig. 9f).

Fig. 9 — a Primary cancer distribution shows hepatocellular carcinoma and kidney cancer as the most prevalent types. b Age distribution demonstrates a near-normal curve with a mean age of 59.7 ± 10.3 years (range 18–85). c NRS pain score comparison reveals a significant reduction in pain intensity from baseline (mean 6.9) to post-treatment (mean 2.5). d WHO analgesic ladder distribution indicates that most patients (64.5%) required step 3 opioids. e Metastasis sites were mainly multiple bones and spine, reflecting complex pain etiologies. f Pain score correlation analysis (Pearson r = 0.052, p < 0.001) confirms effective pain relief with minimal residual correlation between pre- and post-treatment scores.

In this set of experiments, we evaluated multiple AI models on real-world clinical documentation from cancer pain patients, including OncoPainBot, MDAgents, Claude 4, ChatGPT 4o, Gemini 2.5 Pro, and GLM4.5. The evaluation focused on two complementary dimensions: (i) semantic alignment between model-generated reports and clinician-documented EMR notes, and (ii) clinical decision accuracy of the generated analgesic recommendations under real-world prescribing constraints^38,39.

For semantic alignment, we used a suite of natural-language evaluation metrics to quantify meaning preservation and textual consistency with the reference EMR documentation. Specifically, to evaluate the consistency of the model’s answers, BERTScore was applied to assess the embedding-based semantic equivalence between the generated text and the ground truth. A high BERTScore indicates that the model’s output maintains semantic consistency with the clinician’s expert consensus, regardless of superficial phrasing variations. And BLEU and ROUGE (ROUGE-1/ROUGE-2/ROUGE-L) were used to quantify n-gram overlap and sequence-level retention between generated and reference texts^40,41. These metrics were computed under a unified Python evaluation pipeline to ensure consistent preprocessing and scoring across models.

For clinical decision accuracy, the model outputs were evaluated as structured recommendation tasks, assessing whether the suggested analgesic plan matched the EMR-documented prescriptions and safety constraints. Performance was summarized using standard classification metrics, complemented by confusion-matrix analyses to characterize misclassification patterns⁴². In addition, we conducted error-type analyses by grouping discrepant cases into clinically interpretable categories (e.g., dosage, timing, monitoring/follow-up, drug-interaction, and patient-factor–related adjustments) to support transparent auditing of failure modes in complex cancer pain regimens.

Statistical Analysis

Regarding the statistical analysis of patients’ clinical baseline data, continuous variables, such as age and NRS pain scores before and after treatment, were reported as mean ± standard deviation (SD) and range. Categorical variables, including gender, primary cancer type, metastasis sites, and WHO analgesic ladder steps, were presented as frequencies and percentages. Pain score distributions before and after treatment were visualized with histograms and kernel density estimation (KDE) plots to characterize distributional patterns and treatment response. Given the paired study design and the cohort size (N = 516), we used a two-sided paired Student’s t-test to compare pre-treatment and post-treatment NRS pain scores. All statistical analyses were conducted in Python, and statistical significance was defined as two-sided α = 0.05⁴³. The change in pain intensity was categorized as excellent (≥75% reduction), good (50–74% reduction), moderate (25–49% reduction), and poor (<25% reduction). The correlation between pre-treatment and post-treatment NRS scores was assessed using Pearson’s correlation coefficient (r) with 95% confidence intervals (CIs). Subgroup analyses were performed across different categories (primary cancer type, metastasis sites, pain severity, and WHO analgesic ladder steps) using Chi-square tests for categorical variables and one-way ANOVA for continuous variables⁴⁴.

Ethics and Consent to Participate

The study has been reviewed and approved by the Ethics Committee of Guang’anmen Hospital, China Academy of Chinese Medical Sciences (Ethics approval number: MDKN-2024-058). The requirement for written informed consent was waived by the Ethics Committee of Guang’anmen Hospital, China Academy of Chinese Medical Sciences, because this study was a retrospective analysis of de-identified clinical data.

Supplementary information

Supplementary materials.^{(249.2KB, pdf)}

Acknowledgements

The authors declare that there are no acknowledgements.

Author contributions

H.L., Y.H. and D.L. contributed equally to this work. Conceptualization: H.L., Y.H., D.L. and Y.B. Methodology & software (LLM/RAG design): B.S., X.Z. and Y.N. Data curation & clinical validation: H.L., Y.H., D.L., G.Z., C.L. and L.W. Formal analysis & visualization: H.L., Y.H., D.L. and Y.N. Supervision: Y.B. Writing—original draft: H.L., Y.H. and D.L. Writing—review & editing: all authors. Funding acquisition: Y.B. All authors reviewed and approved the final manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 81302961, 81973890); the Innovation Fund of the China Academy of Chinese Medical Sciences (No. CI2021A01817); the Clinical Research and Achievement Transformation Capacity Enhancement Project for High-Level TCM Hospitals (No. HLCMHPP2023078); and the Innovation Key Fund of the China Academy of Chinese Medical Sciences (No. YY3101202507001).

Data availability

The data supporting this study’s findings are not publicly available due to privacy and confidentiality restrictions. However, the datasets can be obtained from the corresponding author upon reasonable request, subject to approval and compliance with relevant data protection regulations.

Code availability

Core modules, prompts, and orchestration scripts of OncoPainBot are available at GitHub: https://github.com/nine-github/OncoPainBot.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Haixiao Liu, Yue Hu, Dongtao Li.

Supplementary information

The online version contains supplementary material available at 10.1038/s41746-026-02362-6.

References

1.Evenepoel, M. et al. Pain Prevalence During Cancer Treatment: A Systematic Review and Meta-Analysis. J. Pain Symptom Manag.63, e317–e335 (2022). [DOI] [PubMed] [Google Scholar]
2.Mulvey, M. R. et al. Neuropathic pain in cancer: what are the current Guidelines?. Curr. Treat. Options Oncol.25, 1193–1202 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Fallon, M. et al. An international, open-label, randomised trial comparing a two-step approach versus the standard three-step approach of the WHO analgesic ladder in patients with cancer. Ann. Oncol.:J. Eur. Soc. Med. Oncol.33, 1296–1303 (2022). [DOI] [PubMed] [Google Scholar]
4.World Health Organization. (1996). Cancer pain relief: with a guide to opioid availability (2nd ed.). World Health Organization.
5.Luo, W., Huang, H., Zhou, Y., Min, J. & Wang, C. Implementation of pharmaceutical strategies using the PDCA cycle for standardized management of cancer pain medications. Support. Care Cancer: J. Multinatl. Assoc. Support. Care Cancer33, 163 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Abdel Shaheed, C. et al. Opioid analgesics for nociceptive cancer pain: A comprehensive review. CA: a cancer J. Clin.74, 286–313 (2024). [DOI] [PubMed] [Google Scholar]
7.Wiffen, P. J., Wee, B., Derry, S., Bell, R. F. & Moore, R. A. Opioids for cancer pain - an overview of Cochrane reviews. Cochrane Database Syst. Rev.7, CD012592 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Preuss, C. V., Kalava, A., & King, K. C. Prescription of Controlled Substances: Benefits and Risks. In StatPearls. StatPearls Publishing.(2025). [PubMed]
9.Jiang, F. et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc. Neurol.2, 230–243 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med.30, 1134–1142 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zamzmi, G. et al. A review of automated pain assessment in infants: features, classification tasks, and databases. IEEE Rev. Biomed. Eng.11, 77–96 (2018). [DOI] [PubMed] [Google Scholar]
12.Temsah, A. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus17, e79221 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Beger, J. The crucial role of explainability in healthcare AI. Eur. J. Radiol.176, 111507 (2024). [DOI] [PubMed] [Google Scholar]
14.Ouanes, K. & Farhah, N. Effectiveness of Artificial Intelligence (AI) in Clinical Decision Support Systems and Care Delivery. J. Med. Syst.48, 74 (2024). [DOI] [PubMed] [Google Scholar]
15.Wang, J. et al. Token-Mol 1.0: tokenized drug design with large language models. Nat. Commun.16, 4416 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hadjiat, Y. & Perrot, S. Cancer Pain Management in French-Speaking African Countries: assessment of the current situation and research into factors limiting treatment and access to analgesic drugs. Front. public health10, 846042 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Sarker, A., Mollá, D. & Paris, C. Automatic evidence quality prediction to support evidence-based decision making. Artif. Intell. Med.64, 89–103 (2015). [DOI] [PubMed] [Google Scholar]
18.Chi, N. C. et al. Family caregivers’ challenges in cancer pain management for patients receiving palliative care. Am. J. Hosp. Palliat. care40, 43–51 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Paice, J. A. et al. AAPT diagnostic criteria for chronic cancer pain conditions. J. Pain.18, 233–246 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Mansoor, I., Abdullah, M., Rizwan, M. D. & Fraz, M. M. Reasoning with large language models in medicine: a systematic review of techniques, challenges and clinical integration. Health Inf. Sci. Syst.14, 6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Dahan, A., Hellinga, M. D., Niesters, M. & Fleisch, J. Pijn bij kanker [Pain in cancer]. Ned. Tijdschr. voor Geneeskd.167, D7768 (2023). [PubMed] [Google Scholar]
22.Fink, R. M. & Brant, J. M. Complex cancer pain assessment. Hematol. /Oncol. Clin. North Am.32, 353–369 (2018). [DOI] [PubMed] [Google Scholar]
23.Donati, C. M. et al. Further clarification of pain management complexity in radiotherapy: insights from modern statistical approaches. Cancers16, 1407 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Swarm, R. A. et al. Adult Cancer Pain, Version 2.2025, NCCN Clinical Practice Guidelines In Oncology. J. Natl. Compr. Cancer Netw.23, e250032 (2025). [DOI] [PubMed] [Google Scholar]
25.Fallon, M. et al. Management of cancer pain in adult patients: ESMO Clinical Practice Guidelines. Ann. Oncol.: J. Eur. Soc. Med. Oncol.29, iv166–iv191 (2018). [DOI] [PubMed] [Google Scholar]
26.WHO Guidelines for the Pharmacological and Radiotherapeutic Management of Cancer Pain in Adults and Adolescents. (2018). World Health Organization. [PubMed]
27.Ren, S., Jin, Y., Chen, Y. & Shen, B. CRPMKB: a knowledge base of cancer risk prediction models for systematic comparison and personalized applications. Bioinforma38, 1669–1676 (2022). [DOI] [PubMed] [Google Scholar]
28.Burton, A. W., Chai, T. & Smith, L. S. Cancer pain assessment. Curr. Opin. Support. Palliat. Care8, 112–116 (2014). [DOI] [PubMed] [Google Scholar]
29.Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions?. Patterns5, 100943 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Bardhan, J., Roberts, K. & Wang, D. Z. Question answering for electronic health records: scoping review of datasets and models. J. Med. Internet Res.26, e53636 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med.31, 943–950 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Guo, E. et al. neuroGPT-X: toward a clinic-ready large language model. J. Neurosurg.140, 1041–1053 (2023). [DOI] [PubMed] [Google Scholar]
34.Zarfati, M., Soffer, S., Nadkarni, G. N. & Klang, E. Retrieval-Augmented Generation: Advancing personalized care and research in oncology. Eur. J. Cancer220, 115341 (2025). [DOI] [PubMed] [Google Scholar]
35.Xiong, G. et al. Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions. Pac. Symp. Biocomput. Pac. Symp. Biocomput.30, 199–214 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Zhou, J., Park, S., Dong, S., Tang, X. & Wei, X. Artificial intelligence-driven transformative applications in disease diagnosis technology. Med. Rev.5, 353–377 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Lee, M., Kim, K., Shin, Y., Lee, Y. & Kim, T.-J. Advancements in electronic medical records for clinical trials: enhancing data management and research efficiency. Cancers17, 1552 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Ghane, G. et al. Pain management in cancer patients with artificial intelligence: Narrative review. Scientifica, 2025. https://api.semanticscholar.org/CorpusID:278013819 (2025). [DOI] [PMC free article] [PubMed]
39.Liu, Y., Chen, S. & Yang, Y. Semantic alignment: A measure to quantify the degree of semantic equivalence for English–Chinese translation equivalents based on distributional semantics. Behav. Res.57, 51 (2025). [DOI] [PubMed] [Google Scholar]
40.Hartman, V. et al. Developing and evaluating large language model-generated emergency medicine handoff notes. JAMA Netw. open7, e2448723 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Meddeb, A. et al. Large language model ability to translate CT and MRI free-text radiology reports into multiple languages. Radiology313, e241736 (2024). [DOI] [PubMed] [Google Scholar]
42.Miller, C. J. et al. Minimum clinically important differences in acute pain: a patient-level re-analysis of randomized controlled analgesic trials submitted to the US Food and Drug Administration. Pain166, 2430–2438 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Monticelli, A. & Van Grootven, B. Exploring established cut-off points for pain levels in the numeric rating scale: insights from a literature overview. Pain. Manag. Nurs.:J. Am. Soc. Pain. Manag. Nurses26, 689–695 (2025). [DOI] [PubMed] [Google Scholar]
44.Davis, M. P. The Conundrum of Pain Intensity and Pain Relief. J. Pain. Symptom Manag.70, e205–e207 (2025). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary materials.^{(249.2KB, pdf)}

Data Availability Statement

Core modules, prompts, and orchestration scripts of OncoPainBot are available at GitHub: https://github.com/nine-github/OncoPainBot.

[CR1] 1.Evenepoel, M. et al. Pain Prevalence During Cancer Treatment: A Systematic Review and Meta-Analysis. J. Pain Symptom Manag.63, e317–e335 (2022). [DOI] [PubMed] [Google Scholar]

[CR2] 2.Mulvey, M. R. et al. Neuropathic pain in cancer: what are the current Guidelines?. Curr. Treat. Options Oncol.25, 1193–1202 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Fallon, M. et al. An international, open-label, randomised trial comparing a two-step approach versus the standard three-step approach of the WHO analgesic ladder in patients with cancer. Ann. Oncol.:J. Eur. Soc. Med. Oncol.33, 1296–1303 (2022). [DOI] [PubMed] [Google Scholar]

[CR4] 4.World Health Organization. (1996). Cancer pain relief: with a guide to opioid availability (2nd ed.). World Health Organization.

[CR5] 5.Luo, W., Huang, H., Zhou, Y., Min, J. & Wang, C. Implementation of pharmaceutical strategies using the PDCA cycle for standardized management of cancer pain medications. Support. Care Cancer: J. Multinatl. Assoc. Support. Care Cancer33, 163 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Abdel Shaheed, C. et al. Opioid analgesics for nociceptive cancer pain: A comprehensive review. CA: a cancer J. Clin.74, 286–313 (2024). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Wiffen, P. J., Wee, B., Derry, S., Bell, R. F. & Moore, R. A. Opioids for cancer pain - an overview of Cochrane reviews. Cochrane Database Syst. Rev.7, CD012592 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Preuss, C. V., Kalava, A., & King, K. C. Prescription of Controlled Substances: Benefits and Risks. In StatPearls. StatPearls Publishing.(2025). [PubMed]

[CR9] 9.Jiang, F. et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc. Neurol.2, 230–243 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med.30, 1134–1142 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Zamzmi, G. et al. A review of automated pain assessment in infants: features, classification tasks, and databases. IEEE Rev. Biomed. Eng.11, 77–96 (2018). [DOI] [PubMed] [Google Scholar]

[CR12] 12.Temsah, A. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus17, e79221 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Beger, J. The crucial role of explainability in healthcare AI. Eur. J. Radiol.176, 111507 (2024). [DOI] [PubMed] [Google Scholar]

[CR14] 14.Ouanes, K. & Farhah, N. Effectiveness of Artificial Intelligence (AI) in Clinical Decision Support Systems and Care Delivery. J. Med. Syst.48, 74 (2024). [DOI] [PubMed] [Google Scholar]

[CR15] 15.Wang, J. et al. Token-Mol 1.0: tokenized drug design with large language models. Nat. Commun.16, 4416 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Hadjiat, Y. & Perrot, S. Cancer Pain Management in French-Speaking African Countries: assessment of the current situation and research into factors limiting treatment and access to analgesic drugs. Front. public health10, 846042 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Sarker, A., Mollá, D. & Paris, C. Automatic evidence quality prediction to support evidence-based decision making. Artif. Intell. Med.64, 89–103 (2015). [DOI] [PubMed] [Google Scholar]

[CR18] 18.Chi, N. C. et al. Family caregivers’ challenges in cancer pain management for patients receiving palliative care. Am. J. Hosp. Palliat. care40, 43–51 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Paice, J. A. et al. AAPT diagnostic criteria for chronic cancer pain conditions. J. Pain.18, 233–246 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Mansoor, I., Abdullah, M., Rizwan, M. D. & Fraz, M. M. Reasoning with large language models in medicine: a systematic review of techniques, challenges and clinical integration. Health Inf. Sci. Syst.14, 6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Dahan, A., Hellinga, M. D., Niesters, M. & Fleisch, J. Pijn bij kanker [Pain in cancer]. Ned. Tijdschr. voor Geneeskd.167, D7768 (2023). [PubMed] [Google Scholar]

[CR22] 22.Fink, R. M. & Brant, J. M. Complex cancer pain assessment. Hematol. /Oncol. Clin. North Am.32, 353–369 (2018). [DOI] [PubMed] [Google Scholar]

[CR23] 23.Donati, C. M. et al. Further clarification of pain management complexity in radiotherapy: insights from modern statistical approaches. Cancers16, 1407 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Swarm, R. A. et al. Adult Cancer Pain, Version 2.2025, NCCN Clinical Practice Guidelines In Oncology. J. Natl. Compr. Cancer Netw.23, e250032 (2025). [DOI] [PubMed] [Google Scholar]

[CR25] 25.Fallon, M. et al. Management of cancer pain in adult patients: ESMO Clinical Practice Guidelines. Ann. Oncol.: J. Eur. Soc. Med. Oncol.29, iv166–iv191 (2018). [DOI] [PubMed] [Google Scholar]

[CR26] 26.WHO Guidelines for the Pharmacological and Radiotherapeutic Management of Cancer Pain in Adults and Adolescents. (2018). World Health Organization. [PubMed]

[CR27] 27.Ren, S., Jin, Y., Chen, Y. & Shen, B. CRPMKB: a knowledge base of cancer risk prediction models for systematic comparison and personalized applications. Bioinforma38, 1669–1676 (2022). [DOI] [PubMed] [Google Scholar]

[CR28] 28.Burton, A. W., Chai, T. & Smith, L. S. Cancer pain assessment. Curr. Opin. Support. Palliat. Care8, 112–116 (2014). [DOI] [PubMed] [Google Scholar]

[CR29] 29.Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions?. Patterns5, 100943 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Bardhan, J., Roberts, K. & Wang, D. Z. Question answering for electronic health records: scoping review of datasets and models. J. Med. Internet Res.26, e53636 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med.31, 943–950 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Guo, E. et al. neuroGPT-X: toward a clinic-ready large language model. J. Neurosurg.140, 1041–1053 (2023). [DOI] [PubMed] [Google Scholar]

[CR34] 34.Zarfati, M., Soffer, S., Nadkarni, G. N. & Klang, E. Retrieval-Augmented Generation: Advancing personalized care and research in oncology. Eur. J. Cancer220, 115341 (2025). [DOI] [PubMed] [Google Scholar]

[CR35] 35.Xiong, G. et al. Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions. Pac. Symp. Biocomput. Pac. Symp. Biocomput.30, 199–214 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Zhou, J., Park, S., Dong, S., Tang, X. & Wei, X. Artificial intelligence-driven transformative applications in disease diagnosis technology. Med. Rev.5, 353–377 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Lee, M., Kim, K., Shin, Y., Lee, Y. & Kim, T.-J. Advancements in electronic medical records for clinical trials: enhancing data management and research efficiency. Cancers17, 1552 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Ghane, G. et al. Pain management in cancer patients with artificial intelligence: Narrative review. Scientifica, 2025. https://api.semanticscholar.org/CorpusID:278013819 (2025). [DOI] [PMC free article] [PubMed]

[CR39] 39.Liu, Y., Chen, S. & Yang, Y. Semantic alignment: A measure to quantify the degree of semantic equivalence for English–Chinese translation equivalents based on distributional semantics. Behav. Res.57, 51 (2025). [DOI] [PubMed] [Google Scholar]

[CR40] 40.Hartman, V. et al. Developing and evaluating large language model-generated emergency medicine handoff notes. JAMA Netw. open7, e2448723 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Meddeb, A. et al. Large language model ability to translate CT and MRI free-text radiology reports into multiple languages. Radiology313, e241736 (2024). [DOI] [PubMed] [Google Scholar]

[CR42] 42.Miller, C. J. et al. Minimum clinically important differences in acute pain: a patient-level re-analysis of randomized controlled analgesic trials submitted to the US Food and Drug Administration. Pain166, 2430–2438 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Monticelli, A. & Van Grootven, B. Exploring established cut-off points for pain levels in the numeric rating scale: insights from a literature overview. Pain. Manag. Nurs.:J. Am. Soc. Pain. Manag. Nurses26, 689–695 (2025). [DOI] [PubMed] [Google Scholar]

[CR44] 44.Davis, M. P. The Conundrum of Pain Intensity and Pain Relief. J. Pain. Symptom Manag.70, e205–e207 (2025). [DOI] [PubMed] [Google Scholar]

PERMALINK

LLM-driven collaborative framework for knowledge-enhanced cancer pain assessment and management

Haixiao Liu

Yue Hu

Dongtao Li

Boyuan Shi

Yupeng Niu

Xinche Zhang

Guangda Zheng

Changlin Li

Lingyun Wang

Yanju Bao

Abstract

Introduction

Fig. 1. Overall research workflow and system architecture.

Results

Experimental setup

Performance comparison of LLMs for OncoPainBot

Fig. 2. Comparative evaluation of large language models (LLMs) for integration into the OncoPainBot framework.

Comparison of Different RAG Strategies

Fig. 3. Comparison of Retrieval-Augmented Generation (RAG) strategies across different large language models (LLMs).

Semantic alignment validation on clinical real-world datasets

Fig. 4. Semantic alignment validation of real-world clinical datasets (CRBP dataset) across different models.

Analgesic recommendation validation of clinical real-world datasets

Fig. 5. Clinical validation results of the OncoPainBot framework on real-world cancer pain datasets (CRBP dataset).

Practical analysis of real-world cases

Fig. 6. Practical demonstration of OncoPainBot clinical reasoning on a real-world cancer pain case.

Discussion

Methods

Knowledge base construction

Fig. 7. Construction and composition of the clinical knowledge base and patient cohort used in OncoPainBot.

Fig. 8. Multi-agent architecture of the OncoPainBot framework simulating multidisciplinary cancer pain management.

Comparison of LLM base models

RAG strategy comparison

OncoPainBot clinical performance validation

Fig. 9. Baseline characteristics and pain management outcomes of the 516 cancer pain patients included in the OncoPainBot validation cohort.

Statistical Analysis

Ethics and Consent to Participate

Supplementary information

Acknowledgements

Author contributions

Funding

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases