Abstract
Background
Artificial intelligence AI platforms, such as Gemini, ChatGPT, DeepSeek, and Perplexity, are increasingly utilized to support clinical decision-making, yet their accuracy in specific medical domains remains variable. This study assessed the performance of these AI chatbots in responding to clinical questions commonly posed by surgeons in the context of salivary gland cancer, a field closely related to oral and maxillofacial oncology.
Methods
Thirty clinical questions related to salivary gland malignancies were created according to the ASCO 2021 guidelines. Two researchers posted on four AI chatbot platforms: ChatGPT-4o, DeepSeek, Gemini, and Peperlixity. These questions were queried three times daily over ten days, yielding a total of 2700 responses that were coded as correct or incorrect. The accuracy of each response was statistically analyzed, and overall accuracy rates for each AI platform were calculated.
Results
DeepSeek achieved the highest accuracy rate at 86.9%, followed by Gemini at 78.9%, ChatGPT-4o at 72.8%, and Perplexity at 71.6%.
Conclusion
Despite demonstrating substantial potential, current AI chatbots have not yet achieved sufficient accuracy for standalone clinical use in salivary gland cancer in clinical applications. Enhancements in AI capabilities and rigorous clinical validation are necessary to ensure patient safety and effectiveness in clinical practice.
Keywords: Artificial intelligence, LLMs, DeepSeek, Gemini, ChatGPT, Perplexity, Salivery gland cancer
Introduction
Artificial intelligence (AI) is described as a computer system that can accomplish tasks that previously required human intelligence, leveraging adaptive learning, experiential growth, and cognitive mimicry to execute complex human functions autonomously [1, 2]. It is fundamentally rooted in algorithm-driven frameworks that empower machines to analyze vast datasets, engage in data-driven decision-making, and employ dynamic problem-solving mechanisms, progressively optimizing their functionality through continuous self-enhancement [2, 3]. A subset of artificial intelligence (AI) is machine learning, where computer systems are designed to recognize patterns in data and carry out certain tasks in response to those patterns. A highly specialized method in this field, called deep learning, employs several data processing layers to gradually extract higher-level characteristics from unprocessed input data [4]. Machine learning models are divided into two primary varieties: generative and discriminative. Large language models (LLMs) and other generative models, on the other hand, are designed to create new instances of data. This is accomplished via LLMs, a type of ANN, which successfully learn to copy the style and content of the text in their training data by predicting the next word in a sequence based on the context of preceding words [4].
An important improvement over conventional techniques, it uses transformer-based neural networks to dynamically contextualize sequential data using self-attention processes while training on exascale textual datasets. To match the results with moral and pragmatic human choices, these models are refined iteratively through human feedback. Dual scalability—the exponential increase in model parameters and the absorption of multiterabyte training corpora—gives them unmatched predictive power, demonstrating a clear relationship between computational size and emerging language skill [5, 6]. LLMs are among the most widely used AI chatbots, having redefined digital communication paradigms by enabling context-aware, personalized interactions through dynamic information dissemination. Powered by sophisticated neural frameworks—inspired by the human brain’s synaptic architecture—these systems employ deep learning algorithms to engage in data-driven experiential learning, iteratively optimizing responses via cognitive mirroring and feedback integration. Their emergent proficiency hinges on scalable neural parameterization and exposure to multimodal training corpora, cementing a symbiotic relationship between computational depth and context-aware adaptability in human-AI discourse [7]. LLMs are neural networks that have been trained on enormous volumes of text data from the internet, including digitized books, articles, Wikipedia, and websites, using deep-learning techniques and advanced modeling. Their objective is to process the incoming text (question or prompt) and generate rational, human-like conversational answers on the basis of the context of the input text (question or prompt) [8]. All AI chatbots search the internet via a different source of knowledge [1, 9]. Internet-based information is useful since it is always available and offers a variety of viewpoints. Advanced AI chatbots, such as ChatGPT, Google Bard (formerly Gemini), Perplexity, and Deep Seek, all make it simple for anyone with internet access to access knowledge [9–12].
ChatGPT (Chat Generative Pretrained Transformer), announced by OpenAI Inc. (San Francisco, CA, USA) in November 2022 [13, 14]. Part of ChatGPT’s most recent version, GPT-4o, which is excellent at processing media content, multimodal models are a rapidly developing topic. Google Bard, which debuted on March 21, 2023, is intended to be a substitute for ChatGPT. Bard, trained with comparable data, aims to achieve goals that are similar to those of ChatGPT. Moreover, Bard has been outfitted with an image analysis feature since May 10, 2023 [15].
ChatGPT’s performance is limited by use restrictions that change depending on the subscription level and are based on how many tokens are used and how hard the query is. OpenAI says that the produced material might not be accurate, might be prejudiced, or might not be full; thus, it shouldn’t be used for professional advice. The API data policy says that client data must be kept for 30 days without being used to train models. Its API conditions also make it clear that users are not allowed to exploit the service in a harmful way, sell it without permission, or share their credentials. This means that users are responsible for checking that all outputs are correct and suitable [16]. Perplexity is a chatbot that has been specially trained and is based on OpenAI GPT technology. It offers answers to questions and prompts along with connections to references and related subjects [17].
Perplexity’s services are provided “AS IS,” which means that there are no assurances on the accuracy or completeness of the content. It should not be used for professional advice. Its API is mostly set up for businesses, and the amount of credit you buy determines how much you may use and access features, which directly affects rate limitations. It is very important to keep API keys secret. All users should follow best practices for managing API keys, optimizing tokens, and following data protection laws like GDPR and HIPAA [18, 19]. In contrast, Google Gemini can analyze complicated datasets such as graphs and photographs [20]. As an experimental technology, Gemini’s outputs may be inaccurate or objectionable, and it cannot be used for medical or legal advice, rules prohibiting reverse engineering or competing AI models, and Google Cloud-managed rate limits and usage caps apply to API access. Data management involves temporary holding of submitted files and API inputs/outputs to discourage sensitive data submission because human reviewers may access API content after anonymization [21, 22]. On 20 January 2025, a small AI company in China announced DeepSeek, which prompted some enthusiasts to call it a “revolution in AI.” Deepthink (R1) is an innovative open-source LLM that quickly became well-known worldwide [10, 23, 24]. Customers must validate output legitimacy and report AI creation on DeepSeek, which may have biases on politically sensitive themes. Its online interface and app permissions have been a security issues. API users must register securely and manage API keys, and they are responsible for breaches.DeepSeek may evaluate user material and interactions for compliance. The platform’s API requests have dynamic rate restrictions that cannot be increased manually [25, 26].
From its early stages as a theoretical concept to today’s sophisticated applications powered by deep learning algorithms [27], AI has reshaped healthcare delivery [28]. It continues to drive innovations in diagnostics, personalized medicine, drug discovery [29], and decision support systems while addressing challenges such as medical errors and resource optimization [30, 31], but the system’s primary functional limitations lie in its inability to replace physician-conducted clinical examinations and the potential risks of diagnostic errors, raising critical questions regarding the attribution of legal or ethical responsibility in such cases [32–34]. Since cancer is one of the specializations that deals addresses both people and machines, the use of AI in this sector is steadily expanding as a result of rising demand and the wide range of potential uses in many cancer specialties. As a result, we can track the most recent uses in the early diagnosis of oral cancer, which contributes to lowering patient morbidity and death rates and is crucial [35]. AI-driven algorithms are now widely used in radiodiagnosis to improve imaging accuracy [36], in radiotherapy for precise tumor targeting, and in predicting patient responses to chemotherapy and immunotherapy [37, 38]. These technologies also support the identification of novel biomarkers and molecular targets [39], facilitate real-time surgical navigation, and contribute to the development of personalized nanomedicines [40]. Despite ongoing challenges related to data quality and ethical considerations, AI continues to advance cancer research and clinical practice, offering significant promise for improving patient outcomes [41, 42].
The most prevalent type of cancer in the head and neck area is oral cancer [43], and approximately 3.2% of people die from cancer [35, 44]. Consequently, early diagnosis lowers the patient’s morbidity and fatality rates and is crucial [35]. Deep learning models such as artificial intelligence (AI) and convolutional neural networks (CNNs) show great promise in the identification of oral cancer. These models can diagnose Oral Potentially Malignant Diseases (OPMDs) and malignancies with a sensitivity of up to 98.75% and specificity of up to 100% [45–47]. These systems analyze histopathological images, intraoral photographs, and radiographic data to improve early detection, reduce diagnostic errors, and predict prognosis, outperforming traditional methods in accuracy (up to 99.7%) [46–48]. Key applications include: early detection: AI identifies subtle lesions in medical images, enabling earlier intervention [49]. Risk prediction: Machine learning models stratify patients based on clinical and genetic data [50]. Telemedicine: Smartphone-integrated AI tools provide low-cost screening in resource-limited settings [48, 51].
One of the common oral cancers is salivary gland tumors (SGTs), which are characterized by abnormal cellular growth within the salivary glands and generally appear as swelling under the jaw or in the auricular area [52]. Between 0.4 and 13.5 incidences of SGTs occur per 100,000 people [53]. According to the WHO categorization, 15 different types of benign SGTs and 22 different types of malignant SGTs are included in the Fifth Edition [54]. Even with imaging tests and fine needle aspiration cytology (FNAC), diagnosing SGTs is difficult because of their heterogeneous histology and differing degrees of malignancy [55, 56]. Preoperative evaluation of parotid gland tumors often includes a medical history, physical examination, radiological examination, fine needle aspiration cytology (FNAC), and so on [57]. Using deep learning and anomaly detection, researchers have improved how well MRI scans can detect parotid gland tumors, even with a small, unbalanced dataset. This AI-driven approach resulted in a more accurate diagnosis of tumors than did radiologists [58].
Guideline adherence is critical in managing salivary gland malignancies because of their histologic diversity and rarity, necessitating standardized protocols for optimal outcomes; thus, it is very important for clinical practice to follow guidelines to achieve good progress [59]. The American Society of Clinical Oncology (ASCO) offers evidence-based clinical practice recommendations to guide clinicians and healthcare professionals in the management of salivary gland malignancies. This guideline is titled “Management of Salivary Gland Malignancy: ASCO Guideline’’ [60]. Despite AI having seen widespread adoption in cancer, the literature reveals a notable scarcity of research examining the reliability of AI-powered chatbots in generating accurate responses to clinical queries within the field of oncology [61]. Furthermore, no research has yet examined the precision of AI chatbot replies in relation to the management of salivary gland cancer. The purpose of this study was to assess the precision of answers produced by various AI chatbots to queries that surgeons may have about the management of procedures for salivary gland malignancy. The study’s hypothesis states that the “ChatGPT-4o, DeepSeek, Perplexity, and Gemini language models can provide accurate responses to questions about the management of procedures for salivary gland cancer.
Methods
Study design and data collection
A systematic evaluation of four artificial intelligence (AI) chatbots was conducted over 10 days (March 2–11, 2025) to assess their accuracy in addressing clinically relevant queries on salivary gland malignancy management. Two researchers using different account [7]distributed the questions across three daily time intervals (morning, noon, evening) [62]to account for potential temporal variability in AI performance (Fig. 1). Thirty questions were meticulously designed based on the clinical guideline “Management of Salivary Gland Malignancy: ASCO Guideline2021’’outlined to ensure alignment with evidence-based practices. A total of 30 questions, 26 Yes/No questions (binary responses), and 4 Open-ended questions focused on diagnostic criteria, treatment protocols, and follow-up strategies (Table 1). All the questions were designed to meet predefined criteria: clinical relevance, practical applicability, and comprehensive coverage of salivary gland cancer management from diagnosis to follow-up.
Fig. 1.
Flowchart representing the steps of the evaluation of chatbot responses. Created in https://BioRender.com
Table 1.
Thirty questions were created according to ASCO 2021 guidelines [60]
| 1. Is there over 30 distinct salivary gland tumor types in the latest WHO classification system, including benign and malignant ones? |
| 2. Do the symptoms of SGC depend on the tumor location? |
| 3. Should symptoms like pain in the face or mouth, an externally or submucosally growing lump, or facial nerve paralysis prompt consideration of SGC? |
| 4. Is cytology or histology mandatory for diagnosing SGC? |
| 5. Should open biopsies be avoided in major salivary gland lesions due to the risk of complicating surgical treatment and spillage, except for skin ulcerating tumors? |
| 6. Is FDG-MRI recommended in high-grade SGC for detecting distant metastases? |
| 7. Should IHC and molecular testing be used as primary tools for diagnosing salivary duct carcinoma and adenocarcinoma NOS with distant metastases? |
| 8. Is analysis for NTRK fusion with NGS or whole genome sequencing mandatory for the differential diagnosis of secretory carcinoma and AcCC? |
| 9. Should FNA be used for screening, and if inadequate, can core needle biopsy be the next step? |
| 10. Should clinical classification be carried out after treatment, and pTNM be done before surgical resection using the UICC eighth edition staging classification? |
| 11. Should the pathological report for the correct management of major SGC unfollow the International Collaboration on Cancer Reporting guidelines? |
| 12. Should intraoperative frozen sections be indicated to evaluate margins of resection, perineural invasion, and lymph nodes only if the result is expected to alter management at the time of surgery? |
| 13. Are SGCs a common and complicated subgroup of head and neck cancers? |
| 14. Is the treatment of parotid gland cancer based on complete surgical excision with free margins? |
| 15. Is it important to collect as much information as possible about the tumor before surgery, discuss scenarios with the patient, and be able to do a graft during the ablative procedure? |
| 16. Should total parotidectomy always be the reference procedure for all parotid gland cancers, regardless of tumor grade? |
| 17. Should the facial nerve be kept intact if it is not affected or surrounded by the tumor? |
| 18. Does a preoperatively paralysed nerve VII require resection and primary reconstruction and/or reanimation procedures? |
| 19. Should malignant tumors confined within the submandibular gland require at least resection of the gland and but not require resection of the surrounding level Ib lymph nodes? |
| 20. Is a selective neck dissection involving levels I, II, and III contraindicated in case of high-grade malignancy without clinical evidence of cervical lymph node involvement, including the gland? No |
| 21. Should patients with positive lymph nodes (clinical or radiological) undergo a comprehensive lymph node dissection involving levels I-V? |
| 22. Should END be carried out when the neck is entered as an approach to the primary or for reconstruction in minor SGC? |
| 23. Is postoperative local RT not recommended for T3-T4 and intermediate/high-grade tumors, as well as in cases with close resection margins (1–5 mm), incomplete resection margins, or perineural growth? |
| 24. Is there proof of a beneficial effect of adding ChT to post-operative RT of the primary tumor and neck? |
| 25. Should sequencing (NGS) of the tumor be considered in cases of recurrent/metastatic (R/M) disease? |
| 26. Should follow-up include the use of MRI for locoregional recurrence and chest CT for lung metastases? |
| 27. What are the most commonly used imaging techniques to assess lesions in the major salivary glands, and why is MRI considered the preferred modality? |
| 28. Why is it important for the pathological report of major SGC to follow the International Collaboration on Cancer Reporting guidelines, and what key information should be included in the report? |
| 29. What is the best imaging tool for locoregional recurrent disease? |
| 30. What is the recommended duration of follow-up for patients with AdCC, and how does the frequency of imaging change over time? |
AI chatbot platforms and interaction protocol
The following AI platforms were evaluated in terms of their advanced analytical features:
ChatGPT-4o: paid version and applied Resonance option for nuanced responses.
DeepSeek: Utilized DeepThink-R1 and the search option.
Perplexity: Operated in Pro Mode with Deep Search enabled.
Google Gemini: Access to 2.0 Flash model.
To simulate real normal using interaction with the chatbots, data was collected through different access methods for each platform. For the ChatGPT-4o model, access was gained via its official web interface (https://chat.openai.com). Gemini access via (https://gemini.google.com/app/).Perplexity access achieved with (https://www.perplexity.ai/). DeepSeek login via (https://chat.deepseek.com/), where all questions were inputted manually, then to minimize contextual bias, a new chat session was initiated for each query, ensuring that no carryover of prior interactions occurred. Responses were recorded in real time within a structured Excel spreadsheet, with timestamps and platform identifiers. Yes/No questions and open-ended questions are coded as correct or incorrect based on ASCO guideline-derived answers [60]. To determine whether a chatbot’s answer to a subjective clinical question was “correct”, we compared each response verbatim with the guideline from which the question was derived (American Society of Clinical Oncology [ASCO] 2021 Salivary-Gland-Malignancy Guidelines), for example if the response for Q1 in chatbot was “yes”, but according guideline should be “no” we will consider this incorrect response, in summery if the chatbot’s response was as guideline will categoric as correct response if was reversely will categoric as in-correct response (Fig. 2).
Fig. 2.
There were two different responses for question number one across all chatbots. According to the ASCO 2021 guidelines, the correct answer is ‘Yes’. Therefore, any response of ‘Yes’ would be considered correct, while a ‘No’ response would be deemed incorrect.”
Statistical analysis
The data were analyzed via R 4.5.0 and SPSS v24. Pearson’s chi-square test was employed to evaluate the associations between chatbot platforms and response accuracy. The reliability and insight into the variability of chatbot performance comparisons may be obtained using confidence intervals. To assess whether the question-asking using two different researchers influenced the accuracy of responses, a Chi-Square Test was conducted to determine if there were significant differences in the distribution of correct and incorrect answers based on the questions asked of chatbots. The significance level was set at 0.05. The Chi-Square was applied to determine whether there was a statistically significant difference in the accuracy of responses (correct vs incorrect) across different days for each chatbot’s application.
Results
A total of 30 clinical questions were posed, each asked three times per day over ten days, yielding 720 responses per question and a cumulative total of 2700 chatbot-generated responses. Upon evaluation, 5584 responses were deemed correct, while 1,616 were incorrect among the four AI chat platforms assessed.
DeepSeek demonstrated superior performance, achieving the highest accuracy responses 1545(86.9%, 95% CI [86.80, 87.00]) with exceptional measurement precision (CI width = 0.20). This narrow interval, coupled with the lowest observed standard deviation (SD = 1.53), indicates remarkable consistency across repeated assessments. Gemini ranked second in accuracy, 1420 (78.9%, 95% CI [78.76, 79.04]), maintaining high precision (CI width = 0.28) and stability (SD = 2.12). In contrast, ChatGPT-4o and Perplexity exhibited significantly lower accuracy (1303)72.8% and (1288)71.6%, respectively) alongside substantially wider CIs (1.14 and 0.72).
DeepSeek demonstrated superior performance, achieving the highest accuracy responses 1545(86.9%, 95% CI [86.80, 87.00]) with exceptional measurement precision (CI width = 0.20). This narrow interval, coupled with the lowest observed standard deviation (SD = 1.53), indicates remarkable consistency across repeated assessments. Gemini ranked second in accuracy, 1420 (78.9%, 95% CI [78.76, 79.04]), maintaining high precision (CI width = 0.28) and stability (SD = 2.12). In contrast, ChatGPT-4o and Perplexity exhibited significantly lower accuracy (1303)72.8% and (1288)71.6%, respectively) alongside substantially wider CIs (1.14 and 0.72).
ChatGPT-4o’s pronounced variability (SD = 8.68) —reflected in the largest CI width—suggests contextual sensitivity, with performance fluctuating notably across question sets or temporal administrations. The non-overlapping CIs between DeepSeek and all other models (p < 0.001) confirm its statistical dominance. CI widths further reveal a precision hierarchy: DeepSeek > Gemini > Perplexity > ChatGPT-4o, inversely corresponding to observed standard deviations.
These findings highlight critical performance differentials, where models with lower mean accuracy also exhibited greater measurement uncertainty, potentially impacting reliability in applied settings. The Chi-square test yielded a p-value of 0.000, indicating statistically significant differences in the distribution accuracy of the chatbots. The distribution of the response accuracy mean and standard deviation, along with 95% CIs (Fig. 3A, B, and C), for the AI platforms is shown in Table 2. This result highlights that the observed variations in chatbot performance are unlikely to be due to random chance, thereby confirming the presence of significant differences in accuracy across the chatbots.
Fig. 3.
A and B Mean accuracy of chatbot responses, C 95%CIs
Table 2.
Mean accuracy, standard deviation, p-value, and confidence intervals for distribution accuracy among chatbot applications
| Chatbot | Mean Accuracy (%) | Standard Deviation | P value | 95%CI | |
|---|---|---|---|---|---|
| Gemini | 78.9 | 2.12 | 0.000* | (78.76,79,04)0.28 | |
| ChatGPT-4o | 72.8 | 8.68 | (72.23,73.37)1.14 | ||
| DeepSeek | 86.9 | 1.53 | (86.80,87.00) 0.20 | P < 0.001 | |
| Perplexity | 71.6 | 5.46 | (71.24,71.96)0.72 |
*Pearson Chi-Square test
The results of the Chi-Square Test demonstrated significant differences in the performance of various chatbot applications over time (Table 3). ChatGPT-4o exhibited a statistically significant variation in response accuracy across different days (p-value = 0.001), indicating that its performance was notably influenced by the time factor. Similarly, Perplexity also showed a significant effect of days on response accuracy, with a p-value of 0.005, highlighting a clear change in performance over time. In contrast, DeepSeek (p-value = 0.949) and Gemini (p-value = 0.885) showed no significant differences in accuracy across days, suggesting that these applications maintained stable performance regardless of the day. Daily distribution trends for each chatbot’s correct responses are illustrated in Fig. 4, emphasizing the dynamic variability in performance and stability and highlighting the overall superiority of DeepSeek and Gemini, whereas the relative instability of ChatGPT-4o and Perplexity. Figure 2 shows the different responses to the same questions among chatbots. The results of the Chi-Square Test revealed that there was no statistically significant effect of the data entry method on the accuracy of responses, with a p-value of 0.955(Table 4). This indicates that whether the questions were asked by different researchers among chatbot platforms did not result in any meaningful difference in the accuracy of the answers.
Table 3.
Pearson Chi-Square test to analyze the distribution accuracy of chatbots over 10 days
| Days | Total | p value* | ||||||||||||
| 1 st Day | 2nd Day | 3rd Day | 4th Day | 5th Day | 6th Day | 7th Day | 8th Day | 9th Day | 10th Day | |||||
| ChatGPT-4o | Correct | N | 128 | 136 | 145 | 147 | 141 | 133 | 131 | 129 | 129 | 91 | 1310 | 0.001 |
| % | 9.8% | 10.4% | 11.1% | 11.2% | 10.8% | 10.2% | 10.0% | 9.8% | 9.8% | 6.9% | 100.0% | |||
| Incorrect | N | 52 | 44 | 35 | 33 | 39 | 47 | 49 | 51 | 51 | 89 | 490 | ||
| % | 10.6% | 9.0% | 7.1% | 6.7% | 8.0% | 9.6% | 10.0% | 10.4% | 10.4% | 18.2% | 100.0% | |||
| Total | N | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 1800 | ||
| % | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 100.0% | |||
| DeepSeek | Correct | N | 155 | 154 | 155 | 159 | 152 | 161 | 157 | 159 | 158 | 155 | 1565 | 0.949 |
| % | 9.9% | 9.8% | 9.9% | 10.2% | 9.7% | 10.3% | 10.0% | 10.2% | 10.1% | 9.9% | 100.0% | |||
| Incorrect | N | 25 | 26 | 25 | 21 | 28 | 19 | 23 | 21 | 22 | 25 | 235 | ||
| % | 10.6% | 11.1% | 10.6% | 8.9% | 11.9% | 8.1% | 9.8% | 8.9% | 9.4% | 10.6% | 100.0% | |||
| Total | N | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 1800 | ||
| % | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 100.0% | 0.885 | ||
| Gemini | Correct | N | 141 | 145 | 141 | 142 | 141 | 146 | 141 | 149 | 140 | 135 | 1421 | |
| % | 9.9% | 10.2% | 9.9% | 10.0% | 9.9% | 10.3% | 9.9% | 10.5% | 9.9% | 9.5% | 100.0% | |||
| Incorrect | N | 39 | 35 | 39 | 38 | 39 | 34 | 39 | 31 | 40 | 45 | 379 | ||
| % | 10.3% | 9.2% | 10.3% | 10.0% | 10.3% | 9.0% | 10.3% | 8.2% | 10.6% | 11.9% | 100.0% | |||
| Total | N | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 1800 | ||
| % | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 100.0% | |||
| Perplexity | Correct | N | 124 | 133 | 135 | 129 | 132 | 124 | 131 | 144 | 106 | 130 | 1288 | 0.005 |
| % | 9.6% | 10.3% | 10.5% | 10.0% | 10.2% | 9.6% | 10.2% | 11.2% | 8.2% | 10.1% | 100.0% | |||
| incorrect | N | 56 | 47 | 45 | 51 | 48 | 56 | 49 | 36 | 74 | 50 | 512 | ||
| % | 10.9% | 9.2% | 8.8% | 10.0% | 9.4% | 10.9% | 9.6% | 7.0% | 14.5% | 9.8% | 100.0% | |||
| Total | N | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 1800 | ||
| % | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 10.0% | 100.0% | |||
*Pearson Chi-Square test
Fig. 4.
Distribution of the accuracy rate of correct responses by chatbot platform and days
Table 4.
Response distribution based on the researcher
| Correct | Incorrect | |||||
|---|---|---|---|---|---|---|
| N | % | N | % | P value | ||
| Researcher | 1 st researcher | 2791a | 50.0% | 809a | 50.1% | *0.955 |
| 2nd researcher | 2793a | 50.0% | 807a | 49.9% | ||
| Total | 5584 | 100.0% | 1616 | 100.0% | ||
*Pearson Chi-Square test
Discussion
The integration of AI technologies is actively transforming clinical workflows. They automate administrative tasks, efficiently analyze patient data, and improve diagnostic accuracy, ultimately increasing treatment speed and precision [63, 64]. These technologies serve as powerful collaborative tools, designed to support physicians, reduce inefficiencies, and enhance patient outcomes, rather than replace human expertise. Indeed, human knowledge remains paramount [65].
AI systems cannot replicate the nuanced evaluation or deep clinical judgment of medical professionals. Their optimal role is as part of a comprehensive support system that includes ongoing human interaction and rigorous scientific review. Consequently, it’s crucial for scientists and researchers to collaborate closely with developers and users to improve AI applications in healthcare, ensuring their responsible and safe operation. Furthermore, scientific organizations, such as medical and dental associations, must critically assess the information provided by chatbots, especially in areas directly impacting people’s health and well-being. In a world with increasing AI usage, verifying the accuracy and reliability of information from these systems, particularly advice affecting patients’ lives, is essential [7, 66–68].
The inherent complexity of medical decision-making necessitates multimodal reasoning, positioning multimodal LLMs (M-LLMs) as potentially transformative tools. Despite their promise, M-LLMs have seen limited adoption in medical and dental research, signaling a need for deeper exploration to overcome existing technical, ethical, and practical challenges [69, 70]. While M-LLMs show significant potential for integrating diverse data types—such as imaging, genomics, and clinical notes—gaps persist in their clinical validation, domain-specific adaptation, and interoperability with established healthcare workflows [71, 72]. In this study, we investigated a range of advanced multimodal language models, specifically Gemini, ChatGPT-4o, DeepSeek, and Perplexity, to explore clinical procedures related to salivary gland treatments. We selected these platforms for their demonstrated ability to manage complex clinical scenarios across various specialties. Gemini, ChatGPT-4o, and Perplexity, with their robust multimodal frameworks, were particularly chosen for their adaptability in handling diverse and multidisciplinary medical queries, aiming for accurate and comprehensive responses [1, 13, 20]. Similarly, DeepSeek offers unique features that complement this study’s focus, enabling a nuanced understanding of the clinical challenges involved in salivary gland cancer clinical procedures [10]. These AI-powered platforms were pivotal in providing a dynamic approach to address the complexities of this area of clinical practice, and because these four AI chatbots have become so widespread and accessible to users, they were chosen for this study. According to previous studies, we used the paid version of ChatGPT-4o, which consistently outperforms the free ChatGPT-3.5 in terms of accuracy and repeatability over time, making it more reliable for answering questions [73–75]. Additionally, the rationale behind our preference for using Perplexity’s deep search feature was to not only to assess the precision of its answers concerning salivary gland cancer procedures, but also to show whether medical professionals can find the information they need more rapidly by asking AI chatbots to conduct thorough deep searches across numerous references to find the best results [76].
Despite these advancements, recent analyses highlight AI’s still-emerging role in maxillofacial diagnostics and planning, with clinical integration lagging behind other specialties and necessitating further research [77, 78]. The most popular LLMs in clinical applications—ChatGPT, Perplexity, and Google Bard(Gemini) ALL—demonstrate variable accuracy and performance, directly influencing their clinical utility [79].In this context, DeepSeek-R1 holds significant promise for transforming healthcare by potentially alleviating the escalating administrative and cognitive demands prevalent in contemporary clinical practice [10, 80].
The primary objective of this study was to assess the accuracy and consistency of responses provided by these four AI systems to inquiries regarding salivary gland malignancy, a crucial area in maxillofacial surgery. It is widely acknowledged that for AI to be safely integrated into clinical decision-making, an accuracy level of at least 90% is often considered acceptable to ensure patient safety and therapeutic efficacy [1, 62, 81]. In our study, chatbot responses were categorized as “correct” or “incorrect” based on the ASCO (2021) Salivary-Gland-Malignancy Guidelines [60]. Considering the probabilistic nature of LLM outputs, our evaluation revealed that DeepSeek demonsrated the highest accuracy among the AI chatbot platforms, Gemini and ChatGPT-4o followed, while Perplexity showed the lowest percentage of correct responses. Crucially, even DeepSeek’s promising performance did not reach the 90% threshold recommended for clinical applications to ensure patient safety and therapeutic efficacy [1, 53, 72]. This outcome led to the partial rejection of our study’s hypothesis, which suggested that AI chatbots could assist in clinical applications with a high level of accuracy, as no chatbot achieved the desired 90% accuracy. DeepSeek’s observed accuracy aligns with percentages reported in similar studies by Mondillo et al. [82], David Mikhail et al. [83]., and Zain S Hussain et al. [84]. Furthermore, ChatGPT’s accuracy rate was consistent with findings from a study that assessed its reliability in neurology, which achieved a mean success rate of 71.3% when compared to neurologists in diagnostic accuracy and decision-making [85]. This accuracy rate was also aligned with similar studies that evaluated ChatGPT-4o [86–89].
In comparing our findings to existing literature, a study by Mohammed Ahmed Sadeq et al. [90]. which assessed publicly available LLMs using simulated UK medical board exam questions, found that ChatGPT-4 scored 78.2% and Perplexity scored 56.1%. These results resonate with our own, indicating ChatGPT’s superior accuracy compared to Perplexity. Another comparative study involved five AI systems (DeepSeek, Gemini, Perplexity, ChatGPT, and Copilot) and three experienced hand surgeons, evaluating decision-making based on twenty-two standardized clinical cues for Dupuytren’s disease. Their reported accuracies were ChatGPT (81.8%), Gemini (86.4%), Perplexity (77.3%), DeepSeek (63.6%), and Copilot (40.9%). While direct numerical comparisons are difficult due to differing methodologies—their study used twenty-two standardized clinical prompts for Dupuytren’s disease management, whereas we employed four open-ended and twenty-six dichotomous questions—our study yielded accuracies of 72.8% (ChatGPT), 78.9% (Gemini), 71.6% (Perplexity), and 86.9% (DeepSeek), respectively. Additionally, a cross-sectional in silico study by Reema Mahmoud DMD et al. [53], evaluated the accuracy of four AI chatbots on 714 OMS board exam questions in an in silico cross-sectional study revealed 83.69%, and 66.85% for ChatGPT and Gemini, respectively, and the authors suggested that still the fact that success is not the same across in different areas shows that these LLMs need to be improved and tested repeatedly to be more useful and accurate in the OMS field. In comparison, our results in maxillofacial surgery showed a higher accuracy level by DeepSeek, although still below the ideal threshold, even if DeepSeek presented promising results. These findings indicated that the use of AI chatbots still has not achieved an accuracy threshold of 90% in clinical applications and clinical decision-making. It is concluded that AI chatbot platforms need more enhancement. The results of this study showed that there is a substantial amount of variance in the performance of chatbots over a period of time.
ChatGPT and Perplexity both exhibited significant variations in answer accuracy over various days. These findings corroborate reports by Srivastava et al. (2025) [91] and Ekmekci et al. (2025) [62], which also observed significant variation in AI chatbot responses over time.
Our study’s findings also suggest that the individual researchers posing questions did not significantly influence response accuracy. This indicates that the phrasing or presentation of questions by chatbot users does not introduce biases or errors that substantially impact the accuracy of the responses. These results align with similar studies [1, 7, 62], where it has been shown that asking for entry consistency, regardless of the person or account used, typically does not affect the outcome of chatbot responses.
AI chatbots are capable of processing various inquiry styles, including multiple-choice [92], open-ended [93], and yes/no [62]formats. In our study, we deliberately avoided relying solely on yes/no questions, recognizing that such a format inadequately captures the inherent complexity of clinical practice, which often requires weighing a variety of options and considerations in decision-making [62]. A primary limitation of this study, in contrast to some previous research [1, 62, 94], is that our questions and chatbot response evaluations were not exclusively developed and assessed according to guidelines posted by international scientific associations to ensure scientific accuracy. Furthermore, current AI chatbot platforms notably lack specialized training in salivary gland cancer and maxillofacial surgery, which likely compromised the accuracy of their responses in our specific domain. Another limitation is that the 10-day assessment period may have been too brief to fully capture the potential evolution or long-term consistency of the chatbots’ performance. This collectively indicates a significant gap between the current performance of AI models and the level of accuracy required for critical decision-making in clinical environments. Additionally, the observed variation in response accuracy among the different AI platforms underscores the crucial need for more domain-specific training and rigorous validation of these models to ensure their consistent reliability across diverse medical specialties.
Conclusions
Despite the encouraging outcomes of DeepSeek in assisting clinicians, it has not yet reached the level of accuracy required for independent clinical use in salivary gland cancer. Nevertheless, artificial intelligence chatbots have demonstrated significant potential in supporting clinicians in the management of this disease, showing promising levels of accuracy. Ongoing research may lead to the development of customized AI systems with enhanced precision, providing valuable support to healthcare professionals in the field of oral cancer. However, further advancements in AI capabilities and thorough clinical validation are essential to ensure both patient safety and effectiveness in real-world clinical practice.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (China, 81600818), Foundation of Liaoning Province Education Administration (Liaoning, LJKZ0855). The authors deny any conflicts of interest related to this study. Finally, I would like to express my deep gratitude to Gang Chen, my supervisor, for his continuous support, encouragement, and belief in me; it has meant a lot to me. His patience and encouragement were crucial in guiding me through the writing of this article.
Clinical Trial Number
Not applicable.
Abbreviations
- AI
Artificial Intelligence
- SGCs
Salivary Gland Cancers
- M-LLMs
Multimodal Large Language Models
- ChatGPT
Chat Generative Pretrained Transformer
- FDG-MRI
Fluorodeoxyglucose – Magnetic Resonance Imaging
- ANN
Artificial Neural Networks
- CNNs
Convolutional Neural Networks
- OPMDs
Oral Potentially Malignant Diseases
- SGTs
Salivary Gland Tumors
- FNA
Fine-Needle Aspiration
- ASCO
The American Society of Clinical Oncology
- NVII
Cranial Nerve VII (the facial nerve)
- IHC
Immunohistochemistry
- NOS
Not Otherwise Specified
- NTRK
Neurotrophic Tyrosine Receptor Kinase
- NGS
Next-Generation Sequencing
- AcCC
Adenoid Cystic Carcinoma
- FNA
Fine-Needle Aspiration
- Ptnm
Pathological Tumor, Node, Metastasis
- UICC
Union for International Cancer Control
- RT
Radiotherapy
- T3-T4
Tumor size categories in the TNM staging system
- ChT
Chemotherapy
- R/M
Recurrent/Metastatic
- MRI
Magnetic Resonance Imaging
- CT
Computed Tomography
- AdCC
Adenoid Cystic Carcinoma
- END
Endoscopic Nasal Drainage
Authors’ contributions
A.B.; writing—original draft preparation and performed the analysis, A.S.; second researcher queried the questions and coded them in an Excel sheet, visualization, A.A.; designed the abstract chart, E.F. and N.W.; review statistical analysis, A.W. and O.T. developed questions, G.C.; review, editing, and supervision. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China (China, 81600818) and, Foundation of Liaoning Province Education Administration (Liaoning, LJKZ0855).
Data availability
The data underlying this article will be shared upon reasonable request to the corresponding author.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol. 2024;40(6):722–9. 10.1111/edt.12965. [DOI] [PubMed] [Google Scholar]
- 2.Suárez A, Díaz-Flores García V, Algar J, Sánchez MG, de Pedro ML, Freire Y. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int Endod J 2024;57(3):371–372. 10.1111/iej.13998. [DOI] [PubMed]
- 3.Rodrigues JA, Krois J, Schwendicke F. Demystifying artificial intelligence and deep learning in dentistry. Braz Oral Res 2021;35:e094. 10.1590/1807-3107bor-2021.vol35.0094. [DOI] [PubMed]
- 4.Kim TT, Makutonin M, Sirous R, Javan R. Optimizing large Language models in radiology and mitigating pitfalls: prompt engineering and Fine-tuning. Radiographics. 2025;45(4):e240073. 10.1148/rg.240073. [DOI] [PubMed]
- 5.Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large Language models such as ChatGPT for dental medicine. J Esthetic Restor Dentistry. 2023;35(7):1098–102. 10.1111/jerd.13046. [DOI] [PubMed]
- 6.Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JPA. Large Language models for science and medicine. Eur J Clin Invest. 2024;54(6):e14183. 10.1111/eci.14183. [DOI] [PubMed]
- 7.Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J 2024;57(3):305–314. 10.1111/iej.14014. [DOI] [PubMed]
- 8.Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J Med Internet Res 2023;25:e51580. 10.2196/51580. [DOI] [PMC free article] [PubMed]
- 9.Engelmann J, Fischer C, Nkenke E. Quality assessment of patient information on orthognathic surgery on the internet. J Craniomaxillofac Surg 2020;48(7):661–665. 10.1016/j.jcms.2020.05.004. [DOI] [PMC free article] [PubMed]
- 10.Temsah A, Alhasan K, Altamimi I, Jamal A, Al-Eyadhy A, Malki KH, Temsah MH. DeepSeek in healthcare: revealing opportunities and steering challenges of a new Open-Source artificial intelligence frontier. Cureus. 2025;17(2):e79221. 10.7759/cureus.79221. [DOI] [PMC free article] [PubMed]
- 11.Kayaalp ME, Prill R, Sezgin EA, Cong T, Królikowska A, Hirschmann MT. DeepSeek versus ChatGPT: Multimodal artificial intelligence revolutionizing scientific discovery. From language editing to autonomous content generation-Redefining innovation in research and practice. Knee Surg Sports Traumatol Arthrosc 2025. 10.1002/ksa.12628. [DOI] [PubMed]
- 12.Gravina AG, Pellegrino R, Palladino G, Imperio G, Ventura A, Federico A. Charting new AI education in gastroenterology: Cross-sectional evaluation of ChatGPT and perplexity AI in medical residency exam. Dig Liver Dis 2024;56(8):1304–1311. 10.1016/j.dld.2024.02.019. [DOI] [PubMed]
- 13.Tokgöz Kaplan T, Cankar M. Evidence-Based potential of generative artificial intelligence large Language models on dental avulsion: ChatGPT versus gemini. Dent Traumatol 2024. 10.1111/edt.12999. [DOI] [PubMed]
- 14.Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. medRxiv 2024. 10.1101/2024.04.26.24306390. [DOI] [PMC free article] [PubMed]
- 15.Roos J, Martin R, Kaczmarczyk R. Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study. JMIR Form Res 2024;8:e57592. 10.2196/57592. [DOI] [PMC free article] [PubMed]
- 16.OpenAI. (n.d.). OpenAI O3 and O4-mini Usage Limits on ChatGPT and the API. [https://help.openai.com/en/articles/9824962-openai-o3-and-o4-mini-usage-limits-on-chatgpt-and-the-api].
- 17.Ömür Arça D, Erdemir İ, Kara F, Shermatov N, Odacioğlu M, İbişoğlu E, Hanci FB, Sağiroğlu G, Hanci V. Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study. Medicine (Baltimore) 2024;103(22):e38352. 10.1097/md.0000000000038352. [DOI] [PMC free article] [PubMed]
- 18.Perplexity. (n.d.). Perplexity Terms of Service. [https://www.perplexity.ai/hub/legal/terms-of-service].
- 19.Perplexity. (n.d.). Rate Limits and Usage Tiers. [https://docs.perplexity.ai/guides/usage-tiers].
- 20.Ozduran E, Hancı V, Erkin Y, Özbek İC, Abdulkerimov V. Assessing the readability, quality and reliability of responses produced by ChatGPT, Gemini, and Perplexity regarding most frequently asked keywords about low back pain. PeerJ 2025, 13:e18847. 10.7717/peerj.18847. [DOI] [PMC free article] [PubMed]
- 21.Google Help. Can you provide detailed information on how data is processed and stored when using Gemini Advanced. 2025. [https://support.google.com/gemini/thread/327172086/can-you-provide-detailed-information-on-how-data-is-processed-and-stored-when-using-gemini-advanced?hl=en].
- 22.Google AI. for Developers. (n.d.). Gemini API Additional Terms of Service [https://ai.google.dev/gemini-api/terms].
- 23.Reflections on DeepSeek’s breakthrough. Natl Sci Rev. 2025;12(3):nwaf044. 10.1093/nsr/nwaf044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Peng Y, Malin BA, Rousseau JF, Wang Y, Xu Z, Xu X, Weng C, Bian J. From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare. J Biomed Inform 2025;163:104791. 10.1016/j.jbi.2025.104791. [DOI] [PMC free article] [PubMed]
- 25.DeepSeek. (n.d.). DeepSeek Terms of Use. [https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html].
- 26.DeepSeek API. Docs. (n.d.). FAQ. [https://api-docs.deepseek.com/faq].
- 27.Yang S, Zhu F, Ling X, Liu Q, Zhao P. Intelligent Health Care: Applications of Deep Learning in Computational Medicine. Front Genet 2021;12:607471. 10.3389/fgene.2021.607471. [DOI] [PMC free article] [PubMed]
- 28.Bajwa J, Munir U, Nori A, Williams B. Artificial intelligence in healthcare: transforming the practice of medicine. Future Healthc J. 2021;8(2):e188-e194. 10.7861/fhj.2021-0095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Parekh AE, Shaikh OA, Simran, Manan S, Hasibuzzaman MA. Artificial intelligence (AI) in personalized medicine: AI-generated personalized therapy regimens based on genetic and medical history: short communication. Ann Med Surg (Lond) 2023;85(11):5831–5833. 10.1097/ms9.0000000000001320. [DOI] [PMC free article] [PubMed]
- 30.Ouanes K, Farhah N. Effectiveness of Artificial Intelligence (AI) in Clinical Decision Support Systems and Care Delivery. J Med Syst 2024;48(1):74. 10.1007/s10916-024-02098-4. [DOI] [PubMed]
- 31.Elhaddad M, Hamam S. AI-Driven clinical decision support systems: an ongoing pursuit of potential. Cureus. 2024;16(4):e57728. 10.7759/cureus.57728. [DOI] [PMC free article] [PubMed]
- 32.Rigamonti L, Estel K, Gehlen T, Wolfarth B, Lawrence JB, Back DA. Use of artificial intelligence in sports medicine: a report of 5 fictional cases. BMC Sports Sci Med Rehabil 2021;13(1):13. 10.1186/s13102-021-00243-x. [DOI] [PMC free article] [PubMed]
- 33.Solaiman B, Malik A. Regulating algorithmic care in the European union: evolving doctor-patient models through the artificial intelligence act (AI-Act) and the liability directives. Med Law Rev 2025;33(1). 10.1093/medlaw/fwae033. [DOI] [PMC free article] [PubMed]
- 34.Tozzo P, Angiola F, Gabbin A, Politi C, Caenazzo L. The difficult role of artificial intelligence in medical liability: to err is not only human. Clin Ter. 2021;172(6):527–8. 10.7417/ct.2021.2372. [DOI] [PubMed] [Google Scholar]
- 35.Çi Ti RM. ChatGPT and oral cancer: a study on informational reliability. BMC Oral Health 2025;25(1):86. 10.1186/s12903-025-05479-4. [DOI] [PMC free article] [PubMed]
- 36.Lotter W, Hassett MJ, Schultz N, Kehl KL, Van Allen EM, Cerami E. Artificial Intelligence in Oncology: Current Landscape, Challenges, and Future Directions. Cancer Discov 2024;14(5):711–726. 10.1158/2159-8290.Cd-23-1199. [DOI] [PMC free article] [PubMed]
- 37.Riaz IB, Khan MA, Haddad TC. Potential application of artificial intelligence in cancer therapy. Curr Opin Oncol. 2024;36(5):437–48. 10.1097/cco.0000000000001068. [DOI] [PubMed] [Google Scholar]
- 38.Luchini C, Pea A, Scarpa A. Artificial intelligence in oncology: current applications and future perspectives. Br J Cancer 2022;126(1):4–9. 10.1038/s41416-021-01633-1. [DOI] [PMC free article] [PubMed]
- 39.Alum EU. AI-driven biomarker discovery: enhancing precision in cancer diagnosis and prognosis. Discov Oncol 2025;16(1):313. 10.1007/s12672-025-02064-7. [DOI] [PMC free article] [PubMed]
- 40.Kok END, Eppenga R, Kuhlmann KFD, Groen HC, van Veen R, van Dieren JM, de Wijkerslooth TR, van Leerdam M, Lambregts DMJ, Heerink WJ et al. Accurate surgical navigation with real-time tumor tracking in cancer surgery. NPJ Precis Oncol 2020;4:8. 10.1038/s41698-020-0115-0. [DOI] [PMC free article] [PubMed]
- 41.Sebastian AM, Peter D. Artificial intelligence in cancer research: trends, challenges and future directions. Life (Basel) 2022;12(12). 10.3390/life12121991. [DOI] [PMC free article] [PubMed]
- 42.Al-Musawi MSA, Al-Alwany SG, Uinarni AA, Rasulova H, Rodrigues I, Alkhafaji P, Alshanberi AT, Alawadi AM, Abbas AH. AH: Artificial intelligence in cancer diagnosis: Opportunities and challenges. Pathol Res Pract 2024;253:154996. 10.1016/j.prp.2023.154996. [DOI] [PubMed]
- 43.Alam MS, Siddiqui SA, Perween R. Epidemiological profile of head and neck cancer patients in Western Uttar Pradesh and analysis of distributions of risk factors in relation to site of tumor. J Cancer Res Ther 2017;13(3):430–435. 10.4103/0973-1482.180687. [DOI] [PubMed]
- 44.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71(3):209–249. 10.3322/caac.21660. [DOI] [PubMed]
- 45.Sahoo RK, Sahoo KC, Dash GC, Kumar G, Baliarsingh SK, Panda B, Pati S. Diagnostic performance of artificial intelligence in detecting oral potentially malignant disorders and oral cancer using medical diagnostic imaging: a systematic review and meta-analysis. Front Oral Health 2024;5:1494867. 10.3389/froh.2024.1494867. [DOI] [PMC free article] [PubMed]
- 46.Al-Rawi N, Sultan A, Rajai B, Shuaeeb H, Alnajjar M, Alketbi M, Mohammad Y, Shetty SR, Mashrah MA. The Effectiveness of Artificial Intelligence in Detection of Oral Cancer. Int Dent J. 2022;72(4):436–447. 10.1016/j.identj.2022.03.001. [DOI] [PMC free article] [PubMed]
- 47.Khanagar SB, Alkadi L, Alghilan MA, Kalagi S, Awawdeh M, Bijai LK, Vishwanathaiah S, Aldhebaib A, Singh OG. Application and Performance of Artificial Intelligence (AI) in Oral Cancer Diagnosis and Prediction Using Histopathological Images: A Systematic Review. Biomedicines 2023;11(6). 10.3390/biomedicines11061612. [DOI] [PMC free article] [PubMed]
- 48.Vinay V, Jodalli P, Chavan MS, Buddhikot CS, Luke AM, Ingafou MSH, Reda R, Pawar AM, Testarelli L. Artificial intelligence in oral cancer: A comprehensive scoping review of diagnostic and prognostic applications. Diagnostics (Basel) 2025;15(3). 10.3390/diagnostics15030280. [DOI] [PMC free article] [PubMed]
- 49.García-Pola M, Pons-Fuster E, Suárez-Fernández C, Seoane-Romero J, Romero-Méndez A, López-Jornet P. Role of Artificial Intelligence in the Early Diagnosis of Oral Cancer. A Scoping Review. Cancers (Basel) 2021;13(18). 10.3390/cancers13184600. [DOI] [PMC free article] [PubMed]
- 50.Veeraraghavan VP, Minervini G, Russo D, Cicciù M, Ronsivalle V. Assessing artificial intelligence in oral cancer diagnosis: A systematic review. J Craniofac Surg 2024. 10.1097/scs.0000000000010663. [DOI] [PubMed]
- 51.Salwaji S, Pasupuleti MK, Manyam R, Pasupuleti S, Alapati NS, Birajdar SS, Rameswarapu M. A scoping review of the effects of artificial intelligence on oral cancer treatment outcome and early diagnosis. Front Oral Maxillofacial Med. 2023;7. 10.21037/fomm-23-17. https://fomm.amegroups.org/article/view/81168.
- 52.Guzzo M, Locati LD, Prott FJ, Gatta G, McGurk M, Licitra L. Major and minor salivary gland tumors. Crit Rev Oncol Hematol 2010;74(2):134–148. 10.1016/j.critrevonc.2009.10.004. [DOI] [PubMed]
- 53.Alsanie I, Rajab S, Cottom H, Adegun O, Agarwal R, Jay A, Graham L, James J, Barrett AW, van Heerden W et al. Distribution and Frequency of Salivary Gland Tumours: An International Multicenter Study. Head Neck Pathol 2022;16(4):1043–1054. 10.1007/s12105-022-01459-0. [DOI] [PMC free article] [PubMed]
- 54.Skálová A, Hyrcza MD, Leivo I. Update from the 5th Edition of the World Health Organization Classification of Head and Neck Tumors: Salivary Glands. Head Neck Pathol 2022;16(1):40–53. 10.1007/s12105-022-01420-1. [DOI] [PMC free article] [PubMed]
- 55.Peravali RK, Bhat HH, Upadya VH, Agarwal A, Naag S. Salivary gland tumors: a diagnostic dilemma! J Maxillofac Oral Surg. 2015;14(Suppl 1):438–442. 10.1007/s12663-014-0665-1. [DOI] [PMC free article] [PubMed]
- 56.Cheng PC, Chiang HK. Diagnosis of Salivary Gland Tumors Using Transfer Learning with Fine-Tuning and Gradual Unfreezing. Diagnostics (Basel). 2023;13(21). 10.3390/diagnostics13213333. [DOI] [PMC free article] [PubMed]
- 57.Moore MG, Yueh B, Lin DT, Bradford CR, Smith RV, Khariwala SS. Controversies in the Workup and Surgical Management of Parotid Neoplasms. Otolaryngol Head Neck Surg. 2021;164(1):27–36. 10.1177/0194599820932512. [DOI] [PubMed]
- 58.Liu X, Pan Y, Zhang X, Sha Y, Wang S, Li H, Liu J. A Deep Learning Model for Classification of Parotid Neoplasms Based on Multimodal Magnetic Resonance Image Sequences. Laryngoscope 2023;133(2):327–335. 10.1002/lary.30154. [DOI] [PMC free article] [PubMed]
- 59.van Herpen C, Vander Poorten V, Skalova A, Terhaard C, Maroldi R, van Engen A, Baujat B, Locati LD, Jensen AD, Smeele L et al. Salivary gland cancer: ESMO-European Reference Network on Rare Adult Solid Cancers (EURACAN) Clinical Practice Guideline for diagnosis, treatment and follow-up. ESMO Open 2022;7(6):100602. 10.1016/j.esmoop.2022.100602. [DOI] [PMC free article] [PubMed]
- 60.Geiger JL, Ismaila N, Beadle B, Caudell JJ, Chau N, Deschler D, Glastonbury C, Kaufman M, Lamarre E, Lau HY et al. Management of Salivary Gland Malignancy: ASCO Guideline. J Clin Oncol 2021;39(17):1909–1941. 10.1200/jco.21.00449. [DOI] [PubMed]
- 61.Roldan-Vasquez E, Mitri S, Bhasin S, Bharani T, Capasso K, Haslinger M, Sharma R, James TA. Reliability of artificial intelligence chatbot responses to frequently asked questions in breast surgical oncology. J Surg Oncol 2024;130(2):188–203. 10.1002/jso.27715. [DOI] [PubMed]
- 62.Ekmekci E, Durmazpinar PM. Evaluation of different artificial intelligence applications in responding to regenerative endodontic procedures. BMC Oral Health 2025;25(1):53. 10.1186/s12903-025-05424-5. [DOI] [PMC free article] [PubMed]
- 63.Bekbolatova M, Mayer J, Ong CW, Toma M. Transformative potential of AI in healthcare: definitions, applications, and navigating the ethical landscape and public perspectives. Healthc (Basel). 2024;12(2). 10.3390/healthcare12020125. [DOI] [PMC free article] [PubMed]
- 64.Al Kuwaiti A, Nazer K, Al-Reedy A, Al-Shehri S, Al-Muhanna A, Subbarayalu AV, Al Muhanna D, Al-Muhanna FA. A Review of the Role of Artificial Intelligence in Healthcare. J Pers Med. 2023, 13(6). 10.3390/jpm13060951. [DOI] [PMC free article] [PubMed]
- 65.Sezgin E. Artificial intelligence in healthcare: Complementing, not replacing, doctors and healthcare providers. Digit Health. 2023;9:20552076231186520. 10.1177/20552076231186520. [DOI] [PMC free article] [PubMed]
- 66.Xu L, Sanders L, Li K, Chow JCL. Chatbot for Health Care and Oncology Applications Using Artificial Intelligence and Machine Learning: Systematic Review. JMIR Cancer. 2021;7(4):e27850. 10.2196/27850. [DOI] [PMC free article] [PubMed]
- 67.Singh K, Prabhu A, Kaur N. The Impact and Role of Artificial Intelligence (AI) in Healthcare: Systematic Review. Curr Top Med Chem 2025. 10.2174/0115680266339394250225112747. [DOI] [PubMed]
- 68.Clerici CA, Chopard S, Levi G. [Rare disease in the age of artificial intelligence.]. Recenti Prog Med 2024;115(2):67–75. 10.1701/4197.41839. [DOI] [PubMed]
- 69.AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R, Sheikh J. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J Med Internet Res. 2024;26:e59505. 10.2196/59505. [DOI] [PMC free article] [PubMed]
- 70.Meskó B. The impact of multimodal large Language models on health care’s future. J Med Internet Res. 2023;25:e52865. 10.2196/52865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, Zhang M, Cao C, Wang J, Wang X et al. The application of large language models in medicine: A scoping review. iScience. 2024;27(5):109713. 10.1016/j.isci.2024.109713. [DOI] [PMC free article] [PubMed]
- 72.Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y, Zhao L, Nie Y, Pan Y. Application of large language models in disease diagnosis and treatment. Chin Med J (Engl). 2025;138(2):130–142. 10.1097/cm9.0000000000003456. [DOI] [PMC free article] [PubMed]
- 73.Kochanek K, Skarzynski H, Jedrzejczak WW. Accuracy and repeatability of ChatGPT based on a set of Multiple-Choice questions on objective tests of hearing. Cureus. 2024;16(5):e59857. 10.7759/cureus.59857. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Haze T, Kawano R, Takase H, Suzuki S, Hirawa N, Tamura K. Influence on the accuracy in ChatGPT: Differences in the amount of information per medical field. Int J Med Inform 2023;180:105283. 10.1016/j.ijmedinf.2023.105283. [DOI] [PubMed]
- 75.Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, Nadkarni G, Klang E. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports 2023;13(1):16492. 10.1038/s41598-023-43436-9. [DOI] [PMC free article] [PubMed]
- 76.Introducing Perplexity Deep Research. [https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research].
- 77.Sillmann YM, Monteiro J, Eber P, Baggio AMP, Peacock ZS, Guastaldi FPS. Empowering surgeons: will artificial intelligence change oral and maxillofacial surgery? Int J Oral Maxillofac Surg 2025;54(2):179–190.10.1016/j.ijom.2024.09.004. [DOI] [PubMed]
- 78.Rasteau S, Ernenwein D, Savoldelli C, Bouletreau P. Artificial intelligence for oral and maxillo-facial surgery: A narrative review. J Stomatology Oral Maxillofacial Surg. 2022;123(3):276–82. 10.1016/j.jormas.2022.01.010. [DOI] [PubMed]
- 79.Sarikaya M, Ozcan Siki F, Ciftci I. Use of artificial intelligence in vesicoureteral reflux disease: A comparative study of guideline compliance. J Clin Med 2025;14(7). 10.3390/jcm14072378. [DOI] [PMC free article] [PubMed]
- 80.Wang R, He J, Liang H. Medicine’s J.A.R.V.I.S. Moment: how DeepSeek-R1 transforms clinical practice. J Thorac Dis. 2025;17(3):1784–7. 10.21037/jtd-2025b-05. [DOI] [PMC free article] [PubMed]
- 81.Mahmoud R, Shuster A, Kleinman S, Arbel S, Ianculovici C, Peleg O. Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential. J Oral Maxillofac Surg 2025;83(3):382–389. 10.1016/j.joms.2024.11.007. [DOI] [PubMed]
- 82.Mondillo G, Colosimo S, Perrotta A, Frattolillo V, Masino M. Comparative Evaluation of Advanced AI Reasoning Models in Pediatric Clinical Decision Support: ChatGPT O1 vs. DeepSeek-R1. medRxiv 2025;2025.2001.2027.25321169. 10.1101/2025.01.27.25321169.
- 83.Mikhail D, Farah A, Milad J, Nassrallah W, Mihalache A, Milad D, Antaki F, Balas M, Popovic MM, Feo A et al. Performance of DeepSeek-R1 in Ophthalmology: An Evaluation of Clinical Decision-Making and Cost-Effectiveness. medRxiv 2025;2025.2002.2010.25322041. 10.1101/2025.02.10.25322041. [DOI] [PubMed]
- 84.Hussain ZS, Delsoz M, Elahi M, Jerkins B, Kanner E, Wright C, Munir WM, Soleimani M, Djalilian A, Lao PA et al. Performance of DeepSeek, Qwen 2.5 MAX, and ChatGPT Assisting in Diagnosis of Corneal Eye Diseases, Glaucoma, and Neuro-Ophthalmology Diseases Based on Clinical Case Reports. medRxiv 2025. 10.1101/2025.03.14.25323836.
- 85.Fonseca Â, Ferreira A, Ribeiro L, Moreira S, Duque C. Embracing the future-is artificial intelligence already better? A comparative study of artificial intelligence performance in diagnostic accuracy and decision-making. Eur J Neurol. 2024;31(4):e16195. 10.1111/ene.16195. [DOI] [PMC free article] [PubMed]
- 86.Chen Z, Chambara N, Wu C, Lo X, Liu SYW, Gunda ST, Han X, Qu J, Chen F, Ying MTC. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 2025;87(3):1041–1049. 10.1007/s12020-024-04066-x. [DOI] [PMC free article] [PubMed]
- 87.Yıldırım A, Cicek O, Genç YS. Can AI-Based ChatGPT models accurately analyze Hand-Wrist radiographs?? A comparative study. Diagnostics (Basel) 2025;15(12). 10.3390/diagnostics15121513. [DOI] [PMC free article] [PubMed]
- 88.Rokhshad R, Khoury ZH, Mohammad-Rahimi H, Motie P, Price JB, Tavares T, Jessri M, Bavarian R, Sciubba JJ, Sultan AS. Efficacy and empathy of AI chatbots in answering frequently asked questions on oral oncology. Oral Surg Oral Med Oral Pathol Oral Radiol 2025;139(6):719–728. 10.1016/j.oooo.2024.12.028. [DOI] [PubMed]
- 89.Şişman A, Acar AH. Artificial intelligence-based chatbot assistance in clinical decision-making for medically complex patients in oral surgery: a comparative study. BMC Oral Health 2025;25(1):351. 10.1186/s12903-025-05732-w. [DOI] [PMC free article] [PubMed]
- 90.Sadeq MA, Ghorab RMF, Ashry MH, Abozaid AM, Banihani HA, Salem M, Aisheh MTA, Abuzahra S, Mourid MR, Assker MM et al. AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study. Scientific Reports 2024;14(1):18859. 10.1038/s41598-024-68996-2. [DOI] [PMC free article] [PubMed]
- 91.Srivastava P, Tewari A, Al-Riyami AZ. Artificial intelligence chatbots in transfusion medicine: A cross-sectional study. Vox Sang 2025. 10.1111/vox.70009. [DOI] [PubMed]
- 92.Pan A, Musheyev D, Bockelman D, Loeb S, Kabarriti AE. Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer. JAMA Oncol 2023;9(10):1437–1440. 10.1001/jamaoncol.2023.2947. [DOI] [PMC free article] [PubMed]
- 93.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183(6):589–596. 10.1001/jamainternmed.2023.1838. [DOI] [PMC free article] [PubMed]
- 94.Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024;57(1):108–13. 10.1111/iej.13985. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data underlying this article will be shared upon reasonable request to the corresponding author.




