Skip to main content
BMC Oral Health logoLink to BMC Oral Health
. 2026 Mar 6;26:662. doi: 10.1186/s12903-026-08060-9

Effect of language difference and time on the accuracy of artificial intelligence chatbots responses to questions about vital pulp therapy

Emine Şimşek 1,, Mine Büker 1
PMCID: PMC13077940  PMID: 41792697

Abstract

Background

Artificial intelligence–based chatbots are increasingly being used as supportive tools in healthcare for accessing and interpreting medical information. However, variations in language and temporal updates may influence the accuracy and consistency of the information they provide. This study aimed to evaluate the effects of language difference (Turkish and English) and time (10 days) on the accuracy of responses provided by six different artificial intelligence (AI) chatbots (ChatGPT, ChatGPT-4o, ChatGPT-5, Gemini, Microsoft Copilot, and Perplexity) to questions related to vital pulp therapy.

Methods

Twenty questions were prepared in accordance with the clinical guidelines of the American Association of Endodontists (AAE) and were presented to each model in both Turkish and English three times a day (morning, noon, and evening) over a 10-day period. A total of 7,200 responses were collected. The responses were classified as correct or incorrect by two independent researchers. The accuracy rates of the groups were analyzed using descriptive statistics, and Chi-square tests to compare correct versus incorrect response ratios.

Results

Microsoft Copilot achieved the highest accuracy rate (87.3%), while Perplexity showed the lowest (77.2%). When evaluated by language, ChatGPT-4o performed best in Turkish questions (83.8%), whereas Microsoft Copilot demonstrated the highest accuracy in English questions (87.3%). The time variable had no statistically significant effect on accuracy.

Conclusions

These results indicate that AI chatbots can serve as supportive tools in clinical decision-making processes; however, language differences and model-dependent performance variations should be carefully considered.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12903-026-08060-9.

Keywords: Accuracy, Artificial intelligence, Chatbot, Vital pulp therapy

Introduction

Vital pulp therapies (VPT) are modern, biologically driven treatment approaches that aim to preserve the physiological vitality and function of the dental pulp. Maintaining pulp vitality is critically important for sustaining the sensory and protective functions of dentin, preserving the natural defense mechanisms of the dentin–pulp complex, and ensuring the long-term functional retention of the tooth in the oral cavity [1]. While traditional restorative approaches often resort to root canal treatment in cases of pulp exposure due to deep carious lesions, current literature emphasizes that preserving the vitality of the pulp offers greater long-term advantages [2].

Advancements in biocompatible materials, improved clinical diagnostic criteria, and the growing understanding of pulp biology have significantly increased the success rates of vital pulp therapies. Particularly in young and mature permanent teeth, these treatments have emerged as minimally invasive options that preserve tooth structure in cases of reversible pulpitis, irreversible pulpitis, or traumatic pulp exposure [3]. In this context, when VPT are performed with proper case selection and appropriate clinical protocols, they are considered to enhance patient comfort while reducing the need for more invasive treatments in the future, thus offering substantial biological and economic benefits.

Artificial intelligence (AI) has become a key component driving transformation in healthcare by integrating advanced technologies such as big data analytics, machine learning, and natural language processing. With applications ranging from clinical decision support systems to diagnosis and treatment planning, AI contributes to improving diagnostic accuracy, optimizing treatment processes, and enhancing clinical workflow efficiency [4]. In recent years, AI-based chatbots have emerged as digital communication tools that provide rapid feedback in healthcare settings. Supported by natural language processing and large language models (LLMs), modern chatbots are capable of understanding context and generating coherent responses, which has led to their widespread use in healthcare, education, and various other sectors [57]. With the digitalization of healthcare services, rapid and accurate access to information has become a critical necessity. Chatbots address this need by providing information without time or location constraints. Although they demonstrate significant potential in dentistry—particularly in diagnosis, treatment planning, and patient education—the risk of “AI hallucinations,” defined as the generation of inaccurate or unsupported information, should not be overlooked. Therefore, AI systems should not function as independent decision-makers but rather as supportive tools under human supervision. Human oversight (gatekeeping) plays a crucial role in verifying the accuracy and clinical applicability of AI-generated information, ensuring the safe and ethical integration of these technologies into clinical practice.

A review of the current literature indicates a growing interest in the use of AI–based chatbot applications in the field of dentistry [8, 9]. Particularly in endodontics, an increasing number of studies have evaluated the performance of chatbots in areas such as root canal treatment procedures, management of dental emergencies, identification of endodontic complications, and patient education [1012]. However, the majority of these studies have been conducted in a single language, and research examining the effect of language differences on the accuracy of chatbot responses remains extremely limited [13].

Variability in the performance of LLMs across different languages has been associated with factors such as the unequal distribution of training data among languages, as well as differences in linguistic structure, morphological characteristics, and overall linguistic complexity. In particular, agglutinative languages such as Turkish exhibit higher levels of morphological complexity from a natural language processing perspective, which may negatively affect model accuracy [14, 15]. Therefore, evaluating the accuracy of AI chatbots in delivering clinical information across different languages is of importance from both scientific and clinical perspectives. In addition, large language models are not static systems; they are subject to updates over time and may exhibit variations in response consistency depending on system conditions. As a result, the accuracy of responses to identical questions may vary across different time points [10]. Assessing the time-dependent performance of chatbots is therefore essential for understanding their short-term consistency and reliability in clinical decision-support processes.

Only one study in the literature has examined the effect of language differences on the accuracy of chatbot responses. In that study conducted by Büyüközer et al. [13], questions related to regenerative endodontic procedures (REPs) were posed to different AI-based chatbots in both English and Turkish to investigate the impact of language on response accuracy. However, the time variable was not included in their analysis. Accordingly, the aim of the present study was to evaluate the effects of language differences (Turkish and English) and time (10 days) on the accuracy of responses provided by six AI-based chatbots (ChatGPT, ChatGPT-4o, ChatGPT-5, Gemini, Microsoft Copilot, and Perplexity) to questions related to VPT.

The originality of this study lies in its comparative evaluation of six AI-based chatbots in a clinically current and decision-sensitive field such as vital pulp therapy, with language (Turkish–English) and time variables examined simultaneously. The null hypothesis of this study was that language differences and the time variable would not have a statistically significant effect on the accuracy of chatbot responses to VPT-related questions.

Materials and methods

Ethical approval and sample determination

The study did not involve human or animal participants and was conducted using only publicly available data; therefore, ethical approval was not required in accordance with the guidelines of the Mersin University Health Sciences Research Ethics Committee. According to the power analysis conducted using the GPower program (GPower 3.1 software; Heinrich Heine University, Düsseldorf, Germany), for a chi-square test (Goodness of Fit: Contingency Table analysis) involving six groups, with an alpha level (α) of 0.05, an effect size (f) of 0.23, and a power (1-β) of 0.95, a minimum of 360 valid responses was determined to be required [13]. The selected effect size was based on a moderate effect assumption commonly adopted in similar studies comparing the performance of AI models.

Study design and question preparation

This study was designed as a prospective, cross-sectional, and observational investigation. Questions related to vital pulp therapy were determined following an extensive literature review. A total of 20 closed-ended true/false questions were prepared by two experienced endodontists (E.S. and M.B.) who routinely perform VPT, based on the current guidelines published by the American Association of Endodontists (AAE) [16]. The question set covers contemporary clinical aspects of VPT, including its definition, objectives, advantages, indications and contraindications, treatment types, procedural steps, materials used, success criteria, prognostic factors, and potential complications, and the questions were distributed in a balanced manner across these subheadings. Accordingly, the questions were prepared to evaluate both theoretical knowledge and clinical decision-making ability while reflecting current VPT practices.

All questions used in the study were initially prepared in English and developed in accordance with the AAE clinical guidelines. Subsequently, the statements underwent a linguistic validation process conducted by the two researchers leading the study. First, the primary researcher (E.S.) with academic proficiency in both Turkish and English translated the questions into Turkish. Then, the second researcher (M.B.) performed a back-translation procedure, whereby the Turkish versions were translated back into English and compared with the original text. Any discrepancies in meaning or content were revised until consensus was reached between the two researchers. This process aimed to ensure linguistic and conceptual equivalence between the original and translated versions.

Selection of chatbots and model versions

The following chatbots and their respective versions were used in this study:

Data collection process

The questions were presented to each chatbot in two different languages (Turkish and English), three times a day (morning, noon, and evening) over a ten-day period between October 22 and October 31, 2025. This design allowed for the observation of performance fluctuations across different time intervals and for the evaluation of the temporal stability of the model responses. Each question was posed individually in the format: “Is this information correct?”. Every question was asked in a separate session, and each query was conducted in a new chat window to ensure independence between interactions. To eliminate the influence of prior conversations, cache, cookies, and browsing history were cleared before each session. Chatbots were instructed to respond only with “True” or “False” for each statement. Although the chatbots were asked to indicate the correctness of the given statement, all responses included the expression “True” or “False” accompanied by a brief explanation. During the evaluation process, only the model’s final correctness judgment (True/False) was taken into account; the accompanying explanations were not included in the accuracy classification.

As a result, a total of 1,200 valid responses were collected from each chatbot (20 questions × 3 sessions × 10 days × 2 languages). Since the questions were asked in both Turkish and English across six chatbot platforms, a total of 7,200 responses were analyzed. All queries were conducted by two expert dentists (E.S. and M.B.) to maintain consistency. The responses obtained from the chatbots were saved in Microsoft Word 2016 (Microsoft Corporation, Redmond, WA, USA), anonymized, and coded systematically for analysis.

Evaluation of responses

All chatbot responses were classified as “true” or “false” by two independent researchers based on the diagnostic criteria outlined in the American Association of Endodontists (AAE) Clinical Guidelines [16].

Prior to the main evaluation, a structured calibration session was conducted to standardize the classification criteria between the researchers. As part of this process, a total of 720 responses (20 questions × 2 languages × 6 chatbots × 3 sessions), which were not included in the main study dataset, were randomly selected and assessed as a pilot sample. The researchers independently evaluated these responses, after which the results were compared. Any discrepancies were discussed in light of the AAE guidelines and current literature to clarify interpretative differences. In cases of disagreement, the relevant guidelines and literature were re-examined. In the rare instances where consensus could not be reached, a third expert was consulted to make the final determination. Through this process, the evaluation criteria were finalized before proceeding to the main analysis.

In the primary dataset, all responses were independently evaluated by the same two researchers. To determine inter-rater reliability, Cohen’s kappa coefficient (κ) was calculated. The kappa coefficient is a statistical measure that assesses the level of agreement between raters while accounting for chance agreement. The mean κ value was 0.93 (range: 0.86–0.98), indicating an exceptionally high level of agreement between evaluators. These findings demonstrate that the accuracy assessment of chatbot responses was objective and reproducible.

Statistical analysis

All statistical analyses were performed using IBM SPSS Statistics version 27 (IBM Corp., Armonk, NY, USA). The Chi-square test was applied to compare the proportions of correct and incorrect responses. Bonferronni correction used for multiple omparison. Binary logictic regression used for compare all properities as chatbot, language, sections of the day and time on correct answer. A p-value of less than 0.05 was considered statistically significant.

Results

In this study, a total of 20 questions were presented to six different chatbots in both Turkish and English, three times per day over a 10-day period, resulting in 7,200 total responses. According to the findings, there was a statistically significant difference in the distribution of correct responses among the groups for both Turkish and English answers (p < 0.05). For Turkish questions, ChatGPT-4o demonstrated the highest accuracy rate (83.8%), whereas Gemini showed the lowest accuracy rate (77.3%). For English questions, Microsoft Copilot achieved the highest accuracy rate (87.3%), while Perplexity demonstrated the lowest accuracy rate (77.2%) (Fig. 1; Table 1).

Fig. 1.

Fig. 1

Comparative analysis of chatbots’ correct response rates in Turkish and English questions

Table 1.

Comparative analysis of chatbots’ response performance in Turkish and English questions

ChatGPT ChatGPT-4o ChatGPT-5 Gemini Microsoft Copilot Perplexity Total p
(Effect size)
n % n % n % n % n % n % n %
Turkish True 483 80.50% a 503 83.80% a 488 81.30% a 464 77.30% b 474 79.00% a, b 465 77.50% a 2877 79.92% a, b

0.041*

(0.057)

False 117 19.50% 97 16.20% 112 18.70% 136 22.70% 126 21.00% 135 22.50% 723 20.08%
English True 504 84.00% a 519 86.50% a 506 84.30% a 488 81.30% b 524 87.30% a 463 77.20% b 3004 83.44% a

0.001*

(0.092)

False 96 16.00% 81 13.50% 94 15.70% 112 18.70% 76 12.70% 137 22.80% 596 16.56%

p

(Effect Size)

0.065 0.112 0.097

0.050*

(0.049)

0.001*

(0.111)

0.473

0.001*

(0.046)

*p < 0.05, The superscript letters a, b,c indicate Bonferroni correction for multiple comparisons

When comparing the performance between Turkish and English questions for each chatbot, no statistically significant difference was found for ChatGPT, ChatGPT-4o, ChatGPT-5, and Perplexity (p > 0.05). However, a statistically significant difference was observed in Gemini and Microsoft Copilot (p < 0.05). In both platforms, the accuracy rate for English questions was significantly higher than that for Turkish questions (Fig. 1; Table 1).

Overall, when considering all platforms collectively, a statistically significant difference was found between the accuracy rates of Turkish and English questions (p < 0.05), indicating a generally higher performance in English-language responses.

In this study, responses from six AI chatbots were evaluated for both Turkish and English questions at three different times of the day (morning, noon, and evening) over a 10-day period.

For the Turkish questions, there were no statistically significant differences among the chatbots in terms of the distribution of correct answers at any time of the day (p > 0.05). In contrast, for the English questions, no statistically significant differences were observed among the chatbots in the morning or evening sessions (p > 0.05). However, a significant difference was detected during the noon session (p < 0.05). During this period, Microsoft Copilot demonstrated the highest accuracy rate (89.0%), whereas Perplexity exhibited the lowest accuracy rate (73.5%) (Table 2).

Table 2.

Comparison of the distribution of correct answers among chatbots for Turkish and English questions at different times of the day

ChatGPT ChatGPT-4o ChatGPT-5 Gemini Microsoft Copilot Perplexity p
(Effect size)
Turkish Evening 162 (81.0%) 166 (83.0%) 158 (79.0%) 153 (76.5%) 156 (78.0%) 155 (77.5%) 0.615
Noon 161 (80.5%) 170 (85.0%) 167 (83.5%) 156 (78.0%) 160 (80.0%) 155 (77.5%) 0.336
Morning 160 (80.0%) 167 (83.5%) 163 (81.5%) 155 (77.5%) 158 (79.0%) 155 (77.5%) 0.630
p 0.969 0.852 0.512 0.936 0.886 1.000
English Morning 168 (84.0%) 172 (86.0%) 169 (84.5%) 162 (81.0%) 172 (86.0%) 158 (79.0%) 0.325
Noon 165 (82.5%) a 172 (86.0%) b 171 (85.5%) b 163 (81.5%) a 178 (89.0%) b 147 (73.5%) c

0.001*

(0.130)

Evening 171 (85.5%) 175 (87.5%) 166 (83.0%) 163 (81.5%) 174 (87.0%) 158 (79.0%) 0.143
p 0.715 0.879 0.787 0.989 0.656 0.318

*p < 0.05, The superscript letters a, b,c indicate Bonferroni correction for multiple comparisons

When the responses provided by the chatbots to Turkish and English questions were evaluated within each platform, no statistically significant differences were observed among the morning, noon, and evening measurements for ChatGPT, ChatGPT-4o, ChatGPT-5, Gemini, and Perplexity (p > 0.05).

In contrast, Microsoft Copilot demonstrated a statistically significant difference between Turkish and English responses across all three time periods (morning, noon, and evening) (p < 0.05). In each time frame, the English version consistently achieved higher accuracy rates compared to the Turkish version (Table 3).

Table 3.

Comparison of the distribution of correct answers within each chatbot for Turkish and English questions at different times of the day

Morning Noon Evening
ChatGPT Turkish 160 (80.0%) 161 (80.5%) 162 (81.0%)
English 168 (84.0%) 165 (82.5%) 171 (85.5%)
p 0.181 0.350 0.142
ChatGPT-4o Turkish 167 (83.5%) 170 (85.0%) 166 (83.0%)
English 172 (86.0%) 172 (86.0%) 175 (87.5%)
p 0.289 0.444 0.130
ChatGPT-5 Turkish 163 (81.5%) 167 (83.5%) 158 (79.0%)
English 169 (84.5%) 171 (85.5%) 166 (83.0%)
p 0.253 0.339 0.186
Gemini Turkish 155 (77.5%) 156 (78.0%) 153 (76.5%)
English 162 (81.0%) 163 (81.5%) 163 (81.5%)
p 0.230 0.228 0.135
Microsoft Copilot Turkish 158 (79.0%) a 160 (80.0%) a 156 (78.0%) a
English 172 (86.0%) b 178 (89.0%) b 174 (87.0%) b
p (Effect size) 0.043* (0,092) 0.009* (0,124) 0.012* (0,118)
Perplexity Turkish 155 (77.5%) 155 (77.5%) 155 (77.5%)
English 158 (79.0%) 147 (73.5%) 158 (79.0%)
p 0.404 0.208 0.404

*p<0.05, The superscript letters a and b indicate Bonferroni correction for multiple comparisons

When the accuracy rates of the chatbots’ responses to Turkish and English questions were analyzed by days, no statistically significant difference was found across any of the platforms (p > 0.05) (Table 4).

Table 4.

Comparison of the daily accuracy rates of Turkish and English responses of chatbots over a 10-day period

ChatGPT ChatGPT-4o  ChatGPT-5 Gemini Microsoft Copilot Perplexity p
Turkish Day 1 52 (86.7%) 48 (80.0%) 44 (73.3%) 48 (80.0%) 49 (81.7%) 47 (78.3%) 0.617
Day 2 48 (80.0%) 51 (85.0%) 48 (80.0%) 46 (76.7%) 46 (76.7%) 46 (76.7%) 0.853
Day 3 48 (80.0%) 50 (83.3%) 52 (86.7%) 46 (76.7%) 46 (76.7%) 47 (78.3%) 0.695
Day 4 47 (78.3%) 49 (81.7%) 50 (83.3%) 45 (75.0%) 48 (80.0%) 48 (80.0%) 0.910
Day 5 49 (81.7%) 49 (81.7%) 51 (85.0%) 45 (75.0%) 45 (75.0%) 47 (78.3%) 0.701
Day 6 47 (78.3%) 51 (85.0%) 43 (71.7%) 48 (80.0%) 48 (80.0%) 49 (81.7%) 0.607
Day 7 47 (78.3%) 52 (86.7%) 47 (78.3%) 47 (78.3%) 49 (81.7%) 44 (73.3%) 0.607
Day 8 48 (80.0%) 51 (85.0%) 52 (86.7%) 46 (76.7%) 47 (78.3%) 46 (76.7%) 0.616
Day 9 49 (81.7%) 51 (85.0%) 50 (83.3%) 47 (78.3%) 48 (80.0%) 45 (75.0%) 0.779
Day 10 48 (80.0%) 51 (85.0%) 51 (85.0%) 46 (76.7%) 48 (80.0%) 46 (76.7%) 0.747
p 0.989 0.995 0.329 0.999 0.996 0.995
English Day 1 49 (81.7%) 53 (88.3%) 48 (80.0%) 45 (75.0%) 54 (90.0%) 48 (80.0%) 0.249
Day 2 52 (86.7%) 52 (86.7%) 51 (85.0%) 51 (85.0%) 53 (88.3%) 44 (73.3%) 0.245
Day 3 52 (86.7%) 52 (86.7%) 48 (80.0%) 51 (85.0%) 54 (90.0%) 45 (75.0%) 0.254
Day 4 44 (73.3%) 51 (85.0%) 51 (85.0%) 49 (81.7%) 52 (86.7%) 46 (76.7%) 0.348
Day 5 48 (80.0%) 52 (86.7%) 51 (85.0%) 47 (78.3%) 50 (83.3%) 45 (75.0%) 0.574
Day 6 53 (88.3%) 51 (85.0%) 48 (80.0%) 51 (85.0%) 52 (86.7%) 50 (83.3%) 0.861
Day 7 51 (85.0%) 52 (86.7%) 51 (85.0%) 49 (81.7%) 53 (88.3%) 46 (76.7%) 0.569
Day 8 51 (85.0%) 52 (86.7%) 54 (90.0%) 50 (83.3%) 54 (90.0%) 45 (75.0%) 0.198
Day 9 52 (86.7%) 53 (88.3%) 53 (88.3%) 49 (81.7%) 51 (85.0%) 51 (85.0%) 0.904
Day 10 52 (86.7%) 51 (85.0%) 51 (85.0%) 46 (76.7%) 51 (85.0%) 43 (71.7%) 0.196
 p 0.511 1.000 0.848 0.870 0.973 0.771

*p<0.05

Within the scope of the study, all variables were examined together to identify the main factors affecting correct and incorrect responses. The goodness of fit of the model was evaluated using the Hosmer–Lemeshow test, and no statistically significant difference was found between the observed and predicted values (χ²: 5.089; p: 0.748). These findings indicate that the logistic regression model demonstrates a good fit to the data. Cox & Snell R² (0.007) and Nagelkerke R² (0.011) values were calculated. This indicates that the variables included in the model (model type, language, time, and time of day) explain only approximately 1.1% of the variance in response accuracy. In other words, there are likely other more influential factors affecting success that were not included in the model. The probability of successful responses in operations performed in English was statistically significantly higher compared to the reference language. The use of English increased the probability of success by 26.8% (Exp(B), Odds Ratio = 1.268). Compared to the reference model (ChatGPT), only Perplexity demonstrated a statistically significant difference. When Perplexity was used, the probability of obtaining a correct response decreased by approximately 26.4% compared to the reference model (Exp(B), Odds Ratio = 0.736). Other platforms did not show a statistically significant difference compared to ChatGPT within the overall model. When comparisons were made according to different times of the day and different days, no significant differences were found in the success rates of the AI platforms (Table 5).

Table 5.

Binary logistic regression results

B S.E. Wald df p Exp(B)- Odds Ratio
ChatGPT 32.255 5 0.000
ChatGPT-4o 0.215 0.111 3.743 1 0.053 1.240
ChatGPT-5 0.041 0.108 0.142 1 0.706 1.041
Gemini -0.189 0.104 3.293 1 0.070 0.828
Microsoft Copilot 0.064 0.108 0.353 1 0.552 1.066
Perplexity -0.307 0.102 8.984 1 0.003 0.736
English 0.238 0.061 15.007 1 0.000 1.268
Morning 0.097 2 0.953
Noon 0.017 0.075 0.051 1 0.822 1.017
Evening -0.006 0.075 0.006 1 0.940 0.994
Day1 3.481 9 0.942
Day2 0.028 0.136 0.042 1 0.838 1.028
Day3 0.056 0.137 0.168 1 0.682 1.058
Day4 -0.045 0.135 0.113 1 0.737 0.956
Day5 -0.054 0.134 0.162 1 0.687 0.947
Day6 0.056 0.137 0.168 1 0.682 1.058
Day7 0.028 0.136 0.042 1 0.838 1.028
Day8 0.104 0.138 0.573 1 0.449 1.110
Day9 0.134 0.138 0.936 1 0.333 1.143
Day10 -0.009 0.135 0.005 1 0.946 0.991
Constant 1.387 0.129 115.621 1 0.000 4.002

Cox &Snell R2 : 0.007, Nagelkerke R2: 0.011, Prediction : 81.7%, Hosmer & Lemeshow test: x2: 5.089, p:0.748

Discussion

Vital pulp therapy (VPT) is a conservative and tissue-friendly treatment approach that has gained increasing importance in clinical decision-making processes with the reconstruction of modern endodontics based on biological principles [1]. Artificial intelligence chatbots are innovative digital tools that facilitate access to information in healthcare and have the potential to provide rapid information, support patient guidance, and offer educational content in dentistry [12]. In this context, AI chatbots have the potential to support clinical decision-making in treatment approaches such as VPT that require evidence-based knowledge. However, the accuracy and currency of the information provided by these systems must be carefully evaluated to ensure safe clinical use.

Today, AI models possess multilingual communication capabilities, a feature that is critically important for reaching diverse patient populations. Turkish was selected as one of the study languages due to its accessibility and its status as the primary language of the country where the research was conducted. English, on the other hand, is recognized as the primary language of international communication in the health sciences. Therefore, presenting the questions in both languages enabled an evaluation of chatbot performance variations associated with linguistic differences. Additionally, since AI models are dynamically updateable systems, it was anticipated that response accuracy rates might vary over time. Large language models may theoretically be influenced by factors such as server load, system traffic, or minor background model updates. Within this framework, three different time intervals—morning, noon, and evening—were designated to observe potential intraday performance fluctuations and to analyze short-term temporal variation. The selection of three equally spaced time points ensured methodological consistency and standardization in the data collection process. Accordingly, to evaluate the potential effect of the time variable on response accuracy, the queries were repeated three times per day over a 10-day period.Thus, this study comprehensively analyzed the impact of language (Turkish–English) and time (10 days) variables on the accuracy of information related to VPT. The null hypothesis proposed that neither the language difference nor the time variable would produce a statistically significant effect on chatbot response accuracy. However, the findings indicated that language difference influenced the performance of certain chatbots (Gemini and Microsoft Copilot), whereas the time variable had no significant impact on accuracy. These results suggest that the null hypothesis was supported for the time variable but partially rejected regarding the language difference.

Overall, a statistically significant difference was found between the accuracy rates of Turkish and English questions across all chatbots (p < 0.05), with generally higher accuracy observed in English questions. When the chatbots were evaluated individually in terms of language variable, no significant difference was found between Turkish and English questions accuracy in most models, except for Gemini and Microsoft Copilot, which achieved higher accuracy rates in English. This finding can be explained by the fact that the majority of training datasets are predominantly in English. In addition, the agglutinative structure [14] of Turkish increases morphological complexity by generating numerous semantic variations through root–suffix combinations, which poses challenges for natural language processing (NLP) systems in terms of accurate comprehension and response generation [13]. Consequently, the relatively lower accuracy of chatbots in agglutinative languages such as Turkish can be considered a predictable limitation arising from linguistic structure. Indeed, previous studies have reported that agglutinative languages yield lower accuracy rates compared with fusional languages such as English and German [15]. Our findings are consistent with the study conducted by Büyüközer et al. in 2025, which reported significantly higher accuracy rates for English questions related to regenerative endodontic procedures (REP) compared with their Turkish counterparts [13].

When accuracy rates were evaluated independent of language, Microsoft Copilot demonstrated the highest accuracy, while Perplexity achieved the lowest. This finding indicates clear performance differences among various AI-based language models, with Microsoft Copilot exhibiting superior overall accuracy, whereas Perplexity showed comparatively lower reliability. When analyzed by language, ChatGPT-4o achieved the highest accuracy in Turkish questions (83.8%), while Gemini had the lowest (77.3%). In a study conducted by Öztürk et al. [17], the accuracy and consistency of chatbot responses to multiple-choice endodontic education questions asked at different times of the day and across several days were evaluated, and ChatGPT-4o was reported to have the highest accuracy rate supporting the findings of the present study. For English questions, Microsoft Copilot achieved the highest accuracy (87.3%), while Perplexity again ranked lowest (77.2%). Similarly, in a recent study evaluating chatbot performance on prosthodontics questions from the Dental Specialty Recognition Examination (DSRE), Microsoft Copilot achieved the highest accuracy rate (73%), whereas Perplexity showed the lowest (54.8%) [18]. These results are consistent with the findings of our study, reinforcing that Perplexity tends to perform at a lower level, while Copilot demonstrates more reliable and accurate outcomes.

In recent years, studies evaluating the performance of large language models (LLMs) in dentistry have demonstrated significant differences in accuracy and consistency among chatbots [1925]. These studies indicate that different platforms may exhibit varying accuracy rates when responding to the same set of questions and that performance is substantially influenced by model-specific characteristics. Chau et al. (2024) evaluated the performance of generative artificial intelligence systems in dental licensing examinations and reported marked differences in success rates among models [21]. Similarly, Chau et al. (2025) examined chatbot responses to text-based multiple-choice questions in prosthodontic and restorative dentistry and found statistically significant differences in accuracy across platforms [22]. Rokhshad et al. (2024) compared the accuracy and consistency of chatbots with clinicians in answering pediatric dentistry questions and demonstrated notable heterogeneity in model performance [23]. These findings support that LLM performance is model-dependent and may vary according to the specific clinical domain assessed. Therefore, the accuracy differences observed in the present study are consistent with the model-related variability in performance reported in the literature.

When the accuracy rates of chatbot responses to Turkish and English questions were examined across different days, no statistically significant differences were observed for any platform (p > 0.05). The absence of day-to-day variation in chatbot accuracy (p > 0.05) indicates that these models operate with short-term stability. This finding is consistent with previous chatbot studies investigating the effect of time on response accuracy [17, 26]. Furthermore, when accuracy rates were analyzed across different time periods (morning, noon, and evening), no statistically significant differences were found between Turkish and English responses in most cases. This result confirms that large language models tend to exhibit stable short-term performance and is in line with findings reported in the literature [27, 28]. Therefore, it can be concluded that the time variable has a limited effect on chatbot response accuracy. However, since only a short-term period (10 days) was evaluated in this study, longer-term investigations are needed to determine the long-term performance stability of AI chatbots.

This study has several methodological limitations. First, the evaluation of responses in a ‘true/false’ format represents an important limitation. Although this format allows for standardized and objective comparison, it may not fully reflect the complexity of clinical decision-making processes, including the interpretation of clinical findings, evaluation of imaging results, consideration of individual patient characteristics, risk–benefit analysis, incorporation of current clinical guidelines, and integration of the physician’s clinical experience. Second, only 20 questions related to vital pulp therapy were evaluated in this study. This limited number of questions may not fully represent the broad and heterogeneous range of clinical scenarios encountered in practice. Including a larger number of questions with varying levels of difficulty could have allowed for a more comprehensive and generalizable assessment of chatbot performance. Third, the assessment of the time variable covered only a 10-day period. Although this design was intended to analyze short-term performance consistency and intra-day/inter-day variability, long-term stability (at the level of months) could not be evaluated within the scope of this study, given that large language models are dynamically updatable systems. From a clinical perspective, while stable short-term performance is a positive finding in terms of daily usability and reliability, the primary clinical relevance lies in determining whether accuracy levels are maintained following potential model updates over the long term. Therefore, longitudinal follow-up studies conducted over months or years would provide more meaningful evidence regarding clinical reliability. In addition, the study included only six AI-based chatbot platforms. Other systems with different model architectures, training data, and update processes may demonstrate varying levels of accuracy and consistency. Therefore, the findings of this study cannot be generalized to all artificial intelligence–based language models. Furthermore, the inclusion of only Turkish and English limits the generalizability of the results to other languages and populations with different linguistic structures. Since language differences may affect model performance, studies incorporating a wider range of languages would provide more comprehensive and reliable conclusions. Future studies should include a much larger and more diverse question pool, incorporating a greater number of items with varying levels of difficulty. In addition, longer follow-up periods—particularly studies conducted over several months—should be planned, and a broader range of AI platforms as well as different language groups should be included. Such an approach would allow for a more reliable evaluation of both short- and long-term model stability and provide a more comprehensive understanding of the true contribution of AI systems to clinical decision-making processes.

Conclusions

This study examined the accuracy of responses provided by different AI chatbots to questions related to VPT. Microsoft Copilot achieved the highest accuracy rate, while Perplexity demonstrated the lowest performance. When analyzed by language, ChatGPT-4o showed the highest accuracy in Turkish questions, whereas Microsoft Copilot performed best in English questions. The time variable showed no significant effect on accuracy.

These findings indicate that AI-based chatbots can serve as supportive tools in clinical decision-making; however, language differences and model-specific performance variations should be taken into account. Future research should expand the scope of evaluation by including a larger number of questions and topics, comparing additional AI models and their versions, and monitoring long-term performance trends. Furthermore, studies conducted using real clinical case scenarios could enhance the reliability and practical applicability of these systems in dental education and clinical decision-support processes.

Supplementary Information

Acknowledgements

Not applicable.

Authors’ contributions

E.S. and M.B. contributed to the processes of conceptualization and design and played a pivotal role in data collection. They also contributed to the acquisition, analysis, and interpretation of the data. Both authors contributed to the preparation of the draft and critically reviewed it, approving the final manuscript.

Funding

The authors declare that this study has received no financial support.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declarations

Ethics approval and consent to participate

This study did not require ethical approval as it did not involve human or animal participants.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Duncan HF. Present status and future directions: vital pulp treatment and pulp preservation strategies. Int Endod J. 2022;55:497–511. 10.1111/iej.13688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Zhang L, Yin L, Wu J, Wang X, Huang J, Li Q. Clinical influencing factors of vital pulp therapy on pulpitis permanent teeth with two calcium silicate-based materials: a randomized clinical trial. Med (Baltim). 2024;103(18):e38015. 10.1097/MD.0000000000038015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Maloney B, Creavin G, Duncan HF. Pulp diagnosis: current guidelines, shortcomings, and future developments. J Ir Dent Assoc. 2023;69:1-6. 10.58541/001c.90027.
  • 4.Aminoshariae A, Kulild J, Nagendrababu V. Artificial intelligence in endodontics: current applications and future directions. J Endod. 2021;47(9):1352–7. 10.1016/j.joen.2021.05.003. [DOI] [PubMed] [Google Scholar]
  • 5.King MR. The future of AI in medicine: a perspective from a chatbot. Ann Biomed Eng. 2023;51(2):291–5. 10.1007/s10439-023-03166. [DOI] [PubMed] [Google Scholar]
  • 6.Huang H, Zheng O, Wang D, Yin J, Wang Z, Ding S, et al. ChatGPT for shaping the future of dentistry: the potential of multimodal large language models. Int J Oral Sci. 2023;15(1):29. 10.1038/s41368-023-00238-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rodrigues JA, Krois J, Schwendicke F. Demystifying artificial intelligence and deep learning in dentistry. Braz Oral Res. 2021;35:e094. 10.1590/1807-3107bor-2021.vol35.0094. [DOI] [PubMed] [Google Scholar]
  • 8.Yurdakurban E, Topsakal KG, Duran GS. A comparative analysis of AI-based chatbots: assessing data quality in orthognathic surgery-related patient information. J Stomatol Oral Maxillofac Surg. 2024;125(5):101757. 10.1016/j.jormas.2024.101757. [DOI] [PubMed] [Google Scholar]
  • 9.Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the performance of generative AI large language models ChatGPT, Google Bard, and Microsoft Bing Chat in supporting evidence-based dentistry: a comparative mixed-methods study. J Med Internet Res. 2023;25:e51580. 10.2196/51580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Büker M, Sümbüllü M, Arslan H. Comparative performance of chatbots in endodontic clinical decision support: a 4-day accuracy and consistency study. Int Dent J. 2025;75(5):100920. 10.1016/j.idj.2025.100920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.de Moura JDM, Fontana CE, da Silva Lima VHR, de Souza Alves I, de Melo Santos PA, de Almeida Rodrigues P. Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: a cross-sectional study. Comput Biol Med. 2024;183:109332. 10.1016/j.compbiomed.2024.109332. [DOI] [PubMed] [Google Scholar]
  • 12.Mohammad-Rahimi H, Setzer FC, Aminoshariae A, Dummer PMH, Duncan HF, Nosrat A. Artificial intelligence chatbots in endodontic education: concepts and potential applications. Int Endod J. 2025;00:1–14. 10.1111/iej.14231. [DOI] [PubMed] [Google Scholar]
  • 13.Büyüközer Özkan H, Doğan Çankaya T, Kölüş T. The impact of language variability on artificial intelligence performance in regenerative endodontics. Healthc (Basel). 2025;13(10):1190. 10.3390/healthcare13101190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Onan B. The cognitive backgrounds that agglutinating language structure forms in Turkish teaching. Mustafa Kemal Univ J Soc Sci Inst. 2014;6(11):236–63. [Google Scholar]
  • 15.Arnett C, Bergen BK. Why do language models perform worse for morphologically complex languages? arXiv. 2024;241114198. 10.48550/arXiv.2411.14198.
  • 16.American Association of Endodontists. Guide to clinical endodontics. 6th ed. Chicago: American Association of Endodontists; 2019. [Google Scholar]
  • 17.Arılı Öztürk E, Turan Gökduman C, Çanakçi BC. Evaluation of the performance of ChatGPT-4 and ChatGPT-4o as a learning tool in endodontics. Int Endod J. 2025;00:1–13. 10.1111/iej.14217. [DOI] [PubMed] [Google Scholar]
  • 18.Eraslan R, Ayata M, Yagci F, Albayrak H. Exploring the potential of artificial intelligence chatbots in prosthodontics education. BMC Med Educ. 2025;25(1):321. 10.1186/s12909-025-06849-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yilmaz BE, Gokkurt Yilmaz BN, Ozbey F. Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis. BMC Oral Health. 2025;25(1):573. 10.1186/s12903-025-05926-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Gokkurt Yilmaz BN, Ozbey F, Yilmaz BE. Evaluation of the performance of different large language models on head and neck anatomy questions in the dentistry specialization exam in Turkey. Surg Radiol Anat. 2025;47(1):211. 10.1007/s00276-025-03723-8. [DOI] [PubMed] [Google Scholar]
  • 21.Chau RCW, Thu KM, Yu OY, Hsung RTC, Lo ECM, Lam WYH. Performance of generative artificial intelligence in dental licensing examinations. Int Dent J. 2024;74(3):616–21. 10.1016/j.identj.2023.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chau RCW, Thu KM, Yu OY, Hsung RTC, Wang DCP, Man MWH, et al. Evaluation of chatbot responses to text-based multiple-choice questions in prosthodontic and restorative dentistry. Dent J. 2025;13(7):279. 10.3390/dj13070279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study. J Dent. 2024;144:104938. 10.1016/j.jdent.2024.104938. [DOI] [PubMed] [Google Scholar]
  • 24.Rokhshad R, Mohammad FD, Nomani M, Mohammad-Rahimi H, Schwendicke F. Chatbots for conducting systematic reviews in pediatric dentistry. J Dent. 2025;158:105733. 10.1016/j.jdent.2025.105733. [DOI] [PubMed] [Google Scholar]
  • 25.Rokhshad R, Khoury ZH, Mohammad-Rahimi H, Motie P, Price JB, Tavares T, et al. Efficacy and empathy of AI chatbots in answering frequently asked questions on oral oncology. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;139(6):719–28. 10.1016/j.oooo.2024.12.028. [DOI] [PubMed] [Google Scholar]
  • 26.Mustuloğlu Ş, Deniz BP. Evaluation of chatbots in the emergency management of avulsion injuries. Dent Traumatol. 2025;41(4):437–44. 10.1111/edt.13041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Büker M, Mercan G. Readability, accuracy, appropriateness, and quality of AI chatbot responses as a patient information source on root canal retreatment: a comparative assessment. Int J Med Inf. 2025;195:105948. 10.1016/j.ijmedinf.2025.105948. [DOI] [PubMed] [Google Scholar]
  • 28.Şimşek E, Kurt Ö. The impact of language differences on the readability, quality, and reliability of information provided by artificial intelligence chatbots regarding vital pulp therapy: a cross-sectional study. BMC Oral Health. 2026;26:134. 10.1186/s12903-025-07520-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.


Articles from BMC Oral Health are provided here courtesy of BMC

RESOURCES