ABSTRACT
Background
This study assessed the accuracy and consistency of responses provided by six Artificial Intelligence (AI) applications, ChatGPT version 3.5 (OpenAI), ChatGPT version 4 (OpenAI), ChatGPT version 4.0 (OpenAI), Perplexity (Perplexity.AI), Gemini (Google), and Copilot (Bing), to questions related to emergency management of avulsed teeth.
Materials and Methods
Two pediatric dentists developed 18 true or false questions regarding dental avulsion and asked public chatbots for 3 days. The responses were recorded and compared with the correct answers. The SPSS program was used to calculate the obtained accuracies and their consistency.
Results
ChatGPT 4.0 achieved the highest accuracy rate of 95.6% over the entire time frame, while Perplexity (Perplexity.AI) had the lowest accuracy rate of 67.2%. ChatGPT version 4.0 (OpenAI) was the only AI that achieved perfect agreement with real answers, except at noon on day 1. ChatGPT version 3.5 (OpenAI) was the AI that showed the weakest agreement (6 times).
Conclusions
With the exception of ChatGPT's paid version, 4.0, AI chatbots do not seem ready for use as the main resource in managing avulsed teeth during emergencies. It might prove beneficial to incorporate the International Association of Dental Traumatology (IADT) guidelines in chatbot databases, enhancing their accuracy and consistency.
Keywords: artificial intelligence, chatbot, emergency, large language models, tooth avulsion
1. Introduction
Avulsion is defined as the complete displacement of the tooth from the alveolar socket due to trauma. This trauma is most common between the ages of 7 and 9 years. The most commonly affected teeth in the primary and permanent dentition are the upper central incisors [1]. Avulsion occurs in 7%–21% of primary teeth and 0.5%–3% of permanent teeth [2]. Correct emergency intervention is very important in the successful treatment of avulsed teeth. The most appropriate treatment for permanent avulsed teeth is to immediately place the tooth in its socket. If this condition is not met, it must be delivered to the dentist in an optimal transport environment. Leaving avulsed teeth in a dry environment for an extended period of time results in failure [3].
Large language models (LLMs) are generative mathematical models that are trained on hundreds of terabytes of textual data (articles, books, and other online content), capable of sampling from statistical distributions of tokens, including words, graphemes, individual characters, and punctuation marks, in the public corpus [4]. Most chatbots, however, are proprietary models, and their functionality remains largely unknown [5]. Additionally, LLMs vary in their history, architecture, transformers, resources, training methods, and implementations. For example, GPT‐3.5 (OpenAI, San Francisco, CA) was trained using both supervised and unsupervised learning techniques on a substantial corpus of Internet text data. In addition to its predecessor, GPT‐4 (OpenAI, San Francisco, CA) introduced multimodal capabilities, allowing it to evaluate image inputs that were not yet available for public use. Nonetheless, GPT‐3.5 and GPT‐4 were mostly trained on text generated till September 2021. ChatGPT‐4.0 possesses distinctive attributes, including a substantial user base and collaborating experts who offer ongoing feedback for training, enhanced reasoning and instruction‐following abilities, as well as the incorporation of recent training data and insights from the practical application of prior models into GPT‐4.0's security research and monitoring framework. Google's Bard (Google, Mountain View, CA) and Bing AI (Microsoft Corporation, ABD) has the ability to access and incorporate information from the Internet in real‐time while generating responses [6, 7]. Perplexity (Perplexity.AI) facilitates the presentation of sources in its output, shown as hyperlinks to the websites utilized by Perplexity AI. Furthermore, pre‐generated “related” inputs are supplied below the output, assisting the user in providing further comments or exploring the original output further [8].
Artificial intelligence (AI) chatbots can simulate human conversation using AI methods, including natural language processing and machine learning [4]. Nowadays, the understanding, interpretation, and answering features of chatbots have made it easier for people to comprehend complex topics, making them increasingly popular for use [4, 9]. AI chatbots can also be easily utilized in various healthcare domains, such as helping users evaluate the need to consult a healthcare professional, detecting symptoms, and providing clinical decision support [4, 9, 10]. Although the use of artificial intelligence, especially GPTs, in health‐related issues is becoming more widespread, there are also concerns about its validity and reliability [11, 12, 13, 14, 15, 16, 17, 18]. Chatbots have several inherent limitations, including the potential for factually inaccurate responses known as “hallucinations” [17], the possibility of giving different responses at different times, and concerns regarding the reliability of their training data and processes [11, 14]. The performance of dentistry‐related chatbots in answering previously asked questions on endodontics [13, 14], pediatric dentistry [15], oral and maxillofacial surgery [16], and dental trauma [11] was evaluated. However, these studies have generally assessed a limited number of chatbot types. Additionally, further research is needed to better understand the variability in responses provided by various chatbots, particularly in the context of medical applications where consistency in response content is crucial for ensuring patient safety.
In the case of an avulsion, it may not always be feasible for parents or healthcare professionals to access a dentist specialized in this area (endodontist, pediatric dentist). As a result, discussing the benefits and drawbacks of AI applications in the emergency intervention of avulsed teeth can greatly contribute to this matter. Therefore, the aim of this study was to assess the consistency and accuracy of responses to dental avulsion‐related questions provided by various chatbots such as ChatGPT version 3.5 (OpenAI), ChatGPT version 4 (OpenAI), ChatGPT version 4.0 (OpenAI), Perplexity (Perplexity.AI), Gemini (Google), and Copilot (Bing, Microsoft).
2. Method
In this cross‐sectional study, an educational research design was employed to evaluate six different chatbots based on their responses to standardized test questions. Ethical approval was not required for this research due to the lack of human participants. The guidelines of the International Association of Dental Traumatology (IADT), which outline widely accepted treatment approaches in the field of dental trauma, were used to design the questions. The IADT guidelines have been developed with the agreement of these departments for the management of traumatic dental injuries and are current [3]. A true/false questionnaire comprising 18 questions, addressing frequently asked inquiries from patients and parents about the emergency management of avulsed teeth, was created by a pediatric dentistry faculty member and a research assistant (Ş.M. and B.P.D.) during their specialist training (Table 1). The questionnaire was sent to three experts (a pediatric dentist, an endodontist, and an oral and maxillofacial surgeon) for validation. Based on the feedback from the experts, some questions were simplified for better clarity.
TABLE 1.
Eighteen questions about avulsion injuries and correct answers.
| The avulsed primary tooth needs to be reimplanted | False |
| The avulsed permanent tooth should be reimplanted | True |
| In case of tooth avulsion, the patient should be advised to get a tetanus vaccine | True |
| It is a suitable method to store an avulsed permanent tooth in a sponge, cotton, or tissue until consulting a dentist | False |
| Milk is the most suitable and practical storage medium for an avulsed tooth until consulting a dentist | True |
| The condition of periodontal ligament cells in an avulsed permanent tooth depends on the development level of the root and is not affected by the time spent outside the mouth and the storage medium in which the avulsed tooth is kept | False |
| A rigid splint should be applied for 2 weeks after reimplantation of the avulsed tooth | False |
| In permanent tooth avulsion, root canal treatment should be started within 2 weeks after reimplantation | True |
| In primary tooth avulsion, root canal treatment should be started within 2 weeks after reimplantation | False |
| Periodontal ligament cells can survive for 2 h in a dry environment after avulsion | False |
| In the case of permanent tooth avulsion, root canal treatment should be started immediately after reimplantation, since the healing potential of open‐apex teeth is higher than closed‐apex teeth | False |
| In the case of permanent tooth avulsion, the long‐term prognosis of delayed replantation is poor. The expected result is replacement resorption | True |
| Immediate reimplantation of the avulsed tooth at the site of the accident is the best treatment | True |
| Holding an avulsed permanent tooth root and avoiding touching the tooth crown should be avoided | False |
| Avulsion is the complete displacement of a tooth as a result of trauma | True |
| Maxillary central incisors are the teeth most prone to avulsion | True |
| In the case of a permanent tooth avulsion with an open apex, obliteration of the pulp canal on follow‐up radiographs is a positive result indicating that the pulp is healing | True |
| Amoxicillin or penicillin should be prescribed as the first‐choice antibiotic for a child with an avulsed permanent tooth | True |
A user with expertise in using chatbots (B.P.D.) posed the prepared questions to ChatGPT version 3.5 (OpenAI), ChatGPT version 4 (OpenAI), ChatGPT version 4.0 (OpenAI), Perplexity (Perplexity.AI), Gemini (Google), and Copilot (Bing, Microsoft) platforms between June 25, 2024, and June 27, 2024. An example question is, “Should an avulsed primary tooth be reimplanted? Correct or incorrect?” The questions were asked for 3 days, each question was asked three times a day (morning, evening, and afternoon), and the “new chat” option was selected each time, thus obtaining a total of 54 answers for each question. These conversations were conducted using a single account from the user's computer IP address. Additionally, the history between these conversations was cleared. The “delete our chat history” command was applied, and the messages were manually deleted. To avoid time differences, the test was applied to the AI programs in order, starting at the same hours (9:00 AM, 2:00 PM, and 7:00 PM). This approach allowed us to investigate the consistency of the responses and whether any changes or fluctuations were observed at different times [11, 14]. All responses were stored in an Excel spreadsheet (Microsoft, Redmond, WA, USA). At the end of the 3 days, the responses were compared with the correct answers in the guide and coded as “correct” or “incorrect” by the researchers. To address possible sources of bias, the generated response list was blinded to two researchers, and the researchers independently evaluated the responses. Discrepancies were then resolved through consensus.
The hypotheses tested in this study are whether the answers provided by different chatbot types regarding dental avulsion match the correct responses and whether the accuracy of the information generated by chatbots varies at different times of the day and on different days. The IBM SPSS Statistics V22 software package (SPSS Inc., Chicago, IL, USA) was utilized for statistical analysis. Weighted kappa was used to assess the consistency of the responses provided by the AI programs with the actual situation. As criteria for kappa values, < 0.10 = no agreement; 0.10–0.40 = weak agreement; 0.41–0.60 = moderate agreement; 0.61–0.80 = strong agreement; and 0.81–1.00 = perfect agreement was established [19]. The Cochran Q test was used to compare changes at three time points (day 1, 2, 3; and morning, noon, and evening) (p < 0.05). The significance level was set at 5%.
3. Results
When 972 responses were evaluated (18 questions × 6 platforms × 3 time periods × 3 days), the percentage of correct answers given by AI applications for all questions was 78.3%. ChatGPT 4.0 achieved the highest accuracy rate with 95.6% in the overall time period, while Perplexity (Perplexity.AI) had the lowest accuracy rate with 67.2% (Table 2).
TABLE 2.
Accuracy distribution of answers obtained from artificial intelligence platforms compared to real answers.
| Artificial intelligence | Correct responses | |||
|---|---|---|---|---|
| 1st day | 2nd day | 3rd day | Total | |
| n = 54 (%) | n = 54 (%) | n = 54 (%) | n = 162 (%) | |
| ChatGPT version 3.5 (OpenAI) | 34 (62.9) | 37 (68.5) | 40 (74) | 111 (68.5) |
| ChatGPT version 4 (OpenAI) | 40 (74) | 43 (79.6) | 46 (83.3) | 129 (79.6) |
| ChatGPT version 4.0 (OpenAI) | 50 (92.5) | 52 (96.2) | 53 (98.1) | 155 (95.6) |
| Perplexity (Perplexity.AI) | 36 (66.6) | 36 (66.6) | 37 (68.5) | 109 (67.2) |
| Gemini (Google) | 38 (70.3) | 44 (81.4) | 45 (83.3) | 127 (78.3) |
| Copilot (Bing) | 39 (72.2) | 47 (87) | 45 (83.3) | 131 (80.8) |
| Total | 237 (73.1) | 259 (79.9) | 266 (82) | 762 (78.3) |
Note: The last row presents information regarding the responses obtained over different time periods, specifically 1 day and over a total duration of 3 days (n = 324, n = 972).
Table 3 shows the variations in the responses of AI systems at different times of the day, including morning, noon, and evening. Despite not having any statistically significant difference, accuracies, and errors of the responses delivered by AI platforms vary across three distinct times of the day (p > 0.05). Only the Perplexity (Perplexity.AI) consistently provided an equal number of correct and incorrect responses during the morning, noon, and evening of day 2.
TABLE 3.
Changes in AI responses at different times of the day.
| AI | Morning (n = 18) | Noon (n = 18) | Evening (n = 18) | p |
|---|---|---|---|---|
| Correct | Correct | Correct | ||
| n (%) | n (%) | n (%) | ||
| ChatGPT version 3.5 (OpenAI) | ||||
| 1st day | 14 (77.7) | 12 (66.6) | 14 (77.7) | 0.607 |
| 2nd day | 13 (72,2) | 15 (83.3) | 15 (83.3) | 0.641 |
| 3rd day | 15 (83.3) | 13 (72.2) | 14 (77.7) | 0.368 |
| ChatGPT version 4 (OpenAI) | ||||
| 1st day | 12 (66.6) | 13 (72.2) | 11 (61.1) | 0.651 |
| 2nd day | 11 (61.1) | 12 (66.6) | 12 (66.6) | 0.717 |
| 3rd day | 13 (72.2) | 12 (66.6) | 13 (72.2) | 0.368 |
| ChatGPT version 4.0 (OpenAI) | ||||
| 1st day | 11 (61.1) | 12 (66.6) | 11 (61.1) | 0.368 |
| 2nd day | 11 (61.1) | 11 (61.1) | 10 (55.5) | 0.368 |
| 3rd day | 10 (55.5) | 9 (50.0) | 10 (55.5) | 0.368 |
| Perplexity (Perplexity.AI) | ||||
| 1st day | 6 (33.3) | 7 (38.8) | 5 (27.7) | 0.607 |
| 2nd day | 6 (33.3) | 6 (33.3) | 6 (33.3) | 1.000 |
| 3rd day | 7 (38.8) | 11 (61.1) | 7 (38.8) | 0.069 |
| Gemini (Google) | ||||
| 1st day | 10 (55.5) | 9 (50.0) | 11 (61.1) | 0.794 |
| 2nd day | 11 (61.1) | 10 (55.5) | 11 (61.1) | 0.819 |
| 3rd day | 12 (66.6) | 11 (61.1) | 12 (66.6) | 0.717 |
| Copilot (Bing) | ||||
| 1st day | 15 (83.3) | 11 (61.1) | 15 (83.3) | 0.169 |
| 2nd day | 11 (61.1) | 10 (55.5) | 12 (66.6) | 0.607 |
| 3rd day | 13 (72.2) | 12 (66.6) | 14 (77.7) | 0.368 |
Table 4 shows the differences in the responses provided by the AI chatbots at the same time on days 1, 2 and 3. Accordingly, although there is no statistically significant difference (except for Perplexity [Perplexity.AI] at noon [p < 0.001]), the correct–incorrect status of the answers provided by all AI platforms differs on alternate days, within the same time periods.
TABLE 4.
Variations in AI responses different days (same time of day).
| 1st day (n = 18) | 2nd day (n = 18) | 3rd day (n = 18) | p | |
|---|---|---|---|---|
| Correct | Correct | Correct | ||
| n (%) | n (%) | n (%) | ||
| ChatGPT version 3.5 (OpenAI) | ||||
| Morning | 14 (77.7) | 13 (72,2) | 15 (83.3) | 0.741 |
| Noon | 12 (66.6) | 15 (83.3) | 13 (72,2) | 0.368 |
| Evening | 14 (77.7) | 15 (83.3) | 14 (77.7) | 0.882 |
| ChatGPT version 4 (OpenAI) | ||||
| Morning | 12 (66.6) | 11 (61.1) | 13 (72,2) | 0.368 |
| Noon | 13 (72,2) | 12 (66.6) | 12 (66.6) | 0.819 |
| Evening | 11 (61.1) | 12 (66.6) | 13 (72,2) | 0.368 |
| ChatGPT version 4.0 (OpenAI) | ||||
| Morning | 11 (61.1) | 11 (61.1) | 10 (55.5) | 0.368 |
| Noon | 12 (66.6) | 11 (61.1) | 9 (50.0) | 0.097 |
| Evening | 11 (61.1) | 10 (55.5) | 10 (55.5) | 0.368 |
| Perplexity (Perplexity.AI) | ||||
| Morning | 6 (33.3) | 6 (33.3) | 7 (38.8) | 0.819 |
| Noon | 7 (38.8) | 6 (33.3) | 11 (61.1) | < 0.001 |
| Evening | 5 (27.7) | 6 (33.3) | 7 (38.8) | 0.741 |
| Gemini (Google) | ||||
| Morning | 10 (55.5) | 11 (61.1) | 12 (66.6) | 0.549 |
| Noon | 9 (50.0) | 10 (55.5) | 11 (61.1) | 0.717 |
| Evening | 11 (61.1) | 11 (61.1) | 12 (66.6) | 0.846 |
| Copilot (Bing) | ||||
| Morning | 15 (83.3) | 11 (61.1) | 13 (72,2) | 0.180 |
| Noon | 11 (61.1) | 10 (55.5) | 12 (66.6) | 0.472 |
| Evening | 15 (83.3) | 12 (66.6) | 14 (77.7) | 0.368 |
Note: Perplexity (Perplexity.AI), Noon, p value = 0.030.
Table 5 shows the level of reliability of the AI platforms' responses compared to the actual responses. Accordingly, ChatGPT version 4.0 (OpenAI) consistently achieved perfect agreement, except at noon on day 1. Additionally, ChatGPT version 3.5 (OpenAI) showed weak agreement 6 times, while Perplexity (Perplexity.AI) showed it 5 times, and ChatGPT version 4 (OpenAI) and Gemini (Google) showed it once. Copilot (Bing) showed it twice.
TABLE 5.
Response consistency of AI platforms relative to real responses.
| Day | Time | ChatGPT version 3.5 (OpenAI) | ChatGPT version 4 (OpenAI) | ChatGPT version 4.0 (OpenAI) | Perplexity (Perplexity.AI) | Gemini (Google) | Copilot (Bing) |
|---|---|---|---|---|---|---|---|
| 1st day | Morning | 0.289 | 0.308 | 0.886 | 0.357 | 0.775 | 0.160 |
| Noon | 0.538 | 0.416 | 0.769 | 0.458 | 0.222 | 0.886 | |
| Evening | 0.184 | 0.658 | 0.886 | 0.259 | 0.658 | 0.160 | |
| 2nd day | Morning | 0.182 | 0.430 | 0.886 | 0.357 | 0.658 | 0.658 |
| Noon | 0.400 | 0.538 | 0.886 | 0.571 | 0.550 | 0.775 | |
| Evening | 0.400 | 0.769 | 1.000 | 0.143 | 0.658 | 0.779 | |
| 3rd day | Morning | 0.400 | 0.649 | 0.889 | 0.458 | 0.769 | 0.649 |
| Noon | 0.416 | 0.769 | 1.000 | 0.430 | 0.658 | 0.769 | |
| Evening | 0.526 | 0.649 | 1.000 | 0.241 | 0.538 | 0.536 |
Note: Response consistency of AIs was assessed by Kappa values (< 0.10 = no agreement; 0.10–0.40 = weak agreement; 0.41–0.60 = moderate agreement; 0.61–0.80 = strong agreement; and 0.81–1.00 = perfect agreement).
4. Discussion
Health‐related information collected via the Internet is a significant and commonly utilized resource for the general population. Prior research has evaluated the precision of information disseminated by TikTok and YouTube videos for the emergency care of an avulsed tooth. The research findings suggest that the videos possess low to moderate utility [20, 21]. Recent advancements in artificial intelligence and human interactions have led to the growing prevalence of AI‐based chatbots. Consequently, patients or parents may rely on AI chatbots as a principal source of information on dental avulsion, a critical dental trauma necessitating immediate intervention. Delays in treatment adversely impact prognosis; thus, it is essential to establish the reliability and consistency of chatbot responses to pertinent inquiries. AI chatbots can facilitate conversations by providing answers to patient questions, providing patients information about treatments, potential adverse effects, complications, prognoses, and outcomes [4, 10, 11]. Although all of these models are based on language processing, they produce varying answers to the same questions due to differences in their neural network architectures, the quality, quantity, diversity of training data, the optimization strategies employed (e.g., fine‐tuning), and the techniques used to develop the models. Each model typically has distinct strengths, weaknesses, capabilities, and limitations. For instance, regarding the neural network architectures employed to identify relationships in the input text and generate meaningful and relevant responses, ChatGPT models and Copilot (Bing) utilize the GPT (Generative Pre‐Trained Transformer) architecture, while Gemini (Google) employs LaMDA (Language Model for Conversational Application) and, more recently, PaLM 2 (Paths Language Model) in conjunction with web search [16]. For these reasons, a comprehensive assessment of their performance and potential mistakes is essential before determining their feasibility [22]. In this context, a study was conducted to evaluate the consistency and accuracy of the answers provided by various AI platforms to questions concerning the emergency management of avulsed teeth. This study is one of the first to evaluate the accuracy and consistency of six different chatbots, which are publicly accessible and require a paid membership, with a focus on the emergency management of avulsed teeth in dentistry.
While the use of binary options for responses is consistent with previous studies, this research approached the models by asking whether the information provided was “correct or incorrect.” In contrast, prior studies employed a “yes/no” format to achieve the same objective [11, 14]. This distinction helps to mitigate the inclusion of extraneous or fabricated information.
Previous studies have extensively evaluated the performance of ChatGPT, particularly the free version 3.5, in the context of medical research and clinical practice. Additionally, research has indicated that the paid version, ChatGPT version 4.0, demonstrates superior accuracy and reliability, consistent with the findings of this study [19]. In the medical field, although chatbots such as Perplexity (Perplexity.AI), Gemini (Google), and Copilot (Bing) have not undergone as extensive testing as ChatGPT, Perplexity (Perplexity.AI) has found applications in oncology [12]. Unlike ChatGPT, Perplexity (Perplexity.AI) allows users to immediately evaluate the accuracy and scientific reliability of its responses by citing sources in its output [23]. This feature may enhance the platform's appeal for users seeking in‐depth exploration of topics through the comprehensive sources that inform the generated output. Consequently, Perplexity (Perplexity.AI) warrants additional investigation in the domains of health and education. Gemini (Google), announced in December 2023 as Google's largest and most advanced artificial intelligence following Google Bard [24], has also been discussed in prior studies concerning endodontics [11, 13]. Meanwhile, Copilot (Bing) has demonstrated high performance in evaluating biochemical data compared to ChatGPT‐3.5 and Gemini [25]; however, it has shown relatively lower accuracy in responding to hematology questions compared to both ChatGPT 3.5 and Bard [26]. This study has revealed important data on these emerging technologies, specifically regarding Perplexity (Perplexity.AI), Gemini (Google), Copilot (Bing), areas that have not been extensively explored to date.
In this study, the lowest accuracy rate based on real answers was observed in ChatGPT version 3.5 (OpenAI) on the first day, with an accuracy of 62.9%. Conversely, Perplexity (Perplexity.AI) exhibited the lowest accuracy rates on the second and third days, with scores of 66.6% and 68.5%, respectively. The highest accuracy rate was consistently recorded by ChatGPT version 4.0 (OpenAI), a paid version, across all days (92.5%, 96.2%, and 98.1%, respectively). Özden et al. [11] reported that ChatGPT provided 51.0% correct answers, while Google Bard achieved a correct answer rate of 64.0% for questions pertaining to dental trauma. Additionally, Suarez et al. [14] indicated that ChatGPT had a response accuracy rate of 57.33%. In an evaluation of various chatbots, excluding ChatGPT 4.0 and ChatGPT 3.5, it was found that ChatGPT 4 achieved the highest accuracy in answering pediatric dentistry questions at 77.8%, with all chatbots, except ChatGPT 3.5, demonstrating acceptable consistency (Cronbach's alpha > 0.7) [15]. Tokgöz Kaplan and Cankar [27] conducted a study evaluating the responses of ChatGPT and Gemini to inquiries regarding dental avulsion, finding that Gemini's responses were statistically significantly more accurate (p = 0.004). Johnson et al. [28] conducted a study assessing the responses of different artificial intelligence chat robots to inquiries regarding dental trauma. Their findings indicated that the Claude AI chat robot demonstrated greater validity and reliability than ChatGPT and Google Gemini, whereas Bing exhibited lower reliability. Our results align with previous studies that highlight the promising potential of AI chatbots. However, Umer and Habib suggest that an accuracy threshold of greater than 90% should be established as acceptable for diagnostic tasks [29]. Given the widespread accessibility of these chatbots, it is imperative that their performance is rigorously evaluated, and the information they provide is thoroughly examined by researchers and relevant assessment organizations. Chatbots offer accurate and reliable information on the emergency intervention of avulsed teeth; however, it is crucial for patients and parents to comprehend and interpret this information correctly to avoid adversely affecting the prognosis. Health professionals must educate the public to assess and utilize information derived from chatbots.
AI chatbots are probabilistic in nature, forming responses by estimating the probability of the next word based on statistical patterns in their training data. They operate differently from regular deterministic software. As a result, asking the same question twice may not always yield the same answers [30]. In previous studies evaluating the consistency of AI chatbots in the field of dental trauma and endodontics, questions were posed three times a day (morning, afternoon, and evening) for 10 days [11, 14]. In this study, to determine the consistency level of AI chatbots and to examine whether fluctuations occur at different times of the day, questions were asked three times daily (morning, afternoon, and evening) for 3 days using a single account. It is believed that temporal variation is mitigated through these simultaneous applications. In the current study, although there were no statistically significant differences in the responses of the chatbots throughout the day or across different days, the chatbots did not consistently produce similar responses. While only Perplexity (Perplexity.AI) consistently provided an equal number of correct and incorrect answers during the day in the morning, noon, and evening of day 2, the correct–incorrect status of the answers provided by all AI platforms varied on alternate days within the same time periods. Suárez et al. [14] reported that the consistency of responses provided by ChatGPT in the field of endodontics was insufficient for clinical decision‐making. Johnson et al. [28] found that all four chatbots demonstrated an acceptable level of consistency in the domain of dental trauma, ranked from high to low as follows: Claude AI, GPT‐3.5, Bing, and Gemini. Ozden et al. [11] evaluated ChatGPT and Google Bard in the field of dental trauma and found that Google Bard produced more consistent results than ChatGPT, although it did not reach a sufficient level of consistency. Additionally, Ozden et al. [11] reported that the software models of chatbots affect the consistency of their responses. The results of this study also support this finding. Mohammad‐Rahimi et al. [13] reported that in the field of Endodontics, Bing consistently provided similar answers, whereas Google Bard and GPT 3.5 generated different narratives each time the same question was posed. While the provision of varied can offer users a more comprehensive perspective on a given topic, it is crucial that the sources utilized by chatbots to generate responses are credible, such as public health organizations or high‐impact scientific journals. Failure to ensure this may result in the dissemination of unrealistic information. The long‐term prognosis of an avulsed tooth is significantly influenced by the provision of appropriate treatment within the initial 15 min. In such instances, primarily involving the anterior teeth, the absence of timely emergency intervention can lead to functional, psychological, and aesthetic complications. This condition incurs further expenses, including treatment charges, transportation, rehabilitation, and loss of labor. Therefore, enhancing public knowledge of the intrinsic limits and potential hazards of employing chatbots for the management of avulsed teeth in emergencies is crucial, as it will influence both physical and psychological growth and development, particularly in youngsters.
This study has several limitations. The AI platforms were instructed to respond to a limited number of questions and provide only binary “correct‐incorrect” answers. This approach may not have adequately captured the comprehensive knowledge required for the emergency management of avulsed teeth, as real‐world scenarios would likely present more diverse and complex inquiries. However, to mitigate the problems of generating “hallucinatory” or fabricated responses, a prompt was used that explicitly stated the desired type of response [31]. In further studies, utilizing multiple‐choice, fill‐in‐the‐blank, short‐answer, and short essay questions may better reflect the breadth of information provided by chatbots. Furthermore, the AI platforms were not specifically trained in the emergency management of avulsed teeth, which may have affected the accuracy of their responses. Additionally, they did not compare the knowledge of clinicians with that of the AI in addressing the prepared questions. Including clinicians with varying levels of experience would have provided valuable insights into the accuracy and efficiency of AI applications in this context.
5. Conclusion
Although the paid version of ChatGPT (4.0) demonstrates promising results in the immediate management of avulsed teeth, notable discrepancies in performance are observed both among different chatbots and within the same chatbot at varying times. Given the limitations of this study, it appears that chatbots, with the exception of the paid version of ChatGPT 4.0, are not yet suitable for use as a primary source for emergency management of avulsed teeth. Accordingly, it is essential that the datasets used for training chatbots incorporate the 2020 avulsion guidelines from the IADT.
Author Contributions
All authors contributed to the study conception and design and questionnaire preparation. Data collection was performed by Büşra Pınar Deniz. The first draft of the manuscript was written by Seyma Mustuloglu. Statistical analysis were performed by Seyma Mustuloglu. All authors commented on previous of the manuscript and all authors read and approved the final manuscript.
Ethics Statement
Our study used The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines and did not require ethics approval, given that it used only publicly accessible data not involving research participants.
Consent
Informed consent was obtained from all individual participants included in the study.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
The authors would like to thank Sevilay KARAHAN for her help in analyzing the data of the study.
Funding: The authors received no specific funding for this work.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
- 1. Addo M., Parekh S., Moles D., and Roberts G., “Knowledge of Dental Trauma First Aid (DTFA): The Example of Avulsed Incisors in Casualty Departments and Schools in London,” British Dental Journal 202 (2007): E27. [DOI] [PubMed] [Google Scholar]
- 2. Gábris K., Tarján I., and Rózsa N., “Dental Trauma in Children Presenting for Treatment at the Department of Dentistry for Children and Orthodontics, Budapest, 1985–1999,” Dental Traumatology 17 (2001): 103–108. [DOI] [PubMed] [Google Scholar]
- 3. Fouad A. F., Abbott P. V., Tsilingaridis G., et al., “International Association of Dental Traumatology Guidelines for the Management of Traumatic Dental Injuries: 2. Avulsion of Permanent Teeth,” Dental Traumatology 36 (2020): 331–342, 10.1111/edt.12573. [DOI] [PubMed] [Google Scholar]
- 4. Eggmann F., Weiger R., Zitzmann N. U., and Blatz M. B., “Implications of Large Language Models Such as ChatGPT for Dental Medicine,” Journal of Esthetic and Restorative Dentistry 35, no. 7 (2023): 1098–1102, 10.1111/jerd.13046. [DOI] [PubMed] [Google Scholar]
- 5. Thirunavukarasu A. J., Ting D. S. J., Elangovan K., Gutierrez L., Tan T. F., and Ting D. S. W., “Large Language Models in Medicine,” Nature Medicine 29, no. 8 (2023): 1930–1940, 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
- 6. Ali R., Tang O. Y., Connolly I. D., et al., “Performance of ChatGPT, GPT‐4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank,” Neurosurgery 93, no. 5 (2023): 1090–1098, 10.1227/neu.0000000000002551. [DOI] [PubMed] [Google Scholar]
- 7. Raiaan M. A. K., Mukta M. S. H., Fatema K., et al., “A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges,” IEEE Access 12 (2024): 26839–26874. [Google Scholar]
- 8. Gravina A. G., Pellegrino R., Palladino G., Imperio G., Ventura A., and Federico A., “Charting New AI Education in Gastroenterology: Cross‐Sectional Evaluation of ChatGPT and Perplexity AI in Medical Residency Exam,” Digestive and Liver Disease 56, no. 8 (2024): 1304–1311, 10.1016/j.dld.2024.02.019. [DOI] [PubMed] [Google Scholar]
- 9. Pandey S. and Sharma S., “A Comparative Study of Retrieval‐Based and Generative‐Based Chatbots Using Deep Learning and Machine Learning,” Healthcare Analytics 3 (2023): 100198. [Google Scholar]
- 10. Ghanem Y. K., Rouhi A. D., Al‐Houssan A., et al., “Google to Dr. ChatGPT: Assessing the Content and Quality of Artificial Intelligence‐Generated Medical Information on Appendicitis,” Surgical Endoscopy 38, no. 5 (2024): 2887–2893, 10.1007/s00464-024-10739-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Ozden I., Gokyar M., Ozden M. E., and Sazak Ovecoglu H., “Assessment of AI Applications in Responding to Dental Trauma,” Dental Traumatology 40 (2024): 722–729, 10.1111/edt.12965. [DOI] [PubMed] [Google Scholar]
- 12. Pan A., Musheyev D., Bockelman D., Loeb S., and Kabarriti A. E., “Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer,” JAMA Oncology 9, no. 10 (2023): 1437–1440, 10.1001/jamaoncol.2023.2947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Mohammad‐Rahimi H., Ourang S. A., Pourhoseingholi M. A., Dianat O., Dummer P. M. H., and Nosrat A., “Validity and Reliability of AI Chatbots as Public Sources of Information on Endodontics,” International Endodontic Journal 57, no. 3 (2024): 305–314, 10.1111/iej.14014. [DOI] [PubMed] [Google Scholar]
- 14. Suárez A., Díaz‐Flores García V., Algar J., Gómez Sánchez M., Llorente de Pedro M., and Freire Y., “Unveiling the ChatGPT Phenomenon: Evaluating the Consistency and Accuracy of Endodontic Question Answers,” International Endodontic Journal 57, no. 1 (2024): 108–113, 10.1111/iej.13985. [DOI] [PubMed] [Google Scholar]
- 15. Rokhshad R., Zhang P., Mohammad‐Rahimi H., Pitchika V., Entezari N., and Schwendicke F., “Accuracy and Consistency of Chatbots Versus Clinicians for Answering Pediatric Dentistry Questions: A Pilot Study,” Journal of Dentistry 144 (2024): 104938, 10.1016/j.jdent.2024.104938. [DOI] [PubMed] [Google Scholar]
- 16. Balel Y., “Can ChatGPT‐3.5 Be Used in Oral and Maxillofacial Surgery?,” Journal of Stomatology, Oral and Maxillofacial Surgery 124, no. 5 (2023): 101471, 10.1016/j.jormas.2023.101471. [DOI] [PubMed] [Google Scholar]
- 17. Chen S., Kann B. H., Foote M. B., et al., “Use of Artificial Intelligence Chatbots for Cancer Treatment Information,” JAMA Oncology 9, no. 10 (2023): 1459–1462, 10.1001/jamaoncol.2023.2954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Giannakopoulos K., Kavadella A., Aaqel Salim A., Stamatopoulos V., and Kaklamanos E. G., “Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence‐Based Dentistry: Comparative Mixed Methods Study,” Journal of Medical Internet Research 25 (2023): e51580, 10.2196/51580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Landis J. R. and Koch G. G., “The Measurement of Observer Agreement for Categorical Data,” Biometrics 33 (1977): 159–174, 10.2307/2529310. [DOI] [PubMed] [Google Scholar]
- 20. Hutchison C. M., Cave V., Walshaw E. G., Burns B., and Park C., “YouTube as a Source for Patient Education About the Management of Dental Avulsion Injuries,” Dental Traumatology 36, no. 2 (2020): 207–211, 10.1111/edt.12517. [DOI] [PubMed] [Google Scholar]
- 21. Saygili S., Gezer I., Oner H. S., Tuna‐Ince E. B., and Kasimoglu Y., “Evaluation of the Reliability and Accuracy of YouTube and TikTok Contents About Storage Media for Avulsed Teeth: A Cross‐Sectional Study,” Dental Traumatology 40, no. 5 (2024): 522–529, 10.1111/edt.12952. [DOI] [PubMed] [Google Scholar]
- 22. Arif T. B., Munaf U., and Ul‐Haque I., “The Future of Medical Education and Research: Is ChatGPT a Blessing or Blight in Disguise?,” Medical Education Online 28, no. 1 (2023): 2181052, 10.1080/10872981.2023.2181052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Meyer A., Riese J., and Streichert T., “Comparison of the Performance of GPT‐3.5 and GPT‐4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study,” JMIR Medical Education 10 (2024): e50965, 10.2196/50965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Anon , “Introducing Gemini: Google's Most Capable AI Model Yet,” accessed July 15, 2024, https://blog.google/technology/ai/google‐gemini‐ai/#sundar‐note.
- 25. Kaftan A. N., Hussain M. K., and Naser F. H., “Response Accuracy of ChatGPT 3.5 Copilot and Gemini in Interpreting Biochemical Laboratory Data a Pilot Study,” Scientific Reports 14, no. 1 (2024): 8233, 10.1038/s41598-024-58964-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Kumari A., Kumari A., Singh A., et al., “Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT‐3.5, Google Bard, and Microsoft Bing,” Cureus 15, no. 8 (2023): e43861, 10.7759/cureus.43861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Tokgöz Kaplan T. and Cankar M., “Evidence‐Based Potential of Generative Artificial Intelligence Large Language Models on Dental Avulsion: ChatGPT Versus Gemini,” Dental Traumatology (2024), 10.1111/edt.12999. [DOI] [PubMed] [Google Scholar]
- 28. Johnson A. J., Singh T. K., Gupta A., et al., “Evaluation of Validity and Reliability of AI Chatbots as Public Sources of Information on Dental Trauma,” Dental Traumatology (2024), 10.1111/edt.13000. [DOI] [PubMed] [Google Scholar]
- 29. Umer F. and Habib S., “Critical Analysis of Artificial Intelligence in Endodontics: A Scoping Review,” Journal of Endodontia 48, no. 2 (2022): 152–160, 10.1016/j.joen.2021.11.007. [DOI] [PubMed] [Google Scholar]
- 30. Ghahramani Z., “Probabilistic Machine Learning and Artificial Intelligence,” Nature 521, no. 7553 (2015): 452–459, 10.1038/nature14541. [DOI] [PubMed] [Google Scholar]
- 31. Athaluri S. A., Manthena S. V., Kesapragada V. S. R. K. M., Yarlagadda V., Dave T., and Duddumpudi R. T. S., “Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References,” Cureus 15, no. 4 (2023): e37432, 10.7759/cureus.37432. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
