Large Language model (LLM) in temporomandibular disorder education: a comparative study

Özge Özdal Zincir; Şirin Hatipoğlu; Esra Çifçi Özkan

doi:10.1186/s12903-025-07468-z

. 2025 Dec 12;26:112. doi: 10.1186/s12903-025-07468-z

Large Language model (LLM) in temporomandibular disorder education: a comparative study

Özge Özdal Zincir ^1,^✉, Şirin Hatipoğlu ², Esra Çifçi Özkan ³

PMCID: PMC12817439 PMID: 41387828

Abstract

Background

With the increasing reliance on artificial intelligence (AI) in healthcare information delivery, it is essential to evaluate the accuracy and reliability of AI-generated responses. This study aimed to assess the quality of responses provided by three AI-based language models—ChatGPT-4, Gemini, and Copilot—on temporomandibular disorders (TMD), a complex and prevalent group of musculoskeletal conditions.

Methods

A total of 83 questions, categorized into seven key domains of TMD (Anatomy, Signs and Symptoms, Etiology, Evaluation and Diagnosis, Treatment Options, Complications, and Prognosis), were presented independently to each AI model. Each response was evaluated and classified into one of five accuracy levels: False, Nonfactual, Minimal Facts, Selected Facts, and Objectively True. Statistical analysis, including Pearson Chi-Square and Fisher’s Exact tests, was conducted to determine the relationship between AI model and response accuracy.

Results

ChatGPT-4 produced the highest proportion of Objectively True answers (78.3%), significantly outperforming Gemini (53%) and Copilot (20.5%) (p < 0.05). Gemini’s responses predominantly consisted of Selected Facts, while Copilot’s outputs were largely incomplete or minimally informative. Statistically significant differences in response accuracy were observed across all thematic domains (p < 0.05).

Conclusion

ChatGPT-4 demonstrated superior reliability in delivering accurate and comprehensive information about TMD, though inconsistencies remain in specific areas such as joint anatomy and prognosis. AI models should undergo rigorous validation before being employed in clinical or patient education settings.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12903-025-07468-z.

Keywords: ChatGPT-4, Gemini, Copilot, Temporomandibular disorders, Artificial intelligence, Patient education

Background

The temporomandibular joint (TMJ) is a highly specialized articulation that enables essential functions such as mastication, speech, and facial expression. Temporomandibular disorders (TMD) represent a group of musculoskeletal conditions affecting the TMJ, masticatory muscles, and associated structures, frequently resulting in pain, functional limitations, and decreased quality of life [1–6]. The prevalence of TMD is substantial, affecting a significant proportion of both adult and pediatric populations, and its multifactorial etiology (encompassing biological, psychological, and environmental factors) creates a complex clinical picture that challenges both diagnosis and management [7–9]. Consequently, patients often seek information to understand their condition, explore treatment options, and make informed decisions regarding their care.

In recent years, artificial intelligence (AI)–based language models have emerged as promising tools for delivering accessible, rapid, and interactive health-related information [10–12]. Large language models (LLMs), including OpenAI’s GPT-4, Google’s Gemini, and Microsoft’s CoPilot, are trained on extensive datasets to generate human-like text, answer clinical and patient questions, and provide educational support [13–19]. Prior studies have demonstrated the utility of AI-generated responses in diverse medical fields; however, evidence regarding their reliability, comprehensiveness, and clinical relevance in the context of TMD remains limited. Evaluating AI performance in this area is critical, as inaccuracies or omissions in information could influence patient understanding, decision-making, and adherence to recommended care.

Emerging evidence suggests that AI applications can support both clinical and educational aspects of TMD management. For example, Reda et al. (2023) demonstrated the potential of AI to facilitate early diagnosis, improve patient understanding, and enhance patient-clinician communication in TMD, highlighting its role in supporting clinical decision-making and patient education [20]. Despite these promising findings, variations in model architecture, training data, and update frequency can lead to differences in accuracy and applicability, emphasizing the need for systematic evaluation before widespread clinical integration.

Access to accurate, reliable, and contextually relevant information is particularly important in complex health conditions like TMD, where chronic pain, functional limitations, and psychological comorbidities are common. Patients may turn to online resources, including AI-based tools, to fill knowledge gaps and make informed decisions regarding treatment. The capacity of LLMs to provide timely, understandable, and evidence-based information could enhance patient engagement and potentially improve outcomes. However, the limitations of these models—including potential biases, outdated information, and variable accuracy—underscore the importance of assessing their reliability within specific clinical contexts.

Therefore, this study aims to evaluate the reliability of widely used AI language models in providing information about TMD. By comparing the responses generated by GPT-4, Gemini, and CoPilot, the study seeks to identify their strengths and limitations and provide insights into their potential role in supporting patient education and healthcare communication. Ensuring that AI-generated information is accurate, trustworthy, and aligned with current clinical standards is essential to optimize its application in TMD management and to minimize the risk of misinformation.

Methods

In this study, a content analysis of the responses to questions about TMD given by ChatGPT-4 (OpenAI LLC, San Francisco, CA, USA), Gemini (Alphabet Inc., Mountain View, CA, USA), and Copilot (Microsoft Corporation, Redmond, WA, USA) was conducted. For the analysis, a question set consisting of a total of 83 questions under 7 main headings [Anatomy of temporomandibular joint (13 questions), Signs and symptoms (10 questions), Etiology (11 questions), Evaluation and diagnosis (12 questions), Treatment options (20 questions), Complications of treatment (5 questions), Prognosis and outcomes (12 questions)] was prepared by two researchers (Ö.Ö.Z., Ş.H.) (Table 1).

Table 1.

Questions under domains related to TMD

Domaıns	Questıons
Anatomy of temporomandIbular joInt	1. What is the temporal bone? 2. Where is the temporal bone located in the body? 3. What tissues are adjacent to the temporal bone? 4. What is the mandible? 5. Where is the mandible located in the body? 6. What tissues are adjacent to the mandible? 7. What is the temporomandibular articular disc? 8. Where is the temporomandibular articular disc located in the body? 9. What tissues are adjacent to the temporomandibular articular disc? 10. What are the masticatory muscles? 11. Where in the body are the masticatory muscles located? 12. What are the tissues adjacent to the masticatory muscles? 13. What is the translation movement that occurs during mouth opening and closing?
SIgns and symptoms	1. What are the signs or symptoms of temporomandibular disorders? 2. What kind of pain occurs in temporomandibular disorders? 3. How is mouth-opening limitation affected in temporomandibular disorders? 4. How does biting or chewing activity change in temporomandibular disorders? 5. In what types of temporomandibular diseases does joint noise occur? 6. What is the relationship between temporomandibular disorders and chronic headache? 7. Is there a connection between temporomandibular disorders and ear pain? 8. Do temporomandibular disorders cause malocclusion? 9. Does malocclusion cause temporomandibular disorders? 10. What is the effect of temporomandibular diseases on the masticatory muscles?
EtIology	1. What are the factors that cause temporomandibular disorders? 2. What kind of injuries are effective in the occurrence of temporomandibular disorders? 3. Which parafunctional behaviors can cause temporomandibular disorders? 4. What skeletal malformations may be associated with temporomandibular disorders? 5. What is the effect of bruxism on the occurrence of temporomandibular diseases? 6. Does poor posture cause temporomandibular disorders? 7. What is the role of stress in temporomandibular disorders? 8. How is depression related to temporomandibular disorders? 9. What is the role of anxiety in temporomandibular disorders? 10. Does malocclusion caused by existing or various dental treatments cause temporomandibular disorders? 11. How does overuse of the temporomandibular joint, which is one of the causes of temporomandibular disorders, occur?
EvaluatIon and dIagnosIs	1. How is joint palpation performed in the diagnosis of temporomandibular disorders? 2. How is muscle palpation performed in the diagnosis of temporomandibular disorders? 3. How is the clinical evaluation of mouth opening done in the diagnosis of temporomandibular disorders? 4. Is taking a history of temporomandibular disorders effective in diagnosis? 5. What is the contribution of mouth opening measurement to the evaluation of temporomandibular disorders? 6. What should be considered in the evaluation of temporomandibular disorders during radiographic examination? 7. Which findings on radiographic examination are important in diagnosing temporomandibular disorders? 8. What is the importance of joint noise evaluation in the diagnosis of temporomandibular disorders? 9. How is intraoral evaluation performed in the diagnosis of temporomandibular disorders? 10. What is the role of magnetic resonance imaging evaluation in the diagnosis of temporomandibular disorders? 11. Does psychological health have an impact on the evaluation of temporomandibular disorders? 12. What is the role of arthroscopy in the diagnosis of temporomandibular disorders?
Treatment optIons	1. How are temporomandibular disorders treated? 2. For which temporomandibular disorders is a stabilization splint an effective treatment? 3. How long is the stabilization splint used in the temporomandibular disorders? 4. What is the role of oral anti-inflammatory medications in the treatment of temporomandibular disorders? 5. Which temporomandibular disorders is physical therapy effective in treating? 6. What physical therapy methods are used in the treatment of temporomandibular disorders? 7. For which temporomandibular disorders are home-based exercises effective in the treatment? 8. What are the home-based exercises that have a therapeutic effect in the treatment of temporomandibular disorders? 9. What behavioral modifications are effective in treating temporomandibular disorders? 10. Which temporomandibular disorders are treated with joint injection? 11. What is used in joint injections to treat temporomandibular disorders? 12. In which temporomandibular disorders is botulinum toxin injection applied? 13. To which area is botulinum toxin injection applied in temporomandibular disorders? 14. Which temporomandibular disorders are treated with joint manipulation? 15. How is the joint manipulation method used in the treatment of temporomandibular disorders applied? 16. What is the effect of biofeedback therapy in the treatment of temporomandibular disorders? 17. Which temporomandibular disorders is arthrocentesis used to treat? 18. How is the arthrocentesis method used in the treatment of temporomandibular disorders applied? 19. In which temporomandibular disorders is temporomandibular joint surgery applied? 20. What are the temporomandibular joint surgery applications applied in temporomandibular disorders?
ComplIcatIons of treatment	1. Can tooth pain occur during or after treatment for temporomandibular disorders? 2. What are the risks of malocclusion occurring during or after treatment for temporomandibular disorders? 3. What is the probability of encountering the complication of acute mouth-opening limitation in the treatment of temporomandibular disorders? 4. Is skin irritation observed during the treatment of temporomandibular disorders? 5. What kind of pain can be encountered in the treatment of temporomandibular disorders?
PrognosIs and outcomes	1. What are the prognosis and treatment results of the stabilization splint used in the treatment of temporomandibular disorders? 2. What are the prognosis and treatment results of oral anti-inflammatory medications used in the treatment of temporomandibular disorders? 3. What are the prognosis and treatment results of physical therapy used in the treatment of temporomandibular disorders? 4. What are the prognosis and treatment results of home-based exercises used in the treatment of temporomandibular disorders? 5. What are the prognosis and treatment results of behavioral modification used in the treatment of temporomandibular disorders? 6. What are the prognosis and treatment results of joint injection used in the treatment of temporomandibular disorders? 7. What are the prognosis and treatment results of botulinum toxin injection used in the treatment of temporomandibular disorders? 8. What are the prognosis and treatment results of joint manipulation used in the treatment of temporomandibular disorders? 9. What are the prognosis and treatment outcomes of biofeedback therapy used in the treatment of temporomandibular disorders? 10. What are the prognosis and treatment outcomes of arthrocentesis used in the treatment of temporomandibular disorders? 11. What are the prognosis and treatment outcomes of temporomandibular joint surgery used in the treatment of temporomandibular disorders? 12. What are the prognosis and treatment outcomes of orthodontic therapy used in the treatment of temporomandibular disorders?

Open in a new tab

The questions were reviewed and finalized by a third researcher (E.Ç.Ö.). Each question asked to the chatbots was initiated as an independent conversation, preventing the systems from being affected by previous responses or creating feedback loops. All questions and answers were written in English.

During the interaction with the language model, no specific prompt engineering or parameter adjustment was performed. All the questions were manually entered into the web interfaces of ChatGPT (OpenAI), Gemini (Google), and Copilot (Microsoft) without modifying the model parameters. Therefore, default settings were used for each platform. On the basis of the providers’ standard configurations, the approximate parameters were as follows: ChatGPT: temperature ≈ 0.7, top-p = 1.0, max tokens ≈ 1024; Gemini: temperature ≈ 0.7, top-p = 1.0; Copilot (Balanced mode): temperature ≈ 0.7, top-p = 1.0. The questions were asked about AI chatbots between May 3 and 6, 2025. The web client interface type was used. Each LLM provider was logged in, and questions were asked one by one. ChatGPT was asked questions on 2 days (May 3 and 4, 2025), and Copilot and Gemini were asked questions on a single day (May 5 and 6, respectively).

The responses received from each AI model were collected by a designated researcher (Ö.Ö.Z.) and processed into an Excel sheet specifically created for data recording. Before the evaluation process began, a meeting was held with the participation of the researchers to establish a common evaluation criterion. The accuracy of the answers was evaluated using the accuracy scale given in Table 2 [21].

Table 2.

Definition of the different accuracy categories

Category	Definition
Objectively True	A claim that is based on scientific evidence and presents all relevant information, whether positive or negative.
Selected Facts	A claim that presents some true selected facts based on scientific evidence but omits important information related to a product.
Minimal Facts	A claim that exaggerates the benefit of the product, with an overemphasis on the benefit supported by poor-quality scientific evidence.
Nonfacts	A claim that presents an intangible characteristic. Often these claims are in the form of product opinions or lifestyle claims, leaving clinicians/patients to misinterpret the opinion as an objective product evaluation.
False	A claim that is objectively false either due to lack of evidence to support it or contradicting available evidence.

Open in a new tab

To enhance objectivity, multiple independent evaluators assessed each response, and any discrepancies were resolved through discussion until consensus was reached. Evaluators were blinded to the source of each response during the assessment process to minimize bias. Inter-rater reliability was assessed using Fleiss’ Kappa, resulting in a kappa value of 0.82, indicating excellent agreement among evaluators.

The panel explicitly adopted the INfORM/IADR standard-of-care key points as the evidence-based reference against which each LLM statement was judged [22].

The results obtained in this analysis are not accepted as absolute and immutable truths, because the evaluation of the answers is based on subjective interpretations and different perspectives. In order to measure the quality of the answers more objectively, a comprehensive approach that includes the opinions of various evaluators was adopted. In this direction, the analysis process was carried out in accordance with the principles of the “Audience Scoring Strategy” based on multi-evaluator methods [16].

Since only the responses obtained from AI-based language models were evaluated in the study, institutional ethics committee approval was not required.

Reference standard (comparator)

To benchmark answer accuracy against current best practices, we used the INfORM/IADR position paper “Temporomandibular disorders: key points for good clinical practice based on standard of care” [22] as the reference standard across all domains (anatomy, signs/symptoms, etiology, evaluation/diagnosis, treatment options, complications, and prognosis). For each item, raters judged whether model statements were consistent with these key points and the Diagnostic Criteria for Temporomandibular Disorders (DC/TMD) diagnostic principles, when applicable. Answers fully aligned with the INfORM/IADR recommendations were coded “Objectively True,” partial alignment with omissions mapped to “Selected Facts,” and departures from the recommendations (or contraindicated practices) mapped to lower categories. Disagreements were resolved by consensus.

Statistical method

In this study, descriptive statistics (number and percentage) of the data are given. In testing the relationship between categorical variables, Pearson Chi-Square test was applied when the sample size assumption (expected value > 5) was met, and Fisher’s Exact test was applied when the sample size assumption was not met. Analyses were performed in IBM SPSS 27 program.

Results

The distribution of answers given according to AI types was given and Pearson Chi Square test was performed to examine the relationships between them. As a result of the analysis, a statistically significant relationship was found between AI types and answers (p < 0.05). It was observed that ChatGPT-4 mostly gave the answer ‘Objectively true’; Gemini gave much more correct answers than Copilot and when the answer ‘Minimal Facts’ was received, the AI type was Copilot. (Table 3) (Fig. 1).

Table 3.

Distribution of responses according to AI types and relationships between them

	ChatGPT-4			Gemini			Copilot
Answers	n	%	%AI	n	%	%AI	n	%	%AI	p
Minimal facts	0a	0,0	0,0	4a	21,1	4,8	15b	78,9	18,1	< 0,001*
Selected facts	18a	17,3	21,7	35b	33,7	42,2	51c	49,0	61,4
Objectively true	65a	51,6	78,3	44b	34,9	53,0	17c	13,5	20,5

Open in a new tab

*p < 0.05, % Row percentage and %AI. Column percentage for AI

Fig. 1 — Bar chart of the distribution of responses by AI types

The distribution of responses according to AI types for domains is given and Fisher’s Exact tests were performed to examine the relationships between them. As a result of the analyses, statistically significant relationships were found between AI types and responses in all domains (p < 0.05). It was determined that Copilot mostly responded with ‘Selected facts’ for ‘Anatomy of temporomandibular joint’; ChatGPT-4 mostly responded with Objectively true for ‘Signs and symptoms’, ‘Etiology’, ‘Evaluation and Diagnosis’, ‘Treatment options’ and ‘Complications of Treatment’, and all of Gemini and Copilot mostly responded with ‘Selected facts’ (Table 4).

Table 4.

Distribution of responses to domains according to AI types and relationships between them

		ChatGPT-4			Gemini			Copilot
Domains	Answers	n	%	%AI	n	%	%AI	n	%	%AI	n
Anatomy of temporomandibular joint	Selected facts	7_{a, b}	33,3	53,8	3_b	14,3	23,1	11_a	52,4	84,6	0,007*
Anatomy of temporomandibular joint	Objectively true	6_{a, b}	33,3	46,2	10_b	55,6	76,9	2_a	11,1	15,4
Signs and symptoms	Minimal facts	0_a	0,0	0,0	3_a	60,0	30,0	2_a	40,0	20,0	0,046*
	Selected facts	2_a	18,2	20,0	3_a	27,3	30,0	6_a	54,5	60,0
	Objectively true	8_a	57,1	80,0	4_{a, b}	28,6	40,0	2_b	14,3	20,0
Etiology	Minimal facts	0_a	0,0	0,0	0_a	0,0	0,0	1_a	100,0	9,1	0,003*
	Selected facts	0_a	0,0	0,0	4_{a, b}	36,4	36,4	7_b	63,6	63,6
	Objectively true	11_a	52,4	100,0	7_{a, b}	33,3	63,6	3_b	14,3	27,3
Evaluation and diagnosis	Minimal facts	0_a	0,0	0,0	0_a	0,0	0,0	2_a	100,0	16,7	0,002*
	Selected facts	1_a	7,1	8,3	5_{a, b}	35,7	41,7	8_b	57,1	66,7
	Objectively true	11_a	55,0	91,7	7_{a, b}	35,0	58,3	2_b	10,0	16,7
Treatment options	Selected facts	3_a	12,0	15,0	9_{a, b}	36,0	45,0	13_b	52,0	65,0	0,005*
Treatment options	Objectively true	17_a	48,6	85,0	11_{a, b}	31,4	55,0	7_b	20,0	35,0
Complications of treatment	Minimal facts	0_a	0,0	0,0	0_a	0,0	0,0	4_b	100,0	80,0	< 0,001*
	Selected facts	1_a	14,3	20,0	5_b	71,4	100,0	1_a	14,3	20,0
	Objectively true	4_a	100,0	80,0	0_b	0,0	0,0	0_b	0,0	0,0
Prognosis and outcomes	Minimal facts	0_a	0,0	0,0	1_{a, b}	14,3	8,3	6_b	85,7	50,0	0,008*
	Selected facts	4_a	26,7	33,3	6_a	40,0	50,0	5_a	33,3	41,7
	Objectively true	8_a	57,1	66,7	5_{a, b}	35,7	41,7	1_b	7,1	8,3

Open in a new tab

*p<0.05, % Row percentage, %AI. Column percentage for AI and different lettering shows the differences between column ratios

The distribution of answers given according to domains for ChatGPT-4 was given and Fisher’s Exact test was used to examine the relationships between them. As a result of the analysis, a statistically significant relationship was found between the domain types and the answers (p < 0.05). It was observed that the answers received as ‘Selected Facts’ were mostly on the subject of ‘Anatomy of Temporomandibular Joint’ (Table 5) (Fig. 2).

Table 5.

Distribution of responses by domain for ChatGPT-4 and their relationships

	Selected Facts			Objectively True
Domains	n	%	%C.	n	%	%C.	p
Anatomy of temporomandibular joint	7_a	53,8	38,9	6_b	46,2	9,2	0,031*
Signs and symptoms	2_a	20,0	11,1	8_a	80,0	12,3
Etiology	0_a	0,0	0,0	11_a	100,0	16,9
Evaluation and diagnosis	1_a	8,3	5,6	11_a	91,7	16,9
Treatment options	3_a	15,0	16,7	17_a	85,0	26,2
Complications of treatment	1_a	20,0	5,6	4_a	80,0	6,2
Prognosis and outcomes	4_a	33,3	22,2	8_a	66,7	12,3

Open in a new tab

*p<0.05, % Row percentage, %C. Column percentage for answers and different lettering shows the differences between column rates

Fig. 2 — Bar chart of the distribution of responses by domain for ChatGPT-4

The distribution of responses given by domains for Gemini was given and Fisher’s Exact test was used to examine the relationships between them. As a result of the analysis, a statistically significant relationship was found between domain types and responses (p < 0.05). It was determined that the responses received as ‘Minimal Facts’ were mostly on the subject of ‘Signs and Symptoms’ (Table 6) (Fig. 3).

Table 6.

Distribution of responses for gemini by domain and relationships between them

	Minimal Facts			Selected Facts			Objectively True
Domains	n	%	%C.	n	%	%C.	n	%	%C.	p
Anatomy of temporomandibular joint	0_a	0,0	0,0	3_a	23,1	8,6	10_a	76,9	22,7	0,034*
Signs and symptoms	3_a	30,0	75,0	3_b	30,0	8,6	4_b	40,0	9,1
Etiology	0_a	0,0	0,0	4_a	36,4	11,4	7_a	63,6	15,9
Evaluation and diagnosis	0_a	0,0	0,0	5_a	41,7	14,3	7_a	58,3	15,9
Treatment options	0_a	0,0	0,0	9_a	45,0	25,7	11_a	55,0	25,0
Complications of treatment	0_{a, b}	0,0	0,0	5_b	100,0	14,3	0_a	0,0	0,0
Prognosis and outcomes	1_a	8,3	25,0	6_a	50,0	17,1	5_a	41,7	11,4

Open in a new tab

*p<0.05, % Row percentage, %C. Column percentage for answers and different lettering shows the differences between column rates

Fig. 3 — Bar chart of the distribution of responses by domain for Gemini

The distribution of responses given for Copilot according to domains was given and Fisher’s Exact test was applied to examine the relationships between them. As a result of the analysis, a statistically significant relationship was observed between domain types and responses (p < 0.05). It was determined that responses received mostly as ‘Selected Facts’ on the subject of ‘Treatment Options’; responses received mostly as ‘Minimal Facts’ were on the subjects of ‘Complications of Treatment’ and ‘Prognosis and outcomes’ (Table 7) (Fig. 4).

Table 7.

Distribution of responses to copilot by domain and relationships between them

	Minimal Facts			Selected Facts			Objectively True
Domains	n	%	%C.	n	%	%C.	n	%	%C.	p
Anatomy of temporomandibular joint	0_a	0,0	0,0	11_a	84,6	21,6	2_a	15,4	11,8	0,002*
Signs and symptoms	2_a	20,0	13,3	6_a	60,0	11,8	2_a	20,0	11,8
Etiology	1_a	9,1	6,7	7_a	63,6	13,7	3_a	27,3	17,6
Evaluation and diagnosis	2_a	16,7	13,3	8_a	66,7	15,7	2_a	16,7	11,8
Treatment options	0_a	0,0	0,0	13_{a, b}	65,0	25,5	7_b	35,0	41,2
Complications of treatment	4_a	80,0	26,7	1_b	20,0	2,0	0_{a, b}	0,0	0,0
Prognosis and outcomes	6_a	50,0	40,0	5_b	41,7	9,8	1_{a, b}	8,3	5,9

Open in a new tab

*p<0.05, % Row percentage, %C. Column percentage for answers and different lettering shows the differences between column rates

Fig. 4 — Bar chart of the distribution of responses by domain for Copilot

Discussion

Recent advancements in artificial intelligence (AI) have introduced new opportunities in patient education, yet they have also raised serious concerns about the accuracy and reliability of the information provided. Individuals often exhibit a high level of trust in AI-generated responses, particularly when such responses appear conversational or human-like. This uncritical acceptance can result in the dissemination of unverified or even inaccurate medical information, especially among individuals lacking sufficient health literacy or domain-specific knowledge [23]. Given the widespread use of AI tools for acquiring health-related information [10], this issue becomes particularly critical in healthcare, where misinformation may carry significant consequences for diagnosis, treatment decisions, and patient well-being.

This study sought to assess the accuracy of three widely used LLM (ChatGPT-4, Gemini, and Copilot) by analyzing their responses to structured questions concerning temporomandibular joint (TMJ) disorders. The results demonstrate substantial differences in information quality across the models. ChatGPT-4 clearly outperformed both Gemini and Copilot in terms of the proportion of responses rated as “Objectively True,” with statistical significance noted (p < 0.001). In particular, ChatGPT-4 provided accurate and reliable answers in complex areas such as etiology (100%), diagnosis (91.7%), treatment options (85%), and treatment-related complications (80%).

Conversely, Gemini and Copilot more frequently generated responses categorized as “Selected Facts” or “Minimal Facts,” highlighting their limitations in delivering complete or clinically actionable information. Gemini’s answers were often superficial and lacked depth, summarizing content without addressing clinical nuances. While it occasionally provided relevant insights, its overall performance was marked by gaps in coverage and a lack of specificity. In comparison, Copilot consistently delivered responses with the lowest accuracy, most of which failed to meet the threshold for meaningful or safe medical communication.

These discrepancies may be attributed to a range of factors, including variation in training datasets, model architecture, frequency of content updates, and differing capabilities in sourcing evidence-based materials [19]. Similar observations have been made in prior literature across various medical specialties. For instance, in ophthalmology-related queries, both ChatGPT-4 and Gemini provided relevant content regarding retinal detachment, but ChatGPT-4 delivered more consistent and clinically appropriate answers, particularly for complex inquiries [24]. In a separate study focusing on keratoconus, ChatGPT-4 again outperformed Gemini and Copilot in terms of detail and accuracy, although its responses were found to be more challenging for lay users to interpret [25].

Comparable patterns were observed in other medical contexts. For example, both ChatGPT and Gemini performed reliably in providing information about hypertension [26] and contraception [27], though concerns regarding content readability were raised. A study evaluating five chatbots (ChatGPT, Bard, Gemini, Copilot, and Perplexity) on palliative care questions concluded that the outputs were often too complex and lacked clarity for general patient use [28]. Further, inadequate readability has been noted in chatbot responses to subdural hematoma questions [29], and the potential for misinformation has also been documented in oropharyngeal cancer-related outputs [30].

While some researchers have expressed skepticism regarding the reliability and clarity of AI-generated health content [31], others argue that these tools can deliver content of satisfactory quality and acceptable readability when applied judiciously [28, 30, 32, 33]. Overall, ChatGPT has been regarded as offering relatively accurate and comprehensive medical information across multiple domains [34]. Nevertheless, even ChatGPT-4 cannot be deemed universally reliable, as it occasionally provided only partially correct answers, especially in topics like TMJ anatomy and long-term prognosis.

All standard answers in this study were established by two board-certified orthodontists and one oral and maxillofacial surgeon, all with extensive clinical and research experience in TMD. These experts independently reviewed all questions and reached consensus through discussion, ensuring that the responses reflect widely accepted evidence-based knowledge. We acknowledge that many questions, such as “How are temporomandibular disorders treated?”, are inherently controversial and may have conflicting evidence in the literature. To address this, we incorporated a structured evaluation framework considering both expert consensus and the clinical weight of potential errors, allowing for interpretation of AI responses even when absolute truth is unattainable.

The inaccuracies identified in this study should also be considered in terms of their relative clinical weight. Minor factual inconsistencies, such as those concerning background or etiological aspects, are unlikely to directly endanger patient outcomes. In contrast, misleading responses regarding diagnostic procedures, therapeutic strategies, or long-term prognosis carry greater clinical risk by potentially influencing treatment decisions. When interpreted within the framework of the recent INfORM/IADR standard of care recommendations for TMD management, our findings highlight both the potential utility and the critical limitations of LLMs. These systems may serve as valuable supplementary tools for patient education and professional support, but their outputs must remain anchored to evidence-based guidelines to ensure patient safety and high-quality care.

To better contextualize the potential consequences of AI-generated responses, we introduced a framework categorizing AI errors according to their clinical impact (Table 8). Minor errors include preferential statements among conservative therapies, which have minimal consequences for patient care. Moderate errors may involve incomplete reporting of conservative options, potentially limiting informed patient decisions. Major and critical errors, such as recommending occlusal interventions, disc repositioning, or providing misleading safety information, carry significant risk for overtreatment or patient harm. This categorization allows readers to weigh the “weight” of potential AI errors and highlights the importance of careful evaluation when integrating AI tools into patient education and clinical decision-making.

Table 8.

AI response errors and clinical significance in TMD

AI Response Error Type	Example	Clinical Impact	Severity Classification
Minor Error	States that physical therapy is “better” than cognitive-behavioral therapy	Both are accepted conservative treatments; minimal impact on patient care	Low
Moderate Error	Suggests one conservative therapy while omitting another (e.g., CBT not mentioned)	Could limit patient awareness of options; moderate effect on informed decision-making	Medium
Major Error	Recommends occlusal adjustments or disc recapture procedures without indication	Risk of overtreatment, unnecessary cost, potential harm	High
Critical Error	Provides inaccurate or misleading information regarding contraindications or comorbidities	May lead to unsafe clinical decisions or exacerbate patient condition	Very High

Open in a new tab

To enhance the interpretability of our findings, we systematically categorized AI-generated inaccuracies according to their potential clinical consequences. This framework not only clarifies the varying levels of risk associated with different types of errors but also provides a structured lens through which clinicians can evaluate the safety and educational utility of AI-derived content. Such classification aligns with the need for evidence-based decision-making and offers a practical approach to balance the potential benefits of AI tools against their limitations in TMD management.

This structured error framework was designed to complement the expert-based evaluation process described in the Methods section. By integrating clinical expertise from two orthodontists and one oral and maxillofacial surgeon, we aimed to ensure that the categorization of AI errors reflects both scientific accuracy and real-world clinical implications. Such an approach acknowledges the inherently multifactorial and sometimes controversial nature of TMD management, while providing a transparent method to assess how AI-generated responses may influence patient care and professional decision-making.

Mitigation strategies were considered to address these limitations. To reduce potential inaccuracies and enhance the reliability of AI-generated responses, several approaches can be recommended. First, AI outputs should always be interpreted in conjunction with expert clinical judgment, particularly in high-risk domains such as diagnosis and treatment planning. Second, integrating continuous updates from peer-reviewed literature and validated clinical guidelines can reduce the risk of outdated or incomplete information. Third, structured error frameworks, like the one applied in this study, can serve as a practical tool for evaluating the clinical weight of AI errors, guiding both clinicians and patients in interpreting the outputs appropriately. Finally, educating users about the limitations of AI systems and promoting critical appraisal skills can minimize uncritical acceptance of inaccurate responses, thereby safeguarding patient safety and supporting evidence-based decision-making [35, 36].

From an ethical standpoint, the interpretation of AI-generated outputs should also incorporate current clinical evidence and patient protection principles. Recent studies indicate that occlusion features are not considered causal factors in TMD [37, 38], disc repositioning procedures are generally discouraged [39], and overtreatment should be avoided to prevent unnecessary risk to patients [40]. In addition, education regarding bruxism remains a critical component of management, emphasizing preventive strategies and patient understanding [41, 42]. Integrating these perspectives reinforces that AI-driven tools must not only provide accurate information but also support ethical clinical decision-making and safeguard patient well-being.

Our domain-level results are congruent with the INfORM/IADR standard-of-care key points, which emphasize DC/TMD-based diagnosis, conservative and reversible therapies, and avoidance of occlusion-driven or disc-repositioning interventions without indication. The main areas where chatbot outputs diverged (particularly selected items in joint anatomy and long-term prognosis) mirror topics in which adherence to these key points is critical for safe patient communication and decision-making [22].

The findings from this study reinforce the position that LLM possess potential value as supplementary tools in health communication. However, significant limitations remain. Although such systems can access vast repositories of publicly available data, they often lack access to the most recent peer-reviewed literature or subscription-based academic resources. Moreover, they may fail to differentiate between high- and low-quality sources or reflect updates in clinical guidelines—issues that can critically undermine medical reliability [11]. Compounding this, previous work has indicated that the accuracy and consistency of AI-generated content may fluctuate across languages, and that translation processes can lead to information distortion or oversimplification [43].

In the current study, chatbot responses were collected at a specific point in time, yet AI-generated outputs are dynamic and may evolve with system updates. As prior research suggests, repeated queries submitted at different times can yield variable responses [44], further emphasizing the need for time-sensitive validation and transparency regarding data sources.

In addition to improving accuracy and clarity, AI-based systems should also be trained to recognize and appropriately manage advertisement-oriented statements within health communication. Developing such competencies could strengthen the reliability of AI outputs and serve a crucial function in shielding patients from deceptive or commercially motivated content, thereby supporting the ethical obligation of protecting public health.

In conclusion, while AI-based language models offer promise in facilitating access to health-related information, their variability in accuracy and completeness underscores the necessity for critical oversight. The potential for these systems to disseminate incomplete or misleading information (especially in the absence of professional guidance) limits their suitability as independent sources in clinical decision-making. To optimize their utility in healthcare, AI systems must undergo rigorous expert validation, incorporate ongoing updates from trusted medical databases, and align with ethical standards that prioritize patient safety and information integrity.

Conclusion

This study revealed notable disparities in the accuracy of AI-based language models when answering questions related to temporomandibular disorders. ChatGPT-4 proved to be the most reliable, offering a higher rate of accurate and comprehensive responses compared to Gemini and Copilot. However, it still showed limitations in certain domains, such as anatomy and prognosis.

Gemini often produced context-limited responses with incomplete information, while Copilot consistently delivered the least accurate and most superficial answers across all topics. These findings highlight the importance of critically evaluating AI-generated medical content.

Before being adopted in clinical settings, such models must be rigorously validated and supported by trustworthy medical sources. AI tools should be used to complement, not replace, expert medical guidance.

Supplementary Information

Supplementary Material 1.^{(21.8KB, docx)}

Supplementary Material 2.^{(36.6KB, docx)}

Supplementary Material 3.^{(30.3KB, docx)}

Acknowledgements

The authors thank statistician Ayça Ölmez for performing the statistical analyses.

Abbreviations

AI: Artificial intelligence
LLM: Large language model
TMD: Temporomandibular disorders
TMJ: Temporomandibular joint
DC/TMD: Diagnostic Criteria for Temporomandibular Disorders
INfORM: International Network for Orofacial Pain and Related Disorders Methodology
IADR: International Association for Dental Research
CBT: Cognitive–behavioural therapy

Authors’ contributions

Ö.Ö.Z. conceived and designed the study, collected the chatbot responses, and drafted the manuscript. Ş.H. contributed to the development of the question set, data interpretation, and critical revision of the manuscript for important intellectual content. E.Ç.Ö. refined the question set, prepared the figures and tables, and contributed to the interpretation and revision of the manuscript. All authors read and approved the final version of the manuscript.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data availability

The datasets generated and/or analysed during the current study (Excel files containing the questions and rated chatbot responses) are available from the corresponding author on reasonable request.

Declarations

Ethics approval and consent to participate

Not applicable. This study analysed anonymised outputs generated by publicly available AI-based language models (ChatGPT-4, Gemini, and Copilot) and did not involve human participants, human data, or animals. In accordance with institutional and national regulations, formal approval from an ethics committee was therefore not required.

Consent for publication

Not applicable. This manuscript does not contain any individual person’s data (including images, videos, or case details) that would require consent for publication.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Gauer RL, Semidey MJ. Diagnosis and treatment of temporomandibular disorders. Am Fam Physician. 2015;91(6):378–86. [PubMed] [Google Scholar]
2.Valesan LF, Da-Cas CD, Réus JC, et al. Prevalence of temporomandibular joint disorders: a systematic review and meta-analysis. Clin Oral Investig. 2021;25(2):441–53. [DOI] [PubMed] [Google Scholar]
3.Schiffman E, Ohrbach R, Truelove E, et al. Diagnostic criteria for temporomandibular disorders (DC/TMD) for clinical and research applications: recommendations of the international RDC/TMD consortium Network* and orofacial pain special interest group. J Oral Facial Pain Headache. 2014;28:6–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Sun R, Zhang S, Si J, et al. Association between oral behaviors and painful temporomandibular disorders: A Cross-Sectional study in the general population. J Pain Res. 2024;17:431–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Scrivani SJ, Keith DA, Kaban LB. Temporomandibular disorders. N Engl J Med. 2008;359(25):2693–705. [DOI] [PubMed] [Google Scholar]
6.Dworkin SF, LeResche L. Research diagnostic criteria for temporomandibular disorders: Review, criteria, examinations and specifications, critique. J Craniomandib Disord. 1992;6:301–55. [PubMed] [Google Scholar]
7.List T, Jensen RH. Temporomandibular disorders: old ideas and new concepts. Cephalalgia. 2017;37(7):692–704. [DOI] [PubMed] [Google Scholar]
8.Schiffman E, Ohrbach R. Executive summary of the diagnostic criteria for temporomandibular disorders for clinical and research applications. J Am Dent Assoc. 2016;147(6):438–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Shinchuk LM, Chiou P, Czarnowski V, et al. Demographics and attitudes of chronic-pain patients who seek online painrelated medical information: implications for healthcare providers. Am J Phys Med Rehabil. 2010;89:141–6. [DOI] [PubMed] [Google Scholar]
10.Shen Y, Heacock L, Elias J, et al. ChatGPT and other large Language models are Double-edged swords. Radiology. 2023;307(2):e230163. [DOI] [PubMed] [Google Scholar]
11.Javaid M, Haleem A, Singh RP. ChatGPT for healthcare services: an emerging stage for an innovative perspective. BenchCouncil Trans Benchmarks Stand Evaluations. 2023;1–4. 10.1016/j.tbench.2023.100105.
12.Zhu JJ, Jiang J, Yang M, Ren ZJ. ChatGPT and environmental research. Environ Sci Technol. 2023;57(46):17667–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Nastasi AJ, Courtright KR, Halpern SD, Weissman GE. A vignette-based evaluation of chatgpt’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep. 2023;13(1):17885. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Armbruster J, Bussmann F, Rothhaas C, Titze N, Grützner PA, Freischmidt H. Doctor ChatGPT, can you help me? The patient’s perspective: Cross-Sectional study. J Med Internet Res. 2024;26:e58831. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of chatgpt’s (artificial intelligence) role in research, clinical practice, education, and patient interaction. Med (Baltim). 2024;103(32):e39250. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Nadarzynski T, Miles O, Cowie A, Ridge D. Acceptability of artificial intelligence (AI)-led chatbot services in healthcare: A mixed-methods study. Digit Health. 2019;5:2055207619871808. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kurniawan MH, Handiyani H, Nuraini T, Hariyati RTS, Sutrisno S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann Med. 2024;56(1):2302980. 10.1080/07853890.2024.2302980. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Suhaili S, Salim N, Jambli M. Service chatbots: A systematic review. Expert Syst Appl. 2021;184:115461. [Google Scholar]
20.Faerber AE, Kreling DH. Content analysis of false and misleading claims in television advertising for prescription and nonprescription drugs. J Gen Intern Med. 2014;29(1):110–8. 10.1007/s11606-013-2604-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Deiana G, Dettori M, Arghittu A, Azara A, Gabutti G, Castiglia P. Artificial intelligence and public health: evaluating ChatGPT responses to vaccination Myths and misconceptions. Vaccines (Basel). 2023;11(7):1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Manfredini D, Häggman-Henrikson B, Al Jaghsi A, Baad-Hansen L, Beecroft E, Bijelic T, Bracci A, Brinkmann L, Bucci R, Colonna A, Ernberg M, Giannakopoulos NN, Gillborg S, Greene CS, Heir G, Koutris M, Kutschke A, Lobbezoo F, Lövgren A, Michelotti A, Nixdorf DR, Nykänen L, Oyarzo JF, Pigg M, Pollis M, Restrepo CC, Rongo R, Rossit M, Saracutu OI, Schierz O, Stanisic N, Val M, Verhoeff MC, Visscher CM, Voog-Oras U, Wrangstål L, Bender SD, Durham J, International Network for Orofacial Pain and Related Disorders Methodology. Temporomandibular disorders: INfORM/IADR key points for good clinical practice based on standard of care. Cranio. 2025;43(1):1–5. Epub 3 Oct 2024. PMID: 39360749. 10.1080/08869634.2024.2405298.
23.Strzalkowski P, Strzalkowska A, Chhablani J, et al. Evaluation of the accuracy and readability of ChatGPT-4 and Google gemini in providing information on retinal detachment: a multicenter expert comparative study. Int J Retina Vitreous. 2024;10:61. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Demir S. Evaluation of responses to questions about keratoconus using ChatGPT-4.0, Google Gemini and Microsoft Copilot: a comparative study of large language models on keratoconus. Eye Contact Lens. 2024;2024:1158. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Lee TJ, Campbell DJ, Patel S, Hossain A, Radfar N, Siddiqui E, Gardin JM. Unlocking health literacy: the ultimate guide to hypertension education from ChatGPT versus Google gemini. Cureus. 2024;16:e59898. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Muluk E. A comparative analysis of artificial intelligence platforms: ChatGPT-4o and Google gemini in answering questions about birth control methods. Cureus. 2025;17(1):e76745. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Hancı V, Ergün B, Gül Ş, Uzun Ö, Erdemir İ, Hancı FB. Assessment of readability, reliability, and quality of ChatGPT^®, BARD^®, Gemini^®, Copilot^®, Perplexity^® responses on palliative care. Med (Baltim). 2024;103(33):e39305. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Gül Ş, Erdemir İ, Hanci V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Med (Baltim). 2024;103(18):e38009. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Davis RJ, Ayo-Ajibola O, Lin ME, et al. Evaluation of oropharyngeal cancer information from revolutionary artificial intelligence chatbot. Laryngoscope. 2024;134(5):2252–7. [DOI] [PubMed] [Google Scholar]
30.McCarthy CJ, Berkowitz S, Ramalingam V, Ahmed M. Evaluation of an artificial intelligence chatbot for delivery of interventional radiology patient education material: a comparison with societal website content. J Vasc Interv Radiol. 2023;34:1760–e832. [DOI] [PubMed] [Google Scholar]
31.Musheyev D, Pan A, Loeb S, Kabarriti AE. How well do artificial intelligence chatbots respond to the top search queries about urological malignancies? Eur Urol. 2024;85:13–6. [DOI] [PubMed] [Google Scholar]
32.Erden Y, Temel MH, Bağcıer F. Artificial intelligence insights into osteoporosis: assessing chatgpt’s information quality and readability. Arch Osteoporos. 2024;19:17. [DOI] [PubMed] [Google Scholar]
33.Johnson SB, King AJ, Warner EL, Aneja S, Kann BH, Bylund CL. Using ChatGPT to evaluate cancer Myths and misconceptions: artificial intelligence and cancer information. JNCI Cancer Spectr. 2023;7(2):pkad015. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Kang B, Hong M. Development and evaluation of a mental health chatbot using ChatGPT 4.0: mixed methods user experience study with Korean users. JMIR Med Inf. 2025;13:e63538. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Xue J, Zhang B, Zhao Y, et al. Evaluation of the current state of chatbots for digital health: scoping review. J Med Internet Res. 2023;25:e47217. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Reda B, Contardo L, Prenassi M, Guerra E, Derchi G, Marceglia S. Artificial intelligence to support early diagnosis of temporomandibular disorders: A preliminary case study. J Oral Rehabil. 2023;50(1):31–38. Epub 3 Nov 2022. PMID: 36285513. 10.1111/joor.13383. [DOI] [PubMed]
37.Jacob C, Brasier N, Laurenzi E, Heuss S, Mougiakakou SG, Cöltekin A, Peter MK. AI for IMPACTS framework for evaluating the Long-Term Real-World impacts of AI-Powered clinician tools: systematic review and narrative synthesis. J Med Internet Res. 2025;27:e67485. 10.2196/67485. PMID: 39909417; PMCID: PMC11840377. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Ng FYC, Thirunavukarasu AJ, Cheng H, Tan TF, Gutierrez L, Lan Y, Ong JCL, Chong YS, Ngiam KY, Ho D, Wong TY, Kwek K, Doshi-Velez F, Lucey C, Coffman T, Ting DSW. Artificial intelligence education: an evidence-based medicine approach for consumers, translators, and developers. Cell Rep Med. 2023;4(10):101230. 10.1016/j.xcrm.2023.101230. PMID: 37852174; PMCID: PMC10591047. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Landi N, Manfredini D, Tognini F, Romagnoli M, Bosco M. Quantification of the relative risk of multiple occlusal variables for muscle disorders of the stomatognathic system. J Prosthet Dent. 2004;92(2):190-5. 10.1016/j.prosdent.2004.05.013. PMID: 15295330. [DOI] [PubMed]
40.Manfredini D, Lombardo L, Siciliani G. Temporomandibular disorders and dental occlusion. A systematic review of association studies: end of an era? J Oral Rehabil. 2017;44(11):908–23. Epub 2 July 2017. PMID: 28600812. 10.1111/joor.12531. [DOI] [PubMed] [Google Scholar]
41.Mercuri LG, Greene CS, Manfredini D. The temporomandibular joint disc: A complex fable about an elusive butterfly. Cranio. 2025 Apr 10:1–8. doi: 10.1080/08869634.2025.2477963. Epub ahead of print. PMID: 40205916. [DOI] [PubMed]
42.Manfredini D, Bender SD, Häggman-Henrikson B, Durham J, Greene CS. TMD management standards updated. Br Dent J. 2025;238(5):293. Epub 14 Mar 2025. PMID: 40087414. 10.1038/s41415-025-8515-8. [DOI] [PubMed] [Google Scholar]
43.Näsänen J, Karaharju-Suvanto T, Lobbezoo F, Verhoeff MC, Lappalainen OP, Nykänen L. Self-assessed competence in relation to Bruxism among undergraduate dental students in Finland. Cranio 2025 Mar 6:1–11. doi: 10.1080/08869634.2025.2472085. Epub ahead of print. PMID: 40047365. [DOI] [PubMed]
44.Mungia R, Lobbezoo F, Funkhouser E, Glaros A, Manfredini D, Ahlberg J, Taverna M, Galang-Boquiren MT, Rugh J, Truong C, Boone H, Cheney C 3rd, Verhoeff MC, Gilbert GH, National Practice-Based Research Network Collaborator Group. Dental practitioner approaches to bruxism: preliminary findings from the National dental practice-based research network. Cranio. 2025;43(3):480–8. Epub 4 Apr 2023. PMID: 37016587; PMCID: PMC11011247. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(21.8KB, docx)}

Supplementary Material 2.^{(36.6KB, docx)}

Supplementary Material 3.^{(30.3KB, docx)}

Data Availability Statement

The datasets generated and/or analysed during the current study (Excel files containing the questions and rated chatbot responses) are available from the corresponding author on reasonable request.

[CR1] 1.Gauer RL, Semidey MJ. Diagnosis and treatment of temporomandibular disorders. Am Fam Physician. 2015;91(6):378–86. [PubMed] [Google Scholar]

[CR2] 2.Valesan LF, Da-Cas CD, Réus JC, et al. Prevalence of temporomandibular joint disorders: a systematic review and meta-analysis. Clin Oral Investig. 2021;25(2):441–53. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Schiffman E, Ohrbach R, Truelove E, et al. Diagnostic criteria for temporomandibular disorders (DC/TMD) for clinical and research applications: recommendations of the international RDC/TMD consortium Network* and orofacial pain special interest group. J Oral Facial Pain Headache. 2014;28:6–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Sun R, Zhang S, Si J, et al. Association between oral behaviors and painful temporomandibular disorders: A Cross-Sectional study in the general population. J Pain Res. 2024;17:431–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Scrivani SJ, Keith DA, Kaban LB. Temporomandibular disorders. N Engl J Med. 2008;359(25):2693–705. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Dworkin SF, LeResche L. Research diagnostic criteria for temporomandibular disorders: Review, criteria, examinations and specifications, critique. J Craniomandib Disord. 1992;6:301–55. [PubMed] [Google Scholar]

[CR7] 7.List T, Jensen RH. Temporomandibular disorders: old ideas and new concepts. Cephalalgia. 2017;37(7):692–704. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Schiffman E, Ohrbach R. Executive summary of the diagnostic criteria for temporomandibular disorders for clinical and research applications. J Am Dent Assoc. 2016;147(6):438–45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Shinchuk LM, Chiou P, Czarnowski V, et al. Demographics and attitudes of chronic-pain patients who seek online painrelated medical information: implications for healthcare providers. Am J Phys Med Rehabil. 2010;89:141–6. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Shen Y, Heacock L, Elias J, et al. ChatGPT and other large Language models are Double-edged swords. Radiology. 2023;307(2):e230163. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Javaid M, Haleem A, Singh RP. ChatGPT for healthcare services: an emerging stage for an innovative perspective. BenchCouncil Trans Benchmarks Stand Evaluations. 2023;1–4. 10.1016/j.tbench.2023.100105.

[CR12] 12.Zhu JJ, Jiang J, Yang M, Ren ZJ. ChatGPT and environmental research. Environ Sci Technol. 2023;57(46):17667–70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Nastasi AJ, Courtright KR, Halpern SD, Weissman GE. A vignette-based evaluation of chatgpt’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep. 2023;13(1):17885. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Armbruster J, Bussmann F, Rothhaas C, Titze N, Grützner PA, Freischmidt H. Doctor ChatGPT, can you help me? The patient’s perspective: Cross-Sectional study. J Med Internet Res. 2024;26:e58831. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of chatgpt’s (artificial intelligence) role in research, clinical practice, education, and patient interaction. Med (Baltim). 2024;103(32):e39250. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Nadarzynski T, Miles O, Cowie A, Ridge D. Acceptability of artificial intelligence (AI)-led chatbot services in healthcare: A mixed-methods study. Digit Health. 2019;5:2055207619871808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Kurniawan MH, Handiyani H, Nuraini T, Hariyati RTS, Sutrisno S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann Med. 2024;56(1):2302980. 10.1080/07853890.2024.2302980. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Suhaili S, Salim N, Jambli M. Service chatbots: A systematic review. Expert Syst Appl. 2021;184:115461. [Google Scholar]

[CR20] 20.Faerber AE, Kreling DH. Content analysis of false and misleading claims in television advertising for prescription and nonprescription drugs. J Gen Intern Med. 2014;29(1):110–8. 10.1007/s11606-013-2604-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Deiana G, Dettori M, Arghittu A, Azara A, Gabutti G, Castiglia P. Artificial intelligence and public health: evaluating ChatGPT responses to vaccination Myths and misconceptions. Vaccines (Basel). 2023;11(7):1217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Manfredini D, Häggman-Henrikson B, Al Jaghsi A, Baad-Hansen L, Beecroft E, Bijelic T, Bracci A, Brinkmann L, Bucci R, Colonna A, Ernberg M, Giannakopoulos NN, Gillborg S, Greene CS, Heir G, Koutris M, Kutschke A, Lobbezoo F, Lövgren A, Michelotti A, Nixdorf DR, Nykänen L, Oyarzo JF, Pigg M, Pollis M, Restrepo CC, Rongo R, Rossit M, Saracutu OI, Schierz O, Stanisic N, Val M, Verhoeff MC, Visscher CM, Voog-Oras U, Wrangstål L, Bender SD, Durham J, International Network for Orofacial Pain and Related Disorders Methodology. Temporomandibular disorders: INfORM/IADR key points for good clinical practice based on standard of care. Cranio. 2025;43(1):1–5. Epub 3 Oct 2024. PMID: 39360749. 10.1080/08869634.2024.2405298.

[CR23] 23.Strzalkowski P, Strzalkowska A, Chhablani J, et al. Evaluation of the accuracy and readability of ChatGPT-4 and Google gemini in providing information on retinal detachment: a multicenter expert comparative study. Int J Retina Vitreous. 2024;10:61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Demir S. Evaluation of responses to questions about keratoconus using ChatGPT-4.0, Google Gemini and Microsoft Copilot: a comparative study of large language models on keratoconus. Eye Contact Lens. 2024;2024:1158. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Lee TJ, Campbell DJ, Patel S, Hossain A, Radfar N, Siddiqui E, Gardin JM. Unlocking health literacy: the ultimate guide to hypertension education from ChatGPT versus Google gemini. Cureus. 2024;16:e59898. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Muluk E. A comparative analysis of artificial intelligence platforms: ChatGPT-4o and Google gemini in answering questions about birth control methods. Cureus. 2025;17(1):e76745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Hancı V, Ergün B, Gül Ş, Uzun Ö, Erdemir İ, Hancı FB. Assessment of readability, reliability, and quality of ChatGPT^®, BARD^®, Gemini^®, Copilot^®, Perplexity^® responses on palliative care. Med (Baltim). 2024;103(33):e39305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Gül Ş, Erdemir İ, Hanci V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Med (Baltim). 2024;103(18):e38009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Davis RJ, Ayo-Ajibola O, Lin ME, et al. Evaluation of oropharyngeal cancer information from revolutionary artificial intelligence chatbot. Laryngoscope. 2024;134(5):2252–7. [DOI] [PubMed] [Google Scholar]

[CR30] 30.McCarthy CJ, Berkowitz S, Ramalingam V, Ahmed M. Evaluation of an artificial intelligence chatbot for delivery of interventional radiology patient education material: a comparison with societal website content. J Vasc Interv Radiol. 2023;34:1760–e832. [DOI] [PubMed] [Google Scholar]

[CR31] 31.Musheyev D, Pan A, Loeb S, Kabarriti AE. How well do artificial intelligence chatbots respond to the top search queries about urological malignancies? Eur Urol. 2024;85:13–6. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Erden Y, Temel MH, Bağcıer F. Artificial intelligence insights into osteoporosis: assessing chatgpt’s information quality and readability. Arch Osteoporos. 2024;19:17. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Johnson SB, King AJ, Warner EL, Aneja S, Kann BH, Bylund CL. Using ChatGPT to evaluate cancer Myths and misconceptions: artificial intelligence and cancer information. JNCI Cancer Spectr. 2023;7(2):pkad015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Kang B, Hong M. Development and evaluation of a mental health chatbot using ChatGPT 4.0: mixed methods user experience study with Korean users. JMIR Med Inf. 2025;13:e63538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Xue J, Zhang B, Zhao Y, et al. Evaluation of the current state of chatbots for digital health: scoping review. J Med Internet Res. 2023;25:e47217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Reda B, Contardo L, Prenassi M, Guerra E, Derchi G, Marceglia S. Artificial intelligence to support early diagnosis of temporomandibular disorders: A preliminary case study. J Oral Rehabil. 2023;50(1):31–38. Epub 3 Nov 2022. PMID: 36285513. 10.1111/joor.13383. [DOI] [PubMed]

[CR37] 37.Jacob C, Brasier N, Laurenzi E, Heuss S, Mougiakakou SG, Cöltekin A, Peter MK. AI for IMPACTS framework for evaluating the Long-Term Real-World impacts of AI-Powered clinician tools: systematic review and narrative synthesis. J Med Internet Res. 2025;27:e67485. 10.2196/67485. PMID: 39909417; PMCID: PMC11840377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Ng FYC, Thirunavukarasu AJ, Cheng H, Tan TF, Gutierrez L, Lan Y, Ong JCL, Chong YS, Ngiam KY, Ho D, Wong TY, Kwek K, Doshi-Velez F, Lucey C, Coffman T, Ting DSW. Artificial intelligence education: an evidence-based medicine approach for consumers, translators, and developers. Cell Rep Med. 2023;4(10):101230. 10.1016/j.xcrm.2023.101230. PMID: 37852174; PMCID: PMC10591047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Landi N, Manfredini D, Tognini F, Romagnoli M, Bosco M. Quantification of the relative risk of multiple occlusal variables for muscle disorders of the stomatognathic system. J Prosthet Dent. 2004;92(2):190-5. 10.1016/j.prosdent.2004.05.013. PMID: 15295330. [DOI] [PubMed]

[CR40] 40.Manfredini D, Lombardo L, Siciliani G. Temporomandibular disorders and dental occlusion. A systematic review of association studies: end of an era? J Oral Rehabil. 2017;44(11):908–23. Epub 2 July 2017. PMID: 28600812. 10.1111/joor.12531. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Mercuri LG, Greene CS, Manfredini D. The temporomandibular joint disc: A complex fable about an elusive butterfly. Cranio. 2025 Apr 10:1–8. doi: 10.1080/08869634.2025.2477963. Epub ahead of print. PMID: 40205916. [DOI] [PubMed]

[CR42] 42.Manfredini D, Bender SD, Häggman-Henrikson B, Durham J, Greene CS. TMD management standards updated. Br Dent J. 2025;238(5):293. Epub 14 Mar 2025. PMID: 40087414. 10.1038/s41415-025-8515-8. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Näsänen J, Karaharju-Suvanto T, Lobbezoo F, Verhoeff MC, Lappalainen OP, Nykänen L. Self-assessed competence in relation to Bruxism among undergraduate dental students in Finland. Cranio 2025 Mar 6:1–11. doi: 10.1080/08869634.2025.2472085. Epub ahead of print. PMID: 40047365. [DOI] [PubMed]

[CR44] 44.Mungia R, Lobbezoo F, Funkhouser E, Glaros A, Manfredini D, Ahlberg J, Taverna M, Galang-Boquiren MT, Rugh J, Truong C, Boone H, Cheney C 3rd, Verhoeff MC, Gilbert GH, National Practice-Based Research Network Collaborator Group. Dental practitioner approaches to bruxism: preliminary findings from the National dental practice-based research network. Cranio. 2025;43(3):480–8. Epub 4 Apr 2023. PMID: 37016587; PMCID: PMC11011247. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Large Language model (LLM) in temporomandibular disorder education: a comparative study

Özge Özdal Zincir

Şirin Hatipoğlu

Esra Çifçi Özkan

Abstract

Background

Methods

Results

Conclusion

Supplementary Information

Background

Methods

Table 1.

Table 2.

Reference standard (comparator)

Statistical method

Results

Table 3.

Fig. 1.

Table 4.

Table 5.

Fig. 2.

Table 6.

Fig. 3.

Table 7.

Fig. 4.

Discussion

Table 8.

Conclusion

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases