Abstract
Background
Artificial intelligence (AI)-based chatbots are increasingly used by parents as convenient and fast-access sources of information on health-related topics. This study aimed to assess the readability, accuracy and overall quality of responses provided by ChatGPT-4o, Google Gemini and Microsoft Copilot to questions concerning deleterious oral habits in children.
Methods
A total of 43 questions, derived from real-life discussions on the Reddit platform, were revised for clarity and demographic diversity. These were classified into seven categories based on specific types of deleterious oral habits, including thumb sucking, bruxism, pacifier use, bruxism, tongue thrusting, lip sucking, nail biting, and mouth breathing. Responses from each AI chatbot were evaluated using multiple evaluation tools including Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), the modified DISCERN tool (mDISCERN), Global Quality Score (GQS), and misinformation scoring system. Statistical analyses were performed using the Kruskal–Wallis test followed by Dunn’s post hoc test for non-normally distributed variables, and one-way ANOVA with Tukey’s post hoc test for normally distributed variables (p < .05).
Results
ChatGPT-4o generated responses with significantly lower readability and higher textual complexity compared to Gemini and Copilot, as reflected by its lower FRE (p = .0022) and higher FKGL (p = .0062) scores. ChatGPT-4o had 76.74% of its responses rated as excellent quality (GQS score of 5), compared to 44.19% for Gemini and 30.23% for Copilot. In terms of accuracy, ChatGPT-4o provided correct information for 93% of the questions (misinformation scores of 4 or 5), while Gemini and Copilot achieved 88.34% and 81.4%, respectively. Google Gemini achieved the highest mDISCERN score (34.1) due to better source referencing.
Conclusion
AI chatbots may serve as supplementary tools for parental education on oral health, yet their performance varies by platform. ChatGPT-4o excelled in accuracy and structure, Gemini in transparency, and Copilot in simplicity. However, these tools should not substitute for professional dental guidance. Enhancing readability and source referencing remains essential for improving the reliability of AI-generated health information.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12903-025-07298-z.
Keywords: Deleterious oral habits, Chatbots, Artificial intelligence
Introduction
Deleterious oral habits are defined as repetitive orofacial behaviors performed without functional purpose, including thumb sucking, tongue thrusting, nail biting, bruxism, lip sucking, and mouth breathing [1]. Although some of these habits are considered normal during infancy and tend to decrease with age, their continuation beyond early childhood can negatively affect facial growth and occlusal development [2]. Several studies have shown that deleterious oral habits are highly prevalent among children, with reported rates ranging from 36% to over 50% across different age groups and countries [3, 4].
Given their high prevalence and potential impact, deleterious oral habits have been closely linked to the development of malocclusions [5]. Mouth breathing, especially when combined with nasal obstruction, showed a strong association with bimaxillary protrusion, while nail biting was moderately linked to increased overbite [6]. Thumb sucking during early childhood has been associated with spacing, crowding, and increased overjet, often necessitating orthodontic intervention later in life [7]. Despite the high prevalence and the clinical significance of these habits, studies indicate that many parents lack sufficient awareness regarding their causes and consequences. In a cross-sectional survey, more than 25% of mothers either did not know or incorrectly answered when asked whether habits such as thumb sucking or lip and cheek biting could affect their child’s dentition [8]. In the absence of professional dental guidance, many parents seek information through online sources, with platforms like YouTube being among the most commonly used for pediatric dental advice [9]. However, the quality and accuracy of the content on these platforms vary widely, raising concerns about misinformation. A study analyzing YouTube videos related to thumb-sucking habits found that less than half contained scientifically accurate information. This highlights the need for more reliable, accessible and evidence-based tools to support caregivers in making informed decisions [10].
Technology has significantly changed how individuals access healthcare information, including guidance on children’s oral and dental health. Many parents increasingly turn to online resources, including artificial intelligence (AI)-based platforms, to obtain quick and convenient information. Factors such as the time required for in-person consultations, potential financial constraints, and the need for immediate guidance contribute to this preference [11]. AI based platforms are popular because they provide quick and easy access to information. However, the reliability and accuracy of online content cannot always be assured. Unverified or incorrect information may lead to misguided decisions and may even worsen health issues. In certain instances, parents rely on AI platforms as a temporary solution when access to healthcare professionals is limited [12]. Therefore, it is crucial to evaluate the accuracy and trustworthiness of AI-generated information, particularly in sensitive topics such as children’s oral health.
Although many studies have examined the reliability and accuracy of AI chatbots in various dental fields such as dental trauma [13, 14], orthodontics [15, 16], and pediatric dentistry [17, 18], only one study has investigated deleterious oral habits [19]. That study focused on a single oral habit (bruxism) and assessed only readability, accuracy, and consistency, without parameters such as response quality and misinformation that were evaluated in the present study. To the best of our knowledge, the present study is the first to evaluate AI chatbot responses across seven deleterious oral habits—thumb sucking, pacifier use, nail biting, tongue thrusting, tongue/lip sucking, tooth grinding, and mouth breathing. Therefore, this study aimed to evaluate the quality and accuracy of responses provided by different AI chatbots to questions concerning deleterious oral habits, which are common in children and can contribute to malocclusions if left unaddressed.
The first null hypothesis stated that there would be no statistically significant differences among the three AI chatbots in terms of the readability of their responses concerning deleterious oral habits. The second null hypothesis proposed that the quality and accuracy of the responses would not significantly differ across the chatbots.
Materials and methods
Question source and categorization
The questions used in this study were derived from patient inquiries on a popular question-answer platform, Reddit (https://www.reddit.com), specifically from dentistry- and parenting-related subreddits such as r/askdentist, r/Dentistry, and r/Parenting, where users post questions for public discussion. This platform allows users to ask real-life questions often based on their own or their children’s oral health concerns. From this source, all questions related to deleterious oral habits were extracted to form the initial pool. To create a representative and structured question set, some inquiries were modified by adjusting variables such as the patient’s age or the phrasing. These modifications were specifically made to evaluate how AI chatbots adapt their responses to different scenarios and demographics. For example, altering the age of the patient within the same question allowed us to assess whether the chatbot could adjust its responses appropriately, while maintaining accuracy and relevance. This method helped keep the questions realistic while covering a wide range of situations. It allowed for a better evaluation of the quality and adaptability of the AI generated responses.
Pediatric dentistry specialists independently screened and refined the initial pool of questions according to predefined inclusion and exclusion criteria. Inclusion criteria were: (1) parent- or patient-oriented questions written in English; (2) questions relevant to one of the seven targeted deleterious oral habits; (3) child age between 0 and 16 years; and (4) content addressing at least one of the following: dental/craniofacial consequences, etiology, timing (when to stop/seek care), or management. Exclusion criteria were: (1) Questions unrelated to oral habits; (2) Questions about oral habits but without any dental, oral, or craniofacial relevance (3) adult-focused questions; and (4) duplicates or near-identical questions.
The final dataset consisted of 43 questions, categorized into seven groups based on common deleterious oral habits to provide balanced representation across all categories and adequate variation for meaningful comparison. Table 1 presents one representative sample question from each category.
Table 1.
Oral habit categories and representative sample questions used in the study
| Oral Habit Category | Sample Question |
|---|---|
| Thumb sucking (7 questions) | My 2-year-old still sucks her thumb. I’ve tried the usual approaches like positive reinforcement and icky nail polish—but nothing has worked. What should I do? |
| Pacifier use (7 questions) | What are the potential dental problems caused by pacifier use? |
| Tooth grinding (6 questions) | What are the treatment options for teeth grinding in children aged 6–12? |
| Tongue thrust (6 questions) | My almost 3-year-old has a tongue thrust still, and I’ve noticed that his teeth are starting to buck out a little bit. Is this something I should be worried about or get treated, or will he likely grow out of it on his own? |
| Tongue/lip sucking (7 questions) | 3-year-old constantly licking and biting his lips. It’s getting raw. Any advice? |
| Nail biting (6 questions) | My 9-year-old child has developed a nail-biting habit. Will this damage his teeth? How can I stop this? |
| Mouth breathing (6 questions) | I noticed that my six-year-old always sleeps with his mouth open. Does this affect his facial development? |
Chatbot query process
Each question was submitted to the latest publicly accessible versions of ChatGPT-4o (model: gpt-4) and Google Gemini (model: Gemini Pro 1.0), both updated as of October 31, 2024. Microsoft Copilot was used in its latest available configuration at that time powered by the Microsoft Prometheus model with underlying GPT-4 variants (version updated October 10, 2024). To maintain consistency, both platforms were accessed using their default settings. For ChatGPT-4o, the “temporary” chat setting was used to ensure no memory of previous interactions influenced the responses. A new chat window was opened for each query to prevent contextual carryover, commonly referred to as the “Token Effect”. All questions were entered on the same day, by the same user profile, and from the same Wi-Fi network and IP address to minimize potential environmental variability. No follow-up or repeat questions were asked. All responses were recorded and are presented in Supplemental Table 1 for evaluation.
All references cited by the chatbots were manually checked. Academic sources such as PubMed links, peer-reviewed journals, official guidelines, or professional society websites were cross-verified for accuracy and relevance. References from non-academic, commercial, unverifiable sources, as well as dead or mismatched links, were excluded from the analysis. Reference verification was performed by two pediatric dentistry specialists.
Evaluation of responses
Readability assessment
To objectively assess the readability of the AI-generated responses two validated readability measures were used: Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL). FRE scores range from 0 to 100, with higher scores indicating easier-to-read content. Scores above 80 are recommended for healthcare materials intended for the general public [20]. FKGL indicates the U.S. school grade level required to understand the text with, with scores ranging from 0 to 18. For example, a score of 6 corresponds to a six-grade reading level, while a score of 17 indicates graduate-level complexity [21]. Both metrics were calculated using Microsoft Office Word, using the following formulas:
FRE score = 206.835 − 1.015 (total number of words/total number of sentences) − 84.6 (total number of syllables/total number of words).
FKGL score = 0.39 (total number of words/total number of sentences) + 11.8 (total number of syllables/total number of words) − 15.59.
Assessing quality and accuracy
To assess the quality and accuracy of AI chatbot responses, three evaluation tools were used: a modified version of the DISCERN tool (mDISCERN), the Global Quality Scale (GQS), and a misinformation analysis. The original DISCERN scale contains sixteen questions divided into three sections: reliability (8 questions), quality of information about treatment options (7 questions), and an overall quality rating (1 question) [22]. Given that several questions in this study were not related to treatment options, the second section was excluded. Additionally, the remaining questions were adapted from a previously modified DISCERN version specifically tailored for evaluating AI-generated content, to ensure relevance to chatbot responses [23]. The mDISCERN tool included eight questions, each rated on a 5-point Likert scale, yielding a total score ranging from 8 of 40. Two pediatric dentistry specialists (with 3 and 6 years of clinical and academic experience), blinded to the identity of the AI chatbots, independently evaluated the responses, and the mean score was calculated for each item. Scores below 15 were classified as poor, scores between 16 and 31 as fair, and scores above 32 as good.
To assess the overall quality of the AI-generated responses, the Global Quality Scale (GQS) was also utilized. This tool provides a simple yet effective measure of content quality, depth, and reliability. Each response was rated on a 5-point scale: 1 for poor quality (misleading or incomplete, not helpful), 2 for fair quality (some value but with inaccuracies), 3 for good quality (mostly accurate but missing depth), 4 for very good quality (accurate and useful for general readers), and 5 for excellent quality (highly accurate, detailed and clear with actionable insights) [24].
The extent of misinformation and comprehensiveness of AI chatbot responses were evaluated using a six-point Likert scale, adapted from Mohammad-Rahimi et al. [25].
GQS and misinformation assessments were independently conducted by two pediatric dentistry specialists. Discrepancies were resolved through discussion, and if consensus could not be reached, a third senior pediatric dentist with 20 years of experience provided the final judgment.
Statistical analysis
Statistical analyses were performed using IBM SPSS Version 23. The Kolmogorov–Smirnov test was applied to evaluate the normality of data distribution. For variables that did not follow a normal distribution, the Kruskal–Wallis test with Dunn’s post hoc analysis was conducted. In cases of normally distributed variables, one-way ANOVA followed by Tukey’s post hoc test was used. Inter-rater reliability assessment was determined using the intraclass correlation coefficient (ICC). Categorical variables were analyzed using Pearson’s chi-square test, with post hoc comparisons adjusted using the Bonferroni-corrected Z test. Quantitative data are reported as mean ± standard deviation and median (minimum–maximum), while categorical data are presented as frequencies and percentages. A p-value of less than 0.05 was considered statistically significant.
Results
Readability (FRE and FKGL Scores)
The overall readability of the responses generated by ChatGPT-4o, Gemini, and Copilot was evaluated using FRE and FKGL scores, as illustrated in Fig. 1a and b. The median FRE score for ChatGPT-4o was 49.1, corresponding to the “difficult to read” category. In contrast, Gemini and Copilot had higher median FRE scores of 51.6 and 54.9, respectively, both falling within the “fairly difficult to read” range. This difference was statistically significant, with ChatGPT-4o scoring lower than both Gemini and Copilot (p =.0022), indicating reduced readability. Similarly, FKGL scores revealed that ChatGPT-4o had a significantly higher mean score (10.47) compared to Gemini (9.53) and Copilot (9.61) (p =.0062), reflecting greater textual complexity. These findings indicate that ChatGPT-4o produces responses that are quantitatively more difficult to read than those generated by Gemini and Copilot.
Fig. 1.
Readability scores for responses generated by ChatGPT-4o, Gemini, and Copilot for 43 questions. a FRE scores are presented as violin plots, with each dot representing a response and the median (min–max) values indicated within each distribution. Kruskal Wallis and post hoc Dunn’s test was applied. b FKGL scores are displayed as scatter plots, showing individual responses along with mean ± standard deviation. One-way ANOVA and post hoc Tukey test was applied. (ns, not significant, **p<.01;*p<.05)
Quality and accuracy of the responses (mDISCERN, GQS and misinformation Scores)
Median mDISCERN scores differed significantly among the AI chatbots (p <.001) (Fig. 2). Gemini and Copilot had median mDISCERN scores of 36.5 (21.5–39) and 34.5 (17.5–38), respectively, corresponding to the ‘good quality’ range, whereas ChatGPT-4o had a lower median score of 28.5 (25–30), falling within the ‘fair quality’ category. Gemini achieved a statistically significantly higher median mDISCERN score than both Copilot (p = 0,0045) and ChatGPT-4o (p <.0001).
Fig. 2.

mDISCERN scores are presented as violin plots, with each dot representing a response and the median (min–max) values indicated within each distribution. The mDISCERN scores are categorized as follows: 0–15 (poor quality), 16–31 (fair quality), and 32–40 (good quality). Kruskal Wallis and post hoc Dunn’s test was applied (****p<.0001; **p<.01)
GQS ratings revealed significant differences among ChatGPT-4o, Gemini, and Copilot (p =.004) (Table 2). The proportion of responses rated as 4 and 5, indicating good and excellent quality respectively, was significantly higher for ChatGPT-4o compared to both Gemini and Copilot. Notably, ChatGPT-4o had no responses rated as poor quality (scores of 1 or 2), whereas three of Gemini’s and two of Copilot’s responses received poor quality scores (Fig. 3a). Overall, ChatGPT-4o consistently generated responses of at least good quality.
Table 2.
Comparison of GQS according to AI chatbots
| ChatGPT-4o n (%) |
Gemini n (%) |
Copilot n (%) |
Test Statistic | p* | |
|---|---|---|---|---|---|
| Scores | |||||
| 1 | 0 (0) | 1 (2.33) | 1 (2.33) | 22.421 | 0.004 |
| 2 | 0 (0) | 2 (4.65) | 1 (2.33) | ||
| 3 | 0 (0) | 2 (4.65) | 4 (9.3) | ||
| 4 | 10 (23.26)a | 19 (44.19)a, b | 24 (55.81)b | ||
| 5 | 33 (76.74)a | 19 (44.19)b | 13 (30.23)b | ||
*Pearson Chi-Square Test
a-b: Different superscript letters indicate statistically significant differences
Fig. 3.
Histograms with the score distribution for each AI chatbot. a GQS. b Misinformation scores
The chatbots evaluated in this study generated a small number of incorrect or irrelevant statements in response to the questions. These are highlighted in red in Supplementary Table 1. Notably, none of ChatGPT-4o’s responses were rated as incorrect or irrelevant (scores 1 or 2), while both Gemini and Copilot each produced two such responses, indicating a small proportion of misleading content. ChatGPT-4o had the highest proportion of responses rated as Score 5 (83.72%), indicating completely correct and comprehensive information (Fig. 3b). This was followed by Gemini with 58.14% and Copilot with 46.51%. In contrast, Gemini (30.23%) and Copilot (34.88%) had a higher proportion of responses rated as Score 4, which were considered correct but lacking important information, compared to ChatGPT-4o (9.30%) (p <.05).
Interrater reliability for the mDISCERN scores was found to be excellent, with an ICC of 0.943 (95% CI: 0.920–0.960; p <.001), indicating strong agreement between the two evaluators. Reliability was good for the GQS (ICC = 0.800, 95% CI: 0.728, 0.855, p <.001) and for the misinformation scores (ICC = 0.872, 95% CI: 0.823, 0.908, p <.001). Table 3.
Table 3.
Distribution of misinformation scores according to AI chatbots
| Scores | ChatGPT 4o n (%) |
Gemini n (%) |
Copilot n (%) |
Test Statistic | p* |
|---|---|---|---|---|---|
| Score 1 (Strongly Disagree): The answer and the entire content are incorrect or irrelevant. | 0 (0) | 2 (4.65) | 1 (2.33) | 16.9 | 0.031 |
| Score 2 (Disagree): The answer is incorrect, but the content includes some correct elements. | 0 (0) | 0 (0) | 1 (2.33) | ||
| Score 3 (Neutral): The answer is somewhat correct, but details are primarily incorrect, missing, or irrelevant. | 3 (6.98) | 3 (6.98) | 6 (13.95) | ||
| Score 4 (Agree): The answer is correct and most of the content is correct, but it lacks information. | 4 (9.30)a | 13 (30.23)b | 15 (34.88)b | ||
| Score 5 (Strongly Agree): The answer is correct, and the content is comprehensive. | 36 (83.72)a | 25 (58.14)b | 20 (46.51)b |
*Pearson Chi-Square Test
a-b: Different superscript letters indicate statistically significant differences
Discussion
The present study evaluated the readability, understandability, quality and accuracy of responses generated by ChatGPT-4o, Copilot and Gemini to inquiries related to deleterious oral habits. The first null hypothesis was rejected since the analysis reveals that the ChatGPT-4o showed higher readability scores than Gemini and Copilot.
The readability of health information is critical to ensuring that it is both understandable and actionable for its intended audience. Studies have shown that when health materials are written at a reading level that matches the average patient’s literacy skills, comprehension improves significantly [26]. The American Medical Association recommends that patient education materials be written below a 6th-grade reading level to enhance accessibility for the general population [27, 28]. In the present study, readability scores based on the FRE and FKGL indicated that responses from Gemini and Copilot were statistically significantly easier to read compared to those from ChatGPT-4o. ChatGPT-4o provided the most detailed and comprehensive responses among the AI platforms. However, its college-level readability significantly exceeded the recommended standards for patient education materials. While this level of complexity can benefit users seeking in-depth explanations, it may reduce accessibility for individuals with lower health literacy. In contrast, Copilot offers a better balance between readability and ease of use. Its responses are comparable in clarity to those of Gemini; however, the shortness of the answers sometimes leads to the omission of important details. Notably, when users seek advice rather than general information, Copilot tends to show empathy through supportive language and the use of emojis, reinforcing its identity as a chatbot [29]. There has been a notable increase in studies examining ChatGPT’s role in healthcare, particularly in medicine and dentistry. Many of these studies report findings similar to those of present study emphasizing ChatGPT-4o’s strengths in content depth but also pointing to ongoing challenges with readability [29, 30]. A recent study by Camargo et al. compared the responses of ChatGPT-3.5, ChatGPT-4, and Gemini to frequently asked questions about bruxism and reported findings similar to ours, with Gemini scoring higher on readability scales, indicating that its responses were easier to understand [19].
The second null hypothesis was rejected, as all quality and accuracy assessment tools including mDISCERN, GQS, and misinformation scores revealed differences among the chatbots. It is important to note that previous studies have primarily relied on the standard DISCERN tool, which is designed to assess the quality and reliability of written information, particularly in the context of treatment-related content. In those studies, ChatGPT often outperformed other chatbots due to its comprehensive and detailed responses [31, 32]. In the present study, we used the modified version mDISCERN which places greater emphasis on whether responses include source references and are supported by evidence. This tool helps evaluate not just the content, but also how reliable and verifiable the response is [33]. Gemini achieved the highest mDISCERN scores, largely due to its frequent inclusion of credible references from reliable academic sources such as PubMed. Gemini performed particularly well on “general information” questions, often citing peer-reviewed publications and directing users to specific, highlighted sections within those sources. This approach offered clear, contextual evidence to support its responses. However, Gemini’s referencing was inconsistent. For “advice-based” questions, it often relied on general statements without citing specific sources [34]. In contrast, ChatGPT-4o consistently produced detailed, comprehensive, and well-structured responses. However, it did not include any source attributions. This lack of transparency significantly reduced its mDISCERN scores, as the reliability of even high-quality content could not be independently verified. Although ChatGPT-4o demonstrated superior analytical capabilities and effectively synthesized information, its inability to provide references diminished its perceived reliability compared to Gemini [35]. However, given the known tendency of large language models to generate fabricated or incorrect citations which is often referred to as “hallucinated references”, all chatbot-generated references included in the scoring process were manually verified. Our findings were further supported by a recent study in orthodontics by Asiri et al., which reported that only 15.76% of AI-generated citations were accurate, emphasizing the necessity of human verification when evaluating chatbot-generated health content [36].
In the present study, Copilot received lower mDISCERN and GQS scores due to limited referencing and the use of less reliable sources. However, other studies have reported more balanced outcomes. For example, a recent study evaluating the performance of ChatGPT, Gemini, and Copilot in providing breastfeeding-related information found no significant difference between Copilot and Gemini in terms of mDISCERN and GQS scores [37]. Many of the references cited by Copilot lacked academic rigor or originated from less reliable websites, which contributed to diminish its DISCERN scores. While both Gemini and Copilot used web-based sources, including general internet search results, a significant portion of their references were linked to private dental clinics and commercial websites. These sources often appear on the first page of search results, likely due to search engine optimization (SEO) strategies. SEO refers to digital marketing techniques that use specific keywords and other methods to boost a web-site visibility in search engine rankings [38]. For example, private dental clinics with strong SEO strategies or paid advertisements are often prioritized in search results. As a result, AI chatbots that gather data from the web are more likely to include these sites as references. This reliance on SEO-driven or commercial sources introduces a level of bias in the content provided by Gemini and CoPilot as it reflects the prominence of these websites in search engine algorithms rather than their scientific credibility. While such references may offer general guidance, they raise concerns about the quality and neutrality of the information, particularly when compared to peer-reviewed academic sources.
The evaluation of misinformation and GQS scores revealed significant differences among the chatbots, consistent with findings from other recent studies analyzing the performance of AI chatbots in medical and dental fields [39–41]. In the present study, ChatGPT-4o achieved the highest scores in both GQS and misinformation assessments, reflecting its ability to generate detailed, structured and highly accurate responses. This aligns with previous studies indicating that ChatGPT-4o has consistently provided more comprehensive and scientifically accurate information than earlier versions and other platforms [42–44]. In contrast, a study by Naz et al. comparing the accuracy and quality of ChatGPT-3.5, Gemini and Copilot in responses to chronic kidney disease-related questions reported that Gemini received higher GQS ratings across all categories. It is important to note that this study used ChatGPT-3.5, not version 4o which may partly explain the difference [45]. In the study by Reyhan et al., which evaluated the responses of six AI chatbots to questions about keratoconus, the Gemini model had the highest percentage of responses rated as “good” quality according to the GQS, with 40% of its answers scoring as 4 and 5. In contrast, in the present study, all responses generated by ChatGPT-4o, and more than 85% of those generated by Gemini and Copilot, received GQS scores of 4 or 5 [34].
Although the responses regarding deleterious oral habits provided by the three chatbots included limited misinformation, they often presented overly detailed and generalized content, covering a broad range of possibilities rather than offering focused, in-depth information specific to the topic. A few responses did contain inaccurate or potentially misleading advice. For example, in response to bruxism-related questions across different pediatric age groups showed that Gemini failed to provide age-specific recommendations. Gemini suggested a custom-made nightguard for a 3-year-old child, which is generally not recommended at that age due to developmental and compliance concerns.
In the study conducted by Camargo et al., ChatGPT-3.5, ChatGPT-4, and Gemini were evaluated based on their responses to 30 frequently asked questions about bruxism. Gemini provided no incorrect answers, while ChatGPT-3.5 and ChatGPT-4 gave three and one inaccurate responses, respectively [19]. In the present study, ChatGPT-4o did not produce any misinformation, whereas Gemini and Copilot each provided two inaccurate responses, as reflected by misinformation scores of 1 or 2. The higher accuracy observed for ChatGPT-4o in the present study compared to Gemini may be explained by differences in the chatbot versions and the nature of the questions used. While Camargo et al. evaluated older versions such as ChatGPT-3.5 and ChatGPT-4, our study assessed the performance of the more recent ChatGPT-4o model, using a different set of questions. Notably, ChatGPT-3.5 has also been reported in other studies to produce a higher rate of inaccurate responses compared to other chatbots [13, 30, 46].
In this study, all three chatbots provided potentially misleading recommendations when responding to questions about managing mouth breathing or oral habits during sleep. For example, in response to a question on how to prevent mouth breathing during sleep, all chatbots recommended mouth taping. This practice is strongly discouraged in pediatric populations due to significant safety concerns and the risk of airway obstruction. Although some preliminary studies have explored mouth taping in the context of obstructive sleep apnea in adults, the scientific evidence remains extremely limited [47, 48]. Despite this lack of robust data, mouth taping has gained popularity through social media platforms such as TikTok, raising concerns about the influence of unverified online trends [49, 50]. This example highlights a key limitation of AI chatbots in their inability to critically assess the credibility of sources, which may lead to the spread of potentially unsafe health information.
Additionally, ChatGPT-4o recommended the use of a chin strap to prevent lip sucking during sleep, which is not recommended for children. Such interventions may interfere with normal breathing, lack scientific validation, and can cause discomfort or psychological distress in child. In the literature, chin strap use is documented only in adults, typically as an adjunct to nasal CPAP therapy for obstructive sleep apnea. However, even in that context, its effectiveness remains questionable, with studies indicating limited clinical benefit [51]. These findings highlight the need for chatbot-generated content to be not only accurate, but also age appropriate, safe, and contextually appropriate.
This study is the first to comprehensively evaluate the performance of multiple AI chatbots, including ChatGPT-4o, Gemini, and Copilot, in responding to questions related to deleterious oral habits. This study has several limitations. One of its main limitations is the lack of evaluation tools specifically designed to assess the accuracy and quality of chatbot-generated responses. The instruments used in this study, such as DISCERN and GQS, were originally developed to evaluate online health information presented on platforms like websites and YouTube videos. Therefore, they may not fully reflect the unique features of AI-generated content. Furthermore, since these evaluation tools have often been modified differently across studies, making direct comparisons between findings is challenging.
Another limitation is that the consistency or stability of chatbot responses was not evaluated by submitting repeated queries. Since generative AI models may produce varying outputs for identical inputs due to the inherent randomness in their response generation, assessing stability through repetition is important to strengthen the reliability of the findings [52].
In addition, although the questions were derived from real-life patient inquiries, minor modifications in wording or demographic details may have influenced how chatbots interpreted and responded to them. All chatbot queries were conducted on a single day to maintain standardization; however, given that chatbots are continuously updated and trained, achieving perfect standardization remains challenging. Finally, while the readability formulas applied in this study are widely used in health research, there are currently no validated metrics specifically designed to assess the readability of AI generated text. As a result, existing formulas may not fully capture the conversational style and linguistic nuances of chatbot responses.
Conclusion
AI chatbots including ChatGPT-4o, Gemini, and Copilot offer a convenient way for users to access information on deleterious oral habits. However, their effectiveness in providing accurate and reliable health guidance varies. ChatGPT-4o achieved the best overall performance among the evaluated chatbots, consistently producing responses of good to excellent quality, with the highest proportion of fully correct and comprehensive answers (Score 5) and no misleading content. In contrast, Gemini and Copilot, despite showing good quality ratings in some responses, generated a small proportion of poor-quality or incorrect statements, which lowered their overall accuracy and performance.
Although ChatGPT-4o outperformed the other two models in overall performance, its higher reading level may limit accessibility, and the lack of source attribution remains a concern. While AI chatbots can serve as useful educational tools, their responses should not replace professional dental advice. Future developments should aim to improve readability, reference transparency, and user-friendliness while ensuring evidence-based content.
Supplementary Information
Acknowledgements
Not applicable.
Abbreviations
- AI
Artificial Intelligence
- FRE
Flesch Reading Ease
- FKGL
Flesch-Kincaid Grade Level
- GQS
Global Quality Scale
- mDISCERN
Modified DISCERN
- SEO
Search Engine Optimization
Authors’ contributions
OTO and MY contributed to data collection and interpretation and drafted the initial manuscript. YG conceptualized the study, performed the data analysis and interpretation, and contributed to the final writing. All authors critically reviewed and approved the final version of the manuscript.
Funding
The authors did not receive support from any organization for the submitted work.
Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author Omer Tarik Ozdemir (omertarik.ozdemir@ogr.iu.edu.tr) upon reasonable request.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1. Abd-Elsabour MAAA, Hanafy RMH, Omar OM. Effect of self-perceived oral habits on orofacial dysfunction and oral health-related quality of life among a group of egyptian children: A cohort study. Eur Arch Paediatr Dent. 2022;23(6):935–44. [DOI] [PMC free article] [PubMed]
- 2.Lawal FB, Idiga E, Fagbule OF, Ajayi IJ, Amusa F, Adejumo O, et al. Association between self-reported oral habits and oral health related quality of life of adolescents in ibadan, nigeria. PLOS Glob Public Health. 2024;4(5):e0003218. [DOI] [PMC free article] [PubMed]
- 3.Dhull KS, Verma T, Dutta B. Prevalence of deleterious oral habits among 3- to 5-year-old preschool children in bhubaneswar, odisha, india. Int J Clin Pediatr Dent. 2018;11(3):210–3. [DOI] [PMC free article] [PubMed]
- 4.Garde JB, Suryavanshi RK, Jawale BA, Deshmukh V, Dadhe DP, Suryavanshi MK. An epidemiological study to know the prevalence of deleterious oral habits among 6 to 12 year old children. J Int Oral Health. 2014;6(1):39–43. [PMC free article] [PubMed]
- 5.Jan HE, Abuhamda ISB, Assiri AT, Samanodi HS, Alsulami AA, Alghamdi M, et al., editors. Meta-analysis of prevalence of bad oral habits and relationship with prevalence of malocclusion. 2017;11(4):111–7.
- 6.Baeshen HA. Malocclusion trait and the parafunctional effect among young female school students. Saudi J Biol Sci. 2021;28(1):1088–92. [DOI] [PMC free article] [PubMed]
- 7.Kolawole KA, Folayan MO, Agbaje HO, Oyedele TA, Onyejaka NK, Oziegbe EO. Oral habits and malocclusion in children resident in ile-ife nigeria. Eur Arch Paediatr Dent. 2019;20(3):257–65. [DOI] [PubMed]
- 8.Almugairin S, Alwably A, Alayed N, Algazlan A, Alrowaily H, Eldwakhly E, et al. Parental knowledge, awareness, and attitudes towards children’s oral habits: A descriptive cross-sectional study. Acta Odontol Scand. 2025;84:65–75. [DOI] [PMC free article] [PubMed]
- 9.Madathil KC, Rivera-Rodriguez AJ, Greenstein JS, Gramopadhye AK. Healthcare information on youtube: A systematic review. Health Informatics J. 2015;21(3):173–94. [DOI] [PubMed]
- 10.Hakami Z, Maganur PC, Khanagar SB, Naik S, Alhakami K, Bawazeer OA, et al. Thumb-sucking habits and oral health: An analysis of youtube content. Children (Basel). 2022;9(2):225. [DOI] [PMC free article] [PubMed]
- 11.Amjad A, Kordel P, Fernandes G. A review on innovation in healthcare sector (telehealth) through artificial intelligence. Sustainability. 2023;15(8):6655.
- 12.Oviedo-Trespalacios O, Peden AE, Cole-Hunter T, Costantini A, Haghani M, Rod JE, et al. The risks of using chatgpt to obtain common safety-related information and advice. Safety Science. 2023;167:106244.
- 13.Guven Y, Ozdemir OT, Kavan MY. Performance of artificial intelligence chatbots in responding to patient queries related to traumatic dental injuries: A comparative study. Dent Traumatol. 2024;41(3):338–47. [DOI] [PubMed]
- 14.Kuru HE, Asik A, Demir DM. Can artificial intelligence language models effectively address dental trauma questions? Dent Traumatol. 2025;41(5):567–80. [DOI] [PMC free article] [PubMed]
- 15.Hatia A, Doldo T, Parrini S, Chisci E, Cipriani L, Montagna L, et al. Accuracy and completeness of chatgpt-generated information on interceptive orthodontics: A multicenter collaborative study. J Clin Med. 2024;13(3):735. [DOI] [PMC free article] [PubMed]
- 16.Zhou X, Chen Y, Abdulghani EA, Zhang X, Zheng W, Li Y. Performance in answering orthodontic patients’ frequently asked questions: Conversational artificial intelligence versus orthodontists. J World Fed Orthod. 2025;14(4):202–7. [DOI] [PubMed]
- 17.Dermata A, Arhakis A, Makrygiannakis MA, Giannakopoulos K, Kaklamanos EG. Evaluating the evidence-based potential of six large language models in paediatric dentistry: A comparative study on generative artificial intelligence. Eur Arch Paediatr Dent. 2025;26:527–35. [DOI] [PMC free article] [PubMed]
- 18.Bayraktar Nahir C. Can chatgpt be guide in pediatric dentistry? BMC Oral Health. 2025;25(1):9. [DOI] [PMC free article] [PubMed]
- 19.Camargo ES, Quadras ICC, Garanhani RR, de Araujo CM, Stuginski-Barbosa J. A comparative analysis of three large language models on bruxism knowledge. J Oral Rehabil. 2025;52(6):896–903. [DOI] [PubMed]
- 20.Flesch R. A new readability yardstick. J Appl Psychol. 1948;32(3):221–33. [DOI] [PubMed]
- 21.Kincaid P, Fishburne RP, Rogers RL, Chissom BS, editors. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. 1975.
- 22.Charnock D, Shepperd S, Needham G, Gann R. Discern: An instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health. 1999;53(2):105–11. [DOI] [PMC free article] [PubMed]
- 23.Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, et al. Dr. Google to dr. Chatgpt: Assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc. 2024;38(5):2887–93. [DOI] [PMC free article] [PubMed]
- 24.Bernard A, Langille M, Hughes S, Rose C, Leddin D, Veldhuyzen van Zanten S. A systematic review of patient inflammatory bowel disease information resources on the world wide web. Am J Gastroenterol. 2007;102(9):2070–7. [DOI] [PubMed]
- 25.Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J. 2024;57(3):305–14. [DOI] [PubMed]
- 26.Boutemen L, Miller AN. Readability of publicly available mental health information: A systematic review. Patient Education and Counseling. 2023;111:107682. [DOI] [PubMed]
- 27.Brach C. Ahrq health literacy universal precautions toolkit. Rockville, MD: Agency for Healthcare Research and Quality. AHRQ Publication No. 23–0075. 2023;23(24):0091–EF.
- 28.Weiss BD, Doak LG, Doak CC. Health literacy : a manual for clinicians : part of an educational program about health literacy. Chicago: AMA Foundation, American Medical Association; 2003.
- 29.Aydın FO, Aksoy BK, Ceylan A, Akbaş YB, Ermiş S, Kepez Yıldız B, et al. Readability and appropriateness of responses generated by chatgpt 3.5, chatgpt 4.0, gemini, and microsoft copilot for faqs in refractive surgery. Turk J Ophthalmol. 2024;54(6):313–7. [DOI] [PMC free article] [PubMed]
- 30.Dursun D, Bilici Geçer R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak. 2024;24(1):211. [DOI] [PMC free article] [PubMed]
- 31.Durmaz Engin C, Karatas E, Ozturk T. Exploring the role of chatgpt-4, bingai, and gemini as virtual consultants to educate families about retinopathy of prematurity. Children (Basel). 2024;11(6):750. [DOI] [PMC free article] [PubMed]
- 32.Xie Y, Seth I, Hunter-Smith DJ, Rozen WM, Seifman MA. Investigating the impact of innovative ai chatbot on post-pandemic medical education and clinical assistance: A comprehensive analysis. ANZ J Surg. 2024;94(1–2):68–77. [DOI] [PubMed]
- 33.Akdemir E, Çiçek M, Çakmak BS, Akbulut ML, Buğday MS. Educational quality of youtube(™) videos on laparoscopic radical prostatectomy. J Laparoendosc Adv Surg Tech A. 2025;35(5):373–78. [DOI] [PubMed]
- 34.Reyhan AH, Mutaf Ç, Uzun İ, Yüksekyayla F. A performance evaluation of large language models in keratoconus: A comparative study of chatgpt-3.5, chatgpt-4.0, gemini, copilot, chatsonic, and perplexity. J Clin Med. 2024;13(21). [DOI] [PMC free article] [PubMed]
- 35.Behers BJ, Stephenson-Moe CA, Gibons RM, Vargas IA, Wojtas CN, Rosario MA, et al. Assessing the quality of patient education materials on cardiac catheterization from artificial intelligence chatbots: An observational cross-sectional study. Cureus. 2024;16(9):e69996. [DOI] [PMC free article] [PubMed]
- 36.Asiri SN. Assessing the reliability of chatgpt and gemini in identifying relevant orthodontic literature. European Journal of General Dentistry. 2025. (EFirst).
- 37.Kacer EO. Evaluating ai-based breastfeeding chatbots: Quality, readability, and reliability analysis. PLoS One. 2025;20(3):e0319782. [DOI] [PMC free article] [PubMed]
- 38.Shahzad A, Mohd Nawi N, Hamid N, Khan S, Aamir M, Ulah A, et al. The impact of search engine optimization on the visibility of research paper and citations. JOIV : International Journal on Informatics Visualization. 2017;1(no. 4-2):195–8.
- 39.Bragazzi NL, Buchinger M, Atwan H, Tuma R, Chirico F, Szarpak L, et al. Proficiency, clarity, and objectivity of large language models versus specialists’ knowledge on covid-19’s impacts in pregnancy: Cross-sectional pilot study. JMIR Form Res. 2025;9:e56126. [DOI] [PMC free article] [PubMed]
- 40.Tokgöz Kaplan T, Cankar M. Evidence-based potential of generative artificial intelligence large language models on dental avulsion: Chatgpt versus gemini. Dent Traumatol. 2025;41(2):178–86. [DOI] [PubMed]
- 41.Barlas İ, Tunç L. Quality of chatbot responses to the most popular questions regarding erectile dysfunction. Urol Res Pract. 2025;50(4):253–60. [DOI] [PMC free article] [PubMed]
- 42.Hancı V, Ergün B, Gül Ş, Uzun Ö, Erdemir İ, Hancı FB. Assessment of readability, reliability, and quality of chatgpt®, bard®, gemini®, copilot®, perplexity® responses on palliative care. Medicine. 2024;103(33):e39305. [DOI] [PMC free article] [PubMed]
- 43.Mavrych V, Ganguly P, Bolgova O. Using large language models (chatgpt, copilot, palm, bard, and gemini) in gross anatomy course: Comparative analysis. Clin Anat. 2025;38(2):200–10. [DOI] [PubMed]
- 44.Cetin HK, Demir T. Assessing the knowledge of chatgpt and google gemini in answering peripheral artery disease-related questions. Vascular. 2025:17085381251315999. [DOI] [PubMed]
- 45.Naz R, Akacı O, Erdoğan H, Açıkgöz A. Can large language models provide accurate and quality information to parents regarding chronic kidney diseases? J Eval Clin Pract. 2024;30(8):1556–64. [DOI] [PubMed]
- 46.Özbay Y, Erdoğan D, Dinçer GA. Evaluation of the performance of large language models in clinical decision-making in endodontics. BMC Oral Health. 2025;25(1):648. [DOI] [PMC free article] [PubMed]
- 47.Labarca G, Sands SA, Cohn V, Demko G, Vena D, Messineo L, et al. Mouth closing to improve the efficacy of mandibular advancement devices in sleep apnea. Ann Am Thorac Soc. 2022;19(7):1185–92. [DOI] [PubMed]
- 48.Lee YC, Lu CT, Cheng WN, Li HY. The impact of mouth-taping in mouth-breathers with mild obstructive sleep apnea: A preliminary study. Healthcare (Basel). 2022;10(9). [DOI] [PMC free article] [PubMed]
- 49.Fangmeyer SK, Badger CD, Thakkar PG. Nocturnal mouth-taping and social media: A scoping review of the evidence. Am J Otolaryngol. 2025;46(1):104545. [DOI] [PubMed]
- 50.O’Halloran KD. Mouth taping: A little less conversation, a little more action, please! J Physiol. 2024;602(15):3605–7. [DOI] [PubMed]
- 51.Mansell SK, Devani N, Shah A, Schievano S, Main E, Mandal S. Current treatment strategies in managing side effects associated with domiciliary positive airway pressure (pap) therapy for patients with sleep disordered breathing: A systematic review and meta-analysis. Sleep Med Rev. 2023;72:101850. [DOI] [PubMed]
- 52.Zhu L, Mou W, Hong C, Yang T, Lai Y, Qi C, et al. The evaluation of generative ai should include repetition to assess stability. JMIR Mhealth Uhealth. 2024;12:e57978. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets used and/or analyzed during the current study are available from the corresponding author Omer Tarik Ozdemir (omertarik.ozdemir@ogr.iu.edu.tr) upon reasonable request.


