Abstract
Background
Generative Artificial Intelligence (AI) models, such as ChatGPT (GPT-4) and Gemini, offer potential benefits in educational settings, including dental education. These tools have shown promise in enhancing learning and assessment processes, particularly in dental prosthetic technology (DPT) and oral health (OH) programs.
Objective
This study aimed to evaluate the accuracy, reliability, and consistency of GPT-4 and Gemini AI models in answering examination questions in dental education. The study focused on multiple-choice questions (MCQs), true/false (T/F) questions, and short-answer questions (SAQs).
Methods
An exploratory study design was used with 30 questions (10 MCQs, 10 T/F, and 10 SAQs) covering key topics in DPT and OH education. ChatGPT and Gemini were tested with the same set of questions on two separate occasions to assess consistency. Responses were evaluated by two independent researchers using a predefined answer key. Data were analyzed using descriptive statistics, the Kappa coefficient for agreement, and the Chi-square test for categorical variables.
Results
ChatGPT demonstrated high accuracy in MCQs (90%) and T/F questions (85%) but showed reduced performance in SAQs (60%). Gemini’s accuracy ranged between 60% and 70%, with the highest accuracy in SAQs (70%). ChatGPT showed significant consistency across testing dates (Kappa = 0.754; p = 0.001), whereas Gemini’s responses were less consistent (Kappa = 0.634; p = 0.001).
Conclusion
While both AI models offer valuable support in dental education, ChatGPT exhibited greater accuracy and consistency in structured assessments. The findings suggest that AI tools can enhance teaching and assessment methods if integrated thoughtfully, supporting personalized learning while maintaining academic integrity.
Keywords: ChatGPT, GPT-4, Gemini, Artificial intelligence, Dental education, Assessment methods, Dental prosthetics, Oral health
Introduction
The integration of artificial intelligence (AI) into education has rapidly evolved, offering transformative potential for teaching, learning, and assessment practices. Among the most advanced AI technologies are natural language processing (NLP) models such as ChatGPT (GPT-4) by OpenAI and Gemini by Google AI. These generative AI systems are capable of producing human-like responses and have shown promise in various academic contexts, including specialized domains like dental education [1, 2]. Gemini, based on Google’s LaMDA (Language Model for Dialogue Applications), is optimized for dialogue-based learning and clinical reasoning support [3].
In dental education—particularly in disciplines like dental prosthetic technology (DPT) and oral health (OH)—student assessments often go beyond simple recall, requiring critical thinking and application of clinical knowledge. Generative AI tools can assist in these domains by offering immediate feedback and enhancing self-directed learning. However, their effectiveness across different types of assessment formats remains inconsistent, especially in open-ended or context-dependent questions [4, 5].
Several recent studies have explored the role of AI in dental training. For instance, Revilla-León et al. (2024) demonstrated the applicability of ChatGPT in responding to implant dentistry certification exam questions, while Danesh et al. (2024) evaluated its utility in periodontal in-service training [6, 7]. Chau et al. (2024) further investigated the performance of generative AI in dental licensing assessments, revealing fluctuations in accuracy depending on question complexity [8]. These findings highlight both the promise and limitations of AI integration in specialized educational domains.
Despite the growing body of research, there is limited focus on how AI models perform in practical and highly specialized areas like DPT and OH, where content precision directly affects patient outcomes. These fields require a deep understanding of prosthetic materials, anatomical morphology, and treatment workflows, making them suitable benchmarks for AI’s clinical relevance [9].
Moreover, external factors—such as question phrasing, language ambiguity, and the inherent stochasticity of large language models—can significantly affect AI output. The linguistic training data, prompt structure, and underlying algorithmic design may lead to variability in response quality and consistency [10, 11].
This study hypothesizes that while GPT-4 and Gemini may demonstrate strong performance in structured formats such as multiple-choice questions (MCQs) and true/false (T/F) statements, their effectiveness will decline in short-answer questions (SAQs) due to the open-ended and interpretative nature of such tasks. The null hypothesis posits that there will be no significant difference in accuracy or consistency across AI-generated responses in SAQs.
Study objective
The primary objective of this study is to evaluate and compare the performance of GPT-4 (ChatGPT) and Gemini in answering MCQs, T/F, and SAQs within the context of DPT and OH education. By analyzing accuracy, consistency, and response variability, the study seeks to provide pedagogical insights into the responsible integration of AI tools in specialized dental curricula.
Materials and methods
Study design
This study employed an exploratory design to assess the accuracy and consistency of ChatGPT (GPT-4) and Gemini AI models in answering exam questions commonly used in dental prosthetic technology (DPT) and oral health (OH) education. The research focused on three question formats: multiple-choice questions (MCQs), true/false (T/F) questions, and short-answer questions (SAQs).
A total of 30 questions (10 for each format) were developed to reflect real-world assessments in DPT and OH curricula. These questions encompassed a mix of knowledge-based and applied topics to evaluate both AI models’ ability to handle objective and open-ended formats. The questions were administered on two separate occasions, February 4th and February 15th, 2024, to assess consistency over time.
Sample size and standardization of questions
The sample size of 30 questions was determined based on prior studies indicating that this quantity is sufficient to identify performance trends and assess the reliability of AI responses. A standardized set of questions was presented to both AI models to ensure a fair comparison. All questions were provided in both Turkish and English to evaluate the impact of language on the models’ performance.
The decision to include ten questions per format was informed by previous educational research and aimed to strike a balance between analytical depth and feasibility. This quantity allowed for meaningful cross-format comparisons while maintaining clarity and consistency in evaluating AI performance within a structured educational framework.
Assessment procedure
Responses generated by ChatGPT and Gemini were independently evaluated by two researchers (ŞB and KYD) using a predefined answer key. Each response was scored dichotomously: 1 point for a correct answer and 0 for an incorrect answer. Discrepancies in scoring between the researchers were resolved through discussion until consensus was reached.
To enhance transparency, examples of the questions used (including MCQs, T/F, and SAQs) are provided in Supplementary Material 1. The total number of correctly answered questions for each AI model was recorded and analyzed.
The predefined answer key comprised model answers and scoring rubrics developed by experienced educators in prosthodontics and dental public health. Evaluation criteria focused on factual accuracy, contextual relevance, and clarity of expression, particularly for short-answer questions where variability in interpretation was more likely.
Controlled testing and elimination of external factors
Testing conditions were standardized to minimize external influences. Both AI models were tested under similar conditions, with careful control of variables such as input phrasing, session settings, and contextual prompts. A pilot study was conducted before the main study to verify that the questions were clear and that the AI models could interpret them without bias.
Statistical analysis
Descriptive statistics were calculated as frequencies and percentages. The consistency of AI responses between the two test dates was evaluated using the Kappa coefficient, interpreted based on standard ranges (e.g., Kappa > 0.75 indicating excellent agreement). Relationships between categorical variables were analyzed using the Chi-square test. All analyses were performed using SPSS 22.0 software, with a significance threshold set at p < 0.05.
Results
Accuracy and consistency analysis of AI models
ChatGPT demonstrated higher overall accuracy across most question formats, with correct response rates of 90% in multiple-choice questions (MCQs), 85% in true/false (T/F) questions, and 60% in short-answer questions (SAQs). Its performance was strongest in structured formats (MCQs and T/F), where questions required factual recall or clear decision-making. Gemini’s accuracy ranged from 60 to 70%, with its highest result in SAQs (70%), but performance was less consistent across formats.
Consistency analysis showed that ChatGPT achieved substantial agreement between the two testing sessions (Kappa = 0.754, p = 0.001), indicating stable performance over time. Gemini’s stability was moderate (Kappa = 0.634, p = 0.001), with greater variability in repeated responses to similar prompts.
When examined by question type, ChatGPT maintained higher accuracy than Gemini in MCQs and T/F questions. In SAQs, Gemini’s numerical score was higher (70% vs. 60%); however, qualitative review indicated that ChatGPT’s responses, even when incomplete, more frequently matched the intended content in the reference answers.
Evaluator notes, though not part of a formal learner feedback process, recorded that ChatGPT’s answers were generally more structured and complete in objective formats. In contrast, Gemini occasionally produced shorter or less detailed SAQ responses.
Discussion
This study is among the first to comprehensively evaluate and compare the performance of generative AI tools, specifically ChatGPT (GPT-4) and Gemini, within the context of dental prosthetic technology (DPT) and oral health (OH) education assessments. The findings demonstrate that while both AI models have considerable potential to assist in educational evaluations, their performance varies notably by question format and complexity. ChatGPT exhibited superior accuracy in structured formats—multiple-choice questions (MCQs: 90%) and true/false (T/F: 85%)—and maintained substantial reliability between testing sessions (Kappa = 0.754, p = 0.001). In contrast, Gemini’s performance ranged from 60 to 70%, with greater variability in reproducibility (Kappa = 0.634, p = 0.001). Both models showed reduced accuracy in short-answer questions (SAQs), highlighting persistent challenges in generating nuanced, context-specific responses.
These results align with previous literature on AI performance in dental and medical education. Revilla-León et al. (2024) reported that ChatGPT performed strongly in structured dental implant certification assessments, particularly MCQs [1]. Similarly, Danesh et al. (2024) observed high generative AI accuracy in periodontics in-service examinations, with lower performance in open-ended formats [2]. Chau et al. (2024) documented variability in AI performance for dental licensing examinations, consistent with Gemini’s fluctuating accuracy in this study [3]. Dashti et al. (2024) found that ChatGPT scored highly on the INBDE, ADAT, and DAT examinations in knowledge-based items but showed reduced reliability in complex reasoning tasks [8]. In medical education, Aidan et al. (2023) confirmed high GPT-4 accuracy in standardized exams [4], while Budhwar et al. (2023) cautioned against over-reliance on AI, recommending integration as a complement rather than a replacement for traditional pedagogy [12].
ChatGPT’s consistent high performance in MCQs and T/F questions suggests strong potential for reinforcing factual knowledge and providing immediate formative feedback. Its high reliability across sessions enhances its value as a stable assessment resource. Both AI models demonstrated adaptability across different formats, supporting the possibility of integration into adaptive learning systems. However, the marked reduction in SAQ performance—particularly for Gemini—emphasizes the difficulty of producing detailed, contextually accurate responses. This limitation is critical in dental education, where complex clinical scenarios demand nuanced reasoning. Model architecture differences, as well as potential algorithm updates, may contribute to performance variability. Additionally, AI-generated answers raise concerns about academic integrity; without safeguards, students might overuse AI in ways that undermine intended learning outcomes.
The integration of AI tools such as ChatGPT and Gemini into dental education offers opportunities for enhancing personalized learning, formative assessment, and self-directed study. AI can provide immediate feedback in objective formats, supporting identification of knowledge gaps, as emphasized by Samaranayake (2025) [6] and Perera & Lankathilake (2023) [9]. Educators may incorporate AI-generated examples into discussions or problem-based learning, as supported by Ali et al. (2023) [13] and Baidoo-Anu & Ansah (2023) [14], who demonstrated that blended AI-enhanced learning can improve engagement and critical thinking. However, to mitigate risks of academic dishonesty, strategies such as oral examinations, practical assessments, and requiring students to critically evaluate AI-generated outputs are recommended. Ethical guidelines should be developed at the institutional level to regulate AI use in coursework and examinations, as advocated by Gerke et al. (2020) [15] and Susnjak (2022) [16].
This study ensured fairness by using a standardized set of questions across two testing dates, with objective scoring for MCQs and T/F questions and qualitative evaluation for SAQs. This dual approach allowed a more comprehensive understanding of AI capabilities beyond accuracy percentages. Qualitative analysis of SAQ responses revealed that while Gemini occasionally produced more concise answers, ChatGPT’s outputs were more contextually aligned with expected responses. This underscores the importance of assessing AI not only on correctness but also on depth, clarity, and clinical relevance—critical in health sciences education.
While AI offers transformative potential, disparities in digital access may hinder equitable adoption. Institutions should address the digital divide by providing institutional AI access, structured training, and digital literacy programs to ensure fair use among students.
This study has several limitations. First, the assessment was static and conducted at a single point in time, which may not fully reflect the rapidly evolving capabilities of AI models such as GPT-4 and Gemini. Second, the relatively small sample size of 30 questions and the use of data from a single institution may limit the generalizability of the findings. Third, only multiple-choice questions were included, and other assessment formats commonly used in dental education, such as short-answer or practical examinations, were not evaluated. Finally, the study compared two AI models without considering future updates or alternative large language models that may yield different results. Future research should adopt longitudinal designs, include larger and more diverse datasets, and explore a broader range of assessment types to provide a more comprehensive understanding of AI integration in dental education.
Future research should also include longitudinal studies that evaluate the impact of AI tools on student learning outcomes over extended periods. It is important to expand research to diverse educational settings by including different levels of dental education and comparing AI performance in both undergraduate and postgraduate programs. Another essential direction is investigating whether AI-assisted learning supports or hinders the development of critical thinking and clinical decision-making skills in dental students. Generative AI tools like ChatGPT and Gemini present promising opportunities for enhancing educational assessments in dental education. While ChatGPT demonstrated higher accuracy and consistency, particularly in structured formats, both models exhibited challenges with short-answer questions. These findings suggest that AI should be viewed as a supplementary tool, with careful consideration of its limitations. By adopting a strategic and balanced approach, educators can harness AI technologies to improve educational outcomes while maintaining high academic standards (Table 1).
Table 1.
Kappa Coefficient Interpretation Ranges
| Kappa Value | Interpretation |
|---|---|
| < 0 | No agreement |
| 0.00–0.20 | Negligible agreement |
| 0.21–0.40 | Fair agreement |
| 0.41–0.60 | Moderate agreement |
| 0.61–0.80 | Substantial agreement |
| 0.81–1.00 | Almost perfect agreement |
Limitations of the study
This study has several limitations. First, the assessment was static and conducted at a single point in time, which may not fully reflect the rapidly evolving capabilities of AI models such as GPT-4 and Gemini. Second, the relatively small sample size of 30 questions and the use of data from a single institution may limit the generalizability of the findings. Third, only multiple-choice questions were included, and other assessment formats commonly used in dental education, such as short-answer or practical examinations, were not evaluated. Finally, the study compared two AI models without considering future updates or alternative large language models that may yield different results. Future research should adopt longitudinal designs, include larger and more diverse datasets, and explore a broader range of assessment types to provide a more comprehensive understanding of AI integration in dental education (Table 2).
Table 2.
Differences in Getting Correct Answers Between Question Types
| Category | T/F Correct (%) | MCQ Correct (%) | SAQ Correct (%) | p-value |
|---|---|---|---|---|
| ChatGPT February 4th | 70 (70%) | 9 (90%) | 6 (60%) | 0.270 |
| ChatGPT February 15th | 70 (70%) | 8 (80%) | 6 (60%) | 0.617 |
| Gemini February 4th | 60 (60%) | 6 (60%) | 70 (70%) | 0.864 |
| Gemini February 15th | 70 (70%) | 6 (60%) | 70 (70%) | 0.862 |
Future research directions
Future research should include longitudinal studies that evaluate the impact of AI tools on student learning outcomes over extended periods. It is also important to expand research to diverse educational settings by including different levels of dental education and comparing AI performance in both undergraduate and postgraduate programs. Another essential direction is investigating whether AI-assisted learning supports or hinders the development of critical thinking and clinical decision-making skills in dental students (Table 3).
Table 3.
Internal Consistency of ChatGPT Between Two Days
| February 4th | February 15th | Kappa Value | p-value |
|---|---|---|---|
| Same: 20 (90.9%) | Same: 1 (12.5%) | 0.754 | 0.001* |
| Different: 2 (9.1%) | Different: 7 (87.5%) |
Generative AI tools like ChatGPT and Gemini present promising opportunities for enhancing educational assessments in dental education. While ChatGPT demonstrated higher accuracy and consistency, particularly in structured formats, both models exhibited challenges with short-answer questions. These findings suggest that AI should be viewed as a supplementary tool, with careful consideration of its limitations. By adopting a strategic and balanced approach, educators can harness AI technologies to improve educational outcomes while maintaining high academic standards (Table 4).
Table 4.
Internal Consistency of Gemini Between Two Days
| February 4th | February 15th | Kappa Value | p-value |
|---|---|---|---|
| Same: 17 (89.5%) | Same: 3 (27.3%) | 0.634 | 0.001* |
| Different: 2 (10.5%) | Different: 8 (72.7%) |
*Significant at p < 0.05 (Chi-square test and Kappa coefficient)
Conclusion
This study provides a comprehensive evaluation of generative AI tools, specifically ChatGPT (GPT-4) and Gemini, in the context of dental education assessments. The findings demonstrate that while both AI models exhibit strong performance in structured assessment formats such as multiple-choice questions and true/false questions, their ability to generate accurate and contextually appropriate responses to short-answer questions is more variable. ChatGPT outperformed Gemini in terms of accuracy and consistency, highlighting its potential as a more reliable tool for supporting educational assessments.
Key findings of the study include the observation that ChatGPT achieved up to 90% accuracy in multiple-choice questions, showcasing its ability to handle structured and knowledge-based queries effectively. The performance of both AI models was lower in short-answer questions, with ChatGPT showing higher accuracy than Gemini but still struggling with complex, open-ended questions. Consistency analysis revealed that ChatGPT maintained a higher level of response reliability across different testing sessions compared to Gemini, as indicated by a Kappa coefficient of 0.754 versus 0.634 for Gemini. The study also identified variability in AI responses influenced by the phrasing of questions and contextual factors, underscoring the need for standardized question formats when utilizing AI in assessments.
Implications for dental education
The study highlights the potential of generative AI tools as valuable educational aids in dental prosthetic technology and oral health programs. Educators can utilize AI to enhance personalized learning experiences by providing students with immediate feedback and tailored study recommendations. These tools can also support formative assessments, allowing students to self-evaluate their knowledge before formal exams. Furthermore, AI can offer supplementary educational materials that can be integrated into digital learning platforms. However, the variability in AI performance, particularly in short-answer questions, emphasizes the necessity for educators to approach these tools with caution. While AI can assist in automating certain aspects of teaching and assessment, it should not replace human judgment or the critical evaluation skills that are integral to clinical education.
Strategies for effective integration of AI
Effective integration of AI into dental education requires thoughtful curriculum design that incorporates blended learning approaches combining AI tools with traditional teaching methods. Ensuring assessment integrity is essential, and this can be achieved by creating formats that minimize opportunities for AI misuse, such as practical demonstrations, oral exams, and case-based assessments. Educators should also receive professional development to understand the capabilities and limitations of AI tools, enabling them to guide students effectively. Finally, systems for monitoring and providing feedback on AI-generated content should be implemented to ensure alignment with educational objectives and academic standards.
Limitations and future directions
The primary limitation of this study is its focus on a specific set of questions within dental education, which may not fully capture the generalizability of AI performance across different subjects and educational contexts. Additionally, the study’s design did not account for the rapid evolution of AI technologies, which may influence future performance outcomes. Future research should explore longitudinal studies to assess how AI integration affects learning outcomes over time, particularly focusing on critical thinking, problem-solving, and clinical decision-making skills in dental education.
Generative AI tools like ChatGPT and Gemini present significant opportunities for enhancing dental education by supporting both teaching and assessment processes. However, to harness their full potential, educators and institutions must adopt a balanced approach that leverages AI for learning support while maintaining robust academic integrity measures. As AI technologies continue to evolve, ongoing research and adaptive educational strategies will be essential to ensure these tools contribute positively to student learning and clinical preparedness in dental and healthcare education.
Acknowledgements
The authors would like to express their gratitude to the institutions and colleagues who contributed to the development of this study. Special thanks to the students and academic staff who provided valuable insights during the study design phase.
Authors’ contributions
• **Kübra Yıldız Domaniç:** Conceptualization, Methodology, Data Collection, Writing—Original Draft Preparation, Project Administration.• **Şükran Baycan: ** Data Analysis, Validation, Writing—Review & Editing, Supervision.Both authors have read and approved the final version of the manuscript.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability
The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.
Declarations
Ethics approval and consent to participate
This study did not involve human participants, animals, or clinical data requiring ethical approval. Therefore, ethical approval and consent to participate are not applicable.
Consent for publication
Not applicable. No personal or identifiable data are included in this manuscript.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Bao Y. A comprehensive investigation for chatgpt’s applications in education. Appl Comput Eng. 2024;35:116–22. [Google Scholar]
- 2.Google DeepMind. Gemini: Our Next Generation Foundation Model. 2023. Available from: https://deepmind.google/technologies/gemini/
- 3.Thoppilan R, De Freitas D, Hall J et al. Lamda: Language models for dialog applications. arXiv. 2022. Available from: https://arxiv.org/abs/2201.08239
- 4.Hao Y. The application and challenges of ChatGPT in educational transformation: new demands for teachers’ roles. Heliyon. 2024;10:24289. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 5.Partha PR. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations, and future scope. Internet Things Cyber-Physical Syst. 2023;3:121–54. [Google Scholar]
- 6.Samaranayake LP. The transformative role of artificial intelligence in dentistry: A comprehensive overview part 1: fundamentals of AI and its contemporary applications in dentistry. Int Dent J. 2025. 10.1016/j.identj.2025.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Revilla-León M, et al. ChatGPT’s performance on the European certification in implant dentistry examination. J Prosthodont. 2024;33(2):145–52. [DOI] [PubMed] [Google Scholar]
- 8.Dashti M, Ghasemi S, Ghadimi N, et al. Performance of ChatGPT 3.5 and 4 on U.S. Dental examinations: the INBDE, ADAT, and DAT. Imaging Sci Dentistry. 2024;54(3):271–75. 10.5624/isd.20240037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Perera P, Lankathilake M. AI in higher education: A literature review of ChatGPT and guidelines for responsible implementation. Educ Adv. 2023;12:307–14. [Google Scholar]
- 10.Aidan G, Safranek CW, Huang T, et al. How does ChatGPT perform on the united States medical licensing examination? The implications of large Language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rojanasarot S, Milone A, Balestrieri R, Pittenger AL. Personalized learning in an online drugs and US health care system controversies course. Am J Pharm Educ. 2018;82(8):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Budhwar P, Chowdhury S, Wood G, et al. Human resource management in the age of generative artificial intelligence: perspectives and research directions on ChatGPT. HRM J. 2023;33(3):606–59. [Google Scholar]
- 13.Ali K, Alhaija ESA, Raja M, et al. Blended learning in undergraduate dental education: A global pilot study. Med Educ Online. 2023;28(1):2171700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Baidoo-Anu D, Ansah LO. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. J AI. 2023;7(1):52–62. [Google Scholar]
- 15. Gerke S, Minssen T. Ethical and legal challenges of artificial intelligence driven healthcare. In Bohr A, Memarzadeh K, editors. Artificial Intelligence in Healthcare. Netherlands: Elsevier; 2020. p. 295–336. 10.1016/B978-0-12-818438-7.00012-5.
- 16.Susnjak T, ChatGPT. The end of online exam integrity? arXiv. 2022. Available from: https://arxiv.org/abs/2212.09292
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.
