ChatGPT-4 Omni’s superiority in answering multiple-choice oral radiology questions

Melek Tassoker

doi:10.1186/s12903-025-05554-w

. 2025 Feb 1;25:173. doi: 10.1186/s12903-025-05554-w

ChatGPT-4 Omni’s superiority in answering multiple-choice oral radiology questions

Melek Tassoker ^1,^✉

PMCID: PMC11786404 PMID: 39893407

Abstract

Objectives

This study evaluates and compares the performance of ChatGPT-3.5, ChatGPT-4 Omni (4o), Google Bard, and Microsoft Copilot in responding to text-based multiple-choice questions related to oral radiology, as featured in the Dental Specialty Admission Exam conducted in Türkiye.

Materials and methods

A collection of text-based multiple-choice questions was sourced from the open-access question bank of the Turkish Dental Specialty Admission Exam, covering the years 2012 to 2021. The study included 123 questions, each with five options and one correct answer. The accuracy levels of ChatGPT-3.5, ChatGPT-4o, Google Bard, and Microsoft Copilot were compared using descriptive statistics, the Kruskal-Wallis test, Dunn’s post hoc test, and Cochran’s Q test.

Results and discussion

The accuracy of the responses generated by the four chatbots exhibited statistically significant differences (p = 0.000). ChatGPT-4o achieved the highest accuracy at 86.1%, followed by Google Bard at 61.8%. ChatGPT-3.5 demonstrated an accuracy rate of 43.9%, while Microsoft Copilot recorded a rate of 41.5%.

Conclusion

ChatGPT-4o showcases superior accuracy and advanced reasoning capabilities, positioning it as a promising educational tool. With regular updates, it has the potential to serve as a reliable source of information for both healthcare professionals and the general public.

Clinical trial number

Not applicable.

Keywords: ChatGPT, Bard, Copilot, Oral radiology, Question

Introduction

Natural Language Processing (NLP) is a field dedicated to understanding and enabling computers to interpret and process human language in textual and spoken forms [1]. In 1950, Alan Turing asked whether a computer program could communicate effectively with humans. This inquiry has evolved into the Turing test, a foundation for developing chatbots [2].

Since the launch of ChatGPT by OpenAI in November 2022, the landscape of generative AI chatbots has experienced remarkable advancements. These developments profoundly impact higher education as academics and students increasingly turn to chatbots like Ernie, Bard (now known as Gemini), and Grok [3]. These text-based AI tools have been trained and refined using various datasets, including books, articles, and websites [4]. The efficacy and precision of these AI tools depend on multiple factors, including their expertise, frequency of model updates, and the complexity of the inquiries they address [5]. It is anticipated that these AI technologies, which offer personalized solutions, may soon supplant traditional search engines [6].

Previous studies have investigated the effectiveness of various AI chatbots in specialty medical exams across multiple fields, including family medicine [7], sleep medicine [8], internal medicine [9], dermatology [10], ophthalmology [11], and radiology [12]. A study conducted in Taiwan [13] evaluated the performance of ChatGPT-3.5 on the pharmacy licensing examination, revealing a correct answer rate of 54.4%. Furthermore, another study [14] compared the correct answer rates of ChatGPT-4 and Bard on the American College of Radiology’s Diagnostic Radiology In-Training examination, showcasing that ChatGPT-4 outperformed Bard with rates of 87.11% and 70.44%, respectively. While numerous studies have assessed chatbot performance across various disciplines, data regarding their effectiveness in dentistry remains limited. Recent research [15] indicated that ChatGPT-3.5 and ChatGPT-4 achieved success rates of 61.3% and 76.9%, respectively, on the dental board exam evaluating dental knowledge.

AI is transforming education by enhancing e-learning platforms and assisting with tasks such as answering multiple-choice questions and completing assignments [16]. Many students prefer digital learning resources over traditional materials like textbooks [17]. Large language models (LLMs) can particularly benefit dental students preparing for undergraduate and graduate multiple-choice exams. These models can facilitate academic development by generating new questions and case scenarios inspired by prior exams.

Concerns continue to arise regarding the accuracy and reliability of AI-generated responses, particularly when addressing specific topics or questions. Oral radiology, a specialized field in dentistry with a limited number of practitioners [18] and a shortage of educators, stands to gain significantly from integrating large language models (LLMs). Consequently, examining their accuracy and reliability is crucial for students in this discipline.

By examining the latest advancements in AI tools, particularly ChatGPT-4 and ChatGPT-4o, we can provide valuable insights for dental students specializing in oral radiology and their educators. This analysis will support informed decision-making that enhances learning and teaching methodologies within this specialized field. Notably, the effectiveness of these advanced chatbots in addressing questions related to oral radiology remains to be thoroughly investigated.

ChatGPT-4o excels in delivering accurate and reliable answers, capable of swiftly tackling complex queries and conducting in-depth analyses. Its enhanced language comprehension allows for responses that are increasingly natural and human-like. Microsoft Copilot, developed using GPT-4, exemplifies this technology in practice, as it effectively understands and responds to intricate prompts, creates more innovative and informative text, and assists users across various tasks. Conversely, ChatGPT-3.5 is recognized as the fastest model in the ChatGPT series, while Bard, integrated into the Google ecosystem, offers detailed and comprehensive responses [19–21].

To specialize in oral radiology in Türkiye, passing the Dental Specialty Admission Exam (DUS) after completing a five-year undergraduate program in dentistry is crucial. This exam is held annually and consists of 120 multiple-choice questions, with 10 dedicated explicitly to oral radiology. This study aims to assess and compare the performance of ChatGPT-4o, ChatGPT-3.5, Google Bard, and Microsoft Copilot in answering text-based multiple-choice questions related to oral radiology as part of the DUS in Türkiye.

Methods

Study design and data collection

This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4 Omni (4o), Google Bard, and Microsoft Copilot in answering multiple-choice questions related to oral radiology. The dataset utilized pertains to oral radiology, derived explicitly from the open-source question bank of the DUS covering the years 2012–2021 (https://www.osym.gov.tr/TR,15070/dus-cikmis-sorular.html). It comprises 123 multiple-choice questions, each featuring five answer options with one correct response. These questions primarily assess theoretical knowledge and diagnostic skills in oral radiology. To maintain authenticity and prevent any bias, the original questions, which were written in Turkish, were input into the chatbots without translation.

Inclusion Criteria: All multiple-choice questions from the DUS database that exclusively assessed oral radiology were included.
Exclusion Criteria: Questions with non-multiple-choice formats (e.g., open-ended or rank-based) or visual elements (e.g., images, radiographs) were excluded to ensure uniformity in the analysis.

Querying procedure

The study utilized the following AI chatbots:

ChatGPT-3.5 (OpenAI, San Francisco, USA, March 2024 version).
ChatGPT-4 Omni (OpenAI, San Francisco, USA, June 2024 version).
Google Bard (Google, Menlo Park, USA, April 2024 version).
Microsoft Copilot (Microsoft, Redmond, USA, May 2024 update).

Each question was individually inputted into the chatbots in Turkish to ensure accurate understanding and minimize contextual contamination. Each interaction was treated as a standalone session to prevent memory effects from influencing subsequent prompts. The complete text of the questions, including punctuation and syntax, was preserved during input. No additional prompt optimization or pre-testing was conducted, allowing for conditions that closely reflect real-world usage.

Responses from the chatbots were categorized as either “correct” or “incorrect” based on the official answer key provided by the DUS question bank. The accuracy rate for each chatbot was calculated as the percentage of correct answers out of the total number of questions attempted.

Categories and topics

The questions were categorized into 17 oral radiology topics, including but not limited to: Tooth caries radiology, Dental anomalies, Extraoral imaging, Advanced imaging techniques, Oral medicine, Jaw pathologies, Radiobiology. To better analyze the chatbots’ performance, these topics were further grouped into three educational content areas:

Fundamental Knowledge: Topics such as physics, radiobiology, and projection geometry.
Imaging and Equipment: Topics covering panoramic, intraoral, and extraoral imaging techniques.
Image Interpretation: Topics requiring diagnostic reasoning, such as jaw pathologies and systemic diseases.

Fig. 1 illustrates the English translation of a case question from the DUS and displays the answer screen. Table 1 provides examples of oral radiology questions across various topics

Table 1.

Examples of the multiple choice questions according to different oral radiology topics*

Oral radiology topics	Examples of the questions	Choices
Intraoral imaging	Choose the anatomical structure that is not visible in a periapical radiograph of the maxillary anterior region.	1. Spina nasalis anterior 2. Concha nasalis superior 3. Lateral walls of the nasopalatine canal 4. Incisive foramen 5. Nasal septum
Oral medicine	Choose the option that is not a predisposing factor for recurrent herpetic infections.	1. Immunosuppression 2. Sunlight 3. High fever 4. Trauma 5. Smoking
Jaw pathology	Choose the option that is associated with multiple osteomas.	1. Gorlin syndrome 2. Ehlers-Danlos syndrome 3. Hand-Schüller-Christian disease 4. Gardner syndrome 5. Eosinophilic granuloma
Tooth caries radiology	Choose the option that cannot be confused with caries in the radiographic diagnosis.	1. Cervical burn-out 2. Hypoplastic fissures 3. Loss of substance due to attrition 4. Mach band effect 5. Enamel pearl
Advanced imaging	Choose the most appropriate imaging modality for evaluating temporomandibular joint (TMJ) disc dislocations.	1. Lateral TMJ view 2. Panoramic radiography 3. Digital subtraction radiography 4. Magnetic resonance imaging 5. Open and closed TMJ radiograph

Open in a new tab

* The questions uploaded to the chatbots were in Turkish without any English translation. However, the table has been translated into English for the readers

Reliability assessment

To ensure consistency, a single observer posed each question twice to each chatbot. The initial responses were used for primary analysis, while the second set was employed to evaluate reliability. The agreement between the two query attempts was assessed using Cohen’s Kappa (κ), resulting in the following values: ChatGPT-4 Omni: κ = 0.86, ChatGPT-3.5: κ = 0.78, and Microsoft Copilot: κ = 0.72.

Evaluation metrics

Three metrics were assessed to evaluate chatbot performance:

Accuracy: The proportion of correct answers provided by each chatbot.
Mean Word Count: The average word count of the responses.
Response Time: The time taken by each chatbot to generate a response.

Word Count Calculation: Responses were copied into Microsoft Word, and word count was measured using the built-in feature.

Response Time Measurement: An online stopwatch was used to measure the time from input submission to response completion.

Statistical analysis

Statistical analyses were conducted using IBM SPSS Statistics version 21.0. Descriptive statistics were computed for all metrics. Comparative studies were carried out: the Kruskal-Wallis test was utilized to compare word counts and response times among the four chatbots, followed by Dunn’s post hoc test for pairwise comparisons of the chatbots. Cochran’s Q test was also applied to evaluate differences in accuracy rates for the same questions across the chatbots.

Results

The study revealed statistically significant differences (p = 0.000) in the accuracy of responses the four evaluated chatbots provided. ChatGPT-4o achieved the highest accuracy rate at 86.1%, followed by Google Bard at 61.8%. In contrast, ChatGPT-3.5 had an accuracy rate of 43.9%, while Microsoft Copilot recorded 41.5%. Additionally, significant variations in word counts of the responses were noted, with Google Bard producing the most verbose replies and ChatGPT-3.5 generating the least (p = 0.000). Disparities in response times were also statistically significant, as ChatGPT-3.5 delivered the fastest responses, whereas ChatGPT-4o was the slowest (p = 0.000) (Table 2).

Table 2.

Comparison of the four chatbots

Chatbots			ChatGPT-4o		ChatGPT-3.5		Google Bard		Microsoft Copilot
Chatbots		Significance	Number of questions	%	Number of questions	%	Number of questions	%	Number of questions	%
Responses	True	p = 0.000*	106	86,1	54	43,9	76	61,8	51	41,5
	False		17	13,9	69	56,1	47	38,2	72	58,5
	Total		123	100,0	123	100,0	123	100,0	123	100,0
Mean Word count		p = 0.000*	158.14		121,57		279,71		133,92
Mean Response Time (second)		p = 0.000*	16,00		7,54		8,00		15,14

Open in a new tab

* p-value is significant at p < 0.001 level

Table 3 presents the p-values for the pairwise comparisons among four chatbots. The comparisons between ChatGPT-3.5 and Google Bard and Microsoft Copilot and Google Bard revealed statistically significant differences in word count. Significant differences in mean response time were observed between ChatGPT-3.5, ChatGPT-4o, and Google Bard.

Table 3.

Pairwise comparisons of 4 chatbots

Pairwise chatbots	Sig. (p-value for Word count)	Sig. (p-value for mean response time)
ChatGPT-3.5 - Microsoft Copilot	,513	,062
ChatGPT-3.5 - ChatGPT-4o	,157	,021**
ChatGPT-3.5 - Google Bard	,000*	,003*
Microsoft Copilot - ChatGPT-4o	,448	,660
Microsoft Copilot - Google Bard	,000*	,292
ChatGPT-4o - Google Bard	,001*	,539

Open in a new tab

* p-value is significant at p < 0.01 level, ** p-value is significant at p < 0.05 level

Table 4 analyzes the chatbot responses to oral radiology topics. Jaw pathologies and systemic diseases are the topics with the most questions.

Table 4.

Chatbots gave the responses according to oral radiology topics

Topics	Total number of questions	ChatGPT-4o		ChatGPT-3.5		Google Bard		Microsoft Copilot
Topics	Total number of questions	True	False	True	False	True	False	True	False
Tooth caries radiology	1	1	0	0	1	1	0	0	1
Dental anomalies	4	4	0	2	2	1	3	1	3
Extraoral imaging	3	2	1	1	2	0	3	1	2
Film imaging	8	7	1	2	6	5	3	2	6
Advanced imaging	13	10	3	4	9	8	5	6	7
Implant radiology	1	1	0	0	1	1	0	0	1
Intraoral anatomy	3	3	0	0	3	1	2	1	2
Intraoral imaging	5	3	2	2	3	3	2	2	3
Oral medicine	16	16	0	9	7	13	3	8	8
Panoramic imaging	4	3	1	2	2	2	2	1	3
Projection geometry	2	1	1	1	1	0	2	0	2
Physics	12	12	0	6	6	9	3	5	7
Radiobiology	4	4	0	3	1	4	0	1	3
Artifacts	2	2	0	1	1	0	2	1	1
Jaw pathologies	20	17	3	9	11	13	7	8	12
Systemic diseases	18	15	3	9	9	11	7	10	8
Soft tissue calcifications	7	5	2	3	4	4	3	4	3
Total number of questions	123	106	17	54	69	76	47	51	72

Open in a new tab

Table 5 summarizes the oral radiology questions by educational content. Figure 2 presents graphs of chatbots using educational content from oral radiology.

Table 5.

Multiple choice oral radiology questions divided by educational content

Educational content	Total number of questions	ChatGPT-4o		ChatGPT-3.5		Google Bard		Microsoft Copilot
Educational content	Total number of questions	True	False	True	False	True	False	True	False
Fundamental knowledge	30 (100%)	28 (93.3%)	2 (6.7%)	13 (43.3%)	17 (56.7%)	19 (63.3%)	11 (36.7%)	11 (36.7%)	19 (63.3%)
Imaging and equipment	27 (100%)	20 (74.1%)	7 (25.9%)	9 (33.3%)	18 (66.7%)	14 (51.9%)	13 (48.1%)	9 (33.3%)	18 (66.7%)
Image interpretation	66 (100%)	58 (87.9%)	8 (12.1%)	32 (48.5%)	34 (51.5%)	43 (65.2%)	23 (34.8%)	31 (47%)	35 (53%)
Total number of questions	123 (100%)	106 (86.1%)	17 (13.9%)	54 (43.9%)	69 (56.1%)	76 (61.8%)	47 (38.2%)	51 (41.5%)	72 (58.5%)

Open in a new tab

Fig. 2 — The graphs of chatbots by oral radiology educational contents

Discussion

ChatGPT-3.5, ChatGPT-4o, Bard, and Copilot are large language models (LLMs) constructed using deep neural networks and trained on extensive text datasets to understand and process human language effectively [22]. The potential applications of LLMs in medicine garner increasing attention; however, it is crucial to understand their strengths and limitations before implementation. LLMs can aid in selecting appropriate radiological imaging modalities and reporting images under human supervision, enhancing efficiency and quality in radiology [23]. Conversely, integrating LLMs into medical education can enrich students’ learning experiences by enabling personalized study plans [24]. Nonetheless, there is a notable lack of sufficient data in the literature regarding the performance of LLMs, particularly concerning their similar potential in dental education.

Oral radiology is a specialized branch within clinical dentistry that has a limited number of specialists. Education in this field requires a robust theoretical foundation and practical application to actual cases. However, many institutions face challenges in providing adequate access to diverse cases that enhance learning and maintain a sufficient number of specialist trainers for each student. This situation prompts the inquiry into how large language models (LLMs) can contribute to education in this domain.

A recent study indicated that ChatGPT-4 achieved the highest accuracy among the tested tools, with a correct response rate of 86.1%. In contrast, ChatGPT-3.5 demonstrated lower accuracy, which may be attributed to its less advanced architecture and training, limiting its ability to address complex scenarios effectively [5].

In a separate study evaluating the performance of ChatGPT-3.5 on the Israeli Internal Medicine National Residency Exam, it answered 36.6% of 133 questions correctly [9]. Similarly, ChatGPT-3.5 demonstrated a correctness rate of 43.9% in another related study by Jeong et al. [25].

A study compared LLM-based chatbots with dental students in oral and maxillofacial radiology. Faculty members developed a series of questions and categorized them into various educational content areas, standardizing the inputs in Korean. The dental students achieved an impressive overall accuracy rate of 81.2%. In contrast, the accuracy rates for the chatbots were as follows: 50.0% for ChatGPT, 65.4% for ChatGPT Plus, 50.0% for Bard, and 63.5% for Bing Chat.

This study examined chatbots’ performance using publicly available DUS questions, encompassing a wide array of dental topics, without juxtaposing their results with student performance. The variations in accuracy rates observed may be attributed to the different algorithms utilized and the question complexity in the two studies. The most recent and successful chatbot highlighted in this research is ChatGPT-4o, launched on May 13, 2024. Following this, ChatGPT-3.5 was released in March 2023, with Microsoft Copilot debuting on November 1, 2023, and Google Bard on March 21, 2023. ChatGPT-4o boasts enhanced natural language processing capabilities, enabling seamless and rapid responses in both English and non-English texts. In addition to its success in the educational domain and improvements over previous iterations, it offers an integrated online consulting experience for addressing complex cases within telemedicine systems [26].

Within this study, ChatGPT-3.5 demonstrated the fastest response time, likely due to its concise answers, which employed the fewest words among the evaluated chatbots. Google Bard ranked second fastest, generating the highest word count in its replies while maintaining relatively quick response times; however, its accuracy rate did not match that of ChatGPT-4o.

Its high accuracy distinguishes ChatGPT-4.0, though it typically has longer response times, making it particularly well-suited for high-stakes situations that demand precise information. In contrast, ChatGPT-3.5 and Google Bard offer quicker response times, which may be more appropriate for applications that require fast answers, even if this comes at the expense of some accuracy. Google Bard’s tendency for higher word counts can benefit users seeking extensive explanations, whereas ChatGPT-4.0 strikes a well-balanced response length that effectively meets various needs.

In Table 4, ChatGPT-4o demonstrated high accuracy, answering all questions correctly, especially in areas such as “Oral Medicine” (16/16), “Physics” (12/12), and “Radiobiology” (4/4). This suggests ChatGPT-4o is reliable for oral radiology topics encompassing complex and detailed information. The performance disparities among the chatbots were especially pronounced in areas with extensive details, such as “Jaw Pathologies” (20 questions) and “Systemic Diseases” (18 questions). ChatGPT-4o scored 17 and 15 correct answers in these categories, respectively, while the other models achieved lower accuracy rates. These findings underscore the significance of large language models (LLMs) and the need for comprehensive data sets to address advanced clinical issues. The performance of the other chatbots indicates the areas that require improvement, as they exhibited lower accuracy rates.

Table 5 reveals that the chatbots had lower accuracy rates in oral radiology topics about imaging and equipment. These results indicate that LLMs’ performance in educational content varies across specific fields and that enhancements are needed, particularly for complex tasks like imaging and equipment.

In a study assessing multiple-choice questions from dental licensing exams in the US and UK, ChatGPT-3.5 achieved an accuracy rate of 68.3% for the US and 43.3% for UK questions. ChatGPT-4.0 demonstrated improved accuracy, scoring 80.7% for US questions and 62.7% for UK questions. When evaluating performance on the pre-course Advanced Life Support (ALS) Multiple Choice Questionnaire from the European Resuscitation Council (ERC), Copilot attained an accuracy of 62.5%, Bard achieved 57.5%, and ChatGPT-3.5 scored 42.25%. ChatGPT-4.0 excelled with the highest accuracy at 87.5%, while ChatGPT-3.5 provided the quickest responses, averaging three seconds per answer. Although studies involving different language models yield varying accuracy results, it is clear that the latest versions show a general improvement in performance.

This study reveals substantial differences among various chatbot models, aiding users in selecting the most appropriate option for their needs. Several factors may explain the variations in accuracy observed.

Model Training Data and Architecture: ChatGPT-4.0 likely utilizes a more advanced architecture and a wider range of training data than its predecessors, which may enhance its accuracy.
Fine-tuning and Updates: Advanced models like ChatGPT-4.0 may have undergone more rigorous fine-tuning and received updates more frequently than earlier, contributing to improved precision.
Response Generation Strategy: Some chatbot models may prioritize swift response times. For example, the faster responses of ChatGPT-3.5, along with its lower accuracy, indicate a strategy focused on speed rather than precision. In summary, ChatGPT-4.0 excels in scenarios where high accuracy is imperative, while ChatGPT-3.5 and Google Bard are more fitting for applications that prioritize speed. Additionally, Google Bard is particularly effective in contexts requiring detailed information.

Limitations

The primary limitation of this study is its exclusive focus on Turkish text-based questions. Exploring various question formats, including open-ended and rank-based questions, would be advantageous. Future research could also benefit from increasing the number of questions, uploading radiological images to different chatbots, and comparing their responses with those of human students. Furthermore, generating new questions based on the uploaded ones could provide valuable insights into the potential support available to students as they prepare for exams.

Conclusion

ChatGPT-4o stands out as the most accurate among the four available chatbots. With ongoing advancements in educational content and the underlying architectures of these tools, chatbots are becoming increasingly integral to the academic landscape. They present the potential to facilitate the swift resolution of complex dental and medical scenarios, thereby enhancing outcomes in these vital fields.

However, it is crucial to recognize that, at this stage, AI tools still fall short of matching the expertise and nuanced understanding of human specialists. This gap underscores the need for continued research and development in AI technology. Looking ahead, there is considerable curiosity and anticipation surrounding how these innovations will evolve and their implications for the future of education and healthcare.

Acknowledgements

I extend my special thanks to PhD Candidate Büşra Nisa Yilmaz for her professional editing of the manuscript in accordance with English language rules.

Author contributions

Material preparation, data collection and analysis were performed by MT. The manuscript was written by MT. The final version of the manuscript approved by MT.

Funding

No external funding was obtained for this study. MT contributed to the authors’ material preparation, data collection, and analysis. MT also wrote the manuscript and approved the final version.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Ethics approval and consent to participate

Are Not applicable. Open-source public data was used in this study.

Consent for publication

Not applicable since there was no direct human contact.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Chowdhary K, Chowdhary K. Natural language processing, Fundamentals of artificial intelligence, pp. 603–649, 2020.
2.Turing AM. Computing machinery and intelligence. Springer; 2009.
3.Rudolph J, Tan S, Tan S. War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. J Appl Learn Teach. 2023;6(1):364–89. [Google Scholar]
4.Tepe M, Emekli E. Decoding medical jargon: the use of AI language models (ChatGPT-4, BARD, Microsoft copilot) in radiology reports. Patient Educ Couns, 126, p. 108307, 2024/09/01/ 2024, doi: 10.1016/j.pec.2024.108307 [DOI] [PubMed]
5.Semeraro F, Gamberini L, Carmona F, Monsieurs KG. Clinical questions on advanced life support answered by artificial intelligence. A comparison between ChatGPT, Google Bard, and Microsoft Copilot, Resuscitation, vol. 195, 2024. [DOI] [PubMed]
6.AlAfnan MA, Dishari S, Jovic M, Lomidze K. Chatgpt as an educational tool: opportunities, challenges, and recommendations for communication, business writing, and composition courses. J Artif Intell Technol. 2023;3(2):60–8. [Google Scholar]
7.Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J. ChatGPT failed Taiwan’s family medicine board exam. J Chin Med Association. 2023;86(8):762–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Cheong RCT et al. Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard, European archives of Oto-Rhino-Laryngology, 281, 4, pp. 2137–43, 2024. [DOI] [PubMed]
9.Ozeri DJ, Cohen A, Bacharach N, Ukashi O, Oppenheim A. Performance of ChatGPT in Israeli Hebrew Internal Medicine National Residency Exam, the Israel Medical Association Journal: IMAJ, 26, 2, pp. 86–8, 2024. [PubMed]
10.Lewandowski M, Łukowicz P, Świetlik D, Barańska-Rybak W. ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology, (in eng), Clin Exp Dermatol, vol. 49, no. 7, pp. 686–691, Jun 25 2024. 10.1093/ced/llad255 [DOI] [PubMed]
11.Teebagy S, Colwell L, Wood E, Yaghy A, Faustina M. Improved performance of ChatGPT-4 on the OKAP examination: a comparative study with ChatGPT-3.5. J Acad Ophthalmol, 15, 02, pp. e184-e187, 2023. [DOI] [PMC free article] [PubMed]
12.Almeida LC, Farina EM, Kuriki PE, Abdala N, Kitamura FC. Performance of ChatGPT on the Brazilian radiology and diagnostic imaging and mammography board examinations, Radiology: Artificial Intelligence, vol. 6, no. 1, p. e230103, 2023. [DOI] [PMC free article] [PubMed]
13.Wang Y-M, Shen H-W, Chen T-J. Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Association. 2023;86(7):653–8. 10.1097/jcma.0000000000000942. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative performance of ChatGPT and Bard in a text-based Radiology Knowledge Assessment, Canadian Association of Radiologists Journal, 75, 2, pp. 344–50, 2024, 10.1177/08465371231193716 [DOI] [PubMed]
15.Danesh A, Pazouki H, Danesh K, Danesh F, Danesh A. The performance of artificial intelligence language models in board-style dental knowledge assessment: a preliminary study on ChatGPT. J Am Dent Association, 154, 11, pp. 970–4, 2023/11/01/ 2023, doi: 10.1016/j.adaj.2023.07.016 [DOI] [PubMed]
16.Sabaner MC, Hashas ASK, Mutibayraktaroglu KM, Yozgat Z, Klefter ON, Subhi Y. The performance of artificial intelligence-based large language models on ophthalmology-related questions in Swedish proficiency test for medicine: ChatGPT-4 omni vs Gemini 1.5 Pro, AJO International, vol. 1, no. 4, p. 100070, 2024/12/11/ 2024, 10.1016/j.ajoint.2024.100070
17.Santos GNM, et al. Effectiveness of E-learning in oral radiology education: a systematic review. J Dent Educ. 2016;80(9):1126–39. [PubMed] [Google Scholar]
18.Scarfe WC. Oral and maxillofacial radiology as a dental specialty: the first decade, Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology and Endodontics, vol. 110, no. 4, pp. 405–408, 2010, 10.1016/j.tripleo.2010.06.023 [DOI] [PubMed]
19.OpenAI. GPT-4 Technical Report., ed, 2023.
20.Google. Introducing Bard: A conversational AI service powered by LaMDA., ed, 2023.
21.Microsoft. Introducing Microsoft Copilot powered by OpenAI., ed, 2023.
22.Keshavarz P et al. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives, Diagnostic and Interventional Imaging, vol. 105, no. 7, pp. 251–265, 2024/07/01/ 2024. 10.1016/j.diii.2024.04.003 [DOI] [PubMed]
23.Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications, Radiology, vol. 310, no. 1, p. e232756, 2024, 10.1148/radiol.232756 [DOI] [PubMed]
24.Abd-alrazaq A et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions, JMIR Med Educ, vol. 9, p. e48291, 2023/6/1 2023. 10.2196/48291 [DOI] [PMC free article] [PubMed]
25.Jeong H, Han S-S, Yu Y, Kim S, Jeon KJ. How well do large language model-based chatbots perform in oral and maxillofacial radiology? Dentomaxillofacial Radiology, vol. 53, no. 6, pp. 390–395, 2024. [DOI] [PMC free article] [PubMed]
26.Temsah MH et al. May., Transforming Virtual Healthcare: The Potentials of ChatGPT-4omni in Telemedicine, (in eng), Cureus, vol. 16, no. 5, p. e61377, 2024, 10.7759/cureus.61377 [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.

[CR1] 1.Chowdhary K, Chowdhary K. Natural language processing, Fundamentals of artificial intelligence, pp. 603–649, 2020.

[CR2] 2.Turing AM. Computing machinery and intelligence. Springer; 2009.

[CR3] 3.Rudolph J, Tan S, Tan S. War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. J Appl Learn Teach. 2023;6(1):364–89. [Google Scholar]

[CR4] 4.Tepe M, Emekli E. Decoding medical jargon: the use of AI language models (ChatGPT-4, BARD, Microsoft copilot) in radiology reports. Patient Educ Couns, 126, p. 108307, 2024/09/01/ 2024, doi: 10.1016/j.pec.2024.108307 [DOI] [PubMed]

[CR5] 5.Semeraro F, Gamberini L, Carmona F, Monsieurs KG. Clinical questions on advanced life support answered by artificial intelligence. A comparison between ChatGPT, Google Bard, and Microsoft Copilot, Resuscitation, vol. 195, 2024. [DOI] [PubMed]

[CR6] 6.AlAfnan MA, Dishari S, Jovic M, Lomidze K. Chatgpt as an educational tool: opportunities, challenges, and recommendations for communication, business writing, and composition courses. J Artif Intell Technol. 2023;3(2):60–8. [Google Scholar]

[CR7] 7.Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J. ChatGPT failed Taiwan’s family medicine board exam. J Chin Med Association. 2023;86(8):762–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Cheong RCT et al. Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard, European archives of Oto-Rhino-Laryngology, 281, 4, pp. 2137–43, 2024. [DOI] [PubMed]

[CR9] 9.Ozeri DJ, Cohen A, Bacharach N, Ukashi O, Oppenheim A. Performance of ChatGPT in Israeli Hebrew Internal Medicine National Residency Exam, the Israel Medical Association Journal: IMAJ, 26, 2, pp. 86–8, 2024. [PubMed]

[CR10] 10.Lewandowski M, Łukowicz P, Świetlik D, Barańska-Rybak W. ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology, (in eng), Clin Exp Dermatol, vol. 49, no. 7, pp. 686–691, Jun 25 2024. 10.1093/ced/llad255 [DOI] [PubMed]

[CR11] 11.Teebagy S, Colwell L, Wood E, Yaghy A, Faustina M. Improved performance of ChatGPT-4 on the OKAP examination: a comparative study with ChatGPT-3.5. J Acad Ophthalmol, 15, 02, pp. e184-e187, 2023. [DOI] [PMC free article] [PubMed]

[CR12] 12.Almeida LC, Farina EM, Kuriki PE, Abdala N, Kitamura FC. Performance of ChatGPT on the Brazilian radiology and diagnostic imaging and mammography board examinations, Radiology: Artificial Intelligence, vol. 6, no. 1, p. e230103, 2023. [DOI] [PMC free article] [PubMed]

[CR13] 13.Wang Y-M, Shen H-W, Chen T-J. Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Association. 2023;86(7):653–8. 10.1097/jcma.0000000000000942. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative performance of ChatGPT and Bard in a text-based Radiology Knowledge Assessment, Canadian Association of Radiologists Journal, 75, 2, pp. 344–50, 2024, 10.1177/08465371231193716 [DOI] [PubMed]

[CR15] 15.Danesh A, Pazouki H, Danesh K, Danesh F, Danesh A. The performance of artificial intelligence language models in board-style dental knowledge assessment: a preliminary study on ChatGPT. J Am Dent Association, 154, 11, pp. 970–4, 2023/11/01/ 2023, doi: 10.1016/j.adaj.2023.07.016 [DOI] [PubMed]

[CR16] 16.Sabaner MC, Hashas ASK, Mutibayraktaroglu KM, Yozgat Z, Klefter ON, Subhi Y. The performance of artificial intelligence-based large language models on ophthalmology-related questions in Swedish proficiency test for medicine: ChatGPT-4 omni vs Gemini 1.5 Pro, AJO International, vol. 1, no. 4, p. 100070, 2024/12/11/ 2024, 10.1016/j.ajoint.2024.100070

[CR17] 17.Santos GNM, et al. Effectiveness of E-learning in oral radiology education: a systematic review. J Dent Educ. 2016;80(9):1126–39. [PubMed] [Google Scholar]

[CR18] 18.Scarfe WC. Oral and maxillofacial radiology as a dental specialty: the first decade, Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology and Endodontics, vol. 110, no. 4, pp. 405–408, 2010, 10.1016/j.tripleo.2010.06.023 [DOI] [PubMed]

[CR19] 19.OpenAI. GPT-4 Technical Report., ed, 2023.

[CR20] 20.Google. Introducing Bard: A conversational AI service powered by LaMDA., ed, 2023.

[CR21] 21.Microsoft. Introducing Microsoft Copilot powered by OpenAI., ed, 2023.

[CR22] 22.Keshavarz P et al. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives, Diagnostic and Interventional Imaging, vol. 105, no. 7, pp. 251–265, 2024/07/01/ 2024. 10.1016/j.diii.2024.04.003 [DOI] [PubMed]

[CR23] 23.Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications, Radiology, vol. 310, no. 1, p. e232756, 2024, 10.1148/radiol.232756 [DOI] [PubMed]

[CR24] 24.Abd-alrazaq A et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions, JMIR Med Educ, vol. 9, p. e48291, 2023/6/1 2023. 10.2196/48291 [DOI] [PMC free article] [PubMed]

[CR25] 25.Jeong H, Han S-S, Yu Y, Kim S, Jeon KJ. How well do large language model-based chatbots perform in oral and maxillofacial radiology? Dentomaxillofacial Radiology, vol. 53, no. 6, pp. 390–395, 2024. [DOI] [PMC free article] [PubMed]

[CR26] 26.Temsah MH et al. May., Transforming Virtual Healthcare: The Potentials of ChatGPT-4omni in Telemedicine, (in eng), Cureus, vol. 16, no. 5, p. e61377, 2024, 10.7759/cureus.61377 [DOI] [PMC free article] [PubMed]

PERMALINK

ChatGPT-4 Omni’s superiority in answering multiple-choice oral radiology questions

Melek Tassoker

Abstract

Objectives

Materials and methods

Results and discussion

Conclusion

Clinical trial number

Introduction

Methods

Study design and data collection

Querying procedure

Categories and topics

Fig. 1.

Table 1.

Reliability assessment

Evaluation metrics

Statistical analysis

Results

Table 2.

Table 3.

Table 4.

Table 5.

Fig. 2.

Discussion

Limitations

Conclusion

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases