Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 27;16:11272. doi: 10.1038/s41598-026-41533-z

Comparative analysis of large language models as decision support tools in oral pathology

Valentina Ignacia Alvarez-Silberberg 1,2, Camila Paz Alvarez-Silberberg 2, Cosimo Galletti 3, Javier Flores-Fraile 4,5, Cosimo Galletti 3, Valeria Ramirez 2, Cristian Bravo Palma 2, Victor Gil-Manich 1, Luca Fiorillo 3,5, Vini Mehta 4,5,, Maria-Teresa Fernández-Figueras 6,7, Maria Cuevas-Nunez 1,7
PMCID: PMC13049154  PMID: 41760843

Abstract

This study evaluated the performance of four large language model based chatbots (LLMs) (ChatGPT-4.0, ChatGPT o1-preview, Gemini, and Meta AI) as decision-support systems for interpreting histopathologic descriptions of oral lesions, assessing agreement between their s generated a suggested primary interpretation and three differential diagnoses. Outputs were categorized as Different, Similar, or Correct compared to the consensus reference diagnosis established by two board-certified pathologists. Statistical analyses included the Friedman test to compare model performance, Wilcoxon signed-rank tests for pairwise comparisons, Cohen’s κ to assess agreement, and regression analyses to evaluate the influence of age and sex. Differential diagnosis performance was also analyzed. ChatGPT o1-preview demonstrated the highest proportion of outputs concordant with the reference diagnosis (68.6%), followed by Meta AI (65.7%), ChatGPT-4.0 (59.8%), and Gemini (27.5%). In terms of agreement with oral pathologists, ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) showed substantial agreement, ChatGPT-4.0 demonstrated moderate agreement (κ = 0.57), and Gemini showed poor agreement (κ = 0.24). Increasing patient age was associated with a mild but statistically significant reduction in model performance for ChatGPT-4.0, Meta AI, and Gemini, while no significant age effect was observed for ChatGPT o1-preview; patient sex had no significant impact. Among the evaluated chatbots, ChatGPT o1-preview showed the highest alignment with oral pathologists’ reference diagnoses. These findings support the potential role of LLMs as complementary decision-support tools for interpreting oral histopathology descriptions, while highlighting substantial inter-model variability and the need for cautious implementation with continued human oversight.

Keywords: Artificial intelligence, Chatbot, Oral and maxillofacial pathology, ChatGPT, Gemini, Meta AI, Large language models

Subject terms: Computational biology and bioinformatics, Diseases, Medical research

Introduction

Artificial intelligence (AI), particularly through large language models (LLMs) such as ChatGPT, Gemini, and Meta AI, has emerged as a promising tool in healthcare by offering faster, more accessible, and potentially more accurate clinical decision-support tools. These platforms, often freely available to the public, may contribute to improved oral health education and assist both patients and clinicians in diagnostic reasoning and treatment planning1,2. Their underlying machine learning and deep learning techniques enable the analysis of large datasets and have already shown promise in dentistry, including caries detection and radiographic interpretation3. However, while ChatGPT has been relatively well studied, other widely adopted LLMs remain insufficiently evaluated in healthcare contexts.

Importantly, different LLMs have achieved widespread adoption through distinct access and distribution strategies, leading to major differences in user reach and interaction patterns rather than demonstrated differences in clinical performance47. This underscores the need to assess multiple platforms when considering real-world clinical applicability, including oral lesion analysis. Despite their potential, LLMs face significant challenges related to opacity, reliability, trust, bias, and data privacy, which limit transparency and acceptance in healthcare settings8,9.

In oral pathology, AI applications have shown potential to improve diagnostic accuracy, personalized treatment, and support the interpretation of histopathological features critical for definitive diagnosis10,11. Nevertheless, diagnostic variability persists due to the inherent subjectivity of histological interpretation, particularly in oral potentially malignant disorders10,12. AI-based tools may help reduce this subjectivity by providing standardized, reproducible support, although further evidence is required1,13,14. Current studies on dental chatbots suggest they can deliver generally accurate information but may lack depth when addressing complex histopathological or medical concepts1,9,15.

Additional limitations include the “black box” nature of AI systems, unknown training databases, restricted access in public healthcare settings, and ethical concerns related to data privacy and algorithmic bias10,13,1619. Consequently, few studies have specifically examined the role of chatbots as complementary tools for interpreting histopathological descriptions in oral pathology.

In this context, the intended clinical use of LLM-based chatbots is suggested as adjunctive rather than diagnostic, with potential applications including educational support, second-opinion assistance for clinicians in underserved areas, and cognitive support to possibly improve interpretation of complex histopathologic descriptions. These systems are not intended to replace expert pathologist judgment but to complement existing diagnostic workflows under appropriate supervision.

Therefore the study aimed to evaluate and compare the performance of four publicly accessible large language model (LLM)–based chatbots (ChatGPT-4.0, ChatGPT o1-preview, Meta AI [LLaMA-3], and Gemini) as decision-support tools for interpreting textual histopathologic descriptions of oral and maxillofacial lesions, by assessing their concordance with consensus reference diagnoses provided by certified oral pathologists. Secondary aims were to assess inter-model and model–pathologist agreement, evaluate performance in generating differential diagnoses, and explore the influence of patient age and sex on model performance.

Methods

Reporting standards and data collection

This retrospective study was approved by the Ethics Committee for Clinical Studies (CEIC) of the Quirónsalud–Catalonia Hospital Group (approval code: 2024/40-APA-HUGC). The requirement for informed consent was waived due to the retrospective, anonymized nature of the data. All procedures were conducted in accordance with the Declaration of Helsinki. The study was reported in accordance with the STARD-AI guidelines, where applicable, for retrospective evaluations of AI-based decision-support systems.

A total of 102 histopathological descriptions of oral and maxillofacial lesions were retrieved retrospectively from the Department of Pathology at the General Hospital of the International University of Catalonia in Barcelona, Spain in 2024. To maintain patient confidentiality, cases were de-identified and were numerically coded starting from number 1.

Four different LLMs were used: ChatGPT 4.0 (Open AI from Microsoft Corporation), Gemini (Google Labs from Google), Llama-3 (Meta Platforms from Meta AI), and ChatGPT o1-preview (Open AI from Microsoft Corporation). The study was conducted from September 2024 to June 2025, before the update of Open AI when ChatGPT 4.0 became GPT-5.0 and o1 preview updated to o3. Each was provided with the corresponding histopathological descriptions. In addition, the following participant details were included: age, sex, and lesion location, to further assess the diagnostic capabilities of the AIs.

The large language models (LLMs) evaluated in this study (ChatGPT-4.0, ChatGPT o1-preview, Meta AI (LLaMA-3), and Gemini) were selected based on their accessibility, public use, and technological relevance. ChatGPT-4.0 represents a high-performance, subscription-based model widely adopted in academic and professional settings, while ChatGPT o1-preview corresponds to a reasoning-optimized free-tier model available to the public through the ChatGPT platform. Meta AI was included due to its integration into widely used consumer platforms, such as smartphones and messaging apps, making it one of the most readily available conversational agents for non-specialists. Gemini was selected because it is embedded within Google’s search ecosystem (e.g., Google Search and Bard), providing real-time interaction and high exposure to end users. By including LLMs from three major technology developers (OpenAI, Meta, and Google), we aimed to compare LLMs as decision-support systems for interpreting histopathologic descriptions of oral lesion across diverse model architectures, training approaches, and user interfaces, reflecting the range of tools currently accessible in clinical and consumer environments.

The process was as follows: researchers MCCN and MTFF accessed the Atlas Path system (DBSoft, Madrid) at the General University Hospital of Catalonia to search for histopathological diagnoses of oral and maxillofacial lesions recorded between January 2022 and April 2024. An advanced search was conducted using the keyword “oral,” along with filters for types of biopsies, including gingival and periodontal tissues, salivary gland, tongue, anoral and labial mucosa. In addition, SNOMED filters were applied for the oral cavity and/or mouth. Cases were eligible for inclusion if they corresponded to primary oral or oral–maxillofacial lesions with complete histopathologic descriptions and finalized diagnoses rendered by certified pathologists. Only one specimen per lesion was included; duplicate biopsies and re-excisions of the same lesion were excluded to avoid case duplication. Histopathologic descriptions were used exactly as reported in the original pathology records, without simplification or modification. Accordingly, cases with composite or mixed diagnoses (e.g., oral epithelial dysplasia associated with candidal infection) were retained in their entirety. All microscopic findings and associated diagnoses were included as reported, in order to reflect real-world histopathologic complexity encountered in routine oral pathology practice (Figue 1). Notably, the researchers were trained in Oral and Maxillofacial Pathology (MCCN) or Dermatopathology, including oral pathology (MTTF).

Fig. 1.

Fig. 1

Flow diagram of case selection and inclusion in the study.

The types of oral lesions mainly included in this study were: infectious diseases like candidiasis; immune mediated diseases such as oral lichen planus; lichenoid lesions, oral epithelial dysplasia, fibromas, odontogenic cysts, benign reactive epithelial hyperplasia, squamous cell carcinoma, granulomatous reactions, mucocele, and melanosis.

Data entry and use of chatbots

Using both the free version of ChatGPT-4.0, Gemini, Llama-3, and ChatGPT o1-preview, the following previously described details were transcribed for each case into each LLM. Thus, all systems received the same information as the pathologists, without participant identifiers, ensuring fairness in the study.

The raw histopathological descriptions provided standardized formats of the details of all the tissues present in the specimen, such as oral epithelium, connective tissue, inflammatory infiltrate or bone, when available. For example, for a fibroma the description will include: The histological sections show mucosa lined by hyperplastic squamous epithelium without atypia. In the subepithelial connective tissue, there is a dense nodular collagen proliferation and a mild chronic inflammatory infiltrate. This ensures consistency regarding the type of information entered into the chatbots and simulates the clinical context in which the professional receives the description. For cases in which histopathological lesions involve immunohistochemical studies, for example, p53 in cases of epithelial dysplasia, these were also included in the text entered in the chatbot.

Each chatbot was prompted to provide a proposed diagnosis and three differentials. Thus, all data were collected using a single standardized prompt requesting both a primary suggested diagnosis and three differential diagnoses: “Tell me the definitive diagnosis and give me 3 differentials?”The question specifically included the differential diagnoses to evaluate whether the chatbot might arrive at the correct diagnosis but list it as a differential rather than as the definitive one. Also, although the prompt used the term ‘definitive diagnosis,’ this was intended to elicit a primary suggested interpretation for comparison purposes and does not reflect autonomous diagnostic use. All large language models were accessed through their standard public user interfaces and used with default settings. No system prompts, role instructions, or model parameters influencing response generation (such as randomness or creativity settings) were modified. All interactions were conducted in single-turn sessions, with the conversation history cleared between cases to avoid contextual carryover. Each histopathologic case was submitted once per model, and only the first response generated was recorded and analyzed, without response regeneration. This approach was chosen to reflect typical real-world use by clinicians interacting with publicly available chatbot tools.

Diagnostic coding and pathologist consensus

The answers provided by the chatbots were compared with the original diagnoses made by the pathologists and were classified as follows:

(Different): The chatbot’s suggested diagnosis was clinically or pathologically different from the original diagnosis in the hospital’s database.

(Similar/Partial): This category comprises two different classification options. The category was intentionally defined to capture cases of partial diagnostic concordance, encompassing both clinically aligned terminological variants and incomplete diagnostic formulations, which were grouped together for analytical consistency within a three-level outcome framework.

The chatbot’s suggested diagnosis used different terminology but is clinically aligned with the official patient pathology.

For example, “oral mucosa with hyperplasia, melanosis, and chronic inflammatory changes, with no evidence of neoplasia” vs. the chatbot’s diagnosis of “oral hypermelanosis.” The suggested diagnosis provided by the chatbot is partially correct. That is, if the pathologist’s diagnosis was “gingival inflammatory hyperplasia with the presence of candida,” the chatbot only gives “candida” as the diagnosis. While this is not an incorrect diagnosis, it does not warrant classification in category 3, as it lacks completeness to match the full diagnosis.

(Correct): The chatbot’s suggested diagnosis matches the pathologists’ original diagnosis.

To ensure standardization of the classification process and minimize subjectivity, each chatbot output was independently reviewed by two certified pathologists, who classified the suggested diagnosis according to the predefined outcome categories (different, similar/partial, correct). For similar/partial the assessment was guided by predefined examples illustrating terminology equivalence versus incomplete diagnostic specification. Initial classifications were performed independently. Disagreements were subsequently resolved through consensus discussion, and the final agreed-upon classification was used for analysis. Formal inter-rater agreement among the pathologists was not calculated, as the consensus classification served as the reference standard for model evaluation.

Statistical analysis

To evaluate differences in performance among the different AI methods, a Friedman test was applied, appropriate for comparing repeated measurements in paired data. Subsequently, multiple post-hoc comparisons were performed using the Wilcoxon signed-rank test for paired samples, with Bonferroni adjustment to control Type I error, using a significance level of 0.0083.

A multilevel ordinal logistic regression model was used to assess the relationship between the method as the independent variable and the score as the dependent (ordinal) variable, estimating marginal probabilities with multiple comparisons and Bonferroni adjustment. In addition, an analysis was carried out using ordinal logistic regression models to evaluate the relationship between age and the degree of agreement in lesion classification for each of the different AI methods, considering age and sex as independent variables. Given that the proportional odds assumption was not met for Gemini and age, a nominal logistic regression model was used for that method. Odds ratios (OR) and relative risk ratios (RRR) were reported for Gemini and age, with their respective 95% confidence intervals and p-values, using a significance level of 0.05.

According to 18 different grouped diagnosis types, Cohen’s Kappa statistics were calculated for each method vs. the pathologist’s diagnosis. Also, a heatmap was created showing the percentage of correct diagnoses by method for each of the 18 grouped diagnoses. Of note, Bonferroni-adjusted significance thresholds were applied for multiple pairwise comparisons to reduce the risk of false-positive findings.

Statistical analyses were performed using Stata 18 software (StataCorp LLC, College Station, TX, USA). For the heatmap, Python programming language (version 3.11.11) was used in the Google Colaboratory environment. Data visualization, including heatmap creation, was carried out using Python’s ‘seaborn’ and ‘matplotlib’ libraries.

Diagnostic coding of differential diagnoses

The pathologists’ diagnoses were also compared to the differential diagnoses provided by the chatbots and were classified as follows:

  • (Incorrect): The chatbot’s differential diagnosis was clinically or pathologically different from the definitive diagnosis provided by the pathologists.

  • (Correct): Among the 3 differential diagnoses provided by the chatbot, one matches the definitive diagnosis provided by the pathologists.

Only those chatbot’s diagnoses classified as “Similar” or “Different” were included. Therefore, only 41 cases were included for ChatGPT-4.0, 32 cases for ChatGPT o1-preview, 73 for Gemini, and 35 cases for META AI. Differential diagnosis performance was evaluated only in cases where the primary suggested diagnosis was classified as different or similar/partial, in order to assess whether the correct reference diagnosis was nonetheless included among the differential diagnoses, despite not being selected as the primary option. This approach was chosen to specifically examine model behavior in clinically relevant scenarios of partial or incorrect initial interpretation.

Results

The performance of all artificial intelligence systems was categorized into three groups: Different (score = 1), Similar (score = 2), and Correct (score = 3). The counts for each category were as follows: ChatGPT-4.0: 31 cases were classified as Different (30.4%), 10 cases as Similar (9.8%), and 61 cases as Correct (59.8%). ChatGPT o1-preview: 10 cases were classified as Different (9.8%), 22 cases as Similar (21.5%), and 70 cases as Correct (68.6%). Gemini: 60 cases were classified as Different (58.8%), 14 cases as Similar (13.7%), and 28 cases as Correct (27.5%). META AI: 13 cases were classified as Different (12.7%), 22 cases as Similar (21.6%), and 67 cases as Correct (65.7%).

The distribution of results are shown in Fig. 2 and the median scores and dispersion measures are shown in Table 1.

Fig. 2.

Fig. 2

Distribution of the results of each AI model compared to the consensus diagnosis by pathologists.

Table 1.

Description of the values of the results by AI method.

Statistic ChatGPT 4.0 Gemini Meta AI ChatGPT o1
Median 3 1 3 3
Interquartile range 2 2 1 1
Minimum value 1 1 1 1
Maximum value 3 3 3 3

Table 1 shows that the median scores were 3 for ChatGPT 4.0, Meta AI, and ChatGPT o1-preview, and 1 for Gemini. The interquartile range was 2 for ChatGPT 4.0 and Gemini, and 1 for Meta AI and ChatGPT o1-preview. All methods had a minimum score of 1 and a maximum score of 3.

The median age was 56 years with an interquartile range of 34 years. Ages ranged from 14 to 96 years, with a total of 102 observations, of which 44.12% were male. A total of 17 distinct diagnoses plus an “Other” category were observed, as shown in Table 2.

Table 2.

Distribution of grouped histopathological diagnoses.

Diagnosis n (%)
Mucocele 20 (19.6%)
Fibroma 15 (14.7%)
Inflammatory changes 12 (11.8%)
Oral epithelial dysplasia 9 (8.8%)
Oral lichen planus 9 (8.8%)
Others 8 (7.8%)
Odontogenic cyst 6 (5.9%)
Reactive epithelial changes 5 (4.9%)
Oral squamous cell carcinoma 4 (3.9%)
Oral candidiasis 3 (2.9%)
Hemangioma 2 (2.0%)
Oral melanosis 2 (2.0%)
Nonspecific ulcer 2 (2.0%)
Foreign body granuloma 1 (1.0%)
Traumatic neuroma 1 (1.0%)
Pemphigus 1 (1.0%)
Salivary duct cyst 1 (1.0%)
Syphilis 1 (1.0%)

The Friedman test yielded a p-value of 0.0078, indicating that at least one of the evaluated methods showed significant differences in the scores obtained. Post-hoc multiple comparisons, performed using the Wilcoxon signed-rank test for paired samples, are presented in Table 3. The results showed that Gemini scores were significantly lower compared to ChatGPT 4.0, Meta AI, and ChatGPT o1-preview. Additionally, Meta AI performed better than ChatGPT 4.0.

Table 3.

AI Methods comparison matrix: p-value results from Wilcoxon test with Bonferroni adjustment.

IA Method ChatGPT 4.0 Gemini Meta ChatGPT o1
ChatGPT-4.0
Gemini < 0.0001*
Meta 0.0046* < 0.0001*
ChatGPT o1 0.0091 < 0.0001* 0.5073

*Bonferroni-adjusted significance p < 0.0083.

The multilevel ordinal logistic regression model showed significant differences among the AI methods. Gemini had lower odds of obtaining higher (i.e., correct) scores compared to ChatGPT 4.0 (OR = 0.15, p < 0.001), while Meta AI and ChatGPT o1-previewdid not show significant differences compared to ChatGPT 4.0. Meta AI and ChatGPT o1-previewshowed significantly greater odds of achieving better results compared to Gemini (OR = 15.15 and OR = 17.70, respectively, p < 0.001). No significant differences were found between Meta AI and ChatGPT o1-preview (p > 0.99). See Table 4.

Table 4.

Results of the multilevel ordinal logistic regression model with marginal totals estimates and Bonferroni correction for multiple comparisons.

IA Method OR 95% CI p-value
Gemini vs. Chatgpt 4.0 0.15 0.06–0.37 < 0.001
Meta AI vs. Chatgpt 4.0 2.23 0.93–5.35 0.094
GPT o1 vs. Chatgpt 4.0 2.61 1.07–6.35 0.028
Meta AI vs. Gemini 15.15 5.64–40.70 < 0.001
GPT o1 vs. Gemini 17.70 6.46–48.51 < 0.001
GPT o1 vs. Meta AI 1.17 0.48–2.83 > 0.99

OR: Odds ratio CI: Confidence interval.

The agreement between the four AI models (ChatGPT-4.0, Gemini, Meta AI, and ChatGPT o1-preview) and the pathologists’ diagnoses were evaluated. The results of Cohen’s Kappa coefficient analysis showed substantial agreement between ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) and the pathologists’ reference diagnoses, moderate agreement for ChatGPT-4.0 (κ = 0.57), and poor agreement for Gemini (κ = 0.24).

The results showed that for ChatGPT-4.0 (OR = 0.97, 95% CI: 0.95–0.99, p = 0.004) and Meta AI (OR = 0.97, 95% CI: 0.95–0.99, p = 0.006), age had a significant association with the classification of oral lesions. However, for ChatGPT o1-preview (OR = 0.99, 95% CI: 0.97–1.01, p = 0.236), age did not show a statistically significant relationship with the ability to correctly report the diagnosis. In the case of Gemini, the nominal logistic regression model indicated that age was significantly associated with the classification, with a relative risk ratio of 0.95 (95% CI: 0.93–0.98, p < 0.001).

None of the AI methods showed a significant association between diagnostic ability and sex. The p-values for ChatGPT-4.0 (0.928), Gemini (0.258), Meta AI (0.812), and ChatGPT o1-preview (0.171) were all greater than 0.05, suggesting that sex does not significantly influence the results obtained by these models.

The percentage of correct diagnoses shown in Fig. 3 varied by lesion type and across AI methods. ChatGPT o1-preview achieved the best overall performance but also showed inconsistencies. Across all LLMs, higher concordance was observed for common benign lesions such as mucoceles and fibromas, whereas lower concordance was consistently noted for more diagnostically challenging entities, including oral epithelial dysplasia. The diagnoses of foreign body granuloma, pemphigus, and salivary duct cyst were among those with the poorest agreement across all AI methods.

Fig. 3.

Fig. 3

Percentage of correct agreement (code 3) by lesion type and AI method.

Regarding differential diagnoses, the quantities for each category were: ChatGPT o1-preview had 20 diagnoses marked as “correct” and 12 as “incorrect.” ChatGPT-4.0 had 20 diagnoses marked as “correct” and 21 as “incorrect.” Gemini had 11 “correct” and 62 “incorrect” responses. And lastly, META AI had 5 diagnoses marked as “correct” and 30 as “incorrect.” The results are shown in Table 5.

Table 5.

Description of the outcome values according to AI method.

AI Method
Result
ChatGPT 4.0 Gemini Meta AI ChatGPT o1
Correct 20 (48,8%) 11 (15,1%) 5 (14,3%) 20 (62,5%)
Incorrect 21 (51,2%) 62 (84,9%) 30 (85,7%) 12 (37,5%)

Discussion

This study evaluated the performance of different chatbots (ChatGPT 4.0, ChatGPT o.1 (OpenAI o1), Gemini, and META AI) as decision-support systems in diagnostic ability, using histopathological descriptions and comparing their interpretations of lesions with diagnoses provided by certified pathologists. Importantly, LLMs are not proposed as diagnostic tools, but as adjunctive decision-support systems intended to reduce cognitive load, enhance consistency, and support training in oral pathology. Thus, these findings reveal significant variability in accuracy and consistency among the models, with ChatGPT o1-preview showing the highest alignment with the pathologists’ diagnoses.

ChatGPT o1-preview achieved a correct results rate of 68.6%, outperforming ChatGPT 4.0 (59.8%), Gemini (27.5%), and META AI (65.7%). This suggests that more advanced models, like ChatGPT o1-preview or META AI, could offer better reliability in providing standardized, data-driven diagnoses. Nevertheless, the variability in performance observed among the LLMs underscores the importance of improvements needed before careful integration of these tools into clinical practice.

It is hypothesized that superior performance of ChatGPT o1-preview may reflect differences in model architecture, optimization strategies, or training approaches; however, these factors cannot be directly evaluated, as details regarding underlying algorithms and training datasets are not publicly available. Additionally, ChatGPT o1-preview had the lowest rate of divergent suggested diagnoses (9.8%).These results align with previous research demonstrating AI’s as support tools in diagnostic oral pathology when adequately trained14,20. Moreover, AIs like META AI and both versions of ChatGPT show potential by possibly reducing observer variability and standardizing diagnostic interpretations21,22.

As this study progresses in examining the various chatbots and their applications, it reveals both their capabilities and limitations. A major limitation affecting AI performance is related to the datasets used for training, which are retrospective and exclude real-time clinical scenarios. Moreover, if training data are incorrect or outdated, it may lead to biases and geographic and demographic disparities that could affect diagnostic outcomes and, ultimately, patient treatment16,23. In studies like those by Chen et al. (2021) and Umer et al. (2024), which observed and analyzed AI as a tool to enhance medical diagnostics, it was found that while the models showed promise, they are not infallible and must be used in conjunction with professionals. These studies also emphasize the need for further refinement in algorithms and AI databases to improve diagnostic capabilities22,24.

Other studies note that one of AI’s limitations is the lack of interpretability of histopathological descriptions, complicating clinical application since a single error could have major consequences25,26. Thus, it is recommended that AI should be used only as a complementary tool and not as a full replacement for professionals in clinical settings. Some studies showed that in complex cases with significant variability, or reports using very specific pathological language, AI performed worse than humans. This, in part, is due to its strong dependence on prompts, which can lead to misinterpretation, oversimplification of complex medical terminology, or even varying responses to the same question, affecting final diagnoses27. While AI is competent in pattern recognition and data categorization, it may still struggles with more complex diagnoses and histopathological descriptions where minimal tissue differences can significantly impact diagnosis and treatment26,28.

Furthermore, the results showed a moderate decrease in AI performance scores correlated with patient age for most models, with ChatGPT o1-preview being the exception. This observation does not establish causality, but may reflect increased histopathologic complexity commonly encountered in older patients. It is hypothesized that age-related inflammatory, degenerative, or overlapping pathological changes could result in more heterogeneous microscopic descriptions, thereby posing greater challenges for LLM-based interpretation. The absence of a significant age effect for ChatGPT o1-preview may also possibly be due to greater robustness to descriptive complexity; however, this finding should be interpreted cautiously given the retrospective design and sample size, and warrants confirmation in larger, prospectively designed cohorts. These observations are consistent with prior work by Cuevas-Nuñez et al. and Cirillo et al., which reported associations between demographic variables and AI performance while emphasizing that direct causal relationships remain unproven and that further research is needed to develop models resilient to clinical and demographic variability23,29.

Overall, demographic performance differences observed in AI systems may reflect multiple interacting factors, including lesion mix, case complexity, and potential biases in training data. Because model training datasets and architectures are not accessible in the present study, such explanations remain speculative and cannot be directly evaluated here30,31. Addressing these factors will be essential for the safe and equitable integration of AI-based decision-support tools into clinical oral pathology.

There were significant differences in performance when comparing different types of oral and maxillofacial lesions, with benign cases that have more robust histopathologic criteria showing the highest accuracy. Yet, as reported in other studies while good results were observed for simpler diagnoses, there was a notable drop in accuracy for more complex cases like oral epithelial dysplasia. Sandra et al. (2022) reviewed how AI shows significant potential in oral cancer diagnosis using automated and deep learning tools to detect precancerous lesions through risk factor analysis, databases, and other variables14. For more complex lesions with significant variability, Mahdi et al. reported that atypical and unexpected information reduces AI accuracy. These findings reinforce the importance of professional supervision in oral pathology and emphasize AI’s role as a complementary tool32.

On the other hand, the study by Sivari et al. (2023) focused on dental diseases and anomalies. Although this was not the focus of this study, their findings offer valuable insight into improving AI algorithms. The study analyzed the application of deep learning in AI for dental anomalies and found that greater control over training data improved accuracy33. Deep learning methods performed better with images, suggesting that future research should explore integrating these methods with histological descriptions in oral and maxillofacial pathology.

Moreover, LLMs are not only useful for histopathological descriptions, but also have applications in oral and maxillofacial anesthesia. A study found that conversational AIs (ChatGPT-4, Claude 3, and Gemini 1.0) successfully answered certification exam questions with moderate success, with room for improvement if datasets and algorithms are enhanced34. Khurana et al. (2023), on the other hand, points out that ChatGPT has limitations in image-based questions and suffers from a lack of validation or authenticity in its content. To address some of these issues, the study suggested refining the prompts35. These studies highlight the limitations of AI in diagnosing complex or varied cases, not only in text descriptions, but also in images and other tests.

Regarding differential diagnoses when the chatbots were provided with histopathologic description of lesions, the analysis conducted serves as a starting point for this type of research, as it requires a multidisciplinary team (health professionals and computer scientists) and access to the datasets and algorithms used by the chatbots to thoroughly analyze the root of LLM errors. Therefore, this study only provides a preliminary estimate of the source of discordance with pathologists. The 48.8% success rate of GPT-4.0 suggests possible issues in both its data and algorithms, but this is just a hypothesis as these algorithms are unknown.

This study has several limitations. First, the single-center design and relatively small overall sample size may limit generalizability. In addition, histopathologic descriptions were originally generated by pathologists who also contributed to the consensus reference diagnoses, introducing a potential risk of incorporation bias. However, to mitigate this risk, the evaluation of chatbot outputs was performed through a blinded re-reading of the histopathologic descriptions without access to, or consideration of, the original final diagnoses. Thus, future studies should include fully independent or external blinded reviewers to further minimize this bias.

Likewise, although a broad spectrum of oral and maxillofacial lesions was included, some diagnostic categories were underrepresented, reducing the certainty of lesion-specific comparisons and warranting caution when interpreting apparent model strengths or weaknesses for individual lesion types. The analysis relied exclusively on text-based histopathologic descriptions, without access to whole-slide images or other visual data that are central to routine diagnostic pathology practice33. Advanced deep learning methods, as suggested by Sandra et al. (2022), could also refine AI’s ability to detect subtle morphological changes in tissues, improving diagnostic accuracy for conditions like epithelial dysplasia14,23. Further, the LLMs were provided with no additional clinical context beyond the pathology report, such as patient history, imaging, or risk factors, which may have constrained performance. Finally, the evaluated chatbots are proprietary systems that may evolve over time, posing challenges for strict reproducibility of results. These limitations should be considered when interpreting the findings as well as for future research projects, including the development of standardized prompting strategies, curated oral pathology datasets, and controlled benchmarking frameworks to enable reproducible and comparable evaluation of LLM-based decision-support systems.

Conclusion

The four chatbots selected for this study are characterized by their widespread use and accessibility, with ChatGPT 4.0 being the most used. Gemini, META AI, and ChatGPT o.1 are developed by major tech companies (Google, Meta, and OpenAI). These chatbots are also accessible and have advanced natural language processing capabilities. This study serves as a starting point to evaluate the potential of conversational AI in histopathological diagnosis and to compare their performance.

ChatGPT o1-preview demonstrated the highest concordance with oral pathologists among the evaluated chatbots; however, overall performance remained below levels required for independent clinical use. The observed variability across models and reduced performance in complex cases underscore that current LLMs should be regarded as adjunctive decision-support tools, rather than standalone diagnostic systems. Accordingly, the integration of artificial intelligence into oral and maxillofacial pathology should proceed cautiously, with continued expert oversight, clear awareness of current limitations, and further validation before broader clinical adoption.

Author contributions

Conceptualization: VIA-S, VM; Methodology: VIA-S, CA-S, CG, JF-F, LF, VM; Investigation: VIA-S, CA-S, VR, CBP, MCC-N, MTFF; Data curation: VIA-S, CA-S, VR, CBP, MCC-N, MTFF; Formal analysis: CG, JF-F, VG-M, VM; Visualization: CG, VG-M; Writing – original draft: VIA-S, CA-S; Writing – review & editing: VIA-S, CA-S, CG, JF-F, VR, CBP, VG-M, LF, MCC-N, MTFF, VM; Supervision: LF, MCC-N, MTFF, VM; Project administration: VM; All authors read and approved the final manuscript and agree to be accountable for all aspects of the work.

Funding

Open access funding provided by Dr. DY Patil Vidyapeeth, Pune (Deemed to be University). This research did not receive any specific grant from funding agencies in the public, commercial, or not‑for‑profit sectors.

Data availability

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

This study was approved by the Ethics Committee for Clinical Studies (CEIC) of the Quirón Salud–Catalonia Hospital Group (reference: 202440-APA-HUGC). All procedures were conducted in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Declaration of Helsinki and its later amendments. As this study used anonymized retrospective histopathological descriptions without direct patient contact or identifiable information, the requirement for informed consent to participate was waived by the Ethics Committee for Clinical Studies (CEIC) of the Quirón Salud–Catalonia Hospital Group.”

Statement of clinical relevance

Comparing AI LLMs models such as ChatGPT, Meta and Gemini, offers a broader point of view of each of their unique capabilities and possible applications in clinical oral pathology settings, with differences in information processing efficiency, data handling and user adaptability.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Mohammad-Rahimi, H. et al. Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 137, 508-514; 10.1016/j.oooo.2024.01.015 (2024). [DOI] [PubMed] [Google Scholar]
  • 2.Jovanovic, M., Baez, M. & Casati, F. Chatbots as conversational healthcare services. IEEE Internet Comput. 25, 44-51; 10.1109/MIC.2020.3037151 (2021). [Google Scholar]
  • 3.Ahmed, N. et al. Artificial intelligence techniques: Analysis, application, and outcome in dentistry - A systematic review. BioMed Res. Int. 2021, 975156410.1155/2021/9751564 (2021). [DOI] [PMC free article] [PubMed]
  • 4.Meta Platforms Inc. Meta Reports First Quarter 2025 Results [Internet]. 2025 Apr [cited 2025 Dec 31]. Available from: https://investor.fb.com/investor-news/press-release-details/2025/Meta-Reports-First-Quarter-2025-Results
  • 5.Robertson, A. Google reveals Gemini AI has 350 million monthly active users [Internet]. The Verge. 2025 [cited 2025 Dec 31]. Available from: https://www.theverge.com/2025/4/23/google-gemini-350-million-monthly-users
  • 6.Dolan, B. Google’s Gemini usage is surging, but rivals still dominate [Internet]. Business Insider. 2025 [cited 2025 Dec 31]. Available from: https://www.businessinsider.com/google-gemini-usage-surging-rivals-chatgpt-meta-dominating-2025-4
  • 7.Statcounter Global Stats. New Statcounter AI data finds ChatGPT sends 79.8% of all chatbot referrals to websites [Internet]. Statcounter Global Stats. 2025 [cited 2025 Dec 31]. Available from: https://gs.statcounter.com/press/new-statcounter-ai-data-finds-chatgpt-sends-79-perc-of-all-chatbot-referrals-to-websites
  • 8.Gunning, D. & Aha, D. W. DARPA’s explainable artificial intelligence program. AI Mag. 40, 44-58; 10.1609/aimag.v40i2.2850 (2019). [Google Scholar]
  • 9.Wang, L. et al. A systematic review of ChatGPT and other conversational large language models in healthcare. medRxiv10.1101/2024.04.26.24306390 (2024).39802797 [Google Scholar]
  • 10.Kariamal, N. & Angadi, P. V. Artificial Intelligence in Oral Pathology Practice– An Overview. Annals of Dental Specialty.11, 82-86 (2023).
  • 11.Sultan, A. S., Elgharib, M. A., Tavares, T., Jessri, M. & Basile, J. R. The use of artificial intelligence, machine learning and deep learning in oncologic histopathology. J. Oral Pathol. Med.49, 849-856 (2020). [DOI] [PubMed]
  • 12.Hegde, S., Ajila, V., Zhu, W. & Zeng, C. Artificial intelligence in early diagnosis and prevention of oral cancer. Asia-Pacific J. Oncol. Nurs.9, 100133 (2022). [DOI] [PMC free article] [PubMed]
  • 13.Hirosawa, T. et al. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int. J. Environ. Res. Public. Health. 20, 3378 (2023). [DOI] [PMC free article] [PubMed]
  • 14.Sandra, S. C., Raghavan, A. & Madan Kumar, P. D. Application of artificial intelligence in the diagnosis and survival prediction of patients with oral cancer. J. Oral Res. Rev.14, 154 (2022).
  • 15.Rao, A. et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J. Med. Internet Res.25, e48659 (2023). [DOI] [PMC free article] [PubMed]
  • 16.Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 542, 115-118; 10.1038/nature21056 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Mahesh, N. Advancing healthcare: the role and impact of AI and foundation models. Am. J. Transl. Res.16, 2166-2179 (2024). [DOI] [PMC free article] [PubMed]
  • 18.Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med.28, 31-38 (2022). [DOI] [PubMed]
  • 19.Abdul, N. S. et al. Applications of artificial intelligence in the field of oral and maxillofacial pathology: a systematic review and meta-analysis. BMC Oral Health. 24, 122 (2024). [DOI] [PMC free article] [PubMed]
  • 20.Farhadi Nia, M., Ahmadi, M. & Irankhah, E. Transforming dental diagnostics with artificial intelligence: advanced integration of ChatGPT and large language models for patient care. Front. Dent. Med. 5, 1456208 (2024). [DOI] [PMC free article] [PubMed]
  • 21.Sabri, H. et al. Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education. J. Periodontal Res.60, 121-133 (2025). [DOI] [PMC free article] [PubMed]
  • 22.Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. 5, Nat. Biomedical Eng.5, 493-497 (2021). [DOI] [PMC free article] [PubMed]
  • 23.Cirillo, D. et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. npj Digit. Med. 3, 81 (2020). [DOI] [PMC free article] [PubMed]
  • 24.Umer, F., Batool, I. & Naved, N. Innovation and application of Large Language Models (LLMs) in dentistry – a scoping review. BDJ Open.10, 90 (2024). [DOI] [PMC free article] [PubMed]
  • 25.Cai, X., Zhang, H., Wang, Y., Zhang, J. & Li, T. Digital pathology-based artificial intelligence models for differential diagnosis and prognosis of sporadic odontogenic keratocysts. Int. J. Oral Sci.16, 16 (2024). [DOI] [PMC free article] [PubMed]
  • 26.Buhr, C. R. et al. Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology. Acta Otolaryngol.144, 237-242 (2024). [DOI] [PubMed]
  • 27.Bélisle-Pipon, J. C. Why we need to be careful with LLMs in medicine. Front. Med. (Lausanne).11, 1495582 (2024). [DOI] [PMC free article] [PubMed]
  • 28.Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large Language Models in Medicine: The Potentials and Pitfalls A Narrative Review. Ann. Intern. Med.177, 210-220 (2024). [DOI] [PubMed]
  • 29.Cuevas-Nunez, M. et al. Diagnostic performance of ChatGPT-4.0 in histopathological description analysis of oral and maxillofacial lesions: A comparative study with pathologists. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 139, 453-461; 10.1016/j.oooo.2024.11.087 (2025). [DOI] [PubMed] [Google Scholar]
  • 30.Nadeem, A., Marjanovic, O. & Abedin, B. Gender bias in AI-based decision-making systems: a systematic literature review. Australasian J. Inform. Syst.26, 1-34. (2022).
  • 31.Fletcher, R. R., Nakeshimana, A. & Olubeko, O. Addressing fairness, bias, and appropriate use of artificial intelligence and machine learning in global health. Front. Artif. Intell. 3, 561802; 10.3389/frai.2020.561802 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Mahdi, S. S. et al. How does artificial intelligence impact digital healthcare initiatives? A review of AI applications in dental healthcare. Int. J. Inf. Manag. Data Insights.  3, 100144; 10.1016/j.jjimei.2022.100144 (2023). [Google Scholar]
  • 33.Sivari, E. et al. Deep learning in diagnosis of dental anomalies and diseases: A systematic review. Diagnostics (Basel). 13, 2512; 10.3390/diagnostics13152512 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Fujimoto, M. et al. Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam. Cureus.16, e70302 (2024). [DOI] [PMC free article] [PubMed]
  • 35.Khurana, S. & Vaddi, A. ChatGPT From the Perspective of an Academic Oral and Maxillofacial Radiologist. Cureus.15, e40053 (2023). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES