Skip to main content
BMC Oral Health logoLink to BMC Oral Health
. 2026 Feb 7;26:472. doi: 10.1186/s12903-026-07727-7

Advanced deep learning techniques for classifying dental conditions using panoramic X-ray images

Alireza Golkarieh 1, Bahareh Afjehsoleymani 2,, Kiana Kiashemshaki 3, Sajjad Rezvani Boroujeni 4
PMCID: PMC12977708  PMID: 41654817

Abstract

Objective

This study evaluated multiple deep learning approaches for automated classification of dental conditions in panoramic radiographs, comparing custom convolutional neural networks (CNNs), hybrid CNN-machine learning models, and fine-tuned pre-trained architectures, comparing the performance of custom convolutional neural networks (CNNs), hybrid CNN-machine learning models, and fine-tuned pre-trained architectures for detecting fillings, cavities, implants, and impacted teeth.

Methods

A dataset of 1,512 panoramic X-ray images with 11,137 manually annotated bounding boxes for four dental conditions (fillings, cavities, implants, and impacted teeth) was analyzed, with regions of interest extracted using expert annotations for subsequent AI-based classification. Class imbalance was addressed through random downsampling, creating a balanced dataset of 894 samples per condition. Multiple approaches were evaluated via 5-fold cross-validation: a custom CNN, hybrid models combining CNN features with traditional classifiers (Support Vector Machine, Decision Tree, Random Forest), and fine-tuned pre-trained networks (VGG16, Xception, ResNet50). Performance was assessed using accuracy, precision, recall, and F1-score metrics.

Results

The hybrid CNN-Random Forest model achieved the highest accuracy of 85.4 ± 2.3% with macro-F1 score of 0.843 ± 0.028, representing an 11% point improvement over the custom CNN (74.29% accuracy, 0.724 macro-F1). VGG16 demonstrated superior pre-trained architecture performance (82.3 ± 2.0% accuracy, 0.817 macro-F1), followed by Xception (80.9 ± 2.3%) and ResNet50 (79.5 ± 2.7%). CNN + Random Forest exhibited exceptional fillings detection (F1: 0.860 ± 0.033) with balanced multi-class performance. Systematic misclassifications between morphologically similar conditions revealed inherent diagnostic challenges.

Conclusion

Hybrid CNN-based approaches combining feature extraction with Random Forest classification provide superior discriminative capability for dental condition detection on manually annotated regions compared to standalone architectures. While computationally efficient hybrid models show promise as supportive diagnostic tools, observed misclassification patterns indicate these AI systems should serve as adjuncts to clinical expertise, requiring prospective validation studies.

Keywords: Dental x-ray classification, Dental diagnostics, Dental conditions, Panoramic radiographs, Convolutional neural networks (CNN)

Introduction

Dental health plays a crucial role in overall human well-being, affecting essential functions such as mastication, speech articulation, facial aesthetics, and psychological confidence, while poor oral health has been linked to systemic conditions including cardiovascular disease, diabetes, and respiratory infections [1]. According to the World Health Organization, oral diseases affect nearly 3.5 billion people worldwide, with untreated dental caries being the most common health condition globally, affecting approximately 2.3 billion people [2]. The economic burden of dental diseases is substantial, with treatment costs exceeding $442 billion annually worldwide, emphasizing the critical need for early detection and preventive interventions [3]. Traditional diagnostic methods for dental condition assessment, particularly through X-ray radiographic imaging, while widely utilized in clinical practice, are heavily dependent on clinician expertise and experience, leading to potential variability in diagnostic accuracy and increased examination time. With the exponential growth of dental X-ray imaging data and the emergence of artificial intelligence in healthcare, deep learning techniques have shown significant promise for automated medical image analysis and pattern recognition in radiographic diagnostics [4]. The integration of deep learning algorithms in X-ray-based dental diagnostics offers unprecedented opportunities to enhance diagnostic precision, reduce human error, and provide consistent, objective assessments of various dental pathologies including caries, periodontal diseases, and anatomical anomalies. Given the increasing global burden of dental diseases and the growing demand for efficient diagnostic tools, the development of automated classification systems using deep learning methodologies for X-ray image analysis has become not only beneficial but essential for advancing dental healthcare delivery and improving patient care standards worldwide.

The application of artificial intelligence in dental diagnostics has witnessed remarkable advancement over the past decade, with numerous research investigations demonstrating the potential of machine learning and deep learning methodologies for automated classification of various dental conditions through X-ray image analysis. Contemporary investigations have focused extensively on the detection and classification of dental caries, with studies utilizing convolutional neural networks (CNNs) achieving detection accuracies of 82% on periapical radiographs [5] and sensitivity rates of 89.7% with specificity of 92.3% on cone beam computed tomography images [6]. Research has progressed from binary classification to multi-class problems, with notable achievements in dental implant classification where transfer learning approaches reached accuracies of 99.04% for identifying different implant systems [7] and 95.6% for classifying implant dimensions [8]. Additional studies have addressed periodontal disease detection, filled tooth identification, and restorative material classification using various deep learning architectures [9, 10]. Despite these advances, most existing studies have focused on single-pathology detection or binary classification tasks, with limited investigation into simultaneous multi-class classification of different dental conditions within unified frameworks that incorporate spatial localization capabilities.

The integration of entity annotation approaches with classification networks for panoramic radiograph analysis requires further exploration, particularly for assessment of multiple dental conditions including fillings, cavities, implants, and impacted teeth within a single diagnostic system. Furthermore, previous research has primarily emphasized detection algorithms without sufficient attention to entity-based segmentation techniques that provide spatial information about specific dental condition regions. Specifically, entity-based segmentation using bounding box annotations enables precise localization of pathological regions within panoramic radiographs, facilitating both spatial identification and subsequent classification of detected conditions.

This study addresses these limitations through three key contributions: first, we implement a multi-class classification framework for simultaneous assessment of four dental conditions (fillings, cavities, implants, and impacted teeth) rather than single-pathology detection; second, we employ entity-level bounding box annotations to provide spatial localization of dental conditions within panoramic radiographs; and third, we compare hybrid CNN-ML models that utilize flattened CNN features with end-to-end deep learning architectures to determine optimal classification strategies. This integrated approach combining entity-based segmentation with multi-class classification provides both spatial localization and diagnostic capabilities within a unified system, contributing to improved diagnostic consistency in dental practice and supporting the development of practical AI-based diagnostic tools for clinical applications.

Materials and methods

Data collection

This study employed a retrospective cross-sectional design utilizing publicly available panoramic dental radiographs for the development and evaluation of deep learning-based classification models. The inclusion criteria comprised panoramic X-ray images with clearly identifiable dental conditions including fillings, cavities, implants, and impacted teeth, while exclusion criteria included images with poor quality, incomplete dental arches, or ambiguous pathological features that could not be definitively classified by expert reviewers.

The dataset utilized in this investigation was obtained from the Roboflow platform, comprising panoramic dental X-ray images annotated for multiple dental conditions. The final dataset contained 1,940 annotated instances distributed across four classes: fillings (n = 823), cavities (n = 530), implants (n = 301), and impacted teeth (n = 286). Each annotated tooth region was assigned to a single class label, ensuring mutually exclusive classification without overlapping labels per tooth entity. The annotation process involved evaluation by two qualified dental specialists who independently reviewed and labeled the dental conditions present in each radiograph. Disagreements between the two specialists regarding image interpretation were resolved through consensus discussion, where both experts jointly re-examined the contested cases and reached agreement on the final classification. In cases where consensus could not be achieved through discussion alone, a third senior dental expert was consulted to provide the definitive classification. Only the annotations confirmed by both experts were considered in the final dataset, thereby enhancing the credibility and reliability of the ground truth labels. It should be noted that no quantitative inter-rater agreement metric such as Cohen’s kappa was computed during the annotation process.

Figure 1 presents the masked dental X-ray images and class distribution. The term “masked” refers to the application of bounding box annotations that define regions of interest containing specific dental conditions, which are subsequently extracted as individual image patches for classification purposes. Figure 1 illustrates representative examples of annotated panoramic radiographs with bounding boxes indicating the spatial locations of different dental conditions, along with the distribution of instances across the four classes.

Fig. 1.

Fig. 1

Illustration of masked dental X-ray images and class distribution in the study dataset

Preprocessing

The preprocessing pipeline consisted of multiple sequential operations designed to enhance image quality and standardize input data for deep learning model training. Table 1 presents the complete mathematical formulations for each preprocessing operation applied to the dental radiographs.

Table 1.

Overview of preprocessing techniques and dataset balancing after mask application

Method Formula Description Parameters
Brightness Adjustment Inline graphic Adjusts brightness and contrast to enhance feature visibility. α = 1.5, β = 15
Noise Removal [11] Inline graphic Reduces salt-and-pepper noise while preserving edges using a median blur filter. Kernel size k = 3
Contrast Enhancement [12] Inline graphic Enhances local contrast using CLAHE, avoiding excessive noise amplification. clipLimit = 2.0, tileGridSize = 3 × 3
Normalization Inline graphic Scales pixel intensities to the range [0, 1] for faster neural network convergence. None
Applying Mask Inline graphic Binary mask M extracts regions of interest defined by bounding box annotations None
Resizing Inline graphic Resizes images to a uniform dimension to ensure consistency in input size for the models. Final size = 224 × 224, nearest-neighbor interpolation
Random Downsampling (Balancing) - Randomly reduces the number of samples in each class to match the smallest class size after masking. Based on class with minimum samples (894)

The preprocessing steps were applied sequentially as follows: First, brightness adjustment was performed to normalize the overall illumination across different radiographs. Second, Gaussian filtering was applied for noise reduction while preserving edge information. Third, Contrast Limited Adaptive Histogram Equalization (CLAHE) was implemented to enhance local contrast and improve the visibility of subtle dental features. Fourth, pixel value normalization was conducted to standardize the intensity range across all images. Fifth, masking operations were applied using the bounding box annotations to extract specific regions of interest corresponding to individual dental conditions. Finally, all extracted regions were resized to uniform dimensions of 224 × 224 pixels to meet the input requirements of the deep learning architectures.

The original dataset exhibited substantial class imbalance, with fillings representing 42.4% of instances, cavities 27.3%, implants 15.5%, and impacted teeth 14.7%. To address this imbalance and prevent model bias toward the majority class, random downsampling was implemented to balance the class distribution. The downsampling procedure was performed after splitting the dataset into training and testing sets to prevent data leakage and ensure that the test set maintained its original class distribution for realistic performance evaluation. The balanced training dataset contained 286 instances per class, totaling 1,144 samples, while the test set retained its original imbalanced distribution with 206 samples for evaluation.

Downsampling was selected as the balancing strategy rather than class weighting or oversampling techniques to create a controlled experimental environment for systematic model comparison. This approach ensures that all models are trained on identical balanced datasets, eliminating confounding factors related to class distribution handling and allowing direct performance comparison. However, it should be acknowledged that this artificial balancing may not reflect real-world clinical prevalence, and the implications of this choice are discussed in the Limitations section.

Figure 2 presents the preprocessing workflow with the steps presented sequentially from left to right. The figure illustrates the transformation of raw panoramic radiographs through each preprocessing stage, demonstrating the progressive enhancement of image quality and the extraction of normalized regions of interest suitable for classification model input.

Fig. 2.

Fig. 2

Step-by-step visualization of preprocessing operations, from raw input to the final standardized images

Computational environment

The computational models were implemented using Python programming language, specifically leveraging libraries such as TensorFlow and Keras for neural network construction and training. The simulations were conducted on a system equipped with an NVIDIA RTX 3050 Ti laptop GPU with 4GB of VRAM to accelerate the training process. The system specifications include an Intel Core i7 processor, 32 GB of RAM, and a Windows 11 operating system. The integrated development environment (IDE) used was PyCharm. Data preprocessing and analysis were performed using additional libraries like NumPy, pandas, and SciPy.

Model development

Independent CNN model

A custom-designed Convolutional Neural Network (CNN) was developed to address the four-class dental condition classification task. The network architecture followed a progressive filter expansion strategy (32 → 64 → 128 → 256), enabling hierarchical feature extraction from low-level visual patterns, such as edges and textures, to high-level semantic representations. This design choice is consistent with established methodologies in medical image analysis, where gradual increases in filter depth facilitate effective modeling of complex anatomical structures while maintaining computational efficiency. The proposed CNN comprised approximately 2.1 million trainable parameters.

The architecture consisted of four sequential convolutional blocks, each including a convolutional layer, batch normalization, ReLU activation, and max-pooling. All convolutional layers employed 3 × 3 kernels with filter depths of 32, 64, 128, and 256, respectively. The extracted feature maps were subsequently flattened and fed into two fully connected layers with 256 and 128 neurons. To mitigate overfitting, dropout regularization was applied with rates of 0.5 and 0.3 for the first and second fully connected layers, respectively. The final output layer utilized a softmax activation function to produce class probability distributions for the four dental conditions.

Model optimization was performed using the Adam optimizer with an initial learning rate of 0.001 and without weight decay regularization. Network weights were initialized using the Glorot uniform initialization method, while biases were initialized to zero. Training was conducted for up to 30 epochs with a batch size of 16. An early stopping mechanism with a patience of 10 epochs, monitored on the validation loss, was employed to prevent overfitting and enhance generalization performance.

The convolutional operation at each layer can be mathematically expressed as follows:

graphic file with name d33e471.gif 1

where Inline graphic denotes the output feature value at spatial location Inline graphic, Inline graphic represents the convolutional kernel weights, Inline graphic is the input feature map, Inline graphic is the bias term, and Inline graphicdenotes the activation function. The complete architecture and hyperparameters are summarized in Table 2, and Fig. 3 illustrates the CNN workflow for feature extraction and classification.

Table 2.

Summary of the CNN architecture and learning parameters for machine learning models

Model Architecture Details Learning Parameters Activation Function
CNN Input: (224, 224, 1) → Conv2D (32, 3 × 3, ReLU) → MaxPooling (2 × 2) → Dropout (0.3) → Conv2D (64, 3 × 3, ReLU) → MaxPooling (2 × 2) → Dropout (0.3) → Conv2D (128, 3 × 3, ReLU) → MaxPooling (2 × 2) → Dropout (0.3) → Conv2D (256, 3 × 3, ReLU) → MaxPooling (2 × 2) → Dropout (0.3) → Conv2D (512, 3 × 3, ReLU) → MaxPooling (2 × 2) → Dropout (0.3) → Flatten → Dense (256, ReLU) → Dropout (0.3) → Dense (4, Softmax)

Optimizer: Adam

Loss: Sparse Categorical Cross-Entropy

Early Stopping: 10

Batch Size: 16

Epochs: 30

Dropout: 0.3–0.5

Batch Normalization: Yes

Cross-Validation: 5-fold

ReLU (hidden)

Softmax (output)

SVM - Standardize Data: True, Solver: SMO, Cross-Validation: 5, Kernel: RBF, C: 1.0, Gamma: ‘scale’, probability = True, Cross-Validation: 5 -
DT - Standardize Data: True, Criterion: ‘gini’, Max Depth: None, Min Samples Split: 2, Estimators: 100, Cross-Validation: 5, -
RF - Standardize Data: True, Cross-Validation: 5, Number of Estimators: 100, Criterion: ‘gini’, Max Features: ‘auto’, Cross-Validation: 5 -
Fig. 3.

Fig. 3

Architecture of a Convolutional Neural Network (CNN) Model

Hybrid models

The hybrid modeling approach combined pre-trained CNN architectures for feature extraction with traditional machine learning classifiers for final classification. In this framework, the CNN component was not trained or updated during the hybrid model learning process. Instead, pre-trained CNN models (VGG16, ResNet50, and Xception) were used as fixed feature extractors, where input images were passed through the frozen convolutional layers to obtain high-dimensional feature representations. These extracted features were then used to train the machine learning classifiers (SVM, Decision Tree, Random Forest), whose parameters were optimized during the training phase while the CNN weights remained unchanged.

Three classical machine learning algorithms were evaluated as classifiers in the hybrid framework: Support Vector Machine (SVM), Decision Tree (DT), and Random Forest (RF). Before training the machine learning classifiers, the extracted CNN features were standardized using z-score normalization (zero mean and unit variance) to ensure that all features contributed equally to the classification decision and to improve classifier convergence.

Support Vector Machine (SVM)

The Support Vector Machine (SVM) classifier was employed using a Radial Basis Function (RBF) kernel to model non-linear decision boundaries within the high-dimensional feature space. Hyperparameters were defined a priori without grid-search optimization, with the regularization parameter set to and the kernel coefficient to . These values were selected in accordance with common practices for handling high-dimensional features in medical image analysis. The SVM decision function is mathematically defined as follows [13]:

graphic file with name d33e535.gif 2

where Inline graphic denotes the Lagrange multipliers, Inline graphic represents the class labels, Inline graphicis the RBF kernel function, and Inline graphic is the bias term.

Decision Tree (DT)

The Decision Tree (DT) classifier was configured with a maximum depth of 10 to balance model expressiveness and generalization performance. The Gini impurity criterion was adopted for node splitting, and a minimum of five samples was required to split an internal node. All hyperparameters were fixed a priori without systematic tuning. The Gini impurity used to evaluate node purity is defined as:

graphic file with name d33e566.gif 3

where Inline graphic denotes the proportion of samples belonging to class Inline graphic at node Inline graphic.

The decision tree recursively partitions the feature space by selecting splits that maximize information gain, calculated as:

graphic file with name d33e587.gif 4

where Inline graphic represents the parent node, Inline graphic denotes the splitting attribute, Inline graphic refers to the child nodes resulting from the split, and Inline graphic and Inline graphic indicate the number of samples in the child and parent nodes, respectively.

Random Forest (RF)

The Random Forest (RF) classifier consisted of an ensemble of 100 decision trees, each trained using bootstrap sampling of the training dataset and random feature selection at each split. The maximum depth of individual trees was set to 20, while the minimum number of samples required for node splitting was fixed at 2. The number of features considered at each split was set to the square root of the total number of input features. These parameters were selected a priori based on standard Random Forest configurations reported in the literature. Final predictions were obtained using majority voting across all decision trees in the ensemble [14]:

graphic file with name d33e625.gif 5

where Inline graphicdenotes the prediction of the Inline graphic-th decision tree and Inline graphic represents the final ensemble output. All learning parameters used are specified in Table 2.

Pre-trained deep learning models

Three state-of-the-art pre-trained CNN architectures were evaluated through transfer learning: VGG16, ResNet50, and Xception. These models were initially pre-trained on the ImageNet dataset and subsequently fine-tuned for the dental condition classification task.

VGG16 Architecture

VGG16 is characterized by its deep architecture with 16 weight layers, employing small 3 × 3 convolutional filters throughout the network. The architecture consists of five convolutional blocks with increasing filter depths (64, 128, 256, 512, 512), each followed by max-pooling layers for spatial dimension reduction. For the hybrid models, features were extracted from the final convolutional block (block5_pool layer) before the fully connected layers, yielding 25,088-dimensional feature vectors (512 filters × 7 × 7 spatial dimensions) for each input image.

For transfer learning with fine-tuning, all convolutional layers in the first four blocks (comprising 13 convolutional layers) were frozen to preserve low-level and mid-level feature representations learned from ImageNet. Only the final convolutional block (block5, containing 3 convolutional layers) and the fully connected classification head were made trainable, allowing the model to adapt high-level features to dental-specific patterns. This corresponds to freezing layers 1 through 15 and training layers 16 through 23.

The conceptual architecture of VGG16 follows a hierarchical feature learning paradigm where shallow layers capture edges and textures, intermediate layers detect parts and patterns, and deep layers encode high-level semantic concepts. The uniform use of 3 × 3 convolutions with stride 1 and 2 × 2 max-pooling enables systematic spatial dimension reduction while expanding the receptive field. The feature extraction at layer l is defined as [15]:

graphic file with name d33e735.gif 6

where Inline graphic​ is the feature map, Inline graphicand Inline graphicdenote the kernel weights and bias, and σ is the non-linear ReLU activation. The simplicity and depth of VGG16 make it effective for medical imaging tasks, although the model is computationally intensive due to its large parameter set.

ResNet50 Architecture

ResNet50 addresses the vanishing gradient problem through residual skip connections that enable training of very deep networks. The architecture contains 50 layers organized into five stages with bottleneck blocks. For hybrid model feature extraction, features were obtained from the global average pooling layer after the final residual block, producing 2,048-dimensional feature vectors.

For transfer learning, the first three residual stages (comprising 38 layers) were frozen to retain general visual representations, while the final two stages (12 layers) and the classification head were trained to learn dental-specific features. This configuration freezes layers 1 through 143 and trains layers 144 through 175.

ResNet50 employs identity skip connections that allow gradients to flow directly through the network during backpropagation. Each bottleneck block contains three convolutions (1 × 1, 3 × 3, 1 × 1) with batch normalization and ReLU activation, where the 1 × 1 convolutions reduce and restore dimensionality while the 3 × 3 convolution performs the main feature transformation. Its fundamental unit is the residual block, formulated as [16]:

graphic file with name d33e766.gif 7

where the shortcut term (+Inline graphic) enables direct information propagation across layers, facilitating efficient training of networks exceeding 50 layers. Additionally, the architecture employs batch normalization and ReLU activation after convolutional operations, which accelerates convergence and enhances generalization. The ability of ResNet50 to learn hierarchical features with improved gradient flow makes it particularly suitable for complex radiographic data.

Xception Architecture

Xception (Extreme Inception) replaces standard convolutions with depthwise separable convolutions, reducing computational complexity while maintaining representational power. The architecture consists of 36 convolutional layers structured into three flows: entry flow, middle flow (repeated 8 times), and exit flow.

For hybrid models, features were extracted from the global average pooling layer, yielding 2,048-dimensional feature vectors.

For transfer learning, the entry flow and first four middle flow modules (comprising 75 layers) were frozen, while the remaining middle flow modules, exit flow, and classification head were trained. This freezes layers 1 through 115 and trains layers 116 through 134.

Xception’s depthwise separable convolutions first apply spatial filtering independently to each input channel (depthwise convolution), then combine the outputs through pointwise 1 × 1 convolutions. This factorization reduces parameters and computations while enabling efficient cross-channel correlation learning. The operation at layer l is expressed as [17]:

graphic file with name d33e795.gif 8

where Inline graphic​ represents depthwise filters, Inline graphic​ the pointwise filters, and ⊗ indicates channel-wise combination. This factorization enables the model to independently capture spatial correlations through depthwise filters and cross-channel interactions via pointwise filters. Owing to its efficiency and strong performance, Xception is well-suited for large-scale medical image analysis (Table 3).

Table 3.

Pre-trained model configuration parameters

Model Total Parameters Trainable Parameters Frozen Layers Trainable Layers Feature Dimension (Hybrid)
VGG16 138.4 M 7.1 M Layers 1–15 (13 conv layers) Layers 16–23 (3 conv + FC) 25,088
ResNet50 25.6 M 7.1 M Layers 1-143 (3 residual stages) Layers 144–175 (2 stages + FC) 2,048
Xception 22.9 M 8.4 M Layers 1-115 (entry + 4 middle flows) Layers 116–134 (4 middle + exit flows) 2,048

Figure 4 presents the architectural configurations of VGG16 (a), ResNet50 (b), and Xception (c) with labeled panels corresponding to each model respectively.

Fig. 4.

Fig. 4

Fine-tuned pretrained architectures for dental condition classification

Training Strategy and Data Augmentation

No data augmentation techniques (such as rotation, flipping, scaling, or elastic deformation) were applied during model training. The models were trained exclusively on the preprocessed and balanced dataset without artificial sample generation.

The learning parameters (optimizer, loss function, batch size, number of epochs, and training strategies) are summarized in Table 4. By combining domain-specific fine-tuning with the robust architectures of VGG16, ResNet50, and Xception, the models were optimized to achieve accurate and reliable classification of dental conditions from X-ray images.

Table 4.

Complete implementation specifications :

Component Specification
Data Augmentation None applied
Feature Extraction Layer (Hybrid) VGG16: block5_pool (25,088-D); ResNet50: avg_pool (2,048-D); Xception: avg_pool (2,048-D)
Feature Normalization Z-score standardization (zero mean, unit variance)
SVM Hyperparameters Kernel: RBF; C = 1.0; γ = 0.001; Fixed a priori
Decision Tree Hyperparameters Max depth: 10; Min samples split: 5; Criterion: Gini; Fixed a priori
Random Forest Hyperparameters Estimators: 100; Max depth: 20; Min samples split: 2; Max features: sqrt; Fixed a priori
CNN Learning Rate 0.001 (Adam optimizer)
Weight Decay None applied
Initialization Glorot uniform (weights), Zero (biases)
Early Stopping Patience: 10 epochs; Monitor: validation loss
Batch Size 16
Maximum Epochs 30
Hyperparameter Tuning Fixed a priori without grid/random search

All models were trained using categorical cross-entropy loss function and Adam optimizer. The training process incorporated early stopping monitoring validation loss to prevent overfitting. A batch size of 32 was used for all experiments, and training was conducted for a maximum of 100 epochs with early termination if validation performance plateaued.

Overall workflow

Figure 5 presents the complete methodological workflow of the study, illustrating the sequential stages from data collection through model evaluation. The workflow encompasses: (1) acquisition of panoramic dental radiographs with expert annotations, (2) preprocessing pipeline including brightness adjustment, noise removal, CLAHE, normalization, masking using manually annotated bounding boxes for region of interest extraction, and resizing, (3) dataset splitting and class balancing, (4) parallel model training tracks including independent custom CNN, hybrid models combining pre-trained CNNs (VGG16, ResNet50, Xception) with machine learning classifiers (SVM, Decision Tree, Random Forest), and fine-tuned pre-trained models, and (5) comprehensive evaluation using multiple performance metrics.

Fig. 5.

Fig. 5

Workflow of the proposed classification framework for dental radiographs

It should be explicitly noted that the segmentation masks derived from manual bounding box annotations were used solely for preprocessing purposes to extract regions of interest. The developed models perform image-level four-class classification on these extracted ROI patches, not instance detection or automatic segmentation of dental conditions. The localization information is provided by manual expert annotations, and the AI models are responsible only for classifying the pre-segmented regions.

Model evaluation metrics

Model performance was evaluated using multiple complementary metrics to provide a comprehensive assessment of classification accuracy and reliability. All evaluations were conducted on the held-out test set, which preserved the original imbalanced class distribution in order to reflect realistic clinical conditions.

Accuracy

Overall classification accuracy represents the proportion of correctly classified samples across all classes and is defined as:

graphic file with name d33e1023.gif 9

where Inline graphic denotes true positives, Inline graphic true negatives, Inline graphic false positives, and Inline graphic false negatives.

Precision

Precision measures the proportion of correctly predicted positive instances among all predicted positive instances:

graphic file with name d33e1049.gif 10

Recall (Sensitivity)

Recall, also referred to as sensitivity, quantifies the proportion of actual positive instances that are correctly identified by the model:

graphic file with name d33e1057.gif 11

F1-score

The F1-score represents the harmonic mean of precision and recall, providing a balanced measure of classification performance:

graphic file with name d33e1065.gif 12

Macro-averaged metrics

For multi-class evaluation, macro-averaged precision, recall, and F1-score were computed. This procedure involves first calculating each metric independently for each of the four dental condition classes and then averaging the results using an unweighted arithmetic mean. Consequently, all classes contribute equally to the final score, irrespective of their sample sizes.

For each class Inline graphic:

graphic file with name d33e1079.gif 13

The macro-averaged metrics are then defined as:

graphic file with name d33e1085.gif 14

This macro-averaging strategy is particularly suitable for imbalanced datasets, as it prevents majority classes from dominating the evaluation and provides a more balanced reflection of model performance across minority classes.

Confusion matrix

Confusion matrices were generated to visualize the classification performance across all four dental condition classes, revealing patterns of misclassification and class-specific model behavior. The matrices display the distribution of predicted labels versus true labels, with diagonal elements representing correct classifications and off-diagonal elements indicating misclassification errors.

Results

Dataset characteristics and preprocessing

A publicly available dataset consisting of 1,512 panoramic dental X-ray images with a resolution of 512 × 256 pixels was employed in this study. A total of 11,137 annotations were provided for four dental conditions: fillings (6,797), implants (2,308), cavities (1,139), and impacted teeth (894). All annotations were independently reviewed and validated by two dental specialists to ensure clinical accuracy. To enhance image quality and maintain consistency, a preprocessing pipeline was implemented, including brightness adjustment, noise reduction, local contrast enhancement, normalization, mask application, and resizing to 224 × 224 pixels. Class imbalance was addressed through random downsampling, resulting in a balanced dataset of 894 samples per class. It should be explicitly noted that all performance metrics reported in this study were obtained through 5-fold cross-validation on this internal dataset, and no external validation on independent test sets from different institutions or imaging systems was performed. This represents a limitation in assessing the generalizability of the models to diverse clinical settings.

Custom CNN model performance evaluation

Figure 6 presents a comprehensive evaluation of the custom CNN model’s performance in classifying four dental conditions (fillings, cavity, implant, and impacted tooth) through 5-fold cross-validation analysis. The evaluation includes training dynamics, classification metrics, and detailed error analysis to assess model reliability and generalization capability.

Fig. 6.

Fig. 6

Performance evaluation of custom CNN model for dental condition classification using 5-fold cross-validation. a Training history displaying loss and accuracy curves for both training and validation sets across 30 epochs, with separate lines for each fold. b Performance metrics including precision, recall, and F1-score for four dental conditions (fillings, cavity, implant, and impacted tooth) with error bars representing standard deviation across folds. c Confusion matrix aggregated from all test samples across the 5-fold cross-validation, showing the distribution of true labels versus predicted labels for the four dental condition classes

Table 5 presents the average performance metrics of the CNN model for dental condition classification across five folds, summarizing the results shown in Fig. 6b.

Table 5.

Average performance metrics of CNN for dental condition classification across five folds

Dental Condition Accuracy Precision Recall F1-Score
Fillings 0.7429 0.8001 0.7787 0.7861
Cavity 0.7429 0.6655 0.7602 0.6716
Implant 0.7429 0.7069 0.6508 0.6445
Impacted Tooth 0.7429 0.8001 0.6202 0.6930
Macro Average 0.7294 0.7432 0.7025 0.7238

The results presented in Fig. 6 and summarized in Table 5 demonstrate that the custom CNN model achieved an overall accuracy of 74.29% across all dental conditions. The model exhibited varying performance across different conditions, with fillings showing the most balanced performance (precision: 0.80, recall: 0.78, F1-score: 0.79), while impacted teeth demonstrated challenges in recall (0.62) despite high precision (0.80). The confusion matrix analysis of 3,576 total test samples (894 per class) reveals that fillings achieved the highest correct classification rate (696/894), followed by cavity (679/894), implant (581/894), and impacted tooth (554/894). Notable misclassification patterns include approximately 100 samples misclassified between cavity and implant categories in both directions, and around 110 samples misclassified between cavity and impacted tooth, suggesting morphological similarities that challenge the model’s discriminative capability. From a clinical perspective, the confusion between cavities and implants is particularly concerning as these conditions require fundamentally different treatment approaches. The cavity-implant misclassification may stem from similar radiographic density patterns in certain imaging angles, while cavity-impacted tooth confusion likely reflects overlapping radiolucent appearances in panoramic projections. Such diagnostic errors could lead to inappropriate treatment planning if relied upon without clinical verification. The training curves indicate stable convergence without significant overfitting, demonstrating the model’s ability to generalize effectively across different data folds.

Hybrid CNN-based model comparison

Figure 7 presents the performance evaluation of three fine-tuned pre-trained deep learning architectures (VGG16, Xception, and ResNet50) adapted for dental condition detection. The comparison demonstrates how established CNN architectures perform when fine-tuned on dental radiographic data, leveraging transfer learning to enhance classification accuracy across four dental conditions. The three pre-trained models were evaluated solely through fine-tuning approaches rather than hybrid combinations with machine learning classifiers. This design choice was made to assess the full end-to-end learning capability of these established architectures when adapted to dental imaging tasks. While hybrid approaches combining pre-trained CNNs with machine learning classifiers could potentially offer advantages such as faster training of the classification head and explicit feature space manipulation, the fine-tuning approach allows these deep architectures to adapt their hierarchical feature representations specifically to dental radiographic patterns through backpropagation across multiple layers. Future work could explore hybrid variants of these pre-trained models to determine whether such integration provides additional performance benefits.

Fig. 7.

Fig. 7

Comparative performance analysis of hybrid CNN-based models for dental condition classification. a Macro-averaged performance metrics (accuracy, precision, recall, and F1-score) for three hybrid models across 5-fold cross-validation, with error bars indicating standard deviation. b Confusion matrices for each hybrid model showing the classification results across four dental conditions (fillings, cavity, implant, and impacted tooth), with numerical values representing the count of samples in each prediction category. c t-SNE visualization of CNN feature space structure across model architectures. Data points represent individual dental X-ray samples projected into 2D space, color-coded by condition (blue: fillings, purple: cavity, orange: implant, red: impacted tooth)

Table 6 provides the detailed performance metrics of the hybrid CNN-based models for each of the four dental conditions, highlighting the differences across decision tree, random forest, and SVM classifiers.

Table 6.

Performance comparison of hybrid CNN-Based models for dental condition detection

Model Dental Condition F1-Score Precision Recall Accuracy
CNN + DT Cavity 0.793 ± 0.028 0.789 ± 0.019 0.798 ± 0.026 0.812 ± 0.029
Fillings 0.823 ± 0.022 0.834 ± 0.032 0.812 ± 0.024 0.812 ± 0.029
Impacted Tooth 0.788 ± 0.025 0.784 ± 0.018 0.792 ± 0.026 0.812 ± 0.029
Implant 0.788 ± 0.030 0.801 ± 0.024 0.776 ± 0.026 0.812 ± 0.029
Macro Average 0.798 ± 0.026 0.802 ± 0.023 0.795 ± 0.025 0.812 ± 0.029
CNN + RF Cavity 0.828 ± 0.021 0.823 ± 0.027 0.834 ± 0.023 0.854 ± 0.023
Fillings 0.860 ± 0.033 0.876 ± 0.022 0.845 ± 0.039 0.854 ± 0.023
Impacted Tooth 0.854 ± 0.037 0.848 ± 0.029 0.859 ± 0.020 0.854 ± 0.023
Implant 0.828 ± 0.030 0.839 ± 0.016 0.817 ± 0.037 0.854 ± 0.023
Macro Average 0.843 ± 0.028 0.847 ± 0.024 0.839 ± 0.030 0.854 ± 0.023
CNN + SVM Cavity 0.749 ± 0.019 0.745 ± 0.025 0.754 ± 0.032 0.786 ± 0.031
Fillings 0.787 ± 0.022 0.798 ± 0.024 0.776 ± 0.036 0.786 ± 0.031
Impacted Tooth 0.759 ± 0.034 0.748 ± 0.034 0.771 ± 0.039 0.786 ± 0.031
Implant 0.755 ± 0.019 0.762 ± 0.027 0.748 ± 0.023 0.786 ± 0.031
Macro Average 0.763 ± 0.024 0.763 ± 0.028 0.762 ± 0.033 0.786 ± 0.031

The comparative analysis presented in Fig. 7 and detailed in Table 6 reveals significant performance improvements achieved through hybrid CNN-based approaches compared to the standalone custom CNN model. The CNN + RF hybrid model demonstrated superior performance with the highest accuracy of 85.4 ± 2.3%, followed by CNN + DT (81.2 ± 2.9%) and CNN + SVM (78.6 ± 3.1%). Notably, the CNN + RF model achieved exceptional performance for fillings detection (F1-score: 0.860 ± 0.033, precision: 0.876 ± 0.022), representing a substantial improvement over the baseline CNN model’s 78.61% F1-score for the same condition. The confusion matrices indicate that CNN + RF achieved the most balanced classification across all dental conditions with relatively high recall rates, while CNN + SVM exhibited more conservative predictions characterized by higher precision but lower recall, likely due to the difficulty of optimally tuning RBF kernel hyperparameters in high-dimensional CNN feature spaces where the appropriate kernel bandwidth and regularization parameters are challenging to determine without extensive cross-validation. These results demonstrate that Random Forest effectively leverages CNN-extracted features to enhance discriminative capability, particularly in distinguishing between morphologically similar dental pathologies that challenged the standalone CNN model.

Pre-trained model evaluation

Figure 8 presents the performance evaluation of three fine-tuned pre-trained deep learning architectures (VGG16, Xception, and ResNet50) adapted for dental condition detection. The comparison demonstrates how established CNN architectures perform when fine-tuned on dental radiographic data, leveraging transfer learning to enhance classification accuracy across four dental conditions.

Fig. 8.

Fig. 8

Comparative performance analysis of fine-tuned pre-trained CNN models for dental condition classification. a Performance metrics comparison across three pre-trained models showing accuracy, precision, recall, and F1-score with error bars representing standard deviation from 5-fold cross-validation. b Confusion matrices for each pre-trained model displaying classification results across four dental conditions (fillings, cavity, implant, and impacted tooth), with numerical values indicating the distribution of predicted versus true labels. c Training and validation loss curves for VGG16, Xception, and ResNet50 models across training epochs, demonstrating convergence behavior and generalization characteristics for each architecture

Table 7 summarizes the performance comparison of three pre-trained deep learning models (VGG16, Xception, and ResNet50) for the detection of four dental conditions, providing detailed metrics across F1-score, precision, recall, and accuracy.

Table 7.

Performance comparison of Pre-trained deep learning models for dental condition detection

Model Dental Condition F1-Score Precision Recall Accuracy
VGG16 Cavity 0.808 ± 0.026 0.801 ± 0.026 0.815 ± 0.034 0.823 ± 0.020
Fillings 0.831 ± 0.023 0.834 ± 0.028 0.828 ± 0.027 0.823 ± 0.020
Impacted Tooth 0.822 ± 0.031 0.815 ± 0.023 0.830 ± 0.027 0.823 ± 0.020
Implant 0.805 ± 0.027 0.812 ± 0.032 0.798 ± 0.033 0.823 ± 0.020
Macro Average 0.817 ± 0.027 0.816 ± 0.027 0.818 ± 0.030 0.823 ± 0.020
Xception Cavity 0.791 ± 0.023 0.785 ± 0.028 0.798 ± 0.025 0.809 ± 0.023
Fillings 0.815 ± 0.026 0.818 ± 0.024 0.812 ± 0.023 0.809 ± 0.023
Impacted Tooth 0.804 ± 0.029 0.798 ± 0.030 0.811 ± 0.034 0.809 ± 0.023
Implant 0.791 ± 0.030 0.798 ± 0.025 0.784 ± 0.031 0.809 ± 0.023
Macro Average 0.800 ± 0.027 0.800 ± 0.027 0.801 ± 0.028 0.809 ± 0.023
ResNet50 Cavity 0.774 ± 0.029 0.768 ± 0.021 0.781 ± 0.027 0.795 ± 0.027
Fillings 0.798 ± 0.025 0.801 ± 0.028 0.795 ± 0.027 0.795 ± 0.027
Impacted Tooth 0.797 ± 0.025 0.783 ± 0.025 0.812 ± 0.029 0.795 ± 0.027
Implant 0.774 ± 0.026 0.781 ± 0.019 0.767 ± 0.029 0.795 ± 0.027
Macro Average 0.786 ± 0.026 0.783 ± 0.023 0.789 ± 0.028 0.795 ± 0.027

The comparative analysis presented in Fig. 8 and detailed in Table 7 demonstrates that VGG16 achieved the highest overall performance among the pre-trained models with an accuracy of 82.3 ± 2.0%, followed by Xception (80.9 ± 2.3%) and ResNet50 (79.5 ± 2.7%). VGG16 exhibited the most consistent performance across all dental conditions, with particularly strong results for fillings detection (F1-score: 0.831 ± 0.023, precision: 0.834 ± 0.028). The confusion matrix analysis reveals that VGG16 achieved the highest correct classification rates for fillings (742/894), cavity (697/894), and implant (724/894) conditions, while demonstrating balanced performance across all categories. Xception showed competitive performance with relatively lower variance in metrics, achieving strong results for fillings (727 correct classifications) and impacted tooth (693 correct classifications). ResNet50, while showing the lowest overall accuracy, maintained reasonable performance consistency with notable strength in impacted tooth detection (680 correct classifications). These results indicate that VGG16’s architectural characteristics, particularly its deep convolutional structure, are well-suited for extracting discriminative features from dental radiographic images, outperforming both the custom CNN model and other pre-trained architectures in this specific medical imaging domain.

Comprehensive model comparison and statistical analysis

Table 8 presents a comprehensive comparison of all evaluated models including the custom CNN, hybrid CNN-based approaches, and fine-tuned pre-trained architectures, facilitating direct performance assessment across different modeling strategies.

Table 8.

Comprehensive performance comparison of all evaluated models

Model Category Model Accuracy Macro Precision Macro Recall Macro F1-Score
Custom Architecture CNN 0.743 ± 0.000 0.743 ± 0.000 0.703 ± 0.000 0.724 ± 0.000
Hybrid Models CNN + DT 0.812 ± 0.029 0.802 ± 0.023 0.795 ± 0.025 0.798 ± 0.026
CNN + RF 0.854 ± 0.023 0.847 ± 0.024 0.839 ± 0.030 0.843 ± 0.028
CNN + SVM 0.786 ± 0.031 0.763 ± 0.028 0.762 ± 0.033 0.763 ± 0.024
Pre-trained Models VGG16 0.823 ± 0.020 0.816 ± 0.027 0.818 ± 0.030 0.817 ± 0.027
Xception 0.809 ± 0.023 0.800 ± 0.027 0.801 ± 0.028 0.800 ± 0.027
ResNet50 0.795 ± 0.027 0.783 ± 0.023 0.789 ± 0.028 0.786 ± 0.026

Statistical significance testing was performed to compare model performance using paired t-tests on macro F1-scores across the five cross-validation folds. The CNN + RF model demonstrated statistically significant superiority over the custom CNN baseline (p < 0.001), CNN + SVM (p < 0.01), and all pre-trained models including VGG16 (p < 0.05), Xception (p < 0.01), and ResNet50 (p < 0.001). VGG16 also showed significant improvement over the custom CNN (p < 0.01). These statistical analyses confirm that the observed performance differences are not attributable to random variation across cross-validation folds.

Figure 9 presents the receiver operating characteristic (ROC) curves for the best-performing CNN + RF model across all four dental condition classes, along with the macro-averaged ROC curve. The area under the curve (AUC) values demonstrate excellent discriminative capability: fillings (AUC = 0.94), cavity (AUC = 0.91), impacted tooth (AUC = 0.92), and implant (AUC = 0.90), with a macro-averaged AUC of 0.92. These ROC analyses confirm the model’s strong ability to distinguish between dental conditions across various classification thresholds.

Fig. 9.

Fig. 9

Receiver Operating Characteristic (ROC) curves for the CNN + RF model demonstrating multi-class classification performance across four dental conditions

The comprehensive comparison reveals that hybrid approaches, particularly CNN + RF, provide substantial performance advantages over both custom architectures and fine-tuned pre-trained models for this dental classification task. The superior performance of CNN + RF can be attributed to the ensemble nature of Random Forest, which effectively combines multiple decision boundaries in the CNN-extracted feature space, providing robustness to feature noise and enhanced generalization. The pre-trained models, while benefiting from transfer learning, did not surpass the hybrid approach, suggesting that the combination of CNN feature extraction with ensemble classification offers optimal performance for this specific dental imaging application under the experimental conditions evaluated.

Discussion

This study evaluated multiple deep learning approaches on a comprehensive dataset of 1,512 panoramic dental X-ray images containing 11,137 annotations across four dental conditions (fillings, implants, cavities, and impacted teeth). The results demonstrate that hybrid CNN-based models significantly outperformed standalone approaches, with the CNN + Random Forest combination achieving the highest accuracy of 85.4 ± 2.3%, representing a substantial improvement over the custom CNN baseline (74.29%). Among pre-trained architectures, VGG16 emerged as the most effective with 82.3 ± 2.0% accuracy, while the hybrid models consistently showed superior performance in distinguishing between morphologically similar dental pathologies. These findings indicate that combining CNN feature extraction with traditional machine learning classifiers, particularly Random Forest, provides enhanced discriminative capability for automated dental condition detection in panoramic radiographs. All performance metrics reported herein were obtained through 5-fold cross-validation on a single public dataset, and the models have not been validated on external datasets from different clinical centers or imaging systems, which represents an important limitation affecting generalizability assessment.

Comparison with existing literature

Since studies in this domain have utilized diverse datasets with varying sizes, imaging modalities, and annotation protocols, direct quantitative comparison of performance metrics is not feasible. However, examining recent literature provides valuable context for positioning our findings within the broader landscape of automated dental radiograph analysis. Table 9 presents methodological approaches and reported outcomes from contemporary studies to illustrate the current state of research in dental image classification and detection.

Table 9.

Overview of recent methodologies and reported performance in dental radiograph analysis

Research Study Dataset Size Imaging Modality Clinical Focus Computational Approach Reported Performance
Jae-Hong Lee (2023) [18] 11,980 images Panoramic Dental implant system identification Deep CNN with professional validation 95.4% accuracy
Vasdev et al. (2023) [19] 16,000 images Mixed dental images Multi-class dental disease detection Pipeline: AlexNet, ResNet-18, ResNet-34 AlexNet: 85.2% accuracy
Muhammad Adnan Hasnain (2023) [20] N/A Not specified Radiographic dental pathology classification Transfer learning: ResNet-101, Xception, DenseNet-201, EfficientNet-B0 EfficientNet-B0: 98.91% accuracy
Chisako Muramatsu (2023) [21] 100 images Panoramic Tooth detection and classification for dental charting Object detection network with 4-fold CV 93.2% classification performance
Kailai Zhang (2023) [22] 1,000 images Radiographic images Individual tooth detection and classification Hierarchical label tree with cascade network 95.8% detection accuracy
W. Park (2023) [23] 150,733 images Dental images Implant system classification Modified ResNet-50 with adaptations 82% accuracy
L. Toledo Reyes (2023) [24] 639 images Clinical images Caries progression prediction Ensemble: Decision Trees, RF, XGBoost AUC > 0.70
F. Schwendicke (2022) [25] 3,293,252 samples Radiographic samples AI-assisted caries detection cost-effectiveness ML-based detection algorithms 80% diagnostic accuracy
Present Investigation 1,512 images (3,576 balanced samples) Panoramic Multi-condition classification: Fillings, Cavities, Implants, Impacted Teeth Custom CNN, Hybrid CNN-ML, Pre-trained (VGG16, Xception, ResNet50) Custom CNN: 74.29%, CNN + RF: 85.40%, VGG16: 82.23%

It is important to note that direct comparison of our results with studies reporting 95–99% accuracy requires careful interpretation. Many of these studies focused on binary classification tasks (e.g., implant present/absent) or single-pathology detection with imbalanced datasets favoring majority classes, whereas our investigation addressed simultaneous four-class classification with artificially balanced data to enable fair model comparison. Multi-condition classification with balanced datasets presents inherently greater complexity than single-task detection, as the model must learn discriminative features for multiple pathological conditions simultaneously rather than optimizing for a single diagnostic target. Furthermore, some high-performing studies utilized substantially larger datasets (e.g., 150,733 images), which generally improve deep learning model generalization. Therefore, while our accuracy values may appear lower in absolute terms, they reflect the increased difficulty of the multi-class balanced classification task undertaken in this investigation.

Beyond CNN-based approaches for dental condition classification, alternative deep learning strategies in dentistry have been explored across diverse clinical applications. Recent innovations include methods for skeletal maturation assessment using AI-driven analysis of hand-wrist radiographs, where YOLOv8x-based deep learning models achieved high accuracy (F1 scores ranging from 0.92 to 0.99) across five different hand-wrist maturation classification methods, demonstrating the effectiveness of deep learning in evaluating skeletal age for orthodontic treatment planning [26]. Such approaches highlight the expanding role of artificial intelligence in comprehensive dental diagnostics, encompassing growth assessment, treatment planning, and developmental monitoring. Our study contributes to this evolving landscape by demonstrating that hybrid CNN-ML architectures offer competitive performance for multi-condition classification tasks, building upon earlier innovations while addressing the specific challenges of panoramic radiograph analysis for simultaneous detection of multiple dental pathologies.

Custom CNN model performance

The custom CNN model achieved an overall accuracy of 74.29% across four dental conditions, as demonstrated in Fig. 6 and summarized in Table 4, for automated dental pathology detection from panoramic radiographs. The model exhibited differential performance across conditions, with fillings showing the most balanced metrics (precision: 0.80, recall: 0.78, F1-score: 0.79), while impacted teeth presented classification challenges despite high precision (0.80) but lower recall (0.62). This performance aligns with recent findings in dental radiographic analysis, where CNN-based approaches have shown varying effectiveness depending on the morphological complexity of dental pathologies [27]. The confusion matrix in Fig. 6(c) revealed systematic misclassifications between cavity-implant and cavity-impacted tooth categories, suggesting that these conditions share similar radiographic features that challenge discriminative capability. This observation is consistent with the inherent difficulties in distinguishing overlapping dental pathologies in panoramic images, where anatomical superimposition and varying image quality can obscure diagnostic features [28].

Hybrid CNN-based model performance and superiority analysis

The hybrid CNN-based models demonstrated performance improvements over the standalone CNN approach, as illustrated in Fig. 7 and detailed in Table 6, with the CNN + RF combination achieving the highest accuracy of 85.4 ± 2.3%. This represents an enhancement of approximately 11% points compared to the custom CNN baseline, highlighting the effectiveness of combining deep feature extraction with traditional machine learning classifiers. The macro-averaged performance metrics shown in Fig. 7(a) demonstrate the superiority of the Random Forest hybrid approach across all evaluation metrics.

The exceptional performance of CNN + RF over both the custom CNN and fine-tuned pre-trained networks can be attributed to several key factors. First, Random Forest demonstrates inherent robustness on relatively small to moderate-sized datasets through its ensemble learning mechanism, which averages predictions from multiple decision trees trained on bootstrap samples, thereby reducing overfitting that can affect end-to-end neural networks when training data is limited. Second, the ensemble averaging in Random Forest provides natural regularization by combining diverse weak learners, which helps reduce sensitivity to noisy or misleading features that individual CNN classifiers might overfit. Third, Random Forest operates effectively in high-dimensional feature spaces without requiring explicit dimensionality reduction, making it well-suited for processing the rich feature representations extracted by CNN layers. Fourth, the separation of feature extraction (frozen CNN) from classification (RF training) allows the hybrid approach to leverage pre-learned hierarchical features while optimizing the decision boundary specifically for the dental classification task without risking degradation of learned features through backpropagation on limited data. Statistical analysis confirmed the superiority of CNN + RF with significant performance improvements over custom CNN (p < 0.001), CNN + SVM (p < 0.01), and all pre-trained models (p < 0.05), validating that these performance gains are not attributable to random variation across cross-validation folds.

Figure 7C presents t-SNE visualization of the CNN-extracted feature space, demonstrating that hybrid models, particularly CNN + RF, better exploit the structure of high-dimensional features for classification. The t-SNE plot reveals distinct clustering patterns for the four dental conditions, with CNN + RF achieving clearer separation between classes compared to the custom CNN alone. Additionally, feature importance analysis from the Random Forest classifier identified the most discriminative features for dental condition classification, showing that ensemble methods effectively weight informative features while down-weighting noisy or redundant dimensions. These visualizations provide empirical evidence that the hybrid approach enhances classification performance by optimizing decision boundaries in the CNN-learned feature space rather than through raw pixel-level learning.

The CNN + RF model particularly excelled in fillings detection (F1-score: 0.860 ± 0.033), as shown in Table 6, suggesting that Random Forest effectively leveraged the CNN-extracted features to distinguish filling materials from natural tooth structures and other pathological conditions. The confusion matrices presented in Fig. 7(b) reveal that CNN + RF achieved balanced classification across all dental conditions, with fillings showing the highest correct classification rate (759/894). This finding aligns with research demonstrating that hybrid approaches combining CNNs with ensemble methods can achieve improved performance in medical imaging tasks by capitalizing on both deep feature representation and robust classification strategies [29, 30]. The balanced performance across all dental conditions observed in the CNN + RF model indicates its potential for clinical deployment, where consistent accuracy across different pathological conditions is crucial for reliable diagnostic support.

Pre-trained model performance

The evaluation of fine-tuned pre-trained architectures revealed that VGG16 achieved the best performance (82.3 ± 2.0% accuracy), outperforming both Xception (80.9 ± 2.3%) and ResNet50 (79.5 ± 2.7%). This superiority of VGG16 in dental radiographic classification can be attributed to its deep convolutional architecture with small receptive fields, which proves effective for capturing fine-grained textural features characteristic of dental pathologies in panoramic images [31]. The consistent performance of VGG16 across all dental conditions, particularly for fillings detection (F1-score: 0.831 ± 0.023), demonstrates the effectiveness of transfer learning from natural images to dental radiographs, despite the domain shift between general computer vision datasets and medical imaging [5]. However, these results are somewhat lower than reported in some recent dental AI studies, where ResNet architectures achieved > 98% accuracy [32], potentially due to differences in dataset characteristics, preprocessing methodologies, and task complexity. The competitive performance of all three pre-trained models suggests that established CNN architectures can be successfully adapted for dental diagnostic tasks, though architectural choice remains important for optimizing performance in specific dental imaging contexts.

Computational efficiency

Models were trained on a consumer-grade GPU system (NVIDIA RTX 3050 Ti with 4 GB VRAM), demonstrating the feasibility of developing high-performance dental AI models without requiring expensive computational infrastructure. The training times of the models demonstrate their computational efficiency. Among the pre-trained architectures, Xception completed training in 35.41 min, ResNet50 in 37.92 min, and VGG16 in 38.36 min, with VGG16 requiring the longest time due to its larger number of parameters. The custom CNN model was comparatively more efficient, achieving training in 32.63 min. Notably, hybrid models combining CNN feature extraction with traditional machine learning classifiers trained in approximately one-third of the time required for full CNN architectures, with CNN-DT, CNN-RF, and CNN-SVM completing training in 10.47, 12.22, and 14.31 min respectively. This substantial reduction in training time, achieved by utilizing pre-extracted CNN features, makes hybrid approaches particularly attractive for clinical settings where rapid model development and iterative refinement are desirable. These results emphasize the favorable trade-off between accuracy and computational cost that hybrid models offer for dental disease classification from X-ray images.

Study limitations

Despite the methodological contributions, this study presents several limitations that warrant consideration.

Data-related limitations include the following: The dataset size of 1,512 panoramic images with 894 balanced samples per class after downsampling may limit the generalizability of the models to broader clinical populations and diverse imaging conditions. The reliance on a single publicly available dataset without external validation on independent datasets from different clinical centers, imaging equipment, or patient demographics constrains assessment of model robustness across diverse real-world scenarios. The artificial balancing achieved through random downsampling, while enabling fair model comparison, does not reflect natural clinical prevalence where fillings are substantially more common than impacted teeth. Consequently, the reported accuracy and F1-scores should not be interpreted as direct indicators of real-world diagnostic performance, where class imbalance would affect precision-recall trade-offs differently. The model performance on the original imbalanced dataset with appropriate loss weighting remains to be evaluated to better estimate clinical utility.

Methodological limitations encompass several factors: The 224 × 224 pixel resolution after preprocessing may result in loss of fine anatomical details that could be diagnostically relevant for distinguishing morphologically similar conditions, as evidenced by the misclassification patterns between cavity-implant and cavity-impacted tooth categories observed in the confusion matrices. The study focused solely on panoramic radiographs, which may not capture the full spectrum of dental pathologies visible in other imaging modalities such as bitewing or periapical X-rays. The use of manually annotated bounding boxes for region extraction means that the models perform classification of pre-segmented regions rather than automatic end-to-end detection and classification, which would be required for fully autonomous clinical deployment.

Future research should incorporate larger, multi-center datasets with diverse imaging protocols to enhance model robustness and clinical applicability. Advanced preprocessing techniques preserving higher resolution details, multi-modal imaging integration combining panoramic, bitewing, periapical, and cone-beam computed tomography (CBCT) images, and the development of explainable AI frameworks to provide clinicians with interpretable diagnostic reasoning could improve clinical adoption. Additionally, implementation of object-detection architectures such as YOLO (You Only Look Once) or Faster R-CNN could enable joint localization and classification in a single end-to-end framework, eliminating the need for manual region annotation and facilitating fully automated diagnostic workflows. Furthermore, prospective clinical validation studies comparing AI-assisted diagnosis with traditional expert evaluation in real-world settings are essential to establish the true clinical utility and cost-effectiveness of these automated dental classification systems. Evaluation on the original imbalanced dataset with class-weighted loss functions or focal loss, along with reporting both macro-averaged and micro-averaged metrics, would provide more realistic estimates of clinical performance.

Clinical applications and implications

The clinical applications of AI-based dental radiographic analysis systems present promising opportunities for enhancing diagnostic workflows in modern dental practice. These automated systems can serve as valuable adjunct tools to support clinicians in detecting dental pathologies including fillings, cavities, implants, and impacted teeth in panoramic radiographs, particularly in busy clinical environments where time constraints may limit thorough examination of all radiographic details. For example, in high-volume dental clinics or screening programs, AI systems could provide automated pre-screening of panoramic radiographs to flag suspicious teeth requiring detailed examination by dental professionals, thereby optimizing workflow efficiency and reducing the risk of overlooking pathological conditions in large patient populations. Such pre-screening applications could prioritize cases for expert review based on AI-predicted pathology probability, enabling more efficient allocation of clinical expertise.

The integration of such AI-driven diagnostic aids into existing picture archiving and communication systems (PACS) could facilitate standardized screening protocols and reduce inter-observer variability in radiographic interpretation. However, the inherent limitations in distinguishing morphologically similar dental conditions necessitate that these AI systems function as supportive rather than replacement tools for clinical expertise. The computational efficiency of hybrid machine learning approaches makes them feasible for deployment in diverse clinical settings, including general practice offices and dental schools where they could serve educational purposes for training dental students and residents.

The hybrid CNN + RF model, with its superior balanced performance across all dental conditions (macro F1-score: 0.843 ± 0.028) and excellent discriminative capability demonstrated by ROC analysis (macro-averaged AUC: 0.92), shows particular promise for clinical implementation. The consistent performance across fillings (AUC: 0.94), impacted teeth (AUC: 0.92), cavities (AUC: 0.91), and implants (AUC: 0.90) indicates reliable diagnostic support across diverse pathological conditions. The computational efficiency of hybrid models, requiring only one-third of the training time of full CNN architectures while maintaining superior accuracy, makes them practical for clinical deployment where rapid model updates or site-specific fine-tuning may be necessary.

While these AI-assisted diagnostic tools show potential for improving diagnostic consistency and reducing oversight of pathological conditions, their clinical implementation requires careful validation through prospective studies comparing AI-assisted interpretations with expert radiologist assessments to establish appropriate clinical workflows and determine optimal human-AI collaborative frameworks for reliable dental diagnosis. Specific applications in clinical practice could include: (1) automated quality assurance systems that verify completeness of diagnostic reports by flagging potentially missed pathologies, (2) decision support systems that provide second-opinion recommendations for ambiguous cases, (3) standardized training tools for dental students that provide immediate feedback on radiographic interpretation accuracy, and (4) population-level screening programs in underserved areas with limited access to specialist dental radiologists. However, all such applications must maintain human clinical oversight as the final arbiter of diagnostic and treatment decisions.

Conclusion

This study presented a systematic comparison of deep learning architectures for automated dental condition classification in panoramic radiographs, evaluating custom CNN models, hybrid CNN-machine learning approaches, and fine-tuned pre-trained networks for simultaneous detection of fillings, cavities, implants, and impacted teeth. The CNN + Random Forest hybrid model achieved the highest performance with 85.4 ± 2.3% accuracy, representing an 11% point improvement over the custom CNN baseline (74.29%), while VGG16 emerged as the best-performing pre-trained architecture with 82.3 ± 2.0% accuracy, demonstrating that combining deep feature extraction with ensemble classifiers provides superior discriminative capability compared to end-to-end neural networks for multi-class dental pathology detection. These findings have important clinical implications, as the computational efficiency and balanced performance of hybrid approaches across all dental conditions make them promising candidates for integration into clinical workflows as diagnostic support tools, particularly for automated pre-screening in high-volume dental practices and educational applications in dental training programs. Future research should focus on validation with larger multi-center datasets, implementation of object-detection architectures for joint localization and classification, integration of multiple imaging modalities, and prospective clinical trials to establish optimal human-AI collaborative frameworks that position these systems as adjunct tools supporting rather than replacing clinical expertise.

Acknowledgements

We extend our gratitude to the Roboflow Universe platform ( https://universe.roboflow.com/yolo-cthel/disease-xaijn ) for providing publicly available datasets that significantly supported our research.

Authors’ contributions

A.G. contributed to the conceptualization, methodology design, model development, implementation, analysis, and project administration. B.A. contributed to the study design, clinical validation, supervision, interpretation of the findings, and preparation of the manuscript. K.K. contributed to implementation and project administration as well as data preprocessing and cleaning, literature review, and assistance in manuscript preparation and submission. S.R.B. contributed to data analysis, performance evaluation, and preparation of visualizations.

Funding

This study did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability

The data is publicly accessible at the following link: https://universe.roboflow.com/yolo-cthel/disease-xaijn.

Declarations

Ethics approval and consent to participate

This study was conducted in accordance with the principles outlined in the Declaration of Helsinki. Ethical approval was obtained from the Ethics Committee of Tabriz University of Medical Sciences, Tabriz, Iran, and the Oral and Maxillofacial Radiology department. Written informed consent was waived by the ethics committee as the study did not involve direct patient participation or the collection of personal data.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Peres MA, et al. Oral diseases: a global public health challenge. Lancet. 2019;394(10194):249–60. [DOI] [PubMed] [Google Scholar]
  • 2.Organization WH. Global oral health status report: towards universal health coverage for oral health by 2030. World Health Organization; 2022.
  • 3.Righolt A, et al. Global-, regional-, and country-level economic impacts of dental diseases in 2015. J Dent Res. 2018;97(5):501–7. [DOI] [PubMed] [Google Scholar]
  • 4.Schwendicke F, et al. Convolutional neural networks for dental image diagnostics: A scoping review. J Dent. 2019;91:103226. [DOI] [PubMed] [Google Scholar]
  • 5.Lee J-H, et al. Detection and diagnosis of dental caries using a deep learning-based convolutional neural network algorithm. J Dent. 2018;77:106–11. [DOI] [PubMed] [Google Scholar]
  • 6. Daungsupawong H, Wiwanitkit V. ChatGPT, consistency and accuracy of endodontic question. Int Endod J. 2024;57(3):378-79. 10.1111/iej.13997. Epub 2023 Nov 7. [DOI] [PubMed]
  • 7.Sukegawa S, et al. Deep neural networks for dental implant system classification. Biomolecules. 2020;10(7):984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Park J-H, et al. Deep learning and clustering approaches for dental implant size classification based on periapical radiographs. Sci Rep. 2023;13(1):16856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sivari E, et al. Deep learning in diagnosis of dental anomalies and diseases: a systematic review. Diagnostics. 2023;13(15):2512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mertens S, et al. Artificial intelligence for caries detection: randomized trial. J Dent. 2021;115:103849. [DOI] [PubMed] [Google Scholar]
  • 11.Brownrigg DR. The weighted median filter. Commun ACM. 1984;27(8):807–18. [Google Scholar]
  • 12.Saalfeld S. CLAHE (Contrast Limited Adaptive Histogram Equalization). 2009.
  • 13.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97. [Google Scholar]
  • 14.Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]
  • 15.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • 16.He K et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  • 17.Chollet F. Xception: Deep learning with depthwise separable convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  • 18.Lee J-H, et al. A performance comparison between automated deep learning and dental professionals in classification of dental implant systems from dental imaging: a multi-center study. Diagnostics. 2020;10(11):910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Vasdev D, et al. Periapical dental X-ray image classification using deep neural networks. Annals of Operations Research; 2022. [DOI] [PMC free article] [PubMed]
  • 20.Hasnain MA, et al. Deep learning architectures in dental diagnostics: a systematic comparison of techniques for accurate prediction of dental disease through x-ray imaging. Int J Intell Comput Cybernetics. 2024;17(1):161–80. [Google Scholar]
  • 21.Muramatsu C, et al. Tooth detection and classification on panoramic radiographs for automatic dental chart filing: improved classification by multi-sized input data. Oral Radiol. 2021;37:13–9. [DOI] [PubMed] [Google Scholar]
  • 22.Zhang K, et al. An effective teeth recognition method using label tree with cascade network structure. Comput Med Imaging Graph. 2018;68:61–70. [DOI] [PubMed] [Google Scholar]
  • 23.Park W, et al. Identification of dental implant systems using a large-scale multicenter data set. J Dent Res. 2023;102(7):727–33. [DOI] [PubMed] [Google Scholar]
  • 24.Toledo Reyes L, et al. Early childhood predictors for dental caries: a machine learning approach. J Dent Res. 2023;102(9):999–1006. [DOI] [PubMed] [Google Scholar]
  • 25.Schwendicke F, et al. Cost-effectiveness of AI for caries detection: randomized trial. J Dent. 2022;119:104080. [DOI] [PubMed] [Google Scholar]
  • 26.Tentaş S, Özden S. Deep learning based evaluation of skeletal maturation: A comparative analysis of five Hand-Wrist methods. Orthodontics & Craniofacial Research; 2025. [DOI] [PMC free article] [PubMed]
  • 27.Tuzoff DV, et al. Tooth detection and numbering in panoramic radiographs using convolutional neural networks. Dentomaxillofacial Radiol. 2019;48(4):20180051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Schwendicke F, et al. Deep learning for caries lesion detection in near-infrared light transillumination images: A pilot study. J Dent. 2020;92:103260. [DOI] [PubMed] [Google Scholar]
  • 29.Hung K, et al. The use and performance of artificial intelligence applications in dental and maxillofacial radiology: A systematic review. Dentomaxillofacial Radiol. 2020;49(1):20190107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Mavaie P, Holder L, Skinner MK. Hybrid deep learning approach to improve classification of low-volume high-dimensional data. BMC Bioinformatics. 2023;24(1):419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Almalki YE, et al. Deep learning models for classification of dental diseases using orthopantomography X-ray OPG images. Sensors. 2022;22(19):7370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Cejudo JE, et al. Classification of dental radiographs using deep learning. J Clin Med. 2021;10(7):1496. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data is publicly accessible at the following link: https://universe.roboflow.com/yolo-cthel/disease-xaijn.


Articles from BMC Oral Health are provided here courtesy of BMC

RESOURCES