Skip to main content
Karger Author's Choice logoLink to Karger Author's Choice
. 2025 Jun 22. Online ahead of print. doi: 10.1159/000546448

Advancements in Caries Diagnostics Using Bitewing Radiography: A Systematic Review of Deep Learning Approaches

Kristof Sebastian Hansson Horvath 1, Nils Roar Gjerdet 1, Xie-Qi Shi 1,
PMCID: PMC12331221  PMID: 40544827

Abstract

Introduction

Deep learning techniques have emerged as promising tools for enhancing the radiographic diagnosis of caries, particularly when utilizing bitewing radiographs.

Methods

Following the PRISMA guidelines, a systematic review was conducted to assess the use of deep learning for caries diagnosis in bitewing radiographs. Literature searches were performed across Web of Science and PubMed databases for studies published before March 2025 that utilized deep learning for caries detection, segmentation, and classification using bitewing radiographs. Data extraction focused on model architectures, dataset characteristics, annotation processes, diagnostic performance metrics, and potential biases, as assessed by the QUADAS-2.

Results

Twenty-three studies met the inclusion criteria, encompassing caries detection, segmentation, and severity classification. The most frequently applied deep learning models were classification models, such as ResNet and detection models, such as YOLO architectures. Dataset sizes varied widely, ranging from 112 to 8,539 images. Most studies reported high diagnostic performance, with accuracies ranging from 70% to 99%. Some AI models outperformed or matched the performance of human experts, particularly in detecting advanced carious lesions. However, considerable variability was observed in model architectures, dataset characteristics, the applied diagnostic performance metrics, and reporting standards. The risk of bias assessment revealed concerns in patient selection, index test interpretation, and reference standards, with all studies rated as having a high risk of bias in at least one domain.

Conclusion

The review identified challenges in currently developed deep learning models regarding methodological heterogeneity, lack of standardization, limited dataset diversity, insufficient clinical validation, and concerns about bias and data transparency. Nevertheless, all studies concluded that deep learning models are promising as an assistive diagnostic tool in caries diagnostics using bitewing radiography.

Keywords: Caries detection, Radiography, Deep learning, Systematic review

Plain Language Summary

This systematic review examines the use of deep learning techniques for detecting, segmenting, and classifying dental caries in bitewing radiographs. The review included 23 studies published before March 2025, analysing the performance of various deep learning models. The most commonly used models were ResNet and YOLO, with dataset sizes ranging from 112 to 8,539 images. Most studies showed high diagnostic accuracy, ranging from 70% to over 99%, and some AI models even outperformed or matched human experts, especially in detecting advanced caries. However, there were significant differences between the models, datasets, and the applied performance metrics. Many studies had potential biases, particularly in patient selection and test interpretation. Despite these challenges, the review concluded that deep learning models hold great potential as supportive tools for caries diagnosis using bitewing radiographs, though further standardization and clinical validation are needed.

Introduction

Early detection and staging of caries are crucial for effective treatment and prevention of their progression [1]. Conventional methods for caries diagnosis include visual-tactile examinations and radiographic assessments using the bitewing technique [2]. However, radiographic diagnosis of caries presents several challenges. Variations in image quality, overlapping anatomical structures, and the subtle appearance of early-stage lesions can make accurate diagnosis difficult, even for experienced clinicians [3]. Moreover, radiographic interpretation can be time-consuming and prone to subjectivity, leading to potential inconsistencies in diagnosis [3].

To address these challenges, integrating deep learning, a recent subset of artificial intelligence, in dental imaging has emerged as a promising approach to assist clinicians and enhance the performance of radiographic caries diagnostics, ultimately improving treatment outcomes [4]. Deep learning can be applied in radiographic caries diagnostics through three key tasks: detection, segmentation, and classification. Detection focuses on identifying the presence and location of caries, with the AI typically marking the affected area using a bounding box or marker around the detected region. Segmentation takes this a step further by locating the lesion and precisely outlining its exact boundaries and shape. Classification models categorize the lesions into sub-classes by displaying textual information and color-coded labels directly on the radiograph.

In recent years, an increasing number of studies have investigated the application of deep learning in caries prediction and detection, utilizing visual inspection, clinical photographs, and radiographs [5]. To our knowledge, one recently published meta-analysis has explicitly examined the diagnostic performance of deep learning models for caries detection using bitewing radiographs, while another focused solely on their performance in diagnosing approximal caries. [6, 7]. In addition to diagnostic performance, this current study aims to review the deep learning models systematically developed for all types of caries diagnostics using bitewing radiography, considering model architectures, dataset characteristics, annotation processes, and bias assessment. Addressing this gap is crucial for understanding the current state of research, identifying its limitations, and informing future studies.

Methods

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [8], and the PRISMA 2020 checklist was followed (online suppl. Table S1; for all online suppl. material, see https://doi.org/10.1159/000546448). The search strategy was reviewed by both K.S.H.H and X.Q.S for accuracy using the Peer Review of Electronic Search Strategies (PRESS) criteria [9].

Literature Search Strategy

A comprehensive literature search was conducted to identify relevant articles on the application of deep learning algorithms for caries detection using bitewing radiographs. The databases searched included PubMed and Web of Science. The search was limited to articles published in English between January 2019 and March 2025. Initial searches were conducted in February 2023, with updates in March and August 2023, May 2024, and March 2025. All references were managed using EndNote 21 software (Clarivate Analytics). Detailed search queries and the number of results for each database are provided in online supplementary Tables S2 and S3.

Inclusion Criteria

Studies published in English between January 2019 and March 2025, including peer-reviewed original research articles, utilized deep learning algorithms for caries detection in bitewing radiographs and reported diagnostic metrics.

Exclusion Criteria

The exclusion criteria applied in case of not reporting sample size; “self-trained” models without involvement of dentists’ diagnosis; studies with fewer than 100 images; studies using other radiographic modalities (e.g., panoramic radiography, CBCT).

Study Selection

A total of 1,101 articles were obtained from the database searches. After the removal of duplicates, 425 unique records remained. These were uploaded to the Rayyan Web application for systematic reviews to streamline the review process [10]. The titles and abstracts were screened independently by two reviewers (K.S.H.H and X.Q.S) to identify studies that met the inclusion criteria. Full-text articles were obtained for the studies that appeared relevant or when eligibility was uncertain. Any discrepancies between the reviewers were resolved through discussion or consultation with a third reviewer (N.R.G). The search strategy is presented in the PRISMA flowchart (Fig. 1).

Fig. 1.

Fig. 1.

PRISMA flowchart of the article selection and inclusion process.

Data Extraction

Data were extracted independently by K.S.H.H and X.Q.S using a standardized data extraction form developed for this study. Any uncertainty during data extraction was resolved through discussion or consultation with the co-authors. The extracted data included study characteristics (authors, publication year, country); dataset characteristics (number of images, augmentation techniques, dataset allocation, and lesion prevalence in training data); deep learning models (model architecture, model task, training parameters, and training time); annotation details (number and clinical experience of annotators, annotation tasks performed); data availability (whether the dataset is publicly available or accessible upon request); diagnostic performance metrics of deep learning models and any comparisons with human performance. Dentists are familiar with diagnostic metrics such as accuracy, sensitivity, and specificity, which align well with clinical diagnostic needs. However, when developing deep learning models for medical imaging, metrics such as F1 score, recall (also known as sensitivity), precision (positive predictive value), average precision (AP), and intersection over union (IoU) are often used. Table 1 illustrates the commonly used performance metrics, their formulas, and what they measure.

Table 1.

Commonly used performance metrics, their formulas, and their corresponding interpretations in imaging diagnostics

Metric Formula What it measures
Accuracy TP+TNTP+TN+FP+FN Proportion of correctly classified instances
F1 score 2PrecisionRecallPrecision+recall Balance of precision and recall
Intersection over Union Area of OverlapArea of Union Overlap between predicted and ground truth regions
Sensitivity, recall TPTP+FN Proportion of actual positives correctly identified
Specificity TNTN+FP Proportion of actual negatives correctly identified
Precision positive predictive value TPTP+FP Proportion of predicted positives that are actually positive
Negative predictive value TNTN+FN Proportion of predicted negatives that are actually negative
False-positive rate FPFP+TN Proportion of actual negatives incorrectly identified as positives
False-negative rate FNFN+TP Proportion of actual positives incorrectly identified as negatives
Average precision Area under the precision-recall curve Model's precision at all points where recall increases
Area under the curve Area under the receiver operating characteristic curve Model’s ability to distinguish between classes across all classification thresholds

Data Evaluation

Descriptive Analyses on Dataset Characteristics, Deep Learning Models, Annotation Details, Reported Performance Metrics, and Risk of Bias Assessment

The quality and risk of bias of the included studies were assessed independently by K.S.H.H using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool [11]. The risk of bias was assessed in the following across four domains for each of the studies: patient selection, index test, reference standard, and flow and timing. In addition, applicability concerns were also assessed for the first three domains. Any uncertainty during bias assessment was resolved through discussion or consultation with co-authors.

Results

Twenty-eight unique records were identified for full-text review [1239], of which 23 were deemed suitable for inclusion [1234]. Two studies were excluded because they used periapical radiographs [36, 37]; one did not use a gold standard or validate against dentists, making it unclear how they achieved their results [38]. The other two focused on commercially available deep-learning products or Web-based services [35, 39].

The extracted data from the 23 studies are summarized in Tables 2 a, b. Table 2a provides clinical details regarding the origin of the radiographs, patient demographics, radiographic acquisition systems, and exposure parameters. Table 2b presents information on the applied deep learning architectures, including annotator characteristics (number and clinical experience), prevalence and types of caries, and diagnostic performance metrics of the AI models.

Table 2.

Summary of the included studies and their characteristics

Authors Bitewing origin Patient demographic Intraoral X-ray unit Image receptor Exposure parameters
a Included studies on the origin of the retrieved bitewing radiographs, patient demographics, radiographic system, image receptor used, and exposure parameters
Ayhan (2024) [14] Kırıkkale University Age: >18 Gendex Expert DC (Gendex Dental Systems, ILB, USA) 65 kVp, 7 mA
Faculty of Dentistry
Turkey
Karakuş (2024) [26] Necmettin Erbakan University, Faculty of Dentistry, Turkey Morita Veraview X Type R Intraoral Digital Imaging Device (J. Morita Corp., Kyoto, Japan) Phosphor plates: Soredex Digora Optime image scanner system from Finland 70 kVp, 7 mA, 0.25 s
de Frutos (2024) [23] Six Norwegian institutes and research centres, Norway 56% female 44% male
Age: 19–94
Azhari (2023) [15] King Abdulaziz University Dental Hospital in Jeddah and dental practitioner, Saudi Arabia
Ahmed (2023) [12] King Abdulaziz University Dental Hospital in Jeddah, Saudi Arabia
Panyarak (2022) [31] Chiang Mai University, Faculty of Dentistry, Thailand Permanent teeth Heliodent Plus (Dentsply Sirona, Charlotte, NC, USA) Carestream 7600 Phosphor Plate (Carestream, Rochester, NY, USA) and VistaScan Imaging Plates Plus (DÜRR DENTAL, Bietigheim-Bissingen, Germany)
Kunt (2023) [27] A hospital information system (not specified) Adult Three direct radiography and one indirect radiography (manufacture not specified)
Vimalarani (2022) [34] A faculty (not specified)
Mao (2021) [29]
Moran (2021) [30] Sirona Heliodent Plus oral X-ray unit (Kavo Brasil Focus) The EXPRESS™ Origo imaging plate (KaVo Dental, Biberach an der Riss, Germany) 70 kVp, 7 mA, 0.25–0.64 s
Bayrakdar (2022) [18] Ordu University, School of Dentistry, Turkey Age: >18 Kodak CS 2200 (Carestream Dental, Atlanta, GA) 2 different phosphor plate systems: Carestream CS 7600(Carestream Dental, Atlanta, GA) and Kodak CR (Carestream Dental, Atlanta, GA) 50 kVp, 5 mA, 0.1 s
Forouzeshfar (2024) [25] Samin Oral and Maxillofacial Radiology Center in Tehran, Iran Planmeca periapical device, Helsinki, Finland 60 kVp, 8 mA, 0.128 s
Cantu (2020) [20] Dental clinic at Charité-Universitätsmedizin, Berlin, Germany 52% male Orthophos XG, Dentsply Sirona (Bensheim, Germany), and Dürr Dental (Bietigheim-Bissingen, Germany)
48% female
Mean age: 36.5
Age: 14–89
Chaves (2024) [21] Dutch dental practice-based research network, The Netherlands
Chen (2022) [22] Peking University School and Hospital of Stomatology in Beijing, China Gendex Expert DC system (Gendex Dental Systems, Hatfield, PA, USA) Storage phosphor plate (manufacture not specified) 65 kVp, 7 mA, 0.125–0.32 s
Bayraktar (2022) [19] Kirikkale University, Faculty of Dentistry, Turkey Permanent teeth Gendex Expert DC (Gendex Dental Systems, IL, USA) 65 kVp, 7 mA
Panyarak (2023) [32] Chiang Mai University, Faculty of Dentistry, Thailand Permanent teeth Heliodent Plus (Dentsply Sirona, Charlotte, NC, USA) Carestream 7600 Phosphor Plate (Carestream, Rochester, NY, USA) and VistaScan Imaging Plates Plus (DÜRR DENTAL, Bietigheim-Bissingen, Germany)
Estai (2022) [24] Dental clinics in Perth, Australia Age: 18–60 Australia Photostimulable phosphor plate (not specified) Australia
Lee (2021) [28] Yonsei University, Conservative Dentistry, Korea Permanent teeth Kodak RVG 6200 digital radiography system with CS 2200 (Carestream, Rochester, NY, USA)
Baydar (2023) [17] Eski¸sehir Osmangazi University, Faculty of Dentistry, Turkey ProX periapical X-ray unit (Planmeca, Helsinki, Finland) ProScanner Phosphor Plates and Scanning System (Planmeca, Helsinki, Finland) 60 kVp, 2 mA, 0.05 s
van Nistelrooij (2024) [33] Dutch dental practice-based research network, The Netherlands 15–88 years
202 females, 179 males, and 2 unknown
Ayan (2024) [13] Kirikkale University, Faculty of Dentistry, Turkey Not considered Gendex Expert DC (Gendex Dental Systems, IL, USA) 65 kVp, 7 mA
Bayati (2025) [16] Tehran University of Medical Science, Faculty of Dentistry, Iran Permanent teeth Owandy dental X-ray machine Photostimulable phosphor plates, and a Sordex Digora scanner 60 kVp, 6 mA, 0.32 s
Authors AI models Images, n, training/total Annotators, n/ experience Lesion prevalence Lesion type Diagnostic performance of AI models
b Included studies on model architecture, dataset characteristics, annotation details, lesion prevalence in the training model, and diagnostic performance metrics of the AI models
Ayhan (2024) [14] YOLOv7-AP-CBAM 1,000/1,170 2/>4 years 0.24 Interproximal F1: 0.82, recall: 0.83, precision: 0. 87
Karakuş (2024) [26] YOLOv8 −/860 2/>3 years Overall caries F1: 0.95, recall: 0.93, precision: 0.98
Interproximal
D1 F1: 0.96
D2 F1: 0.96
D3 F1: 0.93
Occlusal F1: 0.95
Secondary caries F1: 0.96
de Frutos (2024) [23] YOLOv5 RetinaNet EfficientDet (D0 and D1) 8,342/8,539 6/– Interproximal YOLOv5 performed best, except for the F1 for enamel caries (EfficientDet D1)
Enamel F1: 0.53, AP: 0.60 (at IoU 0.3), FNR: 0.15
Dentine F1: 0.59, AP: 0.62 (at IoU 0.3), FNR: 0.22
Secondary lesions F1: 0.56, AP: 0.68 (at IoU 0.3), FNR: 0.16
Azhari (2023) [15] Ensemble U-Net (ResNet50, ResNext101, VGG19) –/771 3/– Interproximal adult
Sound IoU: 0.98
Primary IoU: 0.23
Moderate IoU: 0.19
Advanced IoU: 0.51
Overall Mean F1: 0.52, mean IoU: 0.54
Interproximal Ped
Sound IoU: 0.97
Primary IoU: 0.08
Moderate IoU: 0.17
Advanced IoU: 0.25
Overall Mean F1: 0.44, mean IoU: 0.38
Ahmed (2023) [12] U-Net ensemble (ResNet50, ResNext101, Inception-ResNet-v2) 443/554 2/– Overall caries Mean F1: 0.54, mean IoU: 0.55
Modified-ICIDAS system
Sound IoU: 0.96
Outer 1/2 enamel IoU: 0.21
Inner 1/2 enamel IoU: 0.23
Outer 1/2 dentine IoU: 0.35
Inner 1/2 dentine IoU: 0.41
Panyarak (2022) [31] YOLOv3 994/1,425 3/>7 0.42 ICCMS™ 2 class At IoU 0.50
Overall caries Precision 0.77, recall: 0.82, F1: 0.79, AP:0.72
0.34 ICCMS™ 4 class At IoU 0.50
0.035 Initial stage Precision: 0.75, recall: 0.67, F1: 0.71, AP: 0.57
0.046 Moderate stage Precision: 0.62, recall: 0.53, F1: 0.57, AP: 0.36
Extensive stage Precision: 0.75, recall: 0.78, F1: 0.76, AP: 0.67
0.08 ICCMS™ 7 class At IoU 0.50
0.12 Outer 1/2 enamel Precision: 0.35, recall: 0.23, F1: 0.27, AP: 0.10
0.13 Inner 1/2 enamel Precision: 0.48, recall: 0.63, F1: 0.55, AP: 0.38
0.035 Outer 1/3 dentine Precision: 0.55, recall: 0.55, F1: 0.55, AP: 0.37
0.026 Mid 1/3 dentine Precision: 0.48, recall: 0.56, F1: 0.52, AP: 0.36
0.02 Inner 1/3 dentine Precision: 0.52, recall: 0.61, F1: 0.56, AP:0.39
Pulp Precision: 0.77, recall: 0.61, F1: 0.68, AP:0.52
Kunt (2023) [27] YOLOv5 RetinaNet 2,992/3,989 1/5 Overall caries Ensemble models
Faster R-CNN EfficientDet
F1: 0.79–0.80, Precision: 0.81–0.83
Recall: 0.77–0.79, AP: 0.84–0.86 (at IoU 0.5)
Individual models
F1: 0.72–0.76, Precision: 0.71–0.77
Recall: 0.71–0.79, AP: 0.74–0.83 (at IoU 0.5)
Small lesion Individual models at IoU 0.5
AP: 0.66–0.79
Medium lesion AP: 0.78–0.85
Large lesion AP: 0.67–0.85
Vimalarani (2022) [34] DG-LeNet 800/1,000 –/– 0.16 Overall caries Accuracy: 0.99
Sensitivity: 0.91, specificity: 0.99
Premolars Accuracy: 0.99
Molars Sensitivity: 0.93, specificity: 0.99
Accuracy: 0.99
Sensitivity: 0.92, specificity: 0.99
Mao (2021) [29] AlexNet 195/278 3/>3 0.023 Overall caries Accuracy: 0.90 (AlexNet)
(GoogleNet Accuracy: 0.87 (Google Net)
Vgg19 Accuracy: 0.80 (Vgg19)
ResNet50) Accuracy: 0.83 (Resnet50)
Moran (2021) [30] Inception –/112 1/– 0.24 Interproximal The inception model with a 0.001 learning rate performed best
ResNet Incipient Precision: 0.72, recall: 0.87
Specificity: 0.83, NPV: 0.93, AUC: 0.86
0.13 Interproximal Precision: 0.69, recall: 0.733
Advanced Specificity: 0.83, NPV: 0.86, AUC: 0.81
Bayrakdar (2022) [18] VGG-16 –/621 2/>9 Overall caries VGG-16 (detection)
Sensitivity: 0.84, precision: 0.84, F1: 0.84
U-Net (segmentation)
Sensitivity: 0.81, precision: 0.86, F1: 0.84
Overall caries VGG-16 (detection)
External dataset Sensitivity: 0.77, precision: 0.78, F1: 0.78
U-Net (segmentation)
Sensitivity: 0.76, precision: 0.87, F1: 0.81
Forouzeshfar (2024) [25] VGG16, VGG19, AlexNet, ResNet50 –/1,517 2/– Overall caries VGG19 performed best
Accuracy: 0.91–0.94
F1: 0.90–0.93, precision: 0.90–0.93
Sensitivity: 0.91–0.95, specificity: 0.89–096
Cantu (2022) [20] U-Net with EfficientNet-B5 encoder 3,293/3,686 4/>3 years Overall caries Accuracy: 0.80
F1: 0.73
Sensitivity: ≥0.70
Chaves (2024) [21] Mask-RCNN with Swin Transformer backbone 340/425 2/2+3* Interproximal Interproximal
Primary caries Sensitivity: 0.74, precision: 0.68
F1: 0.69; AUC: 0.81
Secondary caries Sensitivity: 0.70, precision: 0.75
F1: 0.72; AUC: 0.80
Chen (2022) [22] Faster R-CNN 818/978 3/>5 years 0.25 Interproximal Accuracy: 0.87, F1: 0.74, precision: 0.77
All caries Sensitivity: 0.72, specificity: 0.93
E1, E2 Sensitivity: 0.65, precision: 0.69
D1 Sensitivity: 0.69, precision: 0.71
D2 and D3 Sensitivity: 0.85, precision: 0.92
Bayraktar (2022) [19] Modified YOLOv3 800/1,000 2/>10 years 0.16 Interproximal caries Accuracy: 0.95, AUC: 0.87
Sensitivity: 0.72, specificity: 0.98
Precision: 0.87, NPV: 0.97
Panyarak (2023) [32] ResNet-18 ResNet-50 2,333/2,758 3/>7 0.32–0.34 Overall caries ResNet-50 performed best
ResNet-101 0.035–0.036 ICCMS™ 4 class Accuracy: 0.70–0.75, AUC: 0.88–0.91
ResNet-152 0.046–0.047 Initial stage Sensitivity: 0.78–0.83
Moderate stage Specificity: 0.54–0.67
Extensive stage
0.08 Overall caries ResNet-152 performed best
0.12 ICCMS™ 7 class Accuracy: 0.53–0.62
0.13 Outer 1/2 enamel AUC: 0.84–0.87
0.035 Inner 1/2 enamel Sensitivity: 0.63–0.75
0.026 Outer 1/3 dentine Specificity: 0.30–0.34
0.02 Middle 1/3 dentine
Inner 1/3 dentine
Pulp
Estai (2022) [24] Faster R-CNN Inception-ResNet-v2 −/2,468 3/− 0.38 Interproximal Accuracy: 0.87, F1: 0.87, recall: 0.89 precision: 0.86, specificity: 0.86
Lee (2021) [28] U-Net −/304 2/>5 Overall caries Precision: 0.63, sensitivity: 0.65, F1: 0.64
Baydar (2023) [17] U-Net 400/500 2/>3 Overall caries Sensitivity: 0.82, precision: 0.95, F1: 0.88
Van Nistelrooij (2024) [33] Mask-RCNN with Swin Transformer backbone 330/413 2/5 0.15 Secondary caries AUC: 0.95, F1 score: 0.77, precision: 0.81
4 Sensitivity: 0.74, specificity: 0.97
Ayan (2024) [13] Modified YOLOv5 1,000/1,200 2/>10 years 0.26 Overall caries mAP: 0.63–0.65
0.06 All caries mAP: 0.52–0.54
0.20 Enamel caries mAP: 0.74–0.77
Dentine caries
Bayati (2025) [16] YOLOv8 442/552 1/– Interproximal caries Precision: 0.84, recall: 0.80, F1: 0.82, FNR: 0.20
1/>10 years All caries Precision: 0.96, recall: 0.96, F1: 0.96, FNR: 0.04
Enamel caries Precision: 0.80, recall: 0.73, F1: 0.77, FNR: 0.27
Dentine caries

A dash (“–”) indicates missing information.

A star (“*”) indicates that cases with annotator discrepancies were reviewed by additional annotators.

F1, F1 score; AUC, area under the curve; FPR, false-negative rate; NPV, negative predictive value; AP, average precision; mAP, mean average precision; IoU, intersection over union.

Dataset Allocation

The number of total images eventually used across the studies varied widely, ranging from 112 in Moran et al. [30] to 8,539 images in de Frutos et al. [23], with a mean of 1,526 images and a median of 978. Not all initially collected images were used in the model training as some studies excluded images with poor quality during data preparation, while others did not.

There was a variation in the distribution of bitewing images used for training, validation, and testing across the included studies (Fig. 2). Six studies [13, 15, 24, 25, 28, 30] did not specify their data splits or provided unclear descriptions; therefore, they were excluded from this figure. Two of the 18 studies included provided data allocation based on the image number after augmentation [18, 26]. Most studies allocated a significant portion of their data to training, with smaller subsets reserved for validation and testing. On average, 81% of images were used for training (range 70–98%). Among the studies that specified a validation set [12, 1618, 20, 21, 26, 27, 3133], an average of 11% of the images was allocated for validation (range: 7–18%). For testing, the mean was 9% (range 2–14%).

Fig. 2.

Fig. 2.

Distribution of bitewing images used for training, validation, and testing in the studies that provided data allocation information. Two studies provided data split information for the augmented data, which is indicated with a “*.”

Data Augmentation

Most studies employed data augmentation techniques – including horizontal and vertical flipping, rotation, resizing, and cropping – to artificially increase the number of training images and enhance model generalization. Three studies did not specify whether they used augmentation [14, 25, 33]. Among the eight studies that reported image counts post-augmentation [16, 18, 19, 24, 26, 29, 30, 34], the datasets expanded significantly, ranging from 1,506 to 11,114, with an average increase of 9.1 times (range 2.7–30.9) compared to the original dataset. One study [28] did not use augmentation and instead utilized 12-bit depth images in the manufacturer’s raw bitewing format, potentially preserving more detailed image information.

Applied AI Models

Various deep learning models were employed, either individually or in combination, to automate detection, classification, and segmentation of caries. The majority were architectures primarily designed for classification tasks, such as ResNet50 and VGG19. For detection tasks, commonly used models included YOLO (you only look once) variants, such as YOLOv3 and YOLOv5, while U-Net variants were frequently used for segmentation tasks (Fig. 3).

Fig. 3.

Fig. 3.

Overview of the deep learning models used independently or in combination in the included studies, grouped by the primary automation tasks of detection, classification, and segmentation.

Training Time

The number of training epochs varied significantly among the studies, ranging from 100 epochs [29] to 5,000 epochs [19], with a mean of 488 epochs and a median of 200 epochs.

Annotator Characteristics

Subjective evaluation using bitewing radiographs was employed to establish the ground truth of caries presence and, in some studies, the depth of the lesion. One study provided an unclear description of annotation [34]. The remaining 22 studies established the ground truth for caries presence based on annotations from 1 to 6 annotators, with considerable differences in clinical experience (Table 2b). Most studies employed two or three annotators, typically experienced dentists, or oral radiologists with several years of clinical practice. The annotators’ role was not specified in one study [17] and unclear in another study, in which an automated segmentation via CanioCatch was applied [25].

Caries Prevalence in Image Material

Ten studies [13, 14, 19, 22, 24, 2931, 33, 34] reported the prevalence of caries lesions in their datasets. The same image material was applied in two studies [31, 32]. Thus, only one study was included in calculating the mean and median of caries prevalence. Lesion prevalence ranged from 2.3% to 41.7%, with a mean prevalence of 24.9% and a median of 24.9%.

Diagnostic Performance

The last two columns in Table 2b summarize the lesion types analysed in each study, classified by affected tooth surface and lesion depth, along with the corresponding diagnostic performance of the deep learning models. While most studies evaluated overall caries diagnostic performance [12, 13, 17, 18, 20, 25, 2729, 31, 32, 34], some provided a more detailed classification, reporting AI performance specifically for interproximal caries detection [1416, 2124, 26, 30], secondary caries detection [21, 23, 26, 33], and occlusal caries detection [26]. The performance of AI models was assessed using various performance metrics, as shown for each study in Table 2b. Some models outperformed or matched the performance of human dentists [18, 20, 22, 23, 27, 28, 30], particularly in detecting advanced lesions than initial lesions. Only one study presented the AI’s performance on an external dataset [18].

QUADAS-2 Bias Assessment

The QUADAS-2 evaluation indicated a generally high risk of bias across the included studies (Fig. 4 and Table 3). In the patient selection domain, all studies were rated high risk due to reliance on single-institution datasets and the exclusion of poor-quality or complex images, limiting generalizability. The index test domain frequently showed high or uncertain risk, often due to a lack of reported blinding to the reference standard. The reference standard domain consistently exhibited high risk as all studies relied solely on expert annotations without external gold standard validation. The flow and timing domain was often rated as uncertain due to insufficient reporting of data handling and timing procedures. Overall, these findings suggest that biases in patient selection, index test interpretation, reference standard usage, and study transparency may limit the confidence and applicability of the results.

Fig. 4.

Fig. 4.

QUADAS-2 summary plot.

Table 3.

QUADAS-2 risk of bias assessment

Authors Domain Risk of bias Applicability concern Narrative summary
Ayhan (2024) [14] Patient selection High High This study used 1,170 anonymized bitewing radiographs. However, radiographs with technical and positioning errors, or those with complex prosthetics, were excluded, which introduces a high risk of selection bias by not representing real-world variability and poses an applicability concern
Index test Unclear Low YOLOv7 was employed for tooth numbering and caries detection, achieving high accuracy. However, the study does not mention if blinding was used during test interpretation, raising concerns about observer bias
Reference standard High Low Annotations by two experienced professionals provided reliability, but the lack of an external gold standard, such as histological validation, raises the risk of reference standard bias
Flow and timing Unclear Low The timing between radiograph acquisition and labelling was not specified, and the handling of missing data is unclear, leading to uncertainty in this domain
Karakuş (2024) [26] Patient selection High High The study utilized 860 bitewing radiographs, excluding images with unknown grades or annotation issues. While the dataset is robust, exclusion criteria may introduce selection bias by omitting real-world cases, such as ambiguous or poor-quality images; this also poses an applicability concern
Index test Unclear Low The YOLOv8 model showed high precision (97.7%), recall (93.2%), and F1 score (95.4%). However, the lack of details regarding blinding during interpretation raises potential concerns about observer bias
Reference standard High Low Ground truth was established by six dental clinicians through consensus annotations. Despite this effort, the absence of an external gold standard, such as histological validation, introduces reference standard bias
Flow and timing Unclear Low The timing between radiograph acquisition and annotation is not detailed, nor is the handling of missing data, resulting in uncertainty in this domain
de Frutos (2024) [23] Patient selection High Low This study utilized 13,887 bitewing radiographs from the HUNT4 Oral Health Study, excluding images with unknown grades or annotation issues. Although it features a large dataset, the exclusion criteria could bias results by removing difficult or ambiguous cases, and this also poses an applicability concern
Index test Low Low Three state-of-the-art models (YOLOv5, RetinaNet, and EfficientDet) were evaluated, showing performance on par or better than clinicians. The index test aligns well with the study objective
Reference standard High Low Annotations by six dental clinicians and consensus annotation for the test set ensured reliability. However, no external gold standard (e.g., histological validation) was used, which could affect bias
Flow and timing Unclear Low The study lacks details on how data splits were performed at the patient level, risking potential leakage. The timing between data collection and annotation was also not described
Azhari (2023) [15] Patient selection High High This study used 771 bitewing radiographs, split into adult and paediatric datasets. Exclusion criteria, such as poor image quality or identifiable patient information, likely introduced selection bias by excluding cases representative of real-world variability; this also poses an applicability concern
Index test Unclear Low Semantic segmentation models trained on U-Net architecture with transfer learning showed high accuracy for advanced caries but struggled with primary and moderate caries. The lack of clarity about blinding during the interpretation raises concerns about observer bias
Reference standard High Low Ground truth labelling by two calibrated examiners was reliable, with kappa >0.8. However, no external gold standard (e.g., histological validation) was used, introducing reference standard bias
Flow and timing Unclear Low No information was provided on the timing of radiograph acquisition and annotation. Handling of missing data was not explicitly discussed, raising uncertainty about this domain
Ahmed (2023) [12] Patient selection High High This study utilized 554 bitewing radiographs and excluded those with technical issues, overlapping images, or indirect restorations. These exclusions could introduce selection bias by limiting the generalizability of the findings to real-world clinical scenarios; this also poses an applicability concern
Index test Unclear Low The U-Net model, using three encoders (ResNet50, ResNext101, and Vgg19), achieved an F1 score of 0.535. However, the absence of information on blinding during test interpretation raises potential observer bias concerns
Reference standard High Low Two calibrated restorative dentists labelled the images according to a modified ICDAS protocol, but the lack of an external gold standard such as histological validation introduces potential reference standard bias
Flow and timing Unclear Low The timing between image acquisition and labelling was not reported, and no mention was made of how missing or excluded data were handled, resulting in an unclear risk of bias in this domain
Panyarak (2022) [31] Patient selection High High The study analysed 994 bitewing radiographs, but the exclusion of teeth with overlapping surfaces exceeding half of the enamel likely introduced selection bias as this does not reflect real-world variability in clinical practice; this also poses an applicability concern
Index test Low Low YOLOv3 showed acceptable performance for caries detection and classification under IoU50 and IoU75 thresholds. However, there was a slight performance reduction at the higher IoU threshold. The test design aligns well with clinical and research objectives
Reference standard High Low Caries annotations were conducted by three experienced radiologists using a consensus approach. However, the absence of an external gold standard, such as histological validation, increases the risk of reference standard bias
Flow and timing Unclear Low The study lacks detailed information on the timing of radiograph acquisition and annotation, as well as how missing or excluded data were managed, leading to uncertainty in this domain
Kunt (2023) [27] Patient selection High High This study analysed 3,989 bitewing radiographs from four X-ray units. The exclusion of low-quality images and inconsistent scaling of radiographs introduces potential selection bias as these exclusions do not fully reflect clinical variability; this also poses an applicability concern
Index test Low Low Multiple object detection CNNs (YOLOv5, RetinaNet, Faster R-CNN, EfficientDet) were evaluated, with the best-performing ensemble achieving an F1 score of 0.80 and AP of 0.86. The use of advanced models and ensembling aligns well with clinical objectives
Reference standard High Low Caries annotations were performed by a single expert using bounding boxes. The reliance on a single annotator without external validation (e.g., histological confirmation) increases the risk of reference standard bias
Flow and timing Unclear Low The study lacks detailed information on the timing of image acquisition and annotation, as well as handling of missing data, leading to uncertainty in this domain
Vimalarani (2022) [34] Patient selection High High This study analysed 1,000 bitewing radiographs, but the exclusion of poor-quality images and augmentation-based dataset creation introduce selection bias as it does not reflect the variability seen in real-world clinical data; this also poses an applicability concern
Index test Unclear Low The proposed DG-LeNet classifier showed high accuracy (98.74%) and sensitivity (91.37%) for detecting caries. However, the study lacks details about blinding during test interpretation, raising concerns about potential observer bias
Reference standard High High The ground truth was determined by faculty-provided annotations without external validation, such as histological confirmation. This introduces a high risk of reference standard bias
Flow and timing Unclear Low The timing between image acquisition, annotation, and testing was not explicitly described, and handling of missing data was not reported, leading to uncertainty in this domain
Mao (2021) [29] Patient selection High High This study analysed 278 bitewing radiographs and excluded low-quality images. Additionally, image augmentation was applied to balance the dataset, which may not reflect real-world clinical diversity, introducing selection bias; this also poses an applicability concern
Index test Low Low The modified AlexNet CNN achieved an accuracy of 90.3% for caries detection and 95.56% for restoration detection, demonstrating strong performance and alignment with the study’s clinical goals
Reference standard High Low Annotations were provided by three professional dentists, but no external gold standard (e.g., histological confirmation) was employed, which raises the potential for reference standard bias
Flow and timing Unclear Low The timing between image acquisition and annotation is not detailed, nor is the handling of missing data explicitly described, leading to uncertainty in this domain
Moran 2021 [30] Patient selection High High The study analysed 112 bitewing radiographs, with exclusion criteria such as dental implants, crowding, and malocclusion. These exclusions may not fully represent real-world variability, introducing a high risk of selection bias; this also poses an applicability concern
Index test Low Low The CNN models (ResNet and Inception) achieved promising accuracy for caries classification, with the best-performing model achieving 73.3% accuracy. The models align well with the study’s objectives and clinical requirements
Reference standard High Low Ground truth annotations were based on a single expert’s evaluation without external gold standard validation, such as histological confirmation, introducing reference standard bias
Flow and timing Unclear Low The study lacks detailed information on timing between radiograph acquisition and annotation, as well as how missing or excluded data were handled, leading to uncertainty in this domain
Bayrakdar (2022) [18] Patient selection High High The study used 621 anonymized bitewing radiographs, excluding radiographs with artifacts and resizing images for standardization. These exclusions may introduce selection bias by not fully representing real-world clinical scenarios; this also poses an applicability concern
Index test Low Low The VGG-16 model for caries detection and U-Net for segmentation achieved strong internal performance metrics (F1 score = 0.84) and promising external validation results (F1 score = 0.78), aligning well with clinical and research objectives
Reference standard High Low Ground truth was established through annotations by two experienced clinicians without external gold standard validation, such as histological data, introducing a high risk of reference standard bias
Flow and timing Unclear Low The timing of radiograph acquisition and annotation, as well as the management of excluded or missing data, was not detailed, resulting in uncertainty in this domain
Forouzeshfar (2024) [25] Patient selection High High The study analysed 713 bitewing radiographs, excluding low-quality images and performing data augmentation to increase the dataset to 6,032 images. These exclusions and augmentations may introduce selection bias as they do not fully reflect real-world variability; this also poses an applicability concern
Index test Low Low Four CNN architectures (VGG16, VGG19, AlexNet, ResNet50) were employed, with VGG19 achieving the highest accuracy (93.93%). The models align well with clinical requirements and the research objectives
Reference standard High Low Annotations were performed by a specialist dentist and a radiologist assistant, but no external gold standard validation (e.g., histological confirmation) was used, increasing the risk of reference standard bias
Flow and timing Unclear Low The timing of radiograph acquisition, annotation, and model evaluation is not clearly specified. Handling of missing data was not reported, resulting in uncertainty in this domain
Cantu (2022) [20] Patient selection High High The study analysed 3,686 bitewing radiographs but excluded those where assessment was deemed impossible (e.g., poor-quality images). This introduces selection bias as these exclusions may not reflect real-world clinical challenges; this also poses an applicability concern
Index test Low Low The U-Net CNN with an EfficientNet-B5 encoder achieved an accuracy of 0.80 and outperformed dentists in sensitivity (0.75 vs. 0.36). The model aligns well with the study’s objectives and demonstrates robustness for detecting both initial and advanced caries lesions
Reference standard High Low Annotations were conducted independently by three experienced dentists and later reviewed by a fourth. However, the absence of an external gold standard (e.g., histological validation) increases the risk of reference standard bias
Flow and timing Unclear Low The timing of radiograph acquisition and annotation was not explicitly reported, nor was the handling of missing data, leading to uncertainty in this domain
Chaves (2024) [21] Patient selection High High This study analysed 425 bitewing radiographs from a dental practice network. Exclusion of radiographs with inappropriate angulation or distortion introduces potential selection bias as these conditions are common in real-world practice; this also poses an applicability concern
Index test Low Low The Mask-RCNN architecture with Swin Transformer backbone demonstrated high performance for detecting primary and secondary caries lesions, with FROC AUCs of 0.806 (primary) and 0.804 (secondary). This aligns well with the study’s clinical objectives
Reference standard High Low Caries annotations were conducted in three rounds by graduate and PhD students, with final consensus by experts. However, the absence of an external gold standard, such as histological validation, increases the risk of reference standard bias
Flow and timing Unclear Low The study lacks details on the timing of radiograph acquisition and annotation or how missing data were managed, leading to uncertainty in this domain
Chen (2022) [22] Patient selection High High The study used 978 bitewing radiographs and excluded primary teeth, poor-quality images, and distorted radiographs. These exclusions, while improving data quality, introduce selection bias by limiting the real-world applicability; this also poses an applicability concern
Index test Low Low The faster R-CNN model achieved a sensitivity of 0.72 and specificity of 0.93, significantly outperforming postgraduate students. Its diagnostic accuracy and robustness across different lesion stages demonstrate strong clinical applicability
Reference standard High Low Annotations were conducted by two endodontic experts and one radiologist, but no external gold standard such as histological validation was used, leading to a high risk of reference standard bias
Flow and timing Unclear Low The study did not provide sufficient details on the timing of radiograph acquisition and annotation or the handling of missing data, resulting in uncertainty in this domain
Bayraktar (2022) [19] Patient selection High High The study analysed 1,000 bitewing radiographs, excluding images that lacked sufficient diagnostic quality. This exclusion may introduce selection bias by removing challenging cases seen in real-world clinical scenarios; this also poses an applicability concern
Index test Low Low The YOLOv3-based CNN demonstrated an accuracy of 94.59%, sensitivity of 72.26%, and specificity of 98.19%. The use of a state-of-the-art object detection model aligns well with the study’s objectives and shows strong applicability for clinical use
Reference standard High Low Caries lesions were annotated by two experienced dentists through consensus, but the lack of external validation with a histological gold standard increases the risk of reference standard bias
Flow and timing Unclear Low The study lacks detailed information on the timing of radiograph acquisition, annotation, and testing, as well as handling of excluded or missing data, leading to uncertainty in this domain
Panyarak (2023) [32] Patient selection High High The study analysed 2,758 annotated bitewing radiographs, excluding images with severe occlusal attrition or poor quality. These exclusions may limit real-world variability and introduce selection bias and poses an applicability concern
Index test Low Low Four ResNet models (18, 50, 101, 152) were employed for ICCMS™-RSS classification. ResNet-50 achieved the best balance of sensitivity (78%) and specificity (67.19%) in the 4-class system, aligning well with clinical applications
Reference standard High Low Ground truth was annotated by three radiologists using a consensus approach but lacked an external gold standard (e.g., histological validation), increasing the risk of reference standard bias
Flow and timing Unclear Low The study lacks detailed information on timing of radiograph acquisition and annotation, and no mention of how missing data were managed, leading to uncertainty in this domain
Estai (2022) [24] Patient selection High High The study analysed 2,468 bitewing radiographs, excluding those with extensive restorations, primary teeth, or poor-quality images. These exclusions likely introduced selection bias by not representing real-world diversity in clinical scenarios; this also poses an applicability concern
Index test Low Low The 2-step DL system combining faster R-CNN for ROI detection and Inception-ResNet-v2 for classification achieved an F1 score of 0.87. The system demonstrated high diagnostic accuracy, aligning well with clinical and research objectives
Reference standard High Low Three experienced dentists annotated the images through a consensus approach. However, the lack of an external gold standard, such as histological validation, increases the risk of reference standard bias
Flow and timing Unclear Low The study does not provide details on the timing of image acquisition, annotation, or validation, nor does it specify how missing or excluded data were managed, leading to uncertainty in this domain
Lee (2021) [28] Patient selection High High The study used 304 training radiographs and 50 evaluation radiographs, excluding images with severe overlapping or poor quality. These exclusions, while improving image clarity, introduce selection bias and limit real-world applicability
Index test Low Low The U-Net CNN model achieved moderate performance with an F1 score of 64.14%. The model aligns well with clinical objectives and demonstrated utility in improving clinicians’ sensitivity, particularly for early caries detection
Reference standard High Low Annotations were performed by two experienced observers, with consensus reached for discrepancies. However, no external gold standard, such as histological validation, was used, increasing the risk of reference standard bias
Flow and timing Unclear Low The study lacks detailed information on the timing of radiograph acquisition and annotation. Handling of missing or excluded data was not explicitly described, leading to uncertainty in this domain
Baydar (2023) [17] Patient selection High High The study used 500 anonymized bitewing radiographs, excluding those with artefacts or poor quality. This exclusion likely introduced selection bias by not representing real-world clinical diversity; this also poses an applicability concern
Index test Low Low The U-Net model demonstrated strong performance in segmentation tasks with F1 scores ranging from 0.8818 (caries detection) to 0.9722 (root canal filling). These results indicate high diagnostic accuracy, aligning well with the study’s objectives
Reference standard High Low Ground truth annotations were performed by two clinicians with differing experience levels, but no external gold standard (e.g., histological confirmation) was used, increasing the risk of reference standard bias
Flow and timing Unclear Low The study does not provide details on the timing of radiograph acquisition and annotation or the handling of missing or excluded data, leading to uncertainty in this domain
van Nistelrooij (2024) [33] Patient selection High Low The study used 413 bitewing radiographs containing 2,612 restored teeth from 383 patients. Exclusion of radiographs with distortions or artifacts may introduce selection bias by omitting challenging real-world cases
Index test Low Low The Mask R-CNN with Swin Transformer backbone achieved high specificity (0.966) for detecting secondary caries, with lower sensitivity (0.737). The model’s ability to stage caries severity aligns well with clinical decision-making
Reference standard High Low Two annotators performed pixel-wise labelling, reviewed by senior dentists. The absence of an external gold standard, such as histological validation, increases the risk of reference standard bias
Flow and timing Unclear Low The study does not specify the timing of radiograph acquisition or annotation. Additionally, it does not provide details on the handling of missing or excluded data, leading to uncertainty in this domain
Ayan (2024) [13] Patient selection High Low The study used 1,200 bitewing radiographs, excluding canines, third molars, and images with restorations or secondary caries. Such exclusions may introduce selection bias by limiting the generalizability to typical clinical scenarios encountered in practice
Index test Low Low The modified YOLOv5 (YOLOv5xs) deep learning model demonstrated good detection performance (mAP: 65%, with 54% for enamel and 77% for dentin lesions). The model’s performance aligns well with clinical diagnostic objectives, indicating its strong applicability in educational contexts
Reference standard High Low Two specialist dentists labelled caries lesions through consensus. However, the absence of external validation (e.g., histological confirmation) introduces a high risk of reference standard bias
Flow and timing Unclear Low Details regarding the timing of radiograph acquisition and annotation procedures were insufficiently described. Additionally, there was no explicit information provided about how excluded or missing data were handled, leading to uncertainty within this domain
Bayati (2025) [16] Patient selection High Low The study used 552 bitewing radiographs, excluding images with poor quality, severe distortion, or significant overlapping proximal surfaces. Such stringent exclusions introduce selection bias by omitting more challenging real-world cases that clinicians typically encounter
Index test Low Low The YOLOv8 deep learning model achieved strong performance metrics, notably high precision (96.03%) and recall (96.03%) for enamel caries, demonstrating the model’s suitability and strong clinical applicability for caries detection tasks
Reference standard High Low Two annotators performed pixel-wise labelling, reviewed by senior dentists. The absence of an external gold standard, such as histological validation, increases the risk of reference standard bias
Flow and timing Unclear Low The study does not specify the timing of radiograph acquisition or annotation. Additionally, it does not provide details on the handling of missing or excluded data, leading to uncertainty in this domain

Risk of bias was assessed across four domains: patient selection, index test, reference standard, and flow and timing. A narrative summary is provided alongside the tabulated results.

mAP, mean average precision; AUC, area under the curve.

Discussion

The applied performance metrics vary significantly among studies, making it challenging to compare them. AI models for dental applications are primarily adapted from medical imaging diagnostics. In medical AI, sensitivity is often prioritized over specificity, particularly in conditions like cancer detection, where missing a diagnosis can have severe consequences. Many studies emphasize recall, precision, and F1 score, as medical datasets are often imbalanced, and decision thresholds can be adjusted to optimize performance.

However, dental AI requires a more balanced approach. In caries detection, both sensitivity and specificity are equally essential for ensuring accurate diagnosis and making informed treatment decisions. Overdiagnosis can lead to unnecessary interventions, while underdiagnosis may allow disease progression. Therefore, AI models for caries detection must carefully balance the detection of actual carious lesions while minimizing false positives. To ensure clinical relevance, studies should report a comprehensive set of diagnostic metrics, including recall (sensitivity), specificity, AUC-ROC, F1 score, and AP, to provide an accurate and meaningful evaluation of model performance in real-world dental practice.

The preference for architectures like ResNet and YOLO indicates their effectiveness in image recognition and object detection within dental radiography. These models excel at processing complex image patterns and detecting features indicative of caries, especially advanced lesions. However, studies have highlighted limitations of the AI models for detecting early-stage caries [12, 13, 15, 27, 31], while two studies reported lower performance in detecting dentinal caries [16, 30]. This discrepancy may partly stem from variations in the training data, such as the distribution of lesion stages used for model training. AI models tend to perform better when the dataset substantially represents the types of lesions being analysed. This underscores the need for larger, more diverse datasets that encompass caries at various stages. Models specifically tailored to recognize subtle early-stage demineralization can be challenging [3].

The included studies explored various deep learning architectures and training techniques for caries detection, segmentation, and severity classification, directly addressing practical needs. The development of these models relied on clinicians’ subjective evaluation for image annotation. The lack of a “gold standard” (histological validation) for caries detection may lead to lowered sensitivity for initial caries diagnostics [40] and potential bias. The number and expertise of annotators varied widely, with some studies not specifying qualifications, impacting the reliability of annotations and model training [41].

For deep learning models in caries diagnostics to be robust and generalizable, they must be trained on diverse, high-quality datasets that include images from various imaging systems, collected across multiple geographic locations, and representing a wide range of patient demographics and imaging conditions. Such diversity enhances the model’s ability to adapt to real-world clinical scenarios, improving reliability and diagnostic accuracy.

In this systematic review, several studies excluded radiographs of poor image quality [12, 1417, 23, 24, 26, 28, 3133]. Most studies relied on bitewing radiographs from a single institution or local clinics, resulting in a lack of demographic diversity in the training data (Table 2a). Only one study utilized radiographs collected from six institutes or research centers on a national scale [23]. However, no AI models were developed using internationally sourced, multicenter imaging datasets. Among the ten studies that reported the type of image receptor used for image acquisition, all employed storage phosphor plates, except for one study [27], which used three direct and one indirect acquisition systems (Table 2b). This raises concerns about the generalizability of the models to broader populations and varying clinical settings.

Data scarcity is common in medical imaging AI research [40]. Only a few studies provided data upon request; just two made the data fully open source [27, 33].

The reported diagnostic performance of deep learning models underscores their potential as adjuncts in the radiographic diagnosis of caries (Table 2b). A recent meta-analysis of 14 studies reported mean sensitivity and specificity values of 0.87 and 0.89 for caries detection using bitewing radiographs [6]. However, our systematic review highlights significant heterogeneity, including variations in radiographic acquisition, dataset characteristics, AI models and their tasks (detection, segmentation, and classification), annotation details, lesion types, and lesion prevalence. Additionally, limitations in the annotation process further restrict direct comparisons of diagnostic performance and impede universal generalizations. The lack of standardized methodologies and variability in reporting outcomes make it challenging to fully assess how these models compare to traditional diagnostic methods or each other. This inconsistency mirrors concerns raised in other systematic reviews of AI in medical imaging, where variability hampers evidence synthesis [42].

Risks of Bias

The risk of bias assessment revealed several concerns. Selection bias may result from using datasets from single institutions and excluding images with artefacts or poor quality, potentially producing models that are not generalizable to broader populations or environments [43]. Measurement bias arises from variability in annotation practices and the lack of standardization, introducing inconsistencies in the training data. The subjectivity inherent in caries diagnosis affects model performance. Algorithmic bias is another concern, where algorithms may inadvertently prioritize certain features, leading to missed subtle signs of caries or overemphasis on irrelevant patterns [44]. Although many of these biases are intrinsic to the field and do not fundamentally negate the findings, they should be carefully considered when interpreting the results.

Implications for Clinical Practice and Future Research

The high risk of bias and lack of external validation limit immediate applicability in diverse clinical settings. Models trained on homogeneous datasets may not perform adequately across different populations or imaging conditions. Therefore, these AI tools should currently be viewed as complementary aids rather than replacements for professional judgment. Ethical considerations, such as patient privacy, data security, and transparency in AI decision-making processes, must be addressed before widespread adoption [45].

High-quality, adequately reported AI studies on caries diagnostics using bitewings are needed [6]. Future research should prioritize large, multi-institutional datasets that encompass diverse patient demographics, imaging conditions, and caries presentations [46]. Efforts should also focus on minimizing selection, measurement, and algorithmic bias through techniques such as cross-validation in the training process. Additionally, standardizing annotation protocols is also crucial; establishing clear guidelines for annotator expertise, the number of annotators, annotation procedures, and consensus-building methods will improve the quality and reliability of ground truth data [47].

Conclusion

Deep learning models show promise in detecting and classifying dental caries using bitewing radiographs. While the studies demonstrate high diagnostic performance, especially in detecting advanced lesions, significant challenges remain. The high risk of bias, lack of standardization, limited dataset diversity, and absence of external validation constrain generalizability and clinical applicability.

Acknowledgments

The article was written with the assistance of Claude, ChatGPT, and Google Gemini for advice on formulations and proofreading but not as a source of facts. Figures were generated using Microsoft Excel, Python (v3.12.1), and the Matplotlib library (v3.5.2).

Statement of Ethics

Ethical approval was not required for this systematic review; a Statement of Ethics is not applicable because this study is based exclusively on published literature.

Conflict of Interest Statement

The authors confirm that there are no known financial or personal relationships that could have potentially influenced the work reported in this paper.

Funding Sources

The University of Bergen provided funding for the project by allocating research resources.

Author Contributions

K.S.H.H. has contributed substantially to the concept and study design, data acquisition, analysis or interpretation of data, and manuscript writing and revision. N.R.G. has contributed substantially to the concept or design, analysis or interpretation of data, and manuscript revision. X.Q.S. has contributed substantially to the concept and study design, data acquisition, analysis or interpretation of data, and manuscript writing and revision.

Funding Statement

The University of Bergen provided funding for the project by allocating research resources.

Data Availability Statement

All data analysed in this study are included in this published article and its online supplementary information files. Further enquiries can be directed to the corresponding author. Thirteen included in this review explicitly commented on data availability [13, 16, 17, 22, 23, 2528, 30, 31, 33, 34]. Of these, seven studies [16, 22, 23, 25, 26, 30, 34] offered access to their data upon request. Two studies [27, 33] made their data fully available as open source.

Supplementary Material.

Supplementary Material.

Supplementary Material.

References

  • 1. Kazeminia M, Abdi A, Shohaimi S, Jalali R, Vaisi-Raygani A, Salari N, et al. Dental caries in primary and permanent teeth in children's worldwide, 1995 to 2019: a systematic review and meta-analysis. Head Face Med. 2020;16(1):22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Young DA, Nový BB, Zeller GG, Hale R, Hart TC, Truelove EL; American Dental Association Council on Scientific Affairs, et al. The American dental association caries classification system for clinical practice A report of the American dental association council on scientific affairs. J Am Dent Assoc. 2015;146(2):79–86. [DOI] [PubMed] [Google Scholar]
  • 3. Wenzel A. Radiographic display of carious lesions and cavitation in approximal surfaces: advantages and drawbacks of conventional and advanced modalities. Acta Odontol Scand. 2014;72(4):251–64. [DOI] [PubMed] [Google Scholar]
  • 4. Mohammad-Rahimi H, Motamedian SR, Rohban MH, Krois J, Uribe SE, Mahmoudinia E, et al. Deep learning for caries detection: a systematic review. J Dent. 2022;122:104115. [DOI] [PubMed] [Google Scholar]
  • 5. Abbott LP, Saikia A, Anthonappa RP. Artificial intelligence platforms in dental caries detection: a systematic review and meta-analysis. J Evid Based Dent Pract. 2025;25(1):102077. [DOI] [PubMed] [Google Scholar]
  • 6. Ammar N, Kühnisch J. Diagnostic performance of artificial intelligence-aided caries detection on bitewing radiographs: a systematic review and meta-analysis. Jpn Dent Sci Rev. 2024;60:128–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Mahizha SI, Annrose J, Mano Christaine Angelo J, Domilin Shyni I, Veda Giri GV. Deep convolutional neural networks for early detection of interproximal caries using bitewing radiographs: a systematic review. Evid Based Dent. 2025. [DOI] [PubMed] [Google Scholar]
  • 8. Moher D, Liberati A, Tetzlaff J, Altman DG; PRISMA Group . Preferred reporting Items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. McGowan J, Sampson M, Salzwedel DM, Cogo E, Foerster V, Lefebvre C. PRESS peer review of electronic search Strategies: 2015 guideline statement. J Clin Epidemiol. 2016;75:40–6. [DOI] [PubMed] [Google Scholar]
  • 10. Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan-a web and mobile app for systematic reviews. Syst Rev. 2016;5(1):210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529–36. [DOI] [PubMed] [Google Scholar]
  • 12. Ahmed W, Azhari A, Fawaz K, Ahmed H, Alsadah Z, Majumdar A, et al. Artificial intelligence in the detection and classification of dental caries. J Prosthet Dent. 2025;133(5):1326–32. [DOI] [PubMed] [Google Scholar]
  • 13. Ayan E, Bayraktar Y, Çelik Ç, Ayhan B. Dental student application of artificial intelligence technology in detecting proximal caries lesions. J Dent Educ. 2024;88(4):490–500. [DOI] [PubMed] [Google Scholar]
  • 14. Ayhan B, Ayan E, Bayraktar Y. A novel deep learning-based perspective for tooth numbering and caries detection. Clin Oral Investig. 2024;28(3):178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Azhari AA, Helal N, Sabri LM, Abduljawad A. Artificial intelligence (AI) in restorative dentistry: performance of AI models designed for detection of interproximal carious lesions on primary and permanent dentition. Digit Health. 2023;9:20552076231216681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Bayati M, Alizadeh Savareh B, Ahmadinejad H, Mosavat F. Advanced AI-driven detection of interproximal caries in bitewing radiographs using YOLOv8. Sci Rep. 2025;15(1):4641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Baydar O, Różyło-Kalinowska I, Futyma-Gąbka K, Sağlam H. The U-net approaches to evaluation of dental bite-wing radiographs: an artificial intelligence study. Diagnostics. 2023;13(3):453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Bayrakdar I, Orhan K, Akarsu S, Çelik Ö, Atasoy S, Pekince A, et al. Deep-learning approach for caries detection and segmentation on dental bitewing radiographs. Oral Radiol. 2022;38(4):468–79. [DOI] [PubMed] [Google Scholar]
  • 19. Bayraktar Y, Ayan E. Diagnosis of interproximal caries lesions with deep convolutional neural network in digital bitewing radiographs. Clin Oral Investig. 2022;26(1):623–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Cantu A, Gehrung S, Krois J, Chaurasia A, Rossi J, Gaudin R, et al. Detecting caries lesions of different radiographic extension on bitewings using deep learning. J Dent. 2020:100. [DOI] [PubMed] [Google Scholar]
  • 21. Chaves E, Vinayahalingam S, van Nistelrooij N, Xi T, Romero V, Flügge T, et al. Detection of caries around restorations on bitewings using deep learning. J Dent. 2024;143:104886. [DOI] [PubMed] [Google Scholar]
  • 22. Chen X, Guo J, Ye J, Zhang M, Liang Y. Detection of proximal caries lesions on bitewing radiographs using deep learning method. Caries Res. 2022;56(5–6):455–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Pérez de Frutos J, Holden Helland R, Desai S, Nymoen L, Langø T, Remman T, et al. AI-Dentify: deep learning for proximal caries detection on bitewing x-ray-HUNT4 Oral Health Study. BMC Oral Health. 2024;24(1):344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Estai M, Tennant M, Gebauer D, Brostek A, Vignarajan J, Mehdizadeh M, et al. Evaluation of a deep learning system for automatic detection of proximal surface dental caries on bitewing radiographs. Oral Surg Oral Med Oral Pathol Oral Radiol. 2022;134(2):262–70. [DOI] [PubMed] [Google Scholar]
  • 25. Forouzeshfar P, Safaei AA, Ghaderi F, Hashemikamangar SS. Dental Caries diagnosis from bitewing images using convolutional neural networks. BMC Oral Health. 2024;24(1):211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Karakuş R, Öziç M, Tassoker M. AI-assisted detection of interproximal, occlusal, and secondary caries on bite-wing radiographs: a single-shot deep learning approach. J Imaging Inform Med. 2024;37(6):3146–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kunt L, Kybic J, Nagyová V, Tichý A. Automatic caries detection in bitewing radiographs: part I-deep learning. Clin Oral Investig. 2023;27(12):7463–71. [DOI] [PubMed] [Google Scholar]
  • 28. Lee S, Oh S, Jo J, Kang S, Shin Y, Park J. Deep learning for early dental caries detection in bitewing radiographs. Sci Rep. 2021;11(1):16807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Mao YC, Chen TY, Chou HS, Lin SY, Liu SY, Chen YA, et al. Caries and restoration detection using bitewing film based on transfer learning with CNNs. Sensors-Basel. 2021;21(13):4613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Moran M, Faria M, Giraldi G, Bastos L, Oliveira L, Conci A. Classification of approximal caries in bitewing radiographs using convolutional neural networks. Sensors-Basel. 2021;21(15):5192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Panyarak W, Suttapak W, Wantanajittikul K, Charuakkra A, Prapayasatok S. Assessment of YOLOv3 for caries detection in bitewing radiographs based on the ICCMS radiographic scoring system. Clin Oral Investig. 2023;27(4):1731–42. [DOI] [PubMed] [Google Scholar]
  • 32. Panyarak W, Wantanajittikul K, Suttapak W, Charuakkra A, Prapayasatok S. Feasibility of deep learning for dental caries classification in bitewing radiographs based on the ICCMS radiographic scoring system. Oral Surg Oral Med Oral Pathol Oral Radiol. 2023;135(2):272–81. [DOI] [PubMed] [Google Scholar]
  • 33. van Nistelrooij N, Chaves ET, Cenci MS, Cao L, Loomans BAC, Xi T, et al. Deep learning-based algorithm for staging secondary caries in bitewings. Caries Res. 2024:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Vimalarani G, Ramachandraiah U. Automatic diagnosis and detection of dental caries in bitewing radiographs using pervasive deep gradient based LeNet classifier model. Microprocess Microsy. 2022;94:104654. [Google Scholar]
  • 35. García-Cañas Á, Bonfanti-Gris M, Paraíso-Medina S, Martínez-Rus F, Pradíes G. Diagnosis of interproximal caries lesions in bitewing radiographs using a deep convolutional neural network-based software. Caries Res. 2022;56(5–6):503–11. [DOI] [PubMed] [Google Scholar]
  • 36. Kaur A, Jyoti D, Sharma A, Yelam D, Goyal R, Nath A. Deep caries detection using deep learning: from dataset acquisition to detection. Clin Oral Invest. 2024;28(12):677. [DOI] [PubMed] [Google Scholar]
  • 37. Majanga V, Viriri S. Automatic blob detection for dental caries. Appl Sci-basel. 2021;11(19):9232. [Google Scholar]
  • 38. Rajee MV, Mythili C. Dental image segmentation and classification using inception Resnetv2. IETE J Res. 2023;69(8):4972–88. [Google Scholar]
  • 39. Szabó V, Szabó BT, Orhan K, Veres DS, Manulis D, Ezhov M, et al. Validation of artificial intelligence application for dental caries diagnosis on intraoral bitewing and periapical radiographs. J Dent. 2024;147:105105. [DOI] [PubMed] [Google Scholar]
  • 40. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9. [DOI] [PubMed] [Google Scholar]
  • 41. Galić I, Habijan M, Leventić H, Romić K. Machine learning empowering personalized medicine: a comprehensive review of medical image analysis methods. Electronics. 2023;12(21):4411. [Google Scholar]
  • 42. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1(6):e271–e297. [DOI] [PubMed] [Google Scholar]
  • 43. Beam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018;319(13):1317–8. [DOI] [PubMed] [Google Scholar]
  • 44. Aquino YS. Making decisions: bias in artificial intelligence and data‑driven diagnostic tools. Aust J Gen Pract. 2023;52(7):439–42. [DOI] [PubMed] [Google Scholar]
  • 45. Price WN, Cohen IG. Privacy in the age of medical big data. Nat Med. 2019;25(1):37–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Yasaka K, Abe O. Deep learning and artificial intelligence in radiology: current applications and future directions. PLoS Med. 2018;15(11):e1002707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Yang F, Zamzmi G, Angara S, Rajaraman S, Aquilina A, Xue Z, et al. Assessing inter-annotator agreement for medical image segmentation. IEEE Access. 2023;11:21300–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

All data analysed in this study are included in this published article and its online supplementary information files. Further enquiries can be directed to the corresponding author. Thirteen included in this review explicitly commented on data availability [13, 16, 17, 22, 23, 2528, 30, 31, 33, 34]. Of these, seven studies [16, 22, 23, 25, 26, 30, 34] offered access to their data upon request. Two studies [27, 33] made their data fully available as open source.


Articles from Caries Research are provided here courtesy of Karger Publishers

RESOURCES