Abstract
Automated machine learning (AutoML) has emerged as a clinician-accessible approach to artificial intelligence by automating key stages of model development, including algorithm selection and hyperparameter optimization. This systematic review aimed to evaluate the application, performance, and translational relevance of AutoML in oral healthcare. Electronic searches of PubMed, Web of Science, Cochrane Library, and grey literature were conducted. Nineteen primary studies met the inclusion criteria. AutoML was applied across multiple dental specialties, including periodontology, orthodontics, pediatric dentistry, oral surgery, oral radiology, and public oral health, using heterogeneous data modalities such as radiographic images, clinical records, photographs, and omics data. Most studies reported strong performance on internal validation (e.g., cross-validation or validation splits) and/or independent hold-out test sets, particularly for imaging-based classification tasks, with several models achieving high accuracy and discrimination. However, external validation was uncommon, sample sizes were frequently limited, and substantial risk of bias was identified, particularly in analysis and participant selection. No study evaluated real-world clinical implementation or patient outcomes. Overall, AutoML shows promise in reducing technical barriers and supporting clinician-led artificial intelligence research in dentistry. Nevertheless, current evidence base remains exploratory, and rigorous external validation, standardized reporting, and prospective clinical evaluation are required before routine clinical adoption.
Keywords: Automated machine learning, Dental artificial intelligence, Clinical decision support, Diagnostic model, Predictive modeling, Oral healthcare
1. Introduction
Artificial intelligence (AI) has become a central methodological framework in contemporary dental research, enabling computational approaches for diagnosis, prognosis, risk stratification, and treatment planning across multiple dental subspecialties [1]. Despite this rapid expansion, the development of conventional machine learning (ML) and deep learning (DL) models in oral healthcare remains technically complex and resource-intensive [2]. Standard AI workflows typically require advanced expertise in data preprocessing, feature engineering, model architecture design, algorithm selection, hyperparameter tuning, and performance optimization [3], [4]. These technical demands create substantial barriers for many dental clinicians and researchers, limiting the independent development and widespread adoption of AI systems within dental institutions [5], [6].
Automated machine learning (AutoML) has emerged as a paradigm aimed at reducing these technical barriers by automating key stages of the machine-learning pipeline, particularly algorithm search, model selection, and hyperparameter optimization. Depending on the platform, AutoML frameworks may also provide partial automation of data preprocessing, feature handling, and model evaluation within standardized workflows [7]. Many AutoML platforms are delivered through no-code or low-code interfaces, enabling users without formal training in programming or data science to develop predictive and diagnostic models directly from clinical datasets [8], [9]. This automation has the potential to shift AI development from computational specialists toward domain experts, thereby altering how AI models are created and evaluated in dental research.
Oral health represents a particularly suitable domain for AutoML applications due to its reliance on high-volume imaging data, structured clinical records, and outcome-driven decision-making [9]. By reducing the need for manual coding and algorithm engineering, AutoML may allow dental researchers to concentrate on clinically relevant problem formulation, data quality assurance, and result interpretation. Moreover, the standardized nature of AutoML pipelines may facilitate methodological consistency, support reproducibility, and enable more uniform performance benchmarking across studies and institutions, while potentially reducing certain forms of subjective, human-driven modeling bias [10]. From a translational perspective, AutoML aligns closely with the practical requirements of digital dentistry. AutoML frameworks have been shown to substantially reduce the technical and resource barriers needed to build competitive AI models, facilitating development by clinicians without deep coding or ML expertise [11], [12]. Lowering these entry thresholds can accelerate clinician-driven AI research, promote multi-center collaboration, and support scalable evaluation of AI systems across diverse dental specialties. Prior evidence in dentistry demonstrates promising AutoML applications across diagnostic and predictive tasks [9].
Despite the rapid global expansion of AI research in oral healthcare, the specific contribution of AutoML within this field remains poorly characterized. Existing systematic reviews have largely concentrated on conventionally developed ML and DL models that rely on manual feature engineering and model optimization [13], [14]. In contrast, AutoML-based approaches have received minimal attention in dental research and have been addressed only in a single narrative review to date [9]. As a result, the scope of AutoML adoption in oral healthcare, its application across different dental specialties, and the methodological characteristics of the existing evidence base remain insufficiently understood.
Therefore, the aim of this systematic review was to examine the use of AutoML in oral healthcare. Specifically, this review sought to (i) describe how AutoML has been applied across dental specialties and data modalities; (ii) summarize reported model performance and evaluation approaches; and (iii) consider the potential translational relevance of AutoML-based models for clinical research and practice. By addressing these objectives, this review aims to outline the current evidence base, identify key methodological limitations, and highlight areas for future investigation regarding AutoML as a clinician-accessible artificial intelligence approach in dental research.
2. Methodology
2.1. Protocol and Registration
This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [15]. The review protocol was prospectively registered in the PROSPERO (Prospective Register of Systematic Reviews) database under the registration number CRD420251266034. As this review synthesized previously published data and did not involve the collection of primary data from human participants or animals, ethical approval and informed consent were not required.
2.2. Review Question
The review question was structured according to the Population, Intervention, Comparison, and Outcome (PICO) framework as follows. Population (P): Individuals assessed for dental and maxillofacial conditions. Intervention (I): AutoML-based models. Comparison (C): Conventional diagnostic approaches, expert clinical assessment, or traditional statistical or manually developed machine learning models, when applicable. Outcome (O): Model performance in terms of diagnostic or predictive accuracy, including accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, precision, recall, and F1-score.
Based on this framework, the review sought to answer the following question: “What is the performance (O) of AutoML–based models (I) in individuals with dental and maxillofacial conditions (P) compared with conventional diagnostic approaches or expert assessment (C)?”
2.3. Eligibility Criteria
This systematic review included all full-text primary studies that evaluated the application of AutoML or highly automated machine learning pipelines in dentistry and oral and maxillofacial research for diagnostic, predictive, classification, or risk-assessment purposes. Eligible studies were required to: (1) implement an AutoML or automated ML framework for model development, including but not limited to Google Cloud AutoML, H2O-AutoML, Auto-WEKA, TPOT, ORIENTATE, Vertex AI, LandingLens, or similar platforms; (2) utilize dental, oral, or maxillofacial datasets, including radiographic, photographic, microbiome, metabolomic, clinical, biochemical, or questionnaire-based data; (3) report quantitative model performance metrics such as accuracy, AUC, sensitivity, specificity, precision, recall, F1-score, or related model evaluation outcomes (internal validation, testing, or external validation); and (4) involve human subjects or human-derived datasets. Studies related to oral and maxillofacial oncology (e.g., oral squamous cell carcinoma, jaw tumors, and odontogenic malignancies) were included provided they met all other eligibility criteria.
Exclusion criteria comprised case reports, narrative reviews, systematic reviews, meta-analyses, book chapters, editorials, letters to the editor, conference abstracts, animal studies, in vitro studies, and purely technical algorithmic papers without direct clinical dental, oral, or maxillofacial application. Studies focusing exclusively on non-oral head and neck oncology (e.g., laryngeal, pharyngeal, thyroid, or salivary gland tumors without oral or jaw involvement) were excluded. No restrictions were applied regarding the year of publication or language.
2.4. Information Sources and Search Strategy
An electronic literature search was performed across the PubMed, Web of Science, and Cochrane Library databases until 01 December 2025. The search strategy combined both Medical Subject Headings (MeSH) terms and free-text keywords related to automated machine learning and dental and maxillofacial applications. The complete electronic search strategy is provided in Supplementary File 1.
To minimize the risk of publication and selection bias, an extensive search of the grey literature was conducted using Google Scholar and ProQuest Dissertations & Theses. In addition, a manual screening of the reference lists of all included studies and relevant review articles was undertaken to identify any additional eligible publications not captured through the electronic database search.
All retrieved citations were imported into EndNote 2025 software (Thomson Reuters, Philadelphia, PA, USA) for duplicate identification and removal, followed by title, abstract, and full-text screening.
2.5. Study Selection and Data Extraction
Two independent reviewers screened the titles and abstracts of all retrieved records to assess potential eligibility. Studies that appeared to meet the inclusion criteria were subsequently evaluated at the full-text level. Full-text articles were independently reviewed to determine final eligibility for inclusion. Any discrepancies between the two reviewers were resolved through discussion, and when consensus could not be reached, a third reviewer was consulted to make the final decision.
Data extraction was performed independently by the two reviewers using a standardized data extraction form. The extracted variables included: study title, authors, year of publication, country of origin, clinical application domain, data modality (e.g., radiographic images, clinical records, photographic images, microbiome or metabolomic data), AutoML platform or framework used, study design, dataset size, type of input features, training, internal validation, and testing strategy (including cross-validation and hold-out test evaluation when applicable), external validation when reported, ground truth reference standard, performance evaluation metrics (including accuracy, AUC, sensitivity, specificity, precision, recall, and F1-score), and major study outcomes. To ensure terminological consistency across studies, the following definitions were applied throughout this review. The training set refers to data used to fit (develop) the model. Internal validation refers to procedures performed during model development for model selection or tuning (e.g., a validation split within the training data, k-fold cross-validation, or bootstrap resampling). The test set refers to an independent hold-out subset used exclusively for final model evaluation after training and tuning were completed. Performance assessed using data obtained from a different institution, population, or time period was considered external validation.
Inter-reviewer agreement for study selection was assessed using Cohen’s kappa (κ) statistic. Agreement was evaluated for title and abstract screening and for full-text eligibility assessment. Inter-reviewer agreement was excellent (κ = 0.95). Disagreements were resolved through discussion and, when necessary, consultation with a third reviewer.
2.6. Risk of Bias Assessment
The risk of bias and applicability of all included studies were independently assessed by two reviewers using the PROBAST (Prediction Model Risk of Bias Assessment Tool) [16], as all studies involved the development and/or validation of machine learning or automated machine learning–based prediction models. PROBAST evaluates risk of bias across four domains: participants, predictors, outcome, and analysis.
Each domain was classified as having low, high, or unclear risk of bias in accordance with the standardized PROBAST signaling questions. Any disagreements between the two reviewers were resolved through discussion, and when consensus could not be achieved, a third senior reviewer was consulted. An overall risk of bias judgment for each study was determined based on the domain-level assessments and is reported descriptively.
3. Results
3.1. Study Selection
The study selection process is presented in the PRISMA flow diagram (Fig. 1). The initial database search identified 769 records (PubMed: 443; Web of Science: 286; Cochrane: 38; grey literature: 2). After the removal of 197 duplicate records, 572 unique records remained and were screened based on titles and abstracts. Of these, 553 records were excluded for not meeting the eligibility criteria. A total of 19 full-text articles were sought for retrieval and assessed for eligibility, with no reports excluded at this stage. Ultimately, 19 studies fulfilled the inclusion criteria and were included in the systematic review. Due to substantial heterogeneity in study design, outcomes, data modalities, and validation strategies, quantitative meta-analysis was not feasible.
Fig. 1.
PRISMA flowchart.
3.2. Qualitative synthesis of included studies
Nineteen studies met the inclusion criteria (Table 1) [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35]. Collectively, they covered a broad range of dental specialties, including periodontology and implantology, orthodontics, pediatric dentistry, operative dentistry, oral and maxillofacial surgery, oral and maxillofacial radiology, community/preventive dentistry and public oral health. Most investigations were single-center observational studies, predominantly retrospective designs based on clinical records or image archives, complemented by a small number of prospective cross-sectional, cohort, diagnostic-accuracy and in-silico analyses. Studies were conducted in North and South America, Europe, Asia and Africa.
Table 1.
Study and clinical–data characteristics of the included studies.
| Author | Country | Study Design | Dental Specialty | Clinical Application | Data Type | Dataset Nature | Data Source | Modality / Format | Task Type | Target / Output |
|---|---|---|---|---|---|---|---|---|---|---|
| Lee et al.[17] | South Korea | Retrospective | Periodontology / Implantology | Dental implant system classification | Dental Imaging | Panoramic & periapical radiographs | Three Korean dental hospitals | 2D DICOM | Multi-Class Classification | Implant system type |
| Tobias et al.[18] | Israel | Prospective Observational | Periodontology / Community Dentistry | Gingival health monitoring and Modified Gingival Index (MGI) classification | Dental Imaging | Smartphone intraoral photographs | iGAM mobile-app pilot study, Hebrew University–Hadassah School of Dental Medicine | 2D RGB images | Binary Classification | Gingivitis severity |
| Heimisdottir et al.[19] | USA | Case–Control | Pediatric Dentistry / Public Health | Early childhood caries (ECC) risk classification | Metabolomics | Supragingival biofilm dataset | ZOE 2.0 preschool cohort, North Carolina Head Start centers | Biospecimen + metabolite matrix | Binary Classification | Localized ECC status |
| Karhade et al.[20] | USA | Cross-Sectional | Pediatric Dentistry / Public Health | Early childhood caries (ECC) status screening | Clinical Tabular | Demographic & behavioral ECC data | ZOE 2.0 cohort with external validation in NHANES 2011–2018 | Tabular variables | Binary Classification | ECC case status |
| Del Real et al.[21] | Chile | Retrospective | Orthodontics | Orthodontic extraction requirement prediction | Clinical Tabular | Clinical & cephalometric orthodontic data | Private orthodontic clinic, Santiago, Chile (2018–2019) | Tabular + numeric | Binary Classification | Extraction decision |
| Boitor et al.[22] | Romania | Cross-Sectional | Periodontology | Identification of periodontal and lifestyle risk factors associated with metabolic syndrome | Clinical Tabular | Periodontal & lifestyle dataset (A) | Hospitalized patients, Sibiu County Emergency Clinical Hospital (Cardiology/Diabetes departments) | Tabular clinical + questionnaire | Binary Classification | Metabolic syndrome |
| Boitor et al.[23] | Romania | Cross-Sectional | Periodontology | Prediction of metabolic syndrome in hospitalized patients with periodontal disease | Clinical Tabular | Periodontal, systemic & HRQoL dataset (B) | Hospitalized patients, Sibiu County Emergency Clinical Hospital (periodontal exam dataset) | Tabular + EQ-5D-5L | Binary Classification | Metabolic syndrome |
| Fatapour et al.[24] | USA | Retrospective | Oral and Maxillofacial Surgery | Prediction of locoregional recurrence in oral tongue squamous cell carcinoma | Oncology Registry | Population-based cancer registry | SEER 2000–2018 cancer registry (USA) | Tabular longitudinal data | Binary Classification | Locoregional recurrence |
| Gomez-Rios et al.[35] | Spain | Cohort Study | Pediatric Dentistry / Special Care Dentistry | Prediction of second deep sedation requirement in pediatric dental patients | Clinical Tabular | Pediatric clinical & sedation dataset | Private dental clinic, Cartagena, Spain (pediatric deep sedation, 2006–2018) | Tabular + tooth-level encoding | Binary Classification | Second deep sedation |
| Hamdan et al.[25] | USA | Retrospective | Operative Dentistry / Oral & Maxillofacial Radiology | Dental restoration segmentation and detection | Dental Imaging | Panoramic radiographs | Adult radiograph records, Marquette University School of Dentistry (2018–2023) | 2D pixel-level labeled images | Segmentation | Dental restorations |
| Kong et al.[26] | South Korea | Retrospective | Prosthodontics / Implantology | Dental implant system classification | Dental Imaging | Periapical radiographs | Implant periapical radiographs, Wonkwang University Dental Hospital (2005–2019) | 2D ROI crops | Multi-Class Classification | Implant system type |
| Kwack et al.[27] | South Korea | Retrospective | Oral and Maxillofacial Surgery | Prediction of medication-related osteonecrosis of the jaw (MRONJ) | Clinical Tabular | MRONJ questionnaire & clinical dataset | Osteoporosis patients on antiresorptive therapy, Dankook University Dental Hospital (2019–2022) | Tabular | Binary Classification | MRONJ occurrence |
| Cheong et al.[28] | United Kingdom | Retrospective | Oral and Maxillofacial Radiology | Paranasal sinus disease detection | Medical Imaging | MRI sinus slices | OASIS-3 public MRI dataset | 2D grayscale MRI | Binary Classification | Paranasal sinus disease |
| Gonzalez et al.[29] | USA | Diagnostic Accuracy (In Vitro) | Pediatric Dentistry | Primary proximal surface caries detection | Dental Imaging | Bitewing radiographs | Pediatric bitewings, Marquette University School of Dentistry | 2D radiographs | Object Detection | Proximal caries location |
| Jeong et al.[30] | South Korea | Experimental | Endodontics / Oral & Maxillofacial Radiology | Multiple root dental disease diagnosis | Dental Imaging | Periapical X-rays | Public Kaggle dataset: “Root Disease X-Ray Images” | 2D root-region images | Multi-Class / Multi-Label Classification | Root disease category |
| Nantakeeratipat et al.[31] | Thailand | Prospective Cross-Sectional | Periodontology / Preventive Dentistry | Dental plaque detection and grading | Dental Imaging | Smartphone plaque images | Dental student plaque images, Srinakharinwirot University, Bangkok | 2D RGB images | Multi-Class / Binary Classification | Plaque severity |
| Yoon et al.[32] | South Korea | Retrospective | Oral and Maxillofacial Surgery | Prediction of intensive care unit (ICU) admission in odontogenic infections | Clinical Tabular | ICU admission dataset | OMFS Department, Dankook University Dental Hospital (2013–2023) | Tabular clinical & laboratory data | Binary Classification | ICU admission |
| Akadiri et al.[33] | Nigeria | Diagnostic Accuracy (In Vitro) | Oral & Maxillofacial Surgery / Oral & Maxillofacial Radiology | Bony jaw lesion classification (benign vs malignant) | Medical Imaging | Panoramic & CT jaw lesions | Radiology archives, University of Port Harcourt Teaching Hospital, Nigeria | 2D panoramic + sectional CT | Binary Classification | Jaw lesion pathology |
| Pais et al.[34] | Portugal | In Silico | Periodontology / Implantology | Peri-implantitis prediction | Metagenomics | Saliva & peri-implant biofilm | NCBI SRA BioProject PRJNA1163384 (oral metagenomes) | Microbiome OTU/relative abundance table | Binary Classification | Peri-implantitis |
2D: two-dimensional; CT: computed tomography; DICOM: Digital Imaging and Communications in Medicine; ECC: early childhood caries; EQ-5D-5L: EuroQol five-dimension five-level questionnaire; HRQoL: health-related quality of life; ICU: intensive care unit; MGI: Modified Gingival Index; MRI: magnetic resonance imaging; MRONJ: medication-related osteonecrosis of the jaw; NCBI: National Center for Biotechnology Information; NHANES: National Health and Nutrition Examination Survey; OMFS: Oral and Maxillofacial Surgery; OTU: operational taxonomic unit; RGB: red–green–blue color model; ROI: region of interest; SEER: Surveillance, Epidemiology, and End Results Program; SRA: Sequence Read Archive
3.3. Study characteristics
Data sources were heterogeneous, spanning three broad domains (Table 1). First, imaging studies used panoramic and periapical radiographs, bitewings, smartphone intraoral photographs, MRI sinus slices and combined panoramic/CT images for tasks such as implant system classification, restoration segmentation, proximal caries detection, root-disease diagnosis, plaque and gingivitis grading, sinonasal disease detection and bony jaw lesion classification [17], [18], [25], [26], [28], [29], [30], [31], [33]. Second, clinical/tabular datasets incorporated demographic, behavioral, orthodontic, periodontal, systemic, sedation, oncologic and ICU-related variables for early childhood caries (ECC) status, metabolic syndrome, orthodontic extraction, locoregional cancer recurrence, second deep sedation, MRONJ and ICU admission [20], [21], [22], [23], [24], [27], [32], [35]. Third, omics studies analyzed supragingival biofilm metabolomics and oral metagenomic profiles to discriminate ECC and peri-implantitis, respectively [19], [34]. Almost all problems were framed as supervised classification (binary or multi-class), with one object-detection and one segmentation task.
As summarized in Table 2, ground truth definitions were generally aligned with accepted standards: implant types derived from electronic health records and inventory logs, radiology ground-truth databases or expert consensus, histopathological examination for jaw lesions, International Caries Detection and Assessment System and decayed–missing–filled surfaces criteria (ICDAS/dmfs) for ECC, National Cholesterol Education Program Adult Treatment Panel III and related criteria (NCEP ATP III) for metabolic syndrome, American Association of Oral and Maxillofacial Surgeons criteria (AAOMS) for medication-related osteonecrosis of the jaw (MRONJ), prospectively recorded intensive care unit admission and sedation outcomes (ICU), and study-defined clinical groupings for peri-implant health and disease. Imaging labels were usually provided by board-certified specialists, sometimes with resident or student assistance; however, public datasets and some omics-based studies reported less detail regarding annotation procedures.
Table 2.
Dataset annotation, validation, and preprocessing characteristics of the included studies.
| Author | Ground Truth Source | Annotation Method | Annotation Expertise | Dataset Size | Training Split | Validation Split | Test Split | External Validation Size | Validation approach | Class Balance | Missing Data | Data Augmentation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lee et al. [17] | Electronic dental & medical records | Implant system labeling from EMR and clinical documentation | 5 periodontal residents + 3 board-certified periodontists | 11,980 radiographs | 9584 (80%) | Nil | 2396 (20%) | 0 | Random train–test split | Imbalanced: 6 implant classes (TSIII 46.9%, Implantium 21.0%, Superline 19.7%, Astra 3.2%, SLActive BL 4.5%, BLT 4.7%) | NR | No (only resizing + brightness/contrast normalization) |
| Tobias et al. [18] | Dental selfie images | MGI scoring (0–4) and tooth localization | Single board-certified dentist (blinded) | 1520 tooth images from 190 selfies (44 participants) | 1336 images used in 5-fold CV | 5-fold CV (no fixed validation set) | 184 images held out as test set | 0 | 5-fold cross-validation + hold-out | Imbalanced: MGI scores 0–4 = 471, 475, 195, 98, 283; healthy vs inflamed = 471 vs 1049 | NR | No (hand-crafted features; no image augmentation) |
| Heimisdottir et al. [19] | Community dental screening records | ICDAS-based caries diagnosis | Calibrated dentists/examiners | 289 metabolomic samples | 10-fold CV on full dataset | 10% per fold (within CV) | No independent test set | 0 | 10-fold cross-validation | Imbalanced: ECC 39% cases vs 61% noncases | QRILC imputation for left-censored metabolite values; samples with > 30% missing excluded | No (preprocessing only: normalization, log2 transform, QRILC imputation) |
| Karhade et al. [20] | Clinical dental exams | dmfs> 0 and ICDAS ≥ 3 criteria | Trained and calibrated examiners in ZOE 2.0; NHANES examiners for external validation | 6404 participants | 80% | 20% (internal validation) | No independent test set | 2308 | External validation (NHANES) | Imbalanced: ECC 54% (ZOE 2.0) vs 27% (NHANES external) | Complete-case analysis; no imputation reported | No |
| Del Real et al. [21] | Orthodontic treatment records | Extraction decision + cephalometric landmarking | Two experienced orthodontists (>20–40 yrs); cephalometric tracing by trained operator | 214 orthodontic cases | ∼81% (inner loop of nested 10-fold CV) | ∼9% (inner validation folds) | ∼10% (outer-fold test per CV) | 0 | Nested cross-validation | Imbalanced: Extraction 38% vs non-extraction 62% | Incomplete records excluded; no imputation | No (tabular numeric data only; no augmentation) |
| Boitor et al. [22] | Hospital cardiology/diabetes records | Literature-based metabolic syndrome criteria | Single dentist performed periodontal exam; metabolic syndrome diagnosis by internists | 148 patients | 70% | Nil | 30% | 0 | Random split | NR: Metabolic syndrome class distribution not reported | NR (final analyzed sample n = 148) | No (categorical recoding only; no augmentation) |
| Boitor et al. [23] | Hospital records | NCEP ATP III criteria | Single dentist performed periodontal exam; medical diagnosis by hospital physicians | 296 patients | 70% | Nil | 30% | 0 | Random split | Imbalanced: Metabolic syndrome 168 vs 128 | NR (final analyzed sample n = 296) | No (categorical recoding only; no augmentation) |
| Fatapour et al. [24] | SEER registry | Algorithmic derivation of recurrence label | SEER tumor registrars + automated algorithm | 14,530 (5-year) + 7100 (10-year) OTSCC patients | 80% | 5-fold CV within training set | 20% | 0 | Split + CV | Imbalanced: Recurrence 657 vs 13,873 (5-year); 971 vs 6129 (10-year) | Cases with missing key variables excluded (complete-case analysis) | No (class balancing via under-sampling only; no augmentation) |
| Gomez-Rios et al. [35] | Clinical treatment charts | Rule-based extraction of sedation outcomes | Treating clinicians; examiner calibration not reported | 230 pediatric records | 80% | 0% (3-fold CV within training set) | 20% | 0 | Split + CV | Imbalanced: Second sedation 55 vs 175 | “Not recorded” category used for some variables; no imputation | No (automated preprocessing only: normalization + one-hot encoding) |
| Hamdan et al. [25] | Radiology ground truth database | Pixel/tooth-level restoration labeling with calibration and final OMFR review | Two radiology faculty + three trained students; final reviewer = oral & maxillofacial radiologist | 100 panoramic radiographs (1108 restorations on 960 teeth; 2601 teeth total) | 70% | 20% | 10% | 0 | Random split | Imbalanced: Pixel-level 745,545 restoration vs 66,081,455 background; tooth-level 960 vs 1641 | NR | Yes (horizontal & vertical flip on training images) |
| Kong et al. [26] | Medical & inventory records | Implant system labeling based on EMR and inventory logs | Prosthodontist curated dataset | 4800 periapical implant images (1200 per system) | 80% | 10% | 10% | 0 | Random split | Balanced: 4 implant classes, 1200 images per class | Poor-quality images excluded; no imputation | No (geometric standardization only: cropping, vertical alignment, rotation) |
| Kwack et al. [27] | Postoperative clinical follow-up | AAOMS criteria for MRONJ | Oral & maxillofacial surgeons | 340 patients | 71.5% (n = 243) | 0% (5-fold CV within training set) | 28.5% (n = 97) | 0 | Split + CV | Imbalanced: MRONJ 130 vs controls 210 | Incomplete clinical records excluded (complete-case analysis) | No (tabular preprocessing only; no augmentation) |
| Cheong et al. [28] | Radiology consensus | Majority voting | Three consultant head & neck radiologists | 1376 MRI slices | 80% (n = 1100) | 10% (n = 138) | 10% (n = 138) | 0 | Random split | Imbalanced: Sinonasal disease 777 “Yes” vs 599 “No” | Low-quality images excluded during preprocessing; no imputation | No (no augmentation; only selection of a single MRI slice) |
| Gonzalez et al. [29] | Expert bounding-box annotations | Bounding box annotation + arbitration | Two expert pediatric dentists + board-certified oral & maxillofacial radiologist | 66 bitewing radiographs (755 surfaces: 178 caries, 577 normal) | 70% | 20% | 10% | 0 | Random split | Imbalanced: Carious surfaces 178 vs sound 577 | Unsuitable radiographs excluded; no imputation | Yes (horizontal flip + random augmentations; ∼8 augmented images per original) |
| Jeong et al. [30] | Public Kaggle dataset | Pre-labeled dataset | NR (original annotators not specified) | 2999 intraoral X-rays | 80% | 10% | 10% | 0 | Random split | Balanced: Root disease classes ∼280–380 images each | NR | No (standard preprocessing only; no augmentation) |
| Nantakeeratipat et al. [31] | Image-based quantification | ImageJ surface area analysis for plaque levels | Dental researchers; clinical plaque measured on dental students | 600 undyed tooth images (AutoML subsets: 300 for 3-class; 299 for 2-class) | 80% (240 per subset) | 10% (30 per subset) | 10% (30 for 3-class; 29 for 2-class) | 0 | Random split | Balanced: 3-class plaque dataset (100/class); Imbalanced: 2-class dataset 196 vs 103 | NR (all participants contributed images) | Yes (image flipping and rotation for plaque dataset balancing) |
| Yoon et al. [32] | Hospital records | Direct clinical outcome (ICU admission) | Oral & maxillofacial surgeons | 210 patients | ∼80% per fold (CV training) | ∼20% per fold (CV validation) | Nil | 0 | Cross-validation | Imbalanced: 176 non-ICU vs 34 ICU (16.2% ICU) | Incomplete examinations excluded (complete-case analysis) | No (tabular data only; no augmentation) |
| Akadiri et al. [33] | Histopathology reports | Biopsy-confirmed diagnosis | Pathologists; images curated by oral & maxillofacial surgeons | 86 images across two projects (46 + 40; plus 5 test per project) | All labeled images used for training | Nil | 5 test images per project | 0 | Small hold-out split | Imbalanced: Jaw lesions Project 1 (26 benign vs 20 malignant); Balanced: Project 2 (20 vs 20) | Poor-quality or unconfirmed pathology images excluded; no imputation | No (no augmentation; only high-quality image selection) |
| Pais et al. [34] | Original clinical implant study | Study-defined peri-implant health vs disease grouping | Examiner qualifications not specified in current article | 100 metagenomes (60 biofilm, 40 saliva) from 40 patients | No fixed training split (5-fold CV used) | Nil | Nil | 0 | Cross-validation | NR: HI vs PI counts not fully reported (40 saliva, 60 biofilm samples) | Low-quality sequence reads excluded during QC; no imputation | No (no augmentation; only metagenomic preprocessing and normalization) |
Note:Training split refers to data used to fit the model. Validation split refers to internal validation during model development (e.g., cross-validation folds or a validation subset used for model selection/tuning). Test split refers to an independent hold-out subset used only for final performance evaluation after training/tuning. External validation refers to evaluation on an independent cohort (different institution, population, or time period).
EMR: Electronic Medical Records; MGI: Modified Gingival Index; CV: Cross-Validation; ICDAS: International Caries Detection and Assessment System; ECC: Early Childhood Caries; QRILC: Quantile Regression Imputation of Left-Censored data; dmfs: decayed, missing, and filled surfaces; NHANES: National Health and Nutrition Examination Survey; ZOE 2.0: ZOE Oral Health Study version 2.0; NCEP ATP III: National Cholesterol Education Program Adult Treatment Panel III; SEER: Surveillance, Epidemiology, and End Results Program; OTSCC: Oral Tongue Squamous Cell Carcinoma; OMFR: Oral and Maxillofacial Radiology; MRONJ: Medication-Related Osteonecrosis of the Jaw; AAOMS: American Association of Oral and Maxillofacial Surgeons; MRI: Magnetic Resonance Imaging; ICU: Intensive Care Unit; AutoML: Automated Machine Learning; ImageJ: Image Processing and Analysis in Java; QC: Quality Control; HI: Healthy Implant; PI: Peri-implantitis; NR: Not Reported
Sample sizes ranged from 66 bitewings or 86 CT/panoramic images in the smallest diagnostic studies to > 20,000 registry patients in the SEER-based cancer recurrence work, with many datasets in the low hundreds. Class imbalance was frequent, particularly for adverse outcomes (recurrence, ICU admission, MRONJ, second sedation) and in pixel-level segmentation. Handling of missing data was inconsistent: several studies used complete-case analysis, some introduced “not recorded” categories, and only the metabolomics work reported more advanced censoring/imputation strategies. Imaging studies typically excluded non-diagnostic images rather than imputing missingness. Data augmentation was used in a minority of imaging studies (mostly flips and rotations) and was absent from all tabular and omics work.
3.4. AutoML platforms
A wide variety of AutoML systems was employed (Table 3). Local, general-purpose frameworks, H2O AutoML, TPOT, Auto-WEKA, Auto-sklearn and the in-house ORIENTATE platform, were used mainly for tabular, metabolomic and metagenomic data [18], [19], [21], [22], [23], [24], [27], [32], [34], [35]. Cloud-based vision platforms including Google Cloud AutoML Tables and AutoML Vision, Vertex AI, LandingLens, Google Teachable Machine and Neuro-T, dominated for imaging tasks [17], [20], [25], [26], [28], [29], [30], [31], [33]. One study combined local evolutionary TPOT with a proprietary cloud “digital phenomics” system (O2Pmgen) to analyze metagenomic data [34].
Table 3.
AutoML platform characteristics and computational considerations of included studies.
| Author | AutoML Platform | Vendor | Platform Type | Automated Feature Engineering (Yes/No) | Automated Model Selection (Yes/No) | Automated Hyperparameter Optimization (Yes/No) | Automated Pipeline End to End (Yes/No) | Explainability (Yes/No) | Code or Model Public (Yes/No) | Hardware or Computation | Computational or Cost Barrier |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Lee et al. [17] | Neuro-T AutoML (Neuro-T 2.0.1) | Neurocle Inc. | Local | Yes | Yes | Yes | No | No | No | NR | Computing power constraints and need for larger high-quality datasets; commercial AutoML software |
| Tobias et al. [18] | H2O AutoML | H2O.ai | Local | No | Yes | Yes | No | Yes | No | NR (method described as lightweight; no hardware reported) | NR |
| Heimisdottir et al. [19] | TPOT AutoML | TPOT Developers | Local | Yes | Yes | Yes | No | Yes | No | TPOT AutoML: 50 random seeds (up to 24 h each) with 10-fold CV; hardware NR | NR |
| Karhade et al. [20] | Google Cloud AutoML Tables | Google Cloud | Cloud | Yes | Yes | Yes | No | No | No | Google Cloud AutoML Tables; compute resources NR | NR |
| Del Real et al. [21] | Auto-WEKA | Auto-WEKA Project Team | Local | Yes | Yes | Yes | No | No | No | Local machine; Auto-WEKA memory cap 2 GB, time budgets 5–60 min and ≥ 18 h; hardware NR | NR |
| Boitor et al. [22] | H2O AutoML | H2O.ai | Local | No | Yes | Yes | No | Yes | No | H2O AutoML; CPU/GPU and memory configuration NR | NR |
| Boitor et al. [23] | H2O AutoML; Auto-sklearn | H2O.ai; Auto-sklearn Team | Local | No | Yes | Yes | No | Yes | No | H2O AutoML + Auto-sklearn with 15-min time budget; hardware NR | NR |
| Fatapour et al. [24] | H2O AutoML | H2O.ai | Local | No | Yes | Yes | No | Yes | No | H2O AutoML with 5-fold CV, 600-s runtime per run (×5); hardware NR | NR |
| Gomez-Rios et al. [35] | ORIENTATE AutoML | In-house Development Team | Local (web-based application) | Yes | Yes | No | No | Yes | Yes | Server with Intel i9–9280X CPU + 2 × NVIDIA RTX 2080Ti GPUs + 32 GB RAM; compute-limited feature selection | Full feature-selection search over many predictors is computationally expensive and constrained by available processing power; case study uses a small dataset |
| Hamdan et al. [25] | LandingLens | Landing AI | Cloud | Yes | Yes | Yes | Yes | No | No | Cloud platform; hardware abstracted from user (CPU/GPU NR) | NR |
| Kong et al. [26] | Google Cloud AutoML Vision | Google LLC | Cloud | Yes | Yes | Yes | Yes | No | No | Google Cloud n1-standard-8 VM + NVIDIA Tesla V100 GPU; 32 node-hours (∼US$100.80) | Cloud training and deployment incur ongoing node-hour costs; ethical and security concerns about uploading medical images to third-party cloud servers |
| Kwack et al. [27] | H2O AutoML | H2O.ai | Local | No | Yes | Yes | No | Yes | No | H2O AutoML training multiple model families; CPU/GPU configuration NR | NR |
| Cheong et al. [28] | Vertex AI AutoML | Google Cloud | Cloud | Yes | Yes | Yes | Yes | No | No | Local PC (Intel i7-8750H + 16 GB RAM) for labeling; Vertex AI training on Google Cloud (∼158.5 min/model) | Cloud AutoML requires paid compute time and internet access; hand-coded ML pipelines would demand substantial specialist time and cost |
| Gonzalez et al. [29] | LandingLens | Landing AI | Cloud | Yes | Yes | Yes | Yes | No | No | LandingLens cloud training (40 epochs); CPU/GPU NR | Data collection and expert annotation for pediatric bitewing datasets are time- and resource-intensive, limiting sample size and generalizability |
| Jeong et al. [30] | Vertex AI AutoML | Google Cloud | Cloud | Yes | Yes | Yes | Yes | No | No | Vertex AI training in Iowa region; 1 h 38 min (8 node-hours); ∼202,315 KRW | Cloud usage incurs node-hour costs; high volumes of training data could increase expense; dependence on Google Cloud platform |
| Nantakeeratipat et al. [31] | Vertex AI AutoML | Google Cloud | Cloud | Yes | Yes | Yes | Yes | No | No | Vertex AI (Gemini 1.5 Pro, M125 release); training region specified; hardware NR | Manual cropping of teeth is time-consuming; model generalization limited by small dataset; cloud compute costs not quantified |
| Yoon et al. [32] | H2O AutoML | H2O.ai | Local | No | Yes | Yes | No | Yes | No | H2O AutoML with 5-fold CV; CPU/GPU configuration NR | NR |
| Akadiri et al. [33] | Google Teachable Machine | Cloud | Yes | No | No | Yes | No | No | In-browser Teachable Machine (MobileNet transfer learning); ∼10–12 min training; hardware NR | Low computational and financial cost; main barriers are small data volume, limited AI literacy and infrastructure in resource-limited settings | |
| Pais et al. [34] | TPOT AutoML; O2Pmgen | TPOT Developers; Bioenhancer Systems Ltd. | Hybrid (Local AutoML + Cloud SaaS) | Yes | Yes | Yes | Yes | Yes | No | TPOT (100 generations, pop. size 50, 5-fold CV) on 16-core 2.6 GHz VPS; O2Pmgen run on cloud SaaS digital phenomics platform | High cost and complexity of metagenomic sequencing and bioinformatics; need for specialized computing resources, evolutionary AutoML runs and commercial SaaS licenses |
AutoML:Automated Machine Learning; AI:Artificial Intelligence; ML:Machine Learning; CV:Cross-Validation; CPU:Central Processing Unit; GPU:Graphics Processing Unit; RAM:Random Access Memory; VM:Virtual Machine; NR:Not Reported; GB:Gigabyte; SaaS:Software as a Service; VPS:Virtual Private Server; TPOT:Tree-based Pipeline Optimization Tool; ORIENTATE:Optimization and Resource-efficient INtelligent Tool for Automated feaTure sElection; RTX:Real-Time Ray Tracing; KRW:South Korean Won; US$:United States Dollar;
Automated model selection and hyperparameter optimization were near-universal features across platforms. Automated feature engineering, however, was less consistent and often limited in H2O-based tabular studies, where feature derivation remained largely manual. True end-to-end automation (from data upload to deployable model/API) was most evident in commercial cloud systems (LandingLens, AutoML Vision, Vertex AI, Teachable Machine) and in the hybrid TPOT/O2Pmgen workflow, whereas local tools required separate preprocessing and integration steps.
Use of explainability methods was uneven. Clinical and omics studies using H2O, TPOT, Auto-sklearn or ORIENTATE frequently reported feature importance or SHapley Additive exPlanations (SHAP) analyses to identify influential predictors [19], [22], [23], [24], [27], [32], [34], [35]. In contrast, most imaging papers using commercial cloud platforms did not report saliency mapping or case-level explanations and remained largely “black box” to end-users. None of the studies released full code or pre-trained models publicly.
Computational descriptions ranged from short browser-based training times in Teachable Machine [33] to multi-hour or multi-node cloud training on Google infrastructure [26], [28], [30], and long TPOT runs with repeated cross-validation [19], [34]. Reported barriers included limited local processing power and memory for exhaustive feature search, ongoing node-hour costs and data-governance concerns for cloud platforms, annotation burden for imaging datasets, and the high cost and complexity of sequencing and bioinformatics pipelines in omics work.
3.5. Model performance and evaluation characteristics
Across tasks, AutoML models generally achieved strong discrimination, particularly for imaging applications (Table 4). Implant-classification, root-disease and restoration-segmentation models reached accuracies and AUCs around 0.95–0.98 and, in some small studies, perfect metrics [17], [25], [26], [29], [30], [33]. Smartphone-based gingivitis and plaque models achieved AUC/Area Under the Precision-Recall Curve (AUPRC) values above 0.90 with F1-scores > 0.80 [18], [31]. For tabular data, ECC screening and status models performed moderately (AUC ∼0.74–0.80) but were parsimonious and relied on readily available variables [19], [20]. Orthodontic extraction, metabolic-syndrome, and cancer-recurrence studies frequently reported very high performance on internal validation procedures such as cross-validation or on independent hold-out test sets, depending on the study design, with F1-scores often ≥ 0.90 [21], [22], [23], [24]. MRONJ and ICU-admission prediction achieved AUROCs ≥ 0.75–0.84, suggesting clinically useful risk stratification using questionnaire and routine clinical data [27], [32]. Metagenomic peri-implantitis models produced AUCs up to 0.98 for small combinations of taxa, whereas metabolomic ECC models achieved more modest discrimination (AUC ∼0.75) but yielded biologically plausible metabolite signatures [19], [34].
Table 4.
Performance outcomes and clinical implications of included AutoML studies.
| Author | Evaluation metrics | Main findings | Clinical Impact | Limitations | Future Work |
|---|---|---|---|---|---|
| Lee et al. [17] | AUC: 0.954 (95% CI 0.933–0.970; SE 0.011), Youden index: 0.808, sensitivity: 0.955, specificity: 0.853; panoramic AUC: 0.929; periapical AUC: 0.961; per-class AUC range: 0.903–0.981 | Automated DCNN with Neuro-T achieved high accuracy (AUC 0.954) for six implant-system classifications and outperformed most dental professionals. | May serve as a clinical decision-support tool to identify implant systems from routine radiographs and assist management of implant complications. | Limited to six implant types from three hospitals; small dataset; no CBCT/3D inputs; computing-resource constraints; limited interpretability; no external validation. | Larger and more diverse 2D/3D datasets, improved optimization, and clinical feasibility and efficacy studies are required. |
| Tobias et al. [18] | AUC (training 5-fold CV, per tooth position): 0.72–0.99, AUC (test-set, stage 1 healthy vs inflamed): 0.69–1.0, AUC (test-set, stage 2 mild vs severe): 0.84–1.0, F1-scores: 0.82–1.0 across tooth positions | H2O AutoML with tailored image features and a two-stage cascade accurately classified gum-health status (healthy vs inflamed; mild vs severe) with high AUC despite a small dataset. | Supports remote, noninvasive gingivitis monitoring via dental selfies and mHealth apps, especially when in-person visits are limited. | Small single-center pilot dataset; limited to anterior teeth; single-dentist gingivitis grading; manual bounding boxes required; no external validation; restricted generalizability. | Collect larger, diverse datasets with multiple graders; automate tooth localization; explore deep learning; validate links between frontal and overall gum health; evaluate real-time use in the iGAM app. |
| Heimisdottir et al. [19] | Balanced accuracy: 0.66 (10-fold CV), AUC: 0.75 for ECC classification, feature-importance (top metabolites): catechin 1.1% decrease in accuracy when permuted, epicatechin 0.87%, fucose 0.48%, imidazole propionate 0.50%, 9,10-DiHOME 0.47%, N-acetylneuraminate 0.43% | TPOT AutoML using 503 metabolites achieved modest ECC discrimination (AUC ∼0.75) and highlighted biologically relevant metabolites. | Provides biochemical insight into ECC and suggests metabolite biomarkers that may support future ECC risk assessment or disease-activity monitoring. | Cross-sectional design; prevalent ECC only; pooled plaque samples; metabolomics alone insufficient for strong prediction; potential dietary/environmental confounding; no external or longitudinal validation. | Conduct longitudinal and site-specific studies integrating metabolomics with microbial, host, behavioral and environmental data; develop refined ECC taxonomies and multi-omics AutoML models with external validation. |
| Karhade et al. [20] | AUC: 0.74 (best model), sensitivity: 0.67, specificity: 0.66, PPV: 0.64, accuracy: 0.67; external replication-AUC: 0.80, sensitivity: 0.73, PPV: 0.49 | A parsimonious AutoML model using only child age and parent-reported oral health provided modest but useful ECC screening and status imputation. | May augment ECC screening where clinical exams are unavailable and enable ECC burden estimation using minimal proxy information. | Cross-sectional design; ECC prevalence rather than risk prediction; high-risk cohort; reliance on parent-reported oral health; modest performance; no prospective validation | Refine models with biological/contextual predictors; test in diverse prospective cohorts; evaluate feasibility and public health impact in real-world settings. |
| Del Real et al. [21] | Accuracy: up to 93.93%, sensitivity: 0.939, false-positive rate: 0.08, precision: 0.94, F-score: 0.939, MCC: 0.87, ROC-AUC: 0.915, PR-AUC: 0.913 (best model using combined clinical + radiographic data) | Auto-WEKA AutoML generated accurate extraction-prediction models (accuracy up to 93.9%, ROC-AUC 0.915) using cephalometric and clinical features. | May support orthodontists in complex extraction vs non-extraction decisions by providing an additional data-driven recommendation. | Single-center retrospective dataset with small sample; no external validation; risk of overfitting; measurement reliability not assessed; AutoML models difficult to interpret. | Include larger, multi-orthodontist datasets; assess generalizability and overfitting; explore model interpretability; evaluate AutoML support tools in clinical practice. |
| Boitor et al. [22] | Precision: 0.897, recall: 1.000, accuracy: 0.932, F1-score: 0.945 (H2O AutoML – Distributed Random Forest model) | H2O AutoML Distributed Random Forest achieved high metabolic-syndrome prediction accuracy and identified periodontal and systemic factors as major predictors. | Supports the idea that controlling periodontal disease and lifestyle factors may help reduce systemic inflammation and cardiometabolic risk in metabolic syndrome. | Single-center cross-sectional design; limited sample size; incomplete class-distribution reporting; no external/prospective validation; potential overfitting from single train–test split. | Use larger multi-center, longitudinal cohorts; incorporate lifestyle/biomarker data; perform external validation; clarify causal links and clinical usefulness of AutoML models. |
| Boitor et al. [23] | Precision: 1.00 (XGBoost), recall: 1.00, accuracy: 1.00, specificity: 1.00, balanced accuracy: 1.00, F1-score: 1.00; DRF model—precision: 1.00, recall: 0.885, accuracy: 0.932; Auto-sklearn RF—precision: 0.929, recall: 1.00, accuracy: 0.955, balanced accuracy: 0.944 | H2O and Auto-sklearn produced high-performing metabolic-syndrome models; H2O XGBoost reached 100% accuracy, with SHAP identifying key periodontal and systemic predictors. | May aid identification of individuals at high risk for metabolic syndrome and support multidisciplinary prevention strategies based on periodontal and systemic factors. | Single-center retrospective dataset; small sample; no external validation; simple 70/30 split risks model drift; causality between periodontal disease and metabolic syndrome cannot be inferred. | Larger multi-center cohorts with external/prospective validation; update models periodically to address drift; expand AutoML + SHAP pipelines to guide metabolic-syndrome management. |
| Fatapour et al. [24] | AutoML (H2O): Gradient Boosting Machine (GBM) selected as best model; Accuracy: 0.818 (5-year), 0.800 (10-year); Precision: 0.977 / 0.940; Recall (Sensitivity): 0.830 / 0.828; AUC (ROC): 0.75 / 0.74 | H2O AutoML–based SEER framework predicted 5- and 10-year oral tongue squamous cell carcinoma recurrence with ∼80–82% accuracy; SHAP analysis identified number of prior tumors, patient age, tumor site, histology, and treatment variables as the most influential predictors. | May support population-level risk stratification and follow-up planning for OTSCC by identifying patients at higher risk of locoregional recurrence using routinely available clinical variables; positioned as a screening and decision-support tool rather than a standalone clinical predictor. | Single population-based retrospective study using registry data; potential information and coding bias; absence of detailed histopathologic and treatment-timing variables (e.g., depth of invasion, perineural invasion, radiation dose); recurrence rate lower than reported literature due to strict inclusion criteria; no independent external cohort or prospective validation. | Apply the AutoML framework to other cancer types and updated SEER releases; incorporate additional clinical and histopathologic variables; evaluate alternative tree-based models and extended training; perform independent external and prospective validation to assess real-world generalizability. |
| Gomez-Rios et al. [35] | F1-score (class of interest – second sedation): 0.83, balanced accuracy: 0.81, precision: 0.92, recall: 0.75; AUC (ROC): 0.92; class “no sedation”—precision: 0.87, recall: 0.96, F1-score: 0.92 | ORIENTATE AutoML enabled non-technical researchers to build interpretable models; an 8-predictor decision tree achieved F1 = 0.83 and AUC 0.92 for sedation prediction. | May support preventive planning by identifying children—especially SHCN—at higher risk of second sedation and highlighting modifiable factors. | Single-center retrospective cohort of 230 cases; highly imbalanced outcome; no external validation; limited generalizability; SHAP importance may be misinterpreted causally. | Apply ORIENTATE to larger datasets; improve and benchmark feature-selection algorithms; integrate causal inference; explore clinical deployment and broad validation. |
| Hamdan et al. [25] | Sensitivity (pixel): 86.64%, specificity (pixel): 99.78%, accuracy (pixel): 99.63%, precision (pixel): 82.4%, F1-score (pixel): 0.844; sensitivity (tooth): 98.34%, specificity (tooth): 98.13%, accuracy (tooth): 98.21%, precision (tooth): 98.85%, F1-score (tooth): 0.98; AUC: 0.978 (95% CI 0.959–0.998) | LandingLens no-code vision model accurately segmented dental restorations (AUC 0.978) with pixel- and tooth-level performance comparable to coded deep-learning systems. | Suggests that no-code AI platforms can democratize dental imaging model development and potentially improve diagnostic workflow efficiency. | Only 100 panoramic images from a single institution/device; limited age and inclusion criteria; no cross-validation or external validation; only one no-code platform tested; difficulty segmenting certain restorations. | Validate multiple no-code platforms on larger, diverse datasets; explore additional imaging tasks; incorporate cross-validation and ensembling; evaluate regulatory pathways and clinical deployment. |
| Kong et al. [26] | Accuracy: 0.981, precision: 0.963, recall: 0.961, specificity: 0.985, F1-score: 0.962; average precision (AUPRC): 0.983; per-class accuracy ranged from 0.965 to 1.000 | Google AutoML Vision CNN achieved high accuracy for classifying four implant systems from periapical radiographs, enabling accurate no-code implant identification. | May support future tools for identifying unknown implant systems in clinical practice, aiding treatment planning and complication management. | Single-center high-quality periapical images; only four implant systems; no external validation; real-world variability not represented; AutoML algorithmic transparency limited; potential cloud-storage data concerns. | Collect larger, multi-center datasets; include more implant systems; test performance on heterogeneous real-world images; address privacy and governance for cloud deployment. |
| Kwack et al. [27] | AUC (training): 0.8283, sensitivity (training): 83.7%, specificity (training): 71.3%; AUC (test): 0.7526, sensitivity (test): 88.6%, specificity (test): 52.8% | H2O AutoML developed MRONJ prediction models (train AUC 0.8283; test AUC 0.7526) identifying medication duration, age, and surgical factors as key predictors. | May help clinicians estimate MRONJ risk preoperatively and counsel osteoporotic patients using easily obtainable questionnaire data. | Single-center retrospective dataset; limited to female osteoporosis patients; reliance solely on questionnaire data; no imaging/lab variables; no external validation; non-uniform variables may lead to over/underfitting. | Conduct multi-center, larger studies including radiographic/laboratory/genetic data; perform prospective validation; develop advanced models for MRONJ decision-support. |
| Cheong et al. [28] | Sensitivity: 91.3%, specificity (precision): 92.8%, accuracy: 92%; TP: 63, TN: 64, FP: 5, FN: 6 (Vertex AI image classification model) | Vertex AI AutoML MRI classifier achieved high sinonasal disease detection accuracy (∼92%), showing clinicians can develop MRI models without coding. | May support MRI pre-screening systems for sinonasal disease to streamline radiology workflows and reduce reporting backlogs. | Only one coronal MRI slice analyzed; may miss disease in other planes; dataset differs from real-world scans; no symptom correlation; no external validation or multi-slice/multi-label testing. | Expand to multi-slice and multi-label MRI tasks; implement saliency mapping; test generalizability on real-world clinical MRI; evaluate CDSS deployment. |
| Gonzalez et al. [29] | Sensitivity: 0.888, specificity: 0.988, precision: 0.958, accuracy: 0.964, F1-score: 0.921, ROC-AUC: 0.965 (no-code AI, LandingLens™) | LandingLens no-code caries model accurately identified pediatric proximal caries (ROC-AUC 0.965), demonstrating strong clinical potential. | Supports AI-assisted early detection of proximal caries in pediatric patients, especially where expert radiology support is limited. | Small single-center dataset (66 bitewings); no external validation; exclusion of low-quality images reduces real-world applicability; radiographic gold standard instead of histology; possible missed early lesions. | Build larger multi-institutional pediatric datasets; include varied sensors and image quality; perform external validation; conduct prospective radiographic–clinical studies. |
| Jeong et al. [30] | AUC: 0.967, precision: 95.6%, recall (reproduction rate): 95.2%, log loss: 0.085 (Vertex AI AutoML, multi-label dental root disease classification) | Google Vertex AutoML root-disease classifier achieved high performance (AUC 0.967; precision 95.6%; recall 95.2%) supporting clinical diagnostic use. | Shows that non-programmers can use no-code AutoML to build clinically meaningful dental X-ray diagnostic models, improving accessibility of AI in dentistry. | Limited to 2999 images from a public non-Korean dataset; uncertain ground-truth process; no external validation or clinical deployment; restricted to root-disease classification. | Use larger, locally collected datasets; expand disease categories; validate externally and prospectively; assess clinical workflow integration and cost-effectiveness. |
| Nantakeeratipat et al. [31] | AUPRC (3-class model): 0.907, precision: 86.2%, recall: 83.3%, F1-score: 84.7%; heavy-label precision 100% and recall 90%; AUPRC (2-class model): 0.964, precision: 93.1%, recall: 93.1%, F1-score: 93.1% | Vertex AI AutoML on smartphone photos accurately classified plaque levels; binary plaque acceptability achieved AUPRC 0.964 and F1 93.1%, enabling dye-free detection. | Could support chairside and at-home plaque monitoring via smartphones, reducing the need for disclosing dyes and enabling scalable preventive care. | Single-center sample of 100 dental students; restricted to upper anterior teeth; small, standardized dataset; manual cropping required; no external validation or testing in diverse populations. | Develop larger, more diverse datasets; build multi-tooth/image-level models; validate across different clinics, smartphones and lighting; integrate into user-friendly apps. |
| Yoon et al. [32] | AUROC (model 1 training): 0.9191, AUROC (model 1 cross-validation): 0.8289; AUROC (model 2 training): 0.9263, AUROC (model 2 cross-validation): 0.8415 (H2O AutoML GLM best performer) | H2O-AutoML GLM using clinical ± MDCT fascial-space data achieved AUROC ≥ 0.83 for predicting ICU admission in odontogenic infection. | Suggests ML models based on simple clinical and laboratory data could assist early ICU triage decisions for patients with oral infections. | Single-center retrospective dataset of 210 patients; no external validation; chart-based variables only; ML model complexity prevents simple clinical thresholds; limited generalizability. | Develop systems to gather/share large-scale blood test and exam data; perform external and prospective validation of ICU prediction models across institutions. |
| Akadiri et al. [33] | Accuracy: 1.00, precision: 1.00, recall (sensitivity): 1.00, F1-score: 1.00, ROC-AUC: 1.00 for both models (Project 1: malignant vs benign; Project 2: fibrous dysplasia vs ossifying fibroma) | Google Teachable Machine no-code CNN distinguished malignant vs benign lesions and FD vs OF with perfect metrics (AUC 1.0) despite very small datasets. | May improve diagnostic workflows in OMFS in low-resource settings by augmenting radiologist and surgeon interpretations. | Very small training and test samples; no external validation; high similarity between train and test sets; overfitting likely; black-box CNN with limited interpretability. | Collect larger multi-center datasets with standardized imaging; use rigorous validation (cross-validation + external); incorporate explainable AI; expand AI education for OMFS clinicians. |
| Pais et al. [34] | AUC: 0.70–0.98 (ML-driven biomarker scoring models), sensitivity: 80–100%, specificity: 80–95% for top models; saliva-based models achieved sensitivity: 95–100% and specificity: 95%; single-microorganism models showed sensitivity: up to 95% but specificity: ≤ 42%, AUC: ≤ 0.56 | AI-driven AutoML using metagenomic microbiome data yielded high-performing peri-implantitis classifiers; saliva-based 2–4 microbe combinations achieved 80–100% AUC, sensitivity and specificity. | Indicates that saliva- and biofilm-based microbial biomarkers may form the basis of future non-invasive, cost-effect | Small sample (40 patients, 100 metagenomes) from a single population; high-dimensional data increases overfitting risk; no demographic/clinical covariates; dependence on reference databases; no external validation. | Build larger, diverse cohorts; externally validate microbiome models; integrate demographic and clinical factors; develop affordable qPCR-style biomarker assays; create interpretable clinical decision tools. |
AI: Artificial Intelligence; AutoML: Automated Machine Learning; AUC: Area Under the Curve; ROC-AUC: Receiver Operating Characteristic Area Under the Curve; AUROC: Area Under the Receiver Operating Characteristic Curve; AUPRC: Area Under the Precision–Recall Curve; CBCT: Cone-Beam Computed Tomography; CDSS: Clinical Decision Support System; CI: Confidence Interval; CNN: Convolutional Neural Network; CV: Cross-Validation; DCNN: Deep Convolutional Neural Network; DiHOME: Dihydroxy-octadecenoic Acid; DRF: Distributed Random Forest; ECC: Early Childhood Caries; F1: F1 Score; FD: Fibrous Dysplasia; FN: False Negative; FP: False Positive; GBM: Gradient Boosting Machine; GLM: Generalized Linear Model; ICU: Intensive Care Unit; iGAM: intelligent Gingival Assessment Model; IOS: Intraoral Scanner; Log Loss: Logarithmic Loss; MCC: Matthews Correlation Coefficient; MDCT: Multidetector Computed Tomography; mHealth: Mobile Health; ML: Machine Learning; MRI: Magnetic Resonance Imaging; MRONJ: Medication-Related Osteonecrosis of the Jaw; OF: Ossifying Fibroma; OMFS: Oral and Maxillofacial Surgery; ORIENTATE: Outcome-oRiented Interpretable ENsemble Auto-ML Tool; OTSCC: Oral Tongue Squamous Cell Carcinoma; PPV: Positive Predictive Value; PR: Precision–Recall; PR-AUC: Precision–Recall Area Under the Curve; qPCR: Quantitative Polymerase Chain Reaction; RF: Random Forest; ROC: Receiver Operating Characteristic; SE: Standard Error; SEER: Surveillance, Epidemiology, and End Results Program; SHAP: SHapley Additive exPlanations; SHCN: Special Health Care Needs; TN: True Negative; TP: True Positive; TPOT: Tree-based Pipeline Optimization Tool; XGBoost: Extreme Gradient Boosting; Youden Index: Youden Index.
Despite frequently strong discrimination metrics, key methodological components necessary for clinical decision support were seldom evaluated. Across all nineteen included studies, none reported formal calibration assessment, such as calibration plots, Brier scores, or calibration slope and intercept. Similarly, quantitative evaluation of subgroup performance was not undertaken; results were almost universally presented as overall accuracy or AUC values without stratification by demographic characteristics, acquisition settings, or disease severity. Although one investigation illustrated ambiguous cases, it did not provide statistical comparison between groups [31]. Reporting of failure behavior was limited. Sixteen studies relied exclusively on aggregate metrics or confusion matrices without describing the clinical contexts of error [17], [18], [19], [20], [21], [22], [23], [24], [27], [28], [29], [30], [32], [33], [34], [35], whereas only three offered narrative or visual insight into misclassifications [25], [26], [31]. Transparency of the AutoML process also varied. Explicit version identifiers or sufficiently detailed pipeline information enabling reproducibility were available in only three studies [19], [31], [34], partial configuration details without version traceability were present in one report [35], and the remaining investigations did not provide enough information to reconstruct the modeling workflow. Finally, none of the studies clearly documented safeguards ensuring separation of training and testing data at the patient level. Together, these observations indicate that, although discrimination performance was frequently high, reporting of calibration, subgroup robustness, failure characteristics, and reproducibility requirements was generally insufficient to assess clinical deployment considerations.
Across the included studies, performance was predominantly estimated using internal validation (18 of 19 studies), including single random train–test splits, k-fold cross-validation, or nested cross-validation within the same dataset. As summarized in Table 2, external validation was uncommon, with most studies reporting an external validation size of 0; the principal exception was Karhade et al. [20], which evaluated the model in an independent cohort.
Studies consistently framed AutoML outputs as decision-support tools rather than standalone diagnostic systems. Proposed uses included assisting identification of unknown implant systems, guiding complex extraction decisions, flagging children at risk of second deep sedation or severe ECC, supporting cardiometabolic risk assessment from periodontal data, triaging oncologic follow-up or ICU admission, and enabling smartphone-based plaque and gingivitis monitoring at home. Omics studies emphasized AutoML’s role in biomarker discovery and prioritization rather than immediate clinical deployment.
At the same time, every study acknowledged significant limitations. Many datasets were small, single-center and subject to spectrum and selection bias; external validation was rare and prospective or interventional evaluations were absent. Several of the highest-performing models (including those with perfect metrics) were trained and tested on very limited image sets, raising a high risk of overfitting. Cross-sectional designs limited inference to disease status rather than future risk; reliance on self-reported or registry variables introduced potential misclassification; calibration and fairness were largely unreported; and explainability for complex models remained limited. None of the AutoML systems described had obtained regulatory approval, and real-world deployment was not evaluated.
Overall, the synthesis of Table 1, Table 2, Table 3, Table 4 indicates that AutoML can achieve competitive, and often excellent, predictive performance across diverse dental applications using heterogeneous data sources, while lowering technical barriers for non-programmer clinicians. However, substantial heterogeneity in study design, outcome definitions, AutoML platforms, and validation strategies, together with small samples and lack of external or prospective evaluation, precludes formal meta-analysis and underscores the need for larger, multicenter, rigorously validated and implementation-focused AutoML studies in dentistry.
3.6. Risk of Bias
Using the PROBAST tool (Table 5), a high risk of bias was most frequently observed in the analysis domain, with 15 of 19 studies (78.9%) rated as high risk, mainly due to small sample sizes, lack of external validation, reliance on single random data splits, and overfitting concerns. The participants domain also showed substantial risk of bias, with 13 studies (68.4%) rated as high risk, largely attributable to single-center recruitment, spectrum bias, case–control enrichment, and restricted study populations.
Table 5.
Risk of bias of included studies.
| Study | Participants | Predictors | Outcome | Analysis | Overall |
|---|---|---|---|---|---|
| Lee et al. [17] | low | low | low | high | high |
| Tobias et al. [18] | high | high | high | high | high |
| Heimisdottir et al. [19] | high | high | low | high | high |
| Karhade et al. [20] | low | low | low | high | high |
| Del Real et al. [21] | high | low | low | low | high |
| Boitor et al. [22] | high | high | high | high | high |
| Boitor et al. [23] | high | high | low | high | high |
| Fatapour et al. [24] | low | low | low | low | low |
| Gomez-Rios et al. [35] | low | low | low | high | high |
| Hamdan et al. [25] | high | low | low | high | high |
| Kong et al. [26] | low | low | low | low | low |
| Kwack et al. [27] | high | low | low | high | high |
| Cheong et al. [28] | high | low | low | high | high |
| Gonzalez et al. [29] | high | low | low | high | high |
| Jeong et al. [30] | low | low | low | high | high |
| Nantakeeratipat et al. [31] | high | low | low | high | high |
| Yoon et al. [32] | high | low | low | low | high |
| Akadiri et al. [33] | high | low | low | high | high |
| Pais et al. [34] | high | low | low | high | high |
| Overall (%) | low: 31.6% high: 68.4% |
low: 78.9% high: 21.1% |
low: 89.5% high: 10.5% |
low: 21.1% high: 78.9% |
low: 10.5% high: 89.5% |
In contrast, the predictors domain demonstrated predominantly low risk of bias, with 4 studies (21.1%) rated as high risk, mainly due to subjective feature engineering or high-dimensional handcrafted predictors. The outcome domain showed the lowest overall risk of bias, with only 2 studies (10.5%) rated as high risk, as most studies employed objective and well-defined reference standards.
Overall, while predictor measurement and outcome assessment were methodologically robust in most studies, the dominant limitation across the literature remains inadequate statistical analysis and validation strategies, substantially limiting the generalizability of many reported models.
4. Discussion
4.1. Translational Maturity of AutoML in Dentistry
This systematic review synthesizes current evidence on the application of AutoML in oral healthcare and clarifies its present role within dental AI research. Overall, AutoML was primarily used as a methodological facilitator for model development, exploratory analysis, and feasibility testing, rather than as a clinically integrated or deployment-ready technology. This mirrors trends observed across broader healthcare domains, where AutoML is largely employed to reduce technical barriers and accelerate experimentation rather than support validated clinical decision-making systems [36], [37]. The predominance of retrospective designs and moderate sample sizes therefore reflects the early translational stage of AutoML adoption in oral health.
4.2. Performance, Validation, and Generalizability
Across included studies, AutoML-based models demonstrated consistently strong discrimination on internal validation (e.g., cross-validation/validation splits) and/or hold-out testing, particularly in imaging tasks such as radiographic diagnosis, implant classification, and photographic assessment of oral conditions. Similar findings have been reported in general medical imaging, where deep learning models often achieve high internal accuracy [38]. However, these results were typically derived from single-center datasets with narrow inclusion criteria, which can inflate performance estimates. Models trained on such homogeneous data may capture site-specific patterns that fail to generalize, as evidenced by marked performance drops when medical imaging models are tested across institutions or populations [39], [40]. Consequently, although current dental AutoML models exhibit strong internal discrimination, their true clinical utility remains uncertain without external validation on diverse datasets.
Most studies in this review remained at the internal validation stage, while external validation was uncommon and prospective clinical evaluation was absent. Consequently, the available literature supports the technical feasibility of clinician-accessible AutoML, but does not yet justify conclusions regarding implementation effectiveness, workflow impact, or patient outcomes.
4.3. Platform Differences: Local versus Cloud Ecosystems
Interestingly, several studies reported comparable performance across different AutoML platforms. This suggests that data quality, outcome definition, and task formulation are more influential determinants of model performance than the specific algorithm or platform used. Emphasis on standardized annotations, refined cohort definitions, and consistent outcome variables is therefore likely to yield greater performance gains than increasing algorithmic complexity [39], [41], [42]. In this context, richer and more representative data inputs are more valuable than fine-tuning sophisticated models on poorly curated datasets.
However, comparable predictive performance did not imply equivalence in how systems were implemented, governed, or interpreted in practice. While predictive accuracy did not appear to depend consistently on the specific AutoML engine, important differences emerged in infrastructure, transparency, computational control, and implementation pathways. Local frameworks such as H2O, TPOT, Auto-WEKA, Auto-sklearn, and ORIENTATE were used predominantly for structured clinical, registry, and omics datasets. These environments typically require greater user involvement in preprocessing and pipeline preparation; however, they simultaneously enable closer methodological supervision of model construction. Studies employing local systems far more frequently incorporated interpretability strategies, including feature-importance ranking and SHAP-based analyses [19], [22], [23], [24], [27], [32], [34], [35], suggesting stronger compatibility with research scenarios in which auditability and reproducibility are central aims. In addition, maintaining computation within institutional infrastructure generally might avoid external data transfer and therefore simplify governance of sensitive clinical information.
In contrast, cloud ecosystems such as LandingLens, Vertex AI, AutoML Vision, and Teachable Machine were concentrated primarily in image-based applications and emphasized accessibility, rapid deployment, and abstraction of algorithmic complexity. Multiple reports demonstrated that clinicians with minimal programming background could produce high-performing models within short time frames through graphical interfaces and automated optimization pipelines [25], [28], [30], [33]. Nevertheless, this convenience is accompanied by reduced visibility into internal search processes, restricted customization, limited reporting of intermediate modeling decisions, and infrequent application of explainability tools.
Operational trade-offs are evident. Cloud implementations introduce ongoing node-hour expenses, reliance on internet connectivity, and dependence on vendor-managed environments [26], [28], [30], whereas locally deployed platforms are more often constrained by hardware capacity, memory limitations, and prolonged optimization cycles, particularly in evolutionary approaches [19], [34]. Taken together, platform selection therefore appears not as a purely technical preference but as a strategic alignment with the primary objective of a project. Investigations focused on biomarker discovery, hypothesis exploration, or transparent clinical risk modeling may derive greater value from local environments that prioritize interpretability and methodological control. By contrast, rapid prototyping, educational use, and early feasibility demonstrations may benefit more from cloud infrastructures that minimize engineering overhead and accelerate experimentation.
For the field to progress toward clinical translation, future reports should describe platform-specific constraints with greater clarity, including data custody, ownership of derived models, retraining permissions, financial dependencies, and availability of explanation interfaces. Without such information, evaluation of scalability and readiness for real-world integration remains incomplete. Strengthening reporting in these areas will be essential for transforming AutoML from a proof-of-concept accelerator into a trustworthy component of deployable dental decision-support systems.
4.4. Democratization of Model Development
Beyond these technical and operational considerations, AutoML also reshapes who can participate in model development and how quickly clinically relevant questions can be translated into empirical investigation. Its contribution in dentistry is not limited to a simple decrease in programming effort. AutoML does not remove the need for appropriate data preparation, clinical supervision, or critical interpretation of results. However, its impact lies in altering how and by whom artificial intelligence investigations can be initiated and executed. By automating algorithm selection, model construction, and hyperparameter optimization, AutoML enables clinicians to move from passive recipients of externally developed systems toward active participants in model development [3], [11], [36]. This transition is particularly relevant in healthcare environments where access to dedicated data-science teams may be limited, financially restrictive, or associated with long development timelines [3], [36].
AutoML also markedly increases development velocity. Instead of investing substantial time in building bespoke pipelines before any evaluation is possible, clinicians can explore feasibility, compare alternative predictor definitions, and test multiple hypotheses within short timeframes. This capacity for rapid iteration has been identified as a central advantage of AutoML in clinical research [8], [10], [12]. Furthermore, AutoML promotes methodological consistency. Standardized and automated workflows reduce variability arising from individual programming strategies, thereby improving reproducibility and facilitating comparisons across studies and institutions, which are prerequisites for trustworthy medical AI [3], [36].
Dependence on specialized engineers is not only a logistical matter but may also restrict which ideas are pursued. When development requires extensive technical mediation, many clinically relevant hypotheses may remain untested. AutoML lowers this structural barrier and broadens participation in AI research, enabling more practitioner-driven questions to be explored [11]. Therefore, even if AutoML does not eliminate all technical requirements, its principal value lies in democratizing innovation, accelerating exploratory research, and enabling clinician-led model development, thereby expanding the research capacity of the dental community.
4.5. Methodological Quality, Reporting, and Evaluation Practices
At present, AutoML functions mainly as an efficient framework for model selection and hyperparameter optimization within predefined search spaces and remains fundamentally constrained by the quality and consistency of input data, which it cannot inherently correct [36], [43]. This reinforces the need to prioritize data curation and standardization over reliance on black-box optimization for marginal performance improvements. In dental AI, assembling high-quality, heterogeneous datasets with precise outcome definitions is likely to have a greater impact than switching between AutoML platforms.
Despite favorable internal metrics, evaluation in dental AutoML studies remains largely discrimination-focused. Our synthesis highlights a persistent translational gap: high AUC or accuracy does not ensure that predictions are reliable, equitable, or safe for routine care. Without calibration analysis, predicted risks cannot be assumed to represent true probabilities at clinically relevant thresholds [44], [45]. Similarly, the lack of subgroup evaluation limits confidence in performance across populations, acquisition settings, and disease spectra, while sparse reporting of failure modes obscures where models may break down. Reproducibility is further constrained by incomplete documentation of platform configuration, workflow traceability, and safeguards against data leakage, particularly in no-code or low-code environments [46]. Future studies should therefore report calibration, subgroup performance, structured error analysis, and transparent patient-level data separation in addition to discrimination metrics. These elements are also essential for interpreting decision-analytic methods such as decision-curve analysis (DCA) when clinical use is proposed [47].
Most studies emphasized discrimination metrics such as accuracy, AUC, sensitivity, and specificity, whereas calibration and DCA were infrequently reported. DCA is designed to estimate clinical utility by comparing the net benefit of model-guided decisions with default strategies (e.g., treat all vs treat none) across clinically reasonable threshold probabilities. Its interpretation therefore requires explicit specification of the intended clinical action, appropriate comparators, and a justified threshold range aligned with the predicted outcome [48], [49]. Importantly, DCA should clarify the consequences of decisions under prespecified preferences rather than be used to identify an “optimal” threshold by maximizing net benefit [49]. Practical guidance further stresses restricting analyses to plausible thresholds and avoiding presentation choices that may exaggerate benefit [50], [51]. Although conclusions are often directionally correct, shortcomings in application and reporting, particularly unclear threshold justification and limited attention to optimism are common [52].
In this synthesized AutoML evidence, explicit linkage between predicted risk, clinical action, and threshold justification was uncommon. Consequently, even when favorable curves were presented, implications for patient management remain uncertain, and clinical readiness may be overstated. This limitation parallels broader critiques of medical AI, where models with high discrimination often fail clinically due to poor calibration or misaligned decision thresholds [44], [47]. Notably, almost no study examined how AutoML-based predictions would influence clinical decisions or patient outcomes. Without such evaluations, high AUC values may be misleading, particularly in dentistry, where diagnostic and treatment decisions are often subjective and carry significant consequences.
4.6. Transparency, Interpretability, and Accountability
In addition to numerical validity, translation into practice also depends on whether systems are understandable, auditable, and accountable. Across studies, AutoML models were consistently framed as decision-support tools rather than autonomous systems. This aligns with ethical and regulatory consensus in healthcare AI, which emphasizes augmenting clinician judgment while preserving human oversight and accountability [53], [54]. None of the reviewed studies proposed unsupervised or closed-loop automation, and all retained clinicians as final decision-makers. This cautious, assistive approach is appropriate given current technological and regulatory constraints and supports the paradigm of augmented intelligence rather than replacement.
Differences in AutoML application across data modalities were evident. Imaging-based studies frequently reported high predictive performance but rarely explored model reasoning, failure modes, or dataset biases. Few provided visual explanations such as saliency maps or systematic misclassification analyses, reflecting broader concerns about opacity in medical imaging AI [55], [56], [57]. In contrast, studies using structured clinical or omics data more frequently incorporated interpretability analyses, such as feature importance rankings or SHAP values [58], [59]. However, these efforts were largely associative and did not establish causal relationships or yield novel clinical insights beyond existing knowledge [46], [60]. Overall, interpretability in dental AutoML remains limited, which may hinder trust, regulatory approval, and clinical uptake. Future studies should adopt more robust explainability methods, particularly for image-based models, and routinely report known limitations and performance degradation under domain shifts.
Concerns regarding transparency, reproducibility, and accountability were also prominent. While automation lowers technical barriers, it often obscures critical aspects of model development [61]. Many dental AutoML studies provided limited documentation of model selection, feature engineering, or training procedures, which undermines trust and reproducibility. In regulated healthcare settings, insufficient access to technical details complicates auditing and accountability, particularly since clinicians and institutions remain responsible for AI-influenced decisions [62], [63]. Emerging guidelines such as DECIDE-AI and MINIMAR emphasize the need for improved documentation and governance, even for automated systems [64], [65]. Dental AI studies should therefore report sufficient technical detail to enable independent validation and clearly define responsibility for model errors.
4.7. Task Formulation and the Economics of Annotation
Most AutoML applications were framed as supervised classification, while object detection and segmentation were comparatively rare. This imbalance is notable, given the growing interest in localization tasks within dental AI. The three formulations differ substantially in both informational output and clinical utility. Classification provides a direct image- or patient-level decision, making it attractive for screening, triage, and rapid support, but it lacks spatial information regarding lesion extent or anatomical relationships [66]. Detection augments classification by localizing candidate regions, typically via bounding boxes, thereby offering spatial context without pixel-level delineation [67]. Segmentation yields the most detailed representation, enabling quantitative assessment, surgical planning, and longitudinal monitoring [68]. The richer outputs of detection and segmentation, however, come with substantially higher methodological demands. Pixel-level annotation requires expert labor, standardized protocols, and specialized tools and remains a central bottleneck in medical imaging AI [69]. In contrast, classification labels are often retrievable from routine documentation because diagnoses are already captured during daily care [70]. This dramatically lowers the threshold for dataset creation, an important consideration in AutoML projects that are frequently initiated without dedicated annotation infrastructure. The dominance of classification may also mirror the nature of many dental decisions, which commonly involve categorical judgments such as disease presence, referral, or risk assignment [71], [72]. In these situations, image-level outputs may sufficiently support the clinical objective, whereas fine-grained delineation does not necessarily yield proportional gains in actionability.
There is no indication that AutoML itself is intrinsically limited to classification. On the contrary, AutoML approaches have demonstrated effectiveness across segmentation and other complex imaging tasks [73]. The distribution observed here is therefore more plausibly explained by practical constraints, particularly the expert time required for dense annotation [66], [69], alongside governance considerations [74] and the computational demands of pixel-wise optimization, which collectively make classification the most accessible pathway. Comparable trends are reported throughout dental AI research, where classification commonly outweighs segmentation in areas such as caries and periodontal imaging [75], [76]. Taken together, these observations suggest that prevailing task choices might be driven primarily by feasibility and resource considerations rather than technological limitations of AutoML.
4.8. Gaps in Clinical Translation
We identified no studies addressing survival analysis, longitudinal modeling, or multi-step treatment decision pathways. This absence likely reflects technical limitations of current AutoML platforms rather than a lack of clinical relevance. Complex questions involving disease progression or time-to-event outcomes require modeling approaches that are not yet well supported by off-the-shelf AutoML systems [77]. As AutoML tools evolve to accommodate longitudinal and survival data, dental research should expand into these clinically meaningful applications. Another major gap is the lack of evaluation beyond retrospective development and internal validation. No study advanced to prospective clinical implementation. This mirrors broader findings that most published clinical AI models do not progress to prospective validation or implementation [78], [79]. Without real-world evidence of effectiveness, usability, and generalizability, claims of clinical benefit remain speculative.
4.9. Real-World Barriers to Adoption
Several practical factors may limit the incorporation of AutoML systems into routine oral healthcare. Clinical translation requires alignment with infrastructure, governance, regulatory frameworks, professional responsibilities, and economic sustainability [6], [80]. Consequently, movement from experimental validation to everyday deployment remains largely theoretical [12], [65].
Integration into existing workflows represents a primary challenge [5], [6]. Dental practices function within heterogeneous digital environments that combine imaging devices, proprietary software, practice-management platforms, and electronic records [5], [81]. Embedding automated outputs without disrupting established routines often demands interface development, vendor cooperation, and continuous technical support [6], [65]. In high-throughput settings, even minor delays may compromise usability. While AutoML reduces the expertise required for model development, deployment, interoperability, and lifecycle management remain substantial challenges [12], [36].
Data protection and governance create further constraints. Cloud-based computation raises concerns regarding confidentiality, cybersecurity, data ownership, and cross-border transfer [6],64]. Such issues are particularly relevant for small or privately run practices with limited IT resources. Legal ambiguity around secondary use and breach responsibility may deter adoption despite promising performance.
Regulatory approval forms another major boundary between development and patient care. Systems influencing diagnosis or treatment planning increasingly fall within medical-device legislation, requiring evidence of safety, effectiveness, transparency, and post-market oversight [46], [64], [65]. Automated model search may complicate certification by reducing traceability and hindering reconstruction of development pathways [12], [36]. Limited access to proprietary processes can conflict with expectations for auditability and reproducibility [46], [64]. At the same time, clinicians generally retain legal responsibility, creating uncertainty around liability when decisions are AI-supported [62], [63].
Human and organizational dimensions are equally influential. Uptake depends on trust, perceived usefulness, and compatibility with established diagnostic reasoning [5], [6]. Tools viewed as opaque or as constraining professional autonomy may face resistance unless accompanied by transparent communication of limitations and appropriate indications [46], [64]. Implementation research repeatedly demonstrates that strong accuracy metrics alone rarely guarantee adoption without tangible workflow benefit and adequate training [6], [65].
Economic feasibility further shapes decision-making. Licensing, subscriptions, hardware, cybersecurity, integration, and staff education require substantial investment [5], [6]. With reimbursement pathways often undefined, return on investment remains uncertain, particularly for smaller providers. Thus, progression from promising performance to sustainable routine use cannot yet be assumed [36], [80]. These barriers frame the broader question of whether AutoML is presently ready for reliable chairside deployment.
4.10. Readiness for Chairside Use
From a clinical standpoint, the key issue is whether the body of evidence summarized in this review is sufficient to justify routine use of AutoML systems in patient care. When considered collectively, the answer remains cautious. Although many investigations report encouraging internal performance, the broader conditions that typically precede clinical deployment, including consistent multicenter validation, prospective demonstration of benefit, seamless incorporation into everyday workflows, regulatory approval pathways, and sustainable governance mechanisms, have seldom been satisfied within the same framework.
Readiness for chairside adoption depends on more than algorithmic accuracy. Systems intended to inform diagnosis or treatment must demonstrate reliability across diverse clinical environments, produce outputs that clinicians can interpret and defend, and operate within legal, ethical, and financial structures acceptable to healthcare providers. Across these dimensions, the current literature remains preliminary and fragmented. Accordingly, AutoML in dentistry should presently be viewed primarily as an enabling methodology for accelerating research, hypothesis generation, and rapid prototyping rather than as a mature clinical solution. Transition toward dependable real-world use will require coordinated progress not only in predictive performance but also in validation methodology, implementation science, and institutional governance.
4.11. Future Directions
Looking ahead, several priorities emerge to ensure AutoML delivers meaningful clinical value in oral health. External validation on independent, multi-center datasets is essential to establish generalizability. Promising models should undergo prospective clinical evaluation, including pilot implementation studies or randomized trials. Reporting standards should require calibration assessment, decision-analytic evaluation, and full disclosure of model development details. In addition, fairness and bias analyses should be routinely conducted to ensure equitable performance across demographic subgroups. Without these steps, oral healthcare risks replicating patterns seen elsewhere in AI, where technically impressive models fail to translate into patient benefit.
4.12. Strengths and Limitations
This review has notable strengths. To our knowledge, it represents the first systematic review focused specifically on AutoML in oral healthcare, enabling a targeted assessment of its opportunities and challenges. Nevertheless, limitations must be acknowledged. The number of eligible studies was small and heterogeneous, precluding meta-analysis and reflecting the early stage of adoption. Most studies relied on retrospective, single-center datasets with limited population diversity, and none evaluated implementation in routine clinical workflows or reported patient outcome data. As such, the available evidence remains largely methodological rather than clinically outcome-driven. Additionally, although major biomedical databases and grey literature sources were searched, AutoML studies might be published in computer-science or engineering venues rather than traditional biomedical journals and may therefore not be consistently indexed in PubMed, Web of Science, or Cochrane. Therefore, some relevant evidence may not have been captured, and the database scope should be interpreted as a potential limitation. Finally, clinical implications should remain conservative because most included studies were internally validated within single datasets; robust multi-center external validation and real-world clinical evaluation are required before AutoML models can be considered ready for routine use.
5. Conclusion
AutoML in oral healthcare currently serves primarily as a methodological tool for early-stage model development rather than a clinically validated technology. While AutoML has democratized access to advanced machine learning and accelerated exploratory research, its application remains largely confined to proof-of-concept studies. Progress toward clinical adoption will require rigorous external validation, transparent and standardized reporting, and evaluation within real-world clinical settings. As AutoML technologies mature, their contribution to dental practice will depend on robust data curation, appropriate governance, and careful integration into clinical decision-making processes to ensure reliability, reproducibility, and patient benefit.
Ethical approval
Nil.
Funding
This study was not supported by any funding.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
Nil.
Consent to Participate
For this type of study, informed consent is not required.
Code Availability
Nil.
Footnotes
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.jdsr.2026.03.001.
Appendix A. Supplementary material
Supplementary material
Data availability
This review is based on previously published studies. All data extracted and analyzed during this systematic review are included within the article and its Supplementary Materials.
References
- 1.Al-Raeei M. From diagnosis to treatment: the comprehensive integration of artificial intelligence in modern dental care. Curr Trends Dent. 2025;2(1):13–17. [Google Scholar]
- 2.Sohrabniya F., Hassanzadeh-Samani S., Ourang S.A., Jafari B., Farzinnia G., Gorjinejad F., et al. Exploring a decade of deep learning in dentistry: a comprehensive mapping review. Clin Oral Invest. 2025;29:143. doi: 10.1007/s00784-025-06216-5. [DOI] [PubMed] [Google Scholar]
- 3.Luo G., Stone B.L., Johnson M.D., Tarczy-Hornoch P., Wilcox A.B., Mooney S.D., et al. Automating Construction of Machine Learning Models With Clinical Big Data: Proposal Rationale and Methods. JMIR Res Protoc. 2017;6 doi: 10.2196/resprot.7757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Luu J., Borisenko E., Przekop V., Patil A., Forrester J.D., Choi J. Practical guide to building machine learning-based clinical prediction models using imbalanced datasets. Trauma Surg Acute Care Open. 2024;9 doi: 10.1136/tsaco-2023-001222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Müller A., Mertens S.M., Göstemeyer G., Krois J., Schwendicke F. Barriers and enablers for artificial intelligence in dental diagnostics: a qualitative study. J Clin Med. 2021;10(8):1612. doi: 10.3390/jcm10081612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hassan M., Kushniruk A., Borycki E. Barriers to and Facilitators of Artificial Intelligence Adoption in Health Care: Scoping Review. JMIR Hum Factors. 2024;11 doi: 10.2196/48633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Baratchi M., Wang C., Limmer S., Van Rijn J.N., Hoos H., Bäck T., et al. Automated machine learning: past, present and future. Artif Intell Rev. 2024;57:122. [Google Scholar]
- 8.Park Y. Automated machine learning with R: AutoML tools for beginners in clinical research. J Minim Invasive Surg. 2024;27:129–137. doi: 10.7602/jmis.2024.27.3.129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Shujaat S. Automated machine learning in dentistry: a narrative review of applications, challenges, and future directions. Diagn (Basel) 2025;15(3):273. doi: 10.3390/diagnostics15030273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Karmaker S.K., Hassan M.M., Smith M.J., Xu L., Zhai C., Veeramachaneni K. AutoML to date and beyond: challenges and opportunities. ACM Comput Surv. 2021;54(8):1–35. [Google Scholar]
- 11.Faes L., Wagner S.K., Fu D.J., Liu X., Korot E., Ledsam J.R., et al. Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study. Lancet Digit Health. 2019;1:e232–e242. doi: 10.1016/S2589-7500(19)30108-6. [DOI] [PubMed] [Google Scholar]
- 12.Thirunavukarasu A.J., Elangovan K., Gutierrez L., Hassan R., Li Y., Tan T.F., et al. Clinical performance of automated machine learning: A systematic review. Ann Acad Med Singap. 2024;53:187–207. doi: 10.47102/annals-acadmedsg.2023113. [DOI] [PubMed] [Google Scholar]
- 13.Luke A.M., Rezallah N.N.F. Accuracy of artificial intelligence in caries detection: a systematic review and meta-analysis. Head Face Med. 2025;21:24. doi: 10.1186/s13005-025-00496-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Macrì M., D’Albis V., D’Albis G., Forte M., Capodiferro S., Favia G., et al. The Role and Applications of Artificial Intelligence in Dental Implant Planning: A Systematic Review. Bioeng (Basel) 2024;11:778. doi: 10.3390/bioengineering11080778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Page M.J., McKenzie J.E., Bossuyt P.M., Boutron I., Hoffmann T.C., Mulrow C.D., et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi: 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wolff R.F., Moons K.G.M., Riley R.D., Whiting P.F., Westwood M., Collins G.S., et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019;170:51–58. doi: 10.7326/M18-1376. [DOI] [PubMed] [Google Scholar]
- 17.Lee J.H., Kim Y.T., Lee J.B., Jeong S.N. A performance comparison between automated deep learning and dental professionals in classification of dental implant systems from dental imaging: a multi-center study. Diagn (Basel) 2020;10(11):910. doi: 10.3390/diagnostics10110910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tobias G., Spanier A.B. Modified gingival index (MGI) classification using dental selfies. Appl Sci. 2020;10:8923. [Google Scholar]
- 19.Heimisdottir L.H., Lin B.M., Cho H., Orlenko A., Ribeiro A.A., Simon-Soro A., et al. Metabolomics Insights in Early Childhood Caries. J Dent Res. 2021;100:615–622. doi: 10.1177/0022034520982963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Karhade D.S., Roach J., Shrestha P., Simancas-Pallares M.A., Ginnis J., Burk Z.J.S., et al. An Automated Machine Learning Classifier for Early Childhood Caries. Pedia Dent. 2021;43:191–197. [PMC free article] [PubMed] [Google Scholar]
- 21.Real A.D., Real O.D., Sardina S., Oyonarte R. Use of automated artificial intelligence to predict the need for orthodontic extractions. Korean J Orthod. 2022;52:102–111. doi: 10.4041/kjod.2022.52.2.102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Boitor O., Stef L., Antonescu E., Stoica F., Mihăilă R. The importance of some risk factors in patients presenting the association of metabolic syndrome with periodontal disease. Acta Med Transilv. 2023;28:53. [Google Scholar]
- 23.Boitor O., Stoica F., Mihaila R., Stoica L.F., Stef L. Automated Machine Learning to Develop Predictive Models of Metabolic Syndrome in Patients with Periodontal Disease. Diagn (Basel) 2023;13:3631. doi: 10.3390/diagnostics13243631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Fatapour Y., Abiri A., Kuan E.C., Brody J.P. Development of a Machine Learning Model to Predict Recurrence of Oral Tongue Squamous Cell Carcinoma. Cancers (Basel) 2023;15:2769. doi: 10.3390/cancers15102769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Hamdan M., Badr Z., Bjork J., Saxe R., Malensek F., Miller C., et al. Detection of dental restorations using no-code artificial intelligence. J Dent. 2023 doi: 10.1016/j.jdent.2023.104768. [DOI] [PubMed] [Google Scholar]
- 26.Kong H.J. Classification of dental implant systems using cloud-based deep learning algorithm: an experimental study. J Yeungnam Med Sci. 2023;40:S29–S36. doi: 10.12701/jyms.2023.00465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kwack D.W., Park S.M. Prediction of medication-related osteonecrosis of the jaw (MRONJ) using automated machine learning in patients with osteoporosis associated with dental extraction and implantation: a retrospective study. J Korean Assoc Oral Maxillofac Surg. 2023;49:135–141. doi: 10.5125/jkaoms.2023.49.3.135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cheong R.C.T., Jawad S., Adams A., Campion T., Lim Z.H., Papachristou N., et al. Enhancing paranasal sinus disease detection with AutoML: efficient AI development and evaluation via magnetic resonance imaging. Eur Arch Otorhinolaryngol. 2024;281:2153–2158. doi: 10.1007/s00405-023-08424-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gonzalez C., Badr Z., Gungor H.C., Han S., Hamdan M.D. Identifying Primary Proximal Caries Lesions in Pediatric Patients From Bitewing Radiographs Using Artificial Intelligence. Pedia Dent. 2024;46:332–336. [PubMed] [Google Scholar]
- 30.Jeong H.-J. Preliminary test of google vertex artificial intelligence in root dental x-ray imaging diagnosis. J Korean Soc Radio. 2024;18:267–273. [Google Scholar]
- 31.Nantakeeratipat T., Apisaksirikul N., Boonrojsaree B., Boonkijkullatat S., Simaphichet A. Automated machine learning for image-based detection of dental plaque on permanent teeth. Front Dent Med. 2024;5 doi: 10.3389/fdmed.2024.1507705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Yoon J.H., Park S.M. Prediction of intensive care unit admission using machine learning in patients with odontogenic infection. J Korean Assoc Oral Maxillofac Surg. 2024;50:216–221. doi: 10.5125/jkaoms.2024.50.4.216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Akadiri O.A., Yarhere K.S. Embracing Computer Vision for Diagnostic Maxillofacial Imaging - An Artificial Intelligence Machine Learning (AIML)Pilot Project. Niger Med J. 2025;66:973–982. doi: 10.71480/nmj.v66i3.748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pais R.J., Botelho J., Machado V., Alcoforado G., Mendes J.J., Alves R., et al. Exploring AI–driven machine learning approaches for optimal classification of peri-implantitis based on oral microbiome data: a feasibility study. Diagn (Basel) 2025;15:425. doi: 10.3390/diagnostics15040425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Gomez-Rios I., Egea-Lopez E., Ortiz Ruiz A.J. ORIENTATE: automated machine learning classifiers for oral health prediction and research. BMC Oral Health. 2023;23:408. doi: 10.1186/s12903-023-03112-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Waring J., Lindvall C., Umeton R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif Intell Med. 2020;104 doi: 10.1016/j.artmed.2020.101822. [DOI] [PubMed] [Google Scholar]
- 37.Thirunavukarasu A.J., Elangovan K., Gutierrez L., Li Y., Tan I., Keane P.A., et al. Democratizing Artificial Intelligence Imaging Analysis With Automated Machine Learning: Tutorial. J Med Internet Res. 2023;25 doi: 10.2196/49949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Aggarwal R., Sounderajah V., Martin G., Ting D.S.W., Karthikesalingam A., King D., et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4:65. doi: 10.1038/s41746-021-00438-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zech J.R., Badgeley M.A., Liu M., Costa A.B., Titano J.J., Oermann E.K. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Med. 2018;15 doi: 10.1371/journal.pmed.1002683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Yu A.C., Mohajer B., Eng J. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review. Radio Artif Intell. 2022;4 doi: 10.1148/ryai.210064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Beam A.L., Kohane I.S. Big Data and Machine Learning in Health Care. JAMA. 2018;319:1317–1318. doi: 10.1001/jama.2017.18391. [DOI] [PubMed] [Google Scholar]
- 42.Cahan E.M., Hernandez-Boussard T., Thadaney-Israni S., Rubin D.L. Putting the data before the algorithm in big data addressing personalized healthcare. NPJ Digit Med. 2019;2:78. doi: 10.1038/s41746-019-0157-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zöller M.A., Huber M.F. Benchmark and survey of automated machine learning frameworks. J Artif Intell Res. 2021;70:409–474. [Google Scholar]
- 44.Van Calster B., McLernon D.J., Van Smeden M., Wynants L., Steyerberg E.W., Bossuyt P., et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:230. doi: 10.1186/s12916-019-1466-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Steyerberg E.W., Vickers A.J., Cook N.R., Gerds T., Gonen M., Obuchowski N., et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–138. doi: 10.1097/EDE.0b013e3181c30fb2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Collins G.S., Moons K.G.M., Dhiman P., Riley R.D., Beam A.L., Van Calster B., et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385 doi: 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Vickers A.J., Holland F. Decision curve analysis to evaluate the clinical benefit of prediction models. Spine J. 2021;21:1643–1648. doi: 10.1016/j.spinee.2021.02.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Vickers A.J., Elkin E.B. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak. 2006;26:565–574. doi: 10.1177/0272989X06295361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Van Calster B., Wynants L., Verbeek J.F.M., Verbakel J.Y., Christodoulou E., Vickers A.J., et al. Reporting and interpreting decision curve analysis: a guide for investigators. Eur Urol. 2018;74:796–804. doi: 10.1016/j.eururo.2018.08.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Kerr K.F., Brown M.D., Zhu K., Janes H. Assessing the clinical impact of risk prediction models with decision curves: guidance for correct interpretation and appropriate use. J Clin Oncol. 2016;34:2534–2540. doi: 10.1200/JCO.2015.65.5654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Vickers A.J., Van Calster B., Steyerberg E.W. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3:18. doi: 10.1186/s41512-019-0064-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Capogrosso P., Vickers A.J. A systematic review of the literature demonstrates some errors in the use of decision curve analysis but generally correct interpretation of findings. Med Decis Mak. 2019;39:493–498. doi: 10.1177/0272989X19832881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Topol E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56. doi: 10.1038/s41591-018-0300-7. [DOI] [PubMed] [Google Scholar]
- 54.Rajpurkar P., Chen E., Banerjee O., Topol E.J. AI in health and medicine. Nat Med. 2022;28:31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]
- 55.Oakden-Rayner L., Dunnmon J., Carneiro G., Ré C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc ACM Conf Health Inference Learn. 2020;2020:151–159. doi: 10.1145/3368555.3384468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Huff D.T., Weisman A.J., Jeraj R. Interpretation and visualization techniques for deep learning models in medical imaging. Phys Med Biol. 2021;66:04tr1. doi: 10.1088/1361-6560/abcd17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Houssein E.H., Gamal A.M., Younis E.M.G., Mohamed E. Explainable artificial intelligence for medical imaging systems using deep learning: a comprehensive review. Clust Comput. 2025;28:469. [Google Scholar]
- 58.Lundberg S.M., Lee S.I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017:4765–4774. [Google Scholar]
- 59.Lundberg S.M., Erion G., Chen H., DeGrave A., Prutkin J.M., Nair B., et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56–67. doi: 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lipton Z.C. The mythos of model interpretability. Commun ACM. 2016;61:36–43. [Google Scholar]
- 61.Imrie F., Cebere B., McKinney E.F., Van der Schaar M. AutoPrognosis 2.0: Democratizing diagnostic and prognostic modeling in healthcare with automated machine learning. PLOS Digit Health. 2023;2 doi: 10.1371/journal.pdig.0000276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Price W.N., 2nd, Gerke S., Cohen I.G. Potential Liability for Physicians Using Artificial Intelligence. JAMA. 2019;322:1765–1766. doi: 10.1001/jama.2019.15064. [DOI] [PubMed] [Google Scholar]
- 63.Neri E., Coppola F., Miele V., Bibbolino C., Grassi R. Artificial intelligence: Who is responsible for the diagnosis? Radio Med. 2020;125:517–521. doi: 10.1007/s11547-020-01135-9. [DOI] [PubMed] [Google Scholar]
- 64.Hernandez-Boussard T., Bozkurt S., Ioannidis J.P.A., Shah N.H. MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care. J Am Med Inf Assoc. 2020;27:2011–2015. doi: 10.1093/jamia/ocaa088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Vasey B., Nagendran M., Campbell B., Clifton D.A., Collins G.S., Denaxas S., et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med. 2022;28:924–933. doi: 10.1038/s41591-022-01772-9. [DOI] [PubMed] [Google Scholar]
- 66.Litjens G., Kooi T., Ehteshami Bejnordi B., Setio A.A.A., Ciompi F., Ghafoorian M., et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. doi: 10.1016/j.media.2017.07.005. [DOI] [PubMed] [Google Scholar]
- 67.Elhanashi A., Saponara S., Zheng Q., Almutairi N., Singh Y., Kuanar S., et al. AI-powered object detection in radiology: current models, challenges, and future direction. J Imaging. 2025;11:141. doi: 10.3390/jimaging11050141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Gao Y., Jiang Y., Peng Y., Yuan F., Zhang X., Wang J. Medical image segmentation: a comprehensive review of deep learning-based methods. Tomography. 2025;11:52. doi: 10.3390/tomography11050052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Tajbakhsh N., Roth H., Terzopoulos D., Liang J. Guest editorial annotation-efficient deep learning: the holy grail of medical imaging. IEEE Trans Med Imaging. 2021;40:2526–2533. doi: 10.1109/tmi.2021.3089292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Pereira S.C., Mendonça A.M., Campilho A., Sousa P., Teixeira Lopes C. Automated image label extraction from radiology reports – a review. Artif Intell Med. 2024;149 doi: 10.1016/j.artmed.2024.102814. [DOI] [PubMed] [Google Scholar]
- 71.Katsumata A. Deep learning and artificial intelligence in dental diagnostic imaging. Jpn Dent Sci Rev. 2023;59:329–333. doi: 10.1016/j.jdsr.2023.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Mallineni S.K., Sethi M., Punugoti D., Kotha S.B., Alkhayal Z., Mubaraki S., et al. Artificial intelligence in dentistry: a descriptive review. Bioeng (Basel) 2024;11:1267. doi: 10.3390/bioengineering11121267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Ali M.J., Essaid M., Moalic L. Idoughar L. A review of AutoML optimization techniques for medical image applications. Comput Med Imaging Graph. 2024;118 doi: 10.1016/j.compmedimag.2024.102441. [DOI] [PubMed] [Google Scholar]
- 74.Rieke N., Hancox J., Li W., Milletari F., Roth H.R., Albarqouni S., et al. The future of digital health with federated learning. NPJ Digit Med. 2020;3:119. doi: 10.1038/s41746-020-00323-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Moharrami M., Farmer J., Singhal S., Watson E., Glogauer M., Johnson A.E.W., et al. Detecting dental caries on oral photographs using artificial intelligence: a systematic review. Oral Dis. 2024;30:1765–1783. doi: 10.1111/odi.14659. [DOI] [PubMed] [Google Scholar]
- 76.Mao K., Thu K.M., Hung K.F., Yu O.Y., Hsung R.T.C., Lam W.Y.H. Artificial intelligence in detecting periodontal disease from intraoral photographs: a systematic review. Int Dent J. 2025;75 doi: 10.1016/j.identj.2025.100883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Olsavszky V., Dosius M., Vladescu C., Benecke J. Time series analysis and forecasting with automated machine learning on a national ICD-10 database. Int J Environ Res Public Health. 2020;17(14):4979. doi: 10.3390/ijerph17144979. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Nagendran M., Chen Y., Lovejoy C.A., Gordon A.C., Komorowski M., Harvey H., et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. doi: 10.1136/bmj.m689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Roberts M., Driggs D., Thorpe M., Gilbey J., Yeung M., Ursprung S., et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell. 2021;3:199–217. [Google Scholar]
- 80.Schwendicke F., Samek W., Krois J. Artificial Intelligence in Dentistry: Chances and Challenges. J Dent Res. 2020;99:769–774. doi: 10.1177/0022034520915714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Schleyer T.K.L., Thyvalikakath T.P., Spallek H., Torres-Urquidy M.H., Hernandez P., Yuhaniak J. Clinical computing in general dentistry. J Am Med Inf Assoc. 2006;13:344–352. doi: 10.1197/jamia.M1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Data Availability Statement
This review is based on previously published studies. All data extracted and analyzed during this systematic review are included within the article and its Supplementary Materials.

