Clinician-accessible automated machine learning in oral healthcare: A systematic review

Sohaib Shujaat; Marryam Riaz; Hawazin Almutairi; Hala Alanazi; Lujain Altalhi; Naden Alenazi; Haya Bin Osayl; Manju Roby Philip; Ali Anwar Aboalela; Hongyang Ma

doi:10.1016/j.jdsr.2026.03.001

. 2026 Apr 1;62:94–113. doi: 10.1016/j.jdsr.2026.03.001

Clinician-accessible automated machine learning in oral healthcare: A systematic review

Sohaib Shujaat ^a,^⁎, Marryam Riaz ^b, Hawazin Almutairi ^c, Hala Alanazi ^c, Lujain Altalhi ^c, Naden Alenazi ^c, Haya Bin Osayl ^c, Manju Roby Philip ^a, Ali Anwar Aboalela ^a, Hongyang Ma ^d

PMCID: PMC13087713 PMID: 42005095

Abstract

Automated machine learning (AutoML) has emerged as a clinician-accessible approach to artificial intelligence by automating key stages of model development, including algorithm selection and hyperparameter optimization. This systematic review aimed to evaluate the application, performance, and translational relevance of AutoML in oral healthcare. Electronic searches of PubMed, Web of Science, Cochrane Library, and grey literature were conducted. Nineteen primary studies met the inclusion criteria. AutoML was applied across multiple dental specialties, including periodontology, orthodontics, pediatric dentistry, oral surgery, oral radiology, and public oral health, using heterogeneous data modalities such as radiographic images, clinical records, photographs, and omics data. Most studies reported strong performance on internal validation (e.g., cross-validation or validation splits) and/or independent hold-out test sets, particularly for imaging-based classification tasks, with several models achieving high accuracy and discrimination. However, external validation was uncommon, sample sizes were frequently limited, and substantial risk of bias was identified, particularly in analysis and participant selection. No study evaluated real-world clinical implementation or patient outcomes. Overall, AutoML shows promise in reducing technical barriers and supporting clinician-led artificial intelligence research in dentistry. Nevertheless, current evidence base remains exploratory, and rigorous external validation, standardized reporting, and prospective clinical evaluation are required before routine clinical adoption.

Keywords: Automated machine learning, Dental artificial intelligence, Clinical decision support, Diagnostic model, Predictive modeling, Oral healthcare

1. Introduction

Artificial intelligence (AI) has become a central methodological framework in contemporary dental research, enabling computational approaches for diagnosis, prognosis, risk stratification, and treatment planning across multiple dental subspecialties [1]. Despite this rapid expansion, the development of conventional machine learning (ML) and deep learning (DL) models in oral healthcare remains technically complex and resource-intensive [2]. Standard AI workflows typically require advanced expertise in data preprocessing, feature engineering, model architecture design, algorithm selection, hyperparameter tuning, and performance optimization [3], [4]. These technical demands create substantial barriers for many dental clinicians and researchers, limiting the independent development and widespread adoption of AI systems within dental institutions [5], [6].

Automated machine learning (AutoML) has emerged as a paradigm aimed at reducing these technical barriers by automating key stages of the machine-learning pipeline, particularly algorithm search, model selection, and hyperparameter optimization. Depending on the platform, AutoML frameworks may also provide partial automation of data preprocessing, feature handling, and model evaluation within standardized workflows [7]. Many AutoML platforms are delivered through no-code or low-code interfaces, enabling users without formal training in programming or data science to develop predictive and diagnostic models directly from clinical datasets [8], [9]. This automation has the potential to shift AI development from computational specialists toward domain experts, thereby altering how AI models are created and evaluated in dental research.

Oral health represents a particularly suitable domain for AutoML applications due to its reliance on high-volume imaging data, structured clinical records, and outcome-driven decision-making [9]. By reducing the need for manual coding and algorithm engineering, AutoML may allow dental researchers to concentrate on clinically relevant problem formulation, data quality assurance, and result interpretation. Moreover, the standardized nature of AutoML pipelines may facilitate methodological consistency, support reproducibility, and enable more uniform performance benchmarking across studies and institutions, while potentially reducing certain forms of subjective, human-driven modeling bias [10]. From a translational perspective, AutoML aligns closely with the practical requirements of digital dentistry. AutoML frameworks have been shown to substantially reduce the technical and resource barriers needed to build competitive AI models, facilitating development by clinicians without deep coding or ML expertise [11], [12]. Lowering these entry thresholds can accelerate clinician-driven AI research, promote multi-center collaboration, and support scalable evaluation of AI systems across diverse dental specialties. Prior evidence in dentistry demonstrates promising AutoML applications across diagnostic and predictive tasks [9].

Despite the rapid global expansion of AI research in oral healthcare, the specific contribution of AutoML within this field remains poorly characterized. Existing systematic reviews have largely concentrated on conventionally developed ML and DL models that rely on manual feature engineering and model optimization [13], [14]. In contrast, AutoML-based approaches have received minimal attention in dental research and have been addressed only in a single narrative review to date [9]. As a result, the scope of AutoML adoption in oral healthcare, its application across different dental specialties, and the methodological characteristics of the existing evidence base remain insufficiently understood.

Therefore, the aim of this systematic review was to examine the use of AutoML in oral healthcare. Specifically, this review sought to (i) describe how AutoML has been applied across dental specialties and data modalities; (ii) summarize reported model performance and evaluation approaches; and (iii) consider the potential translational relevance of AutoML-based models for clinical research and practice. By addressing these objectives, this review aims to outline the current evidence base, identify key methodological limitations, and highlight areas for future investigation regarding AutoML as a clinician-accessible artificial intelligence approach in dental research.

2. Methodology

2.1. Protocol and Registration

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [15]. The review protocol was prospectively registered in the PROSPERO (Prospective Register of Systematic Reviews) database under the registration number CRD420251266034. As this review synthesized previously published data and did not involve the collection of primary data from human participants or animals, ethical approval and informed consent were not required.

2.2. Review Question

The review question was structured according to the Population, Intervention, Comparison, and Outcome (PICO) framework as follows. Population (P): Individuals assessed for dental and maxillofacial conditions. Intervention (I): AutoML-based models. Comparison (C): Conventional diagnostic approaches, expert clinical assessment, or traditional statistical or manually developed machine learning models, when applicable. Outcome (O): Model performance in terms of diagnostic or predictive accuracy, including accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, precision, recall, and F1-score.

Based on this framework, the review sought to answer the following question: “What is the performance (O) of AutoML–based models (I) in individuals with dental and maxillofacial conditions (P) compared with conventional diagnostic approaches or expert assessment (C)?”

2.3. Eligibility Criteria

This systematic review included all full-text primary studies that evaluated the application of AutoML or highly automated machine learning pipelines in dentistry and oral and maxillofacial research for diagnostic, predictive, classification, or risk-assessment purposes. Eligible studies were required to: (1) implement an AutoML or automated ML framework for model development, including but not limited to Google Cloud AutoML, H2O-AutoML, Auto-WEKA, TPOT, ORIENTATE, Vertex AI, LandingLens, or similar platforms; (2) utilize dental, oral, or maxillofacial datasets, including radiographic, photographic, microbiome, metabolomic, clinical, biochemical, or questionnaire-based data; (3) report quantitative model performance metrics such as accuracy, AUC, sensitivity, specificity, precision, recall, F1-score, or related model evaluation outcomes (internal validation, testing, or external validation); and (4) involve human subjects or human-derived datasets. Studies related to oral and maxillofacial oncology (e.g., oral squamous cell carcinoma, jaw tumors, and odontogenic malignancies) were included provided they met all other eligibility criteria.

Exclusion criteria comprised case reports, narrative reviews, systematic reviews, meta-analyses, book chapters, editorials, letters to the editor, conference abstracts, animal studies, in vitro studies, and purely technical algorithmic papers without direct clinical dental, oral, or maxillofacial application. Studies focusing exclusively on non-oral head and neck oncology (e.g., laryngeal, pharyngeal, thyroid, or salivary gland tumors without oral or jaw involvement) were excluded. No restrictions were applied regarding the year of publication or language.

2.4. Information Sources and Search Strategy

An electronic literature search was performed across the PubMed, Web of Science, and Cochrane Library databases until 01 December 2025. The search strategy combined both Medical Subject Headings (MeSH) terms and free-text keywords related to automated machine learning and dental and maxillofacial applications. The complete electronic search strategy is provided in Supplementary File 1.

To minimize the risk of publication and selection bias, an extensive search of the grey literature was conducted using Google Scholar and ProQuest Dissertations & Theses. In addition, a manual screening of the reference lists of all included studies and relevant review articles was undertaken to identify any additional eligible publications not captured through the electronic database search.

All retrieved citations were imported into EndNote 2025 software (Thomson Reuters, Philadelphia, PA, USA) for duplicate identification and removal, followed by title, abstract, and full-text screening.

2.5. Study Selection and Data Extraction

Two independent reviewers screened the titles and abstracts of all retrieved records to assess potential eligibility. Studies that appeared to meet the inclusion criteria were subsequently evaluated at the full-text level. Full-text articles were independently reviewed to determine final eligibility for inclusion. Any discrepancies between the two reviewers were resolved through discussion, and when consensus could not be reached, a third reviewer was consulted to make the final decision.

Data extraction was performed independently by the two reviewers using a standardized data extraction form. The extracted variables included: study title, authors, year of publication, country of origin, clinical application domain, data modality (e.g., radiographic images, clinical records, photographic images, microbiome or metabolomic data), AutoML platform or framework used, study design, dataset size, type of input features, training, internal validation, and testing strategy (including cross-validation and hold-out test evaluation when applicable), external validation when reported, ground truth reference standard, performance evaluation metrics (including accuracy, AUC, sensitivity, specificity, precision, recall, and F1-score), and major study outcomes. To ensure terminological consistency across studies, the following definitions were applied throughout this review. The training set refers to data used to fit (develop) the model. Internal validation refers to procedures performed during model development for model selection or tuning (e.g., a validation split within the training data, k-fold cross-validation, or bootstrap resampling). The test set refers to an independent hold-out subset used exclusively for final model evaluation after training and tuning were completed. Performance assessed using data obtained from a different institution, population, or time period was considered external validation.

Inter-reviewer agreement for study selection was assessed using Cohen’s kappa (κ) statistic. Agreement was evaluated for title and abstract screening and for full-text eligibility assessment. Inter-reviewer agreement was excellent (κ = 0.95). Disagreements were resolved through discussion and, when necessary, consultation with a third reviewer.

2.6. Risk of Bias Assessment

The risk of bias and applicability of all included studies were independently assessed by two reviewers using the PROBAST (Prediction Model Risk of Bias Assessment Tool) [16], as all studies involved the development and/or validation of machine learning or automated machine learning–based prediction models. PROBAST evaluates risk of bias across four domains: participants, predictors, outcome, and analysis.

Each domain was classified as having low, high, or unclear risk of bias in accordance with the standardized PROBAST signaling questions. Any disagreements between the two reviewers were resolved through discussion, and when consensus could not be achieved, a third senior reviewer was consulted. An overall risk of bias judgment for each study was determined based on the domain-level assessments and is reported descriptively.

3. Results

3.1. Study Selection

The study selection process is presented in the PRISMA flow diagram (Fig. 1). The initial database search identified 769 records (PubMed: 443; Web of Science: 286; Cochrane: 38; grey literature: 2). After the removal of 197 duplicate records, 572 unique records remained and were screened based on titles and abstracts. Of these, 553 records were excluded for not meeting the eligibility criteria. A total of 19 full-text articles were sought for retrieval and assessed for eligibility, with no reports excluded at this stage. Ultimately, 19 studies fulfilled the inclusion criteria and were included in the systematic review. Due to substantial heterogeneity in study design, outcomes, data modalities, and validation strategies, quantitative meta-analysis was not feasible.

3.2. Qualitative synthesis of included studies

Nineteen studies met the inclusion criteria (Table 1) [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35]. Collectively, they covered a broad range of dental specialties, including periodontology and implantology, orthodontics, pediatric dentistry, operative dentistry, oral and maxillofacial surgery, oral and maxillofacial radiology, community/preventive dentistry and public oral health. Most investigations were single-center observational studies, predominantly retrospective designs based on clinical records or image archives, complemented by a small number of prospective cross-sectional, cohort, diagnostic-accuracy and in-silico analyses. Studies were conducted in North and South America, Europe, Asia and Africa.

Table 1.

Study and clinical–data characteristics of the included studies.

Author	Country	Study Design	Dental Specialty	Clinical Application	Data Type	Dataset Nature	Data Source	Modality / Format	Task Type	Target / Output
Lee et al.[17]	South Korea	Retrospective	Periodontology / Implantology	Dental implant system classification	Dental Imaging	Panoramic & periapical radiographs	Three Korean dental hospitals	2D DICOM	Multi-Class Classification	Implant system type
Tobias et al.[18]	Israel	Prospective Observational	Periodontology / Community Dentistry	Gingival health monitoring and Modified Gingival Index (MGI) classification	Dental Imaging	Smartphone intraoral photographs	iGAM mobile-app pilot study, Hebrew University–Hadassah School of Dental Medicine	2D RGB images	Binary Classification	Gingivitis severity
Heimisdottir et al.[19]	USA	Case–Control	Pediatric Dentistry / Public Health	Early childhood caries (ECC) risk classification	Metabolomics	Supragingival biofilm dataset	ZOE 2.0 preschool cohort, North Carolina Head Start centers	Biospecimen + metabolite matrix	Binary Classification	Localized ECC status
Karhade et al.[20]	USA	Cross-Sectional	Pediatric Dentistry / Public Health	Early childhood caries (ECC) status screening	Clinical Tabular	Demographic & behavioral ECC data	ZOE 2.0 cohort with external validation in NHANES 2011–2018	Tabular variables	Binary Classification	ECC case status
Del Real et al.[21]	Chile	Retrospective	Orthodontics	Orthodontic extraction requirement prediction	Clinical Tabular	Clinical & cephalometric orthodontic data	Private orthodontic clinic, Santiago, Chile (2018–2019)	Tabular + numeric	Binary Classification	Extraction decision
Boitor et al.[22]	Romania	Cross-Sectional	Periodontology	Identification of periodontal and lifestyle risk factors associated with metabolic syndrome	Clinical Tabular	Periodontal & lifestyle dataset (A)	Hospitalized patients, Sibiu County Emergency Clinical Hospital (Cardiology/Diabetes departments)	Tabular clinical + questionnaire	Binary Classification	Metabolic syndrome
Boitor et al.[23]	Romania	Cross-Sectional	Periodontology	Prediction of metabolic syndrome in hospitalized patients with periodontal disease	Clinical Tabular	Periodontal, systemic & HRQoL dataset (B)	Hospitalized patients, Sibiu County Emergency Clinical Hospital (periodontal exam dataset)	Tabular + EQ-5D-5L	Binary Classification	Metabolic syndrome
Fatapour et al.[24]	USA	Retrospective	Oral and Maxillofacial Surgery	Prediction of locoregional recurrence in oral tongue squamous cell carcinoma	Oncology Registry	Population-based cancer registry	SEER 2000–2018 cancer registry (USA)	Tabular longitudinal data	Binary Classification	Locoregional recurrence
Gomez-Rios et al.[35]	Spain	Cohort Study	Pediatric Dentistry / Special Care Dentistry	Prediction of second deep sedation requirement in pediatric dental patients	Clinical Tabular	Pediatric clinical & sedation dataset	Private dental clinic, Cartagena, Spain (pediatric deep sedation, 2006–2018)	Tabular + tooth-level encoding	Binary Classification	Second deep sedation
Hamdan et al.[25]	USA	Retrospective	Operative Dentistry / Oral & Maxillofacial Radiology	Dental restoration segmentation and detection	Dental Imaging	Panoramic radiographs	Adult radiograph records, Marquette University School of Dentistry (2018–2023)	2D pixel-level labeled images	Segmentation	Dental restorations
Kong et al.[26]	South Korea	Retrospective	Prosthodontics / Implantology	Dental implant system classification	Dental Imaging	Periapical radiographs	Implant periapical radiographs, Wonkwang University Dental Hospital (2005–2019)	2D ROI crops	Multi-Class Classification	Implant system type
Kwack et al.[27]	South Korea	Retrospective	Oral and Maxillofacial Surgery	Prediction of medication-related osteonecrosis of the jaw (MRONJ)	Clinical Tabular	MRONJ questionnaire & clinical dataset	Osteoporosis patients on antiresorptive therapy, Dankook University Dental Hospital (2019–2022)	Tabular	Binary Classification	MRONJ occurrence
Cheong et al.[28]	United Kingdom	Retrospective	Oral and Maxillofacial Radiology	Paranasal sinus disease detection	Medical Imaging	MRI sinus slices	OASIS-3 public MRI dataset	2D grayscale MRI	Binary Classification	Paranasal sinus disease
Gonzalez et al.[29]	USA	Diagnostic Accuracy (In Vitro)	Pediatric Dentistry	Primary proximal surface caries detection	Dental Imaging	Bitewing radiographs	Pediatric bitewings, Marquette University School of Dentistry	2D radiographs	Object Detection	Proximal caries location
Jeong et al.[30]	South Korea	Experimental	Endodontics / Oral & Maxillofacial Radiology	Multiple root dental disease diagnosis	Dental Imaging	Periapical X-rays	Public Kaggle dataset: “Root Disease X-Ray Images”	2D root-region images	Multi-Class / Multi-Label Classification	Root disease category
Nantakeeratipat et al.[31]	Thailand	Prospective Cross-Sectional	Periodontology / Preventive Dentistry	Dental plaque detection and grading	Dental Imaging	Smartphone plaque images	Dental student plaque images, Srinakharinwirot University, Bangkok	2D RGB images	Multi-Class / Binary Classification	Plaque severity
Yoon et al.[32]	South Korea	Retrospective	Oral and Maxillofacial Surgery	Prediction of intensive care unit (ICU) admission in odontogenic infections	Clinical Tabular	ICU admission dataset	OMFS Department, Dankook University Dental Hospital (2013–2023)	Tabular clinical & laboratory data	Binary Classification	ICU admission
Akadiri et al.[33]	Nigeria	Diagnostic Accuracy (In Vitro)	Oral & Maxillofacial Surgery / Oral & Maxillofacial Radiology	Bony jaw lesion classification (benign vs malignant)	Medical Imaging	Panoramic & CT jaw lesions	Radiology archives, University of Port Harcourt Teaching Hospital, Nigeria	2D panoramic + sectional CT	Binary Classification	Jaw lesion pathology
Pais et al.[34]	Portugal	In Silico	Periodontology / Implantology	Peri-implantitis prediction	Metagenomics	Saliva & peri-implant biofilm	NCBI SRA BioProject PRJNA1163384 (oral metagenomes)	Microbiome OTU/relative abundance table	Binary Classification	Peri-implantitis

Open in a new tab

2D: two-dimensional; CT: computed tomography; DICOM: Digital Imaging and Communications in Medicine; ECC: early childhood caries; EQ-5D-5L: EuroQol five-dimension five-level questionnaire; HRQoL: health-related quality of life; ICU: intensive care unit; MGI: Modified Gingival Index; MRI: magnetic resonance imaging; MRONJ: medication-related osteonecrosis of the jaw; NCBI: National Center for Biotechnology Information; NHANES: National Health and Nutrition Examination Survey; OMFS: Oral and Maxillofacial Surgery; OTU: operational taxonomic unit; RGB: red–green–blue color model; ROI: region of interest; SEER: Surveillance, Epidemiology, and End Results Program; SRA: Sequence Read Archive

3.3. Study characteristics

Data sources were heterogeneous, spanning three broad domains (Table 1). First, imaging studies used panoramic and periapical radiographs, bitewings, smartphone intraoral photographs, MRI sinus slices and combined panoramic/CT images for tasks such as implant system classification, restoration segmentation, proximal caries detection, root-disease diagnosis, plaque and gingivitis grading, sinonasal disease detection and bony jaw lesion classification [17], [18], [25], [26], [28], [29], [30], [31], [33]. Second, clinical/tabular datasets incorporated demographic, behavioral, orthodontic, periodontal, systemic, sedation, oncologic and ICU-related variables for early childhood caries (ECC) status, metabolic syndrome, orthodontic extraction, locoregional cancer recurrence, second deep sedation, MRONJ and ICU admission [20], [21], [22], [23], [24], [27], [32], [35]. Third, omics studies analyzed supragingival biofilm metabolomics and oral metagenomic profiles to discriminate ECC and peri-implantitis, respectively [19], [34]. Almost all problems were framed as supervised classification (binary or multi-class), with one object-detection and one segmentation task.

As summarized in Table 2, ground truth definitions were generally aligned with accepted standards: implant types derived from electronic health records and inventory logs, radiology ground-truth databases or expert consensus, histopathological examination for jaw lesions, International Caries Detection and Assessment System and decayed–missing–filled surfaces criteria (ICDAS/dmfs) for ECC, National Cholesterol Education Program Adult Treatment Panel III and related criteria (NCEP ATP III) for metabolic syndrome, American Association of Oral and Maxillofacial Surgeons criteria (AAOMS) for medication-related osteonecrosis of the jaw (MRONJ), prospectively recorded intensive care unit admission and sedation outcomes (ICU), and study-defined clinical groupings for peri-implant health and disease. Imaging labels were usually provided by board-certified specialists, sometimes with resident or student assistance; however, public datasets and some omics-based studies reported less detail regarding annotation procedures.

Table 2.

Dataset annotation, validation, and preprocessing characteristics of the included studies.

Author	Ground Truth Source	Annotation Method	Annotation Expertise	Dataset Size	Training Split	Validation Split	Test Split	External Validation Size	Validation approach	Class Balance	Missing Data	Data Augmentation
Lee et al. [17]	Electronic dental & medical records	Implant system labeling from EMR and clinical documentation	5 periodontal residents + 3 board-certified periodontists	11,980 radiographs	9584 (80%)	Nil	2396 (20%)	0	Random train–test split	Imbalanced: 6 implant classes (TSIII 46.9%, Implantium 21.0%, Superline 19.7%, Astra 3.2%, SLActive BL 4.5%, BLT 4.7%)	NR	No (only resizing + brightness/contrast normalization)
Tobias et al. [18]	Dental selfie images	MGI scoring (0–4) and tooth localization	Single board-certified dentist (blinded)	1520 tooth images from 190 selfies (44 participants)	1336 images used in 5-fold CV	5-fold CV (no fixed validation set)	184 images held out as test set	0	5-fold cross-validation + hold-out	Imbalanced: MGI scores 0–4 = 471, 475, 195, 98, 283; healthy vs inflamed = 471 vs 1049	NR	No (hand-crafted features; no image augmentation)
Heimisdottir et al. [19]	Community dental screening records	ICDAS-based caries diagnosis	Calibrated dentists/examiners	289 metabolomic samples	10-fold CV on full dataset	10% per fold (within CV)	No independent test set	0	10-fold cross-validation	Imbalanced: ECC 39% cases vs 61% noncases	QRILC imputation for left-censored metabolite values; samples with > 30% missing excluded	No (preprocessing only: normalization, log2 transform, QRILC imputation)
Karhade et al. [20]	Clinical dental exams	dmfs> 0 and ICDAS ≥ 3 criteria	Trained and calibrated examiners in ZOE 2.0; NHANES examiners for external validation	6404 participants	80%	20% (internal validation)	No independent test set	2308	External validation (NHANES)	Imbalanced: ECC 54% (ZOE 2.0) vs 27% (NHANES external)	Complete-case analysis; no imputation reported	No
Del Real et al. [21]	Orthodontic treatment records	Extraction decision + cephalometric landmarking	Two experienced orthodontists (>20–40 yrs); cephalometric tracing by trained operator	214 orthodontic cases	∼81% (inner loop of nested 10-fold CV)	∼9% (inner validation folds)	∼10% (outer-fold test per CV)	0	Nested cross-validation	Imbalanced: Extraction 38% vs non-extraction 62%	Incomplete records excluded; no imputation	No (tabular numeric data only; no augmentation)
Boitor et al. [22]	Hospital cardiology/diabetes records	Literature-based metabolic syndrome criteria	Single dentist performed periodontal exam; metabolic syndrome diagnosis by internists	148 patients	70%	Nil	30%	0	Random split	NR: Metabolic syndrome class distribution not reported	NR (final analyzed sample n = 148)	No (categorical recoding only; no augmentation)
Boitor et al. [23]	Hospital records	NCEP ATP III criteria	Single dentist performed periodontal exam; medical diagnosis by hospital physicians	296 patients	70%	Nil	30%	0	Random split	Imbalanced: Metabolic syndrome 168 vs 128	NR (final analyzed sample n = 296)	No (categorical recoding only; no augmentation)
Fatapour et al. [24]	SEER registry	Algorithmic derivation of recurrence label	SEER tumor registrars + automated algorithm	14,530 (5-year) + 7100 (10-year) OTSCC patients	80%	5-fold CV within training set	20%	0	Split + CV	Imbalanced: Recurrence 657 vs 13,873 (5-year); 971 vs 6129 (10-year)	Cases with missing key variables excluded (complete-case analysis)	No (class balancing via under-sampling only; no augmentation)
Gomez-Rios et al. [35]	Clinical treatment charts	Rule-based extraction of sedation outcomes	Treating clinicians; examiner calibration not reported	230 pediatric records	80%	0% (3-fold CV within training set)	20%	0	Split + CV	Imbalanced: Second sedation 55 vs 175	“Not recorded” category used for some variables; no imputation	No (automated preprocessing only: normalization + one-hot encoding)
Hamdan et al. [25]	Radiology ground truth database	Pixel/tooth-level restoration labeling with calibration and final OMFR review	Two radiology faculty + three trained students; final reviewer = oral & maxillofacial radiologist	100 panoramic radiographs (1108 restorations on 960 teeth; 2601 teeth total)	70%	20%	10%	0	Random split	Imbalanced: Pixel-level 745,545 restoration vs 66,081,455 background; tooth-level 960 vs 1641	NR	Yes (horizontal & vertical flip on training images)
Kong et al. [26]	Medical & inventory records	Implant system labeling based on EMR and inventory logs	Prosthodontist curated dataset	4800 periapical implant images (1200 per system)	80%	10%	10%	0	Random split	Balanced: 4 implant classes, 1200 images per class	Poor-quality images excluded; no imputation	No (geometric standardization only: cropping, vertical alignment, rotation)
Kwack et al. [27]	Postoperative clinical follow-up	AAOMS criteria for MRONJ	Oral & maxillofacial surgeons	340 patients	71.5% (n = 243)	0% (5-fold CV within training set)	28.5% (n = 97)	0	Split + CV	Imbalanced: MRONJ 130 vs controls 210	Incomplete clinical records excluded (complete-case analysis)	No (tabular preprocessing only; no augmentation)
Cheong et al. [28]	Radiology consensus	Majority voting	Three consultant head & neck radiologists	1376 MRI slices	80% (n = 1100)	10% (n = 138)	10% (n = 138)	0	Random split	Imbalanced: Sinonasal disease 777 “Yes” vs 599 “No”	Low-quality images excluded during preprocessing; no imputation	No (no augmentation; only selection of a single MRI slice)
Gonzalez et al. [29]	Expert bounding-box annotations	Bounding box annotation + arbitration	Two expert pediatric dentists + board-certified oral & maxillofacial radiologist	66 bitewing radiographs (755 surfaces: 178 caries, 577 normal)	70%	20%	10%	0	Random split	Imbalanced: Carious surfaces 178 vs sound 577	Unsuitable radiographs excluded; no imputation	Yes (horizontal flip + random augmentations; ∼8 augmented images per original)
Jeong et al. [30]	Public Kaggle dataset	Pre-labeled dataset	NR (original annotators not specified)	2999 intraoral X-rays	80%	10%	10%	0	Random split	Balanced: Root disease classes ∼280–380 images each	NR	No (standard preprocessing only; no augmentation)
Nantakeeratipat et al. [31]	Image-based quantification	ImageJ surface area analysis for plaque levels	Dental researchers; clinical plaque measured on dental students	600 undyed tooth images (AutoML subsets: 300 for 3-class; 299 for 2-class)	80% (240 per subset)	10% (30 per subset)	10% (30 for 3-class; 29 for 2-class)	0	Random split	Balanced: 3-class plaque dataset (100/class); Imbalanced: 2-class dataset 196 vs 103	NR (all participants contributed images)	Yes (image flipping and rotation for plaque dataset balancing)
Yoon et al. [32]	Hospital records	Direct clinical outcome (ICU admission)	Oral & maxillofacial surgeons	210 patients	∼80% per fold (CV training)	∼20% per fold (CV validation)	Nil	0	Cross-validation	Imbalanced: 176 non-ICU vs 34 ICU (16.2% ICU)	Incomplete examinations excluded (complete-case analysis)	No (tabular data only; no augmentation)
Akadiri et al. [33]	Histopathology reports	Biopsy-confirmed diagnosis	Pathologists; images curated by oral & maxillofacial surgeons	86 images across two projects (46 + 40; plus 5 test per project)	All labeled images used for training	Nil	5 test images per project	0	Small hold-out split	Imbalanced: Jaw lesions Project 1 (26 benign vs 20 malignant); Balanced: Project 2 (20 vs 20)	Poor-quality or unconfirmed pathology images excluded; no imputation	No (no augmentation; only high-quality image selection)
Pais et al. [34]	Original clinical implant study	Study-defined peri-implant health vs disease grouping	Examiner qualifications not specified in current article	100 metagenomes (60 biofilm, 40 saliva) from 40 patients	No fixed training split (5-fold CV used)	Nil	Nil	0	Cross-validation	NR: HI vs PI counts not fully reported (40 saliva, 60 biofilm samples)	Low-quality sequence reads excluded during QC; no imputation	No (no augmentation; only metagenomic preprocessing and normalization)

Open in a new tab

Note:Training split refers to data used to fit the model. Validation split refers to internal validation during model development (e.g., cross-validation folds or a validation subset used for model selection/tuning). Test split refers to an independent hold-out subset used only for final performance evaluation after training/tuning. External validation refers to evaluation on an independent cohort (different institution, population, or time period).

EMR: Electronic Medical Records; MGI: Modified Gingival Index; CV: Cross-Validation; ICDAS: International Caries Detection and Assessment System; ECC: Early Childhood Caries; QRILC: Quantile Regression Imputation of Left-Censored data; dmfs: decayed, missing, and filled surfaces; NHANES: National Health and Nutrition Examination Survey; ZOE 2.0: ZOE Oral Health Study version 2.0; NCEP ATP III: National Cholesterol Education Program Adult Treatment Panel III; SEER: Surveillance, Epidemiology, and End Results Program; OTSCC: Oral Tongue Squamous Cell Carcinoma; OMFR: Oral and Maxillofacial Radiology; MRONJ: Medication-Related Osteonecrosis of the Jaw; AAOMS: American Association of Oral and Maxillofacial Surgeons; MRI: Magnetic Resonance Imaging; ICU: Intensive Care Unit; AutoML: Automated Machine Learning; ImageJ: Image Processing and Analysis in Java; QC: Quality Control; HI: Healthy Implant; PI: Peri-implantitis; NR: Not Reported

Sample sizes ranged from 66 bitewings or 86 CT/panoramic images in the smallest diagnostic studies to > 20,000 registry patients in the SEER-based cancer recurrence work, with many datasets in the low hundreds. Class imbalance was frequent, particularly for adverse outcomes (recurrence, ICU admission, MRONJ, second sedation) and in pixel-level segmentation. Handling of missing data was inconsistent: several studies used complete-case analysis, some introduced “not recorded” categories, and only the metabolomics work reported more advanced censoring/imputation strategies. Imaging studies typically excluded non-diagnostic images rather than imputing missingness. Data augmentation was used in a minority of imaging studies (mostly flips and rotations) and was absent from all tabular and omics work.

3.4. AutoML platforms

A wide variety of AutoML systems was employed (Table 3). Local, general-purpose frameworks, H2O AutoML, TPOT, Auto-WEKA, Auto-sklearn and the in-house ORIENTATE platform, were used mainly for tabular, metabolomic and metagenomic data [18], [19], [21], [22], [23], [24], [27], [32], [34], [35]. Cloud-based vision platforms including Google Cloud AutoML Tables and AutoML Vision, Vertex AI, LandingLens, Google Teachable Machine and Neuro-T, dominated for imaging tasks [17], [20], [25], [26], [28], [29], [30], [31], [33]. One study combined local evolutionary TPOT with a proprietary cloud “digital phenomics” system (O2Pmgen) to analyze metagenomic data [34].

Table 3.

AutoML platform characteristics and computational considerations of included studies.

Author	AutoML Platform	Vendor	Platform Type	Automated Feature Engineering (Yes/No)	Automated Model Selection (Yes/No)	Automated Hyperparameter Optimization (Yes/No)	Automated Pipeline End to End (Yes/No)	Explainability (Yes/No)	Code or Model Public (Yes/No)	Hardware or Computation	Computational or Cost Barrier
Lee et al. [17]	Neuro-T AutoML (Neuro-T 2.0.1)	Neurocle Inc.	Local	Yes	Yes	Yes	No	No	No	NR	Computing power constraints and need for larger high-quality datasets; commercial AutoML software
Tobias et al. [18]	H2O AutoML	H2O.ai	Local	No	Yes	Yes	No	Yes	No	NR (method described as lightweight; no hardware reported)	NR
Heimisdottir et al. [19]	TPOT AutoML	TPOT Developers	Local	Yes	Yes	Yes	No	Yes	No	TPOT AutoML: 50 random seeds (up to 24 h each) with 10-fold CV; hardware NR	NR
Karhade et al. [20]	Google Cloud AutoML Tables	Google Cloud	Cloud	Yes	Yes	Yes	No	No	No	Google Cloud AutoML Tables; compute resources NR	NR
Del Real et al. [21]	Auto-WEKA	Auto-WEKA Project Team	Local	Yes	Yes	Yes	No	No	No	Local machine; Auto-WEKA memory cap 2 GB, time budgets 5–60 min and ≥ 18 h; hardware NR	NR
Boitor et al. [22]	H2O AutoML	H2O.ai	Local	No	Yes	Yes	No	Yes	No	H2O AutoML; CPU/GPU and memory configuration NR	NR
Boitor et al. [23]	H2O AutoML; Auto-sklearn	H2O.ai; Auto-sklearn Team	Local	No	Yes	Yes	No	Yes	No	H2O AutoML + Auto-sklearn with 15-min time budget; hardware NR	NR
Fatapour et al. [24]	H2O AutoML	H2O.ai	Local	No	Yes	Yes	No	Yes	No	H2O AutoML with 5-fold CV, 600-s runtime per run (×5); hardware NR	NR
Gomez-Rios et al. [35]	ORIENTATE AutoML	In-house Development Team	Local (web-based application)	Yes	Yes	No	No	Yes	Yes	Server with Intel i9–9280X CPU + 2 × NVIDIA RTX 2080Ti GPUs + 32 GB RAM; compute-limited feature selection	Full feature-selection search over many predictors is computationally expensive and constrained by available processing power; case study uses a small dataset
Hamdan et al. [25]	LandingLens	Landing AI	Cloud	Yes	Yes	Yes	Yes	No	No	Cloud platform; hardware abstracted from user (CPU/GPU NR)	NR
Kong et al. [26]	Google Cloud AutoML Vision	Google LLC	Cloud	Yes	Yes	Yes	Yes	No	No	Google Cloud n1-standard-8 VM + NVIDIA Tesla V100 GPU; 32 node-hours (∼US$100.80)	Cloud training and deployment incur ongoing node-hour costs; ethical and security concerns about uploading medical images to third-party cloud servers
Kwack et al. [27]	H2O AutoML	H2O.ai	Local	No	Yes	Yes	No	Yes	No	H2O AutoML training multiple model families; CPU/GPU configuration NR	NR
Cheong et al. [28]	Vertex AI AutoML	Google Cloud	Cloud	Yes	Yes	Yes	Yes	No	No	Local PC (Intel i7-8750H + 16 GB RAM) for labeling; Vertex AI training on Google Cloud (∼158.5 min/model)	Cloud AutoML requires paid compute time and internet access; hand-coded ML pipelines would demand substantial specialist time and cost
Gonzalez et al. [29]	LandingLens	Landing AI	Cloud	Yes	Yes	Yes	Yes	No	No	LandingLens cloud training (40 epochs); CPU/GPU NR	Data collection and expert annotation for pediatric bitewing datasets are time- and resource-intensive, limiting sample size and generalizability
Jeong et al. [30]	Vertex AI AutoML	Google Cloud	Cloud	Yes	Yes	Yes	Yes	No	No	Vertex AI training in Iowa region; 1 h 38 min (8 node-hours); ∼202,315 KRW	Cloud usage incurs node-hour costs; high volumes of training data could increase expense; dependence on Google Cloud platform
Nantakeeratipat et al. [31]	Vertex AI AutoML	Google Cloud	Cloud	Yes	Yes	Yes	Yes	No	No	Vertex AI (Gemini 1.5 Pro, M125 release); training region specified; hardware NR	Manual cropping of teeth is time-consuming; model generalization limited by small dataset; cloud compute costs not quantified
Yoon et al. [32]	H2O AutoML	H2O.ai	Local	No	Yes	Yes	No	Yes	No	H2O AutoML with 5-fold CV; CPU/GPU configuration NR	NR
Akadiri et al. [33]	Google Teachable Machine	Google	Cloud	Yes	No	No	Yes	No	No	In-browser Teachable Machine (MobileNet transfer learning); ∼10–12 min training; hardware NR	Low computational and financial cost; main barriers are small data volume, limited AI literacy and infrastructure in resource-limited settings
Pais et al. [34]	TPOT AutoML; O2Pmgen	TPOT Developers; Bioenhancer Systems Ltd.	Hybrid (Local AutoML + Cloud SaaS)	Yes	Yes	Yes	Yes	Yes	No	TPOT (100 generations, pop. size 50, 5-fold CV) on 16-core 2.6 GHz VPS; O2Pmgen run on cloud SaaS digital phenomics platform	High cost and complexity of metagenomic sequencing and bioinformatics; need for specialized computing resources, evolutionary AutoML runs and commercial SaaS licenses

Open in a new tab

AutoML:Automated Machine Learning; AI:Artificial Intelligence; ML:Machine Learning; CV:Cross-Validation; CPU:Central Processing Unit; GPU:Graphics Processing Unit; RAM:Random Access Memory; VM:Virtual Machine; NR:Not Reported; GB:Gigabyte; SaaS:Software as a Service; VPS:Virtual Private Server; TPOT:Tree-based Pipeline Optimization Tool; ORIENTATE:Optimization and Resource-efficient INtelligent Tool for Automated feaTure sElection; RTX:Real-Time Ray Tracing; KRW:South Korean Won; US$:United States Dollar;

Automated model selection and hyperparameter optimization were near-universal features across platforms. Automated feature engineering, however, was less consistent and often limited in H2O-based tabular studies, where feature derivation remained largely manual. True end-to-end automation (from data upload to deployable model/API) was most evident in commercial cloud systems (LandingLens, AutoML Vision, Vertex AI, Teachable Machine) and in the hybrid TPOT/O2Pmgen workflow, whereas local tools required separate preprocessing and integration steps.

Use of explainability methods was uneven. Clinical and omics studies using H2O, TPOT, Auto-sklearn or ORIENTATE frequently reported feature importance or SHapley Additive exPlanations (SHAP) analyses to identify influential predictors [19], [22], [23], [24], [27], [32], [34], [35]. In contrast, most imaging papers using commercial cloud platforms did not report saliency mapping or case-level explanations and remained largely “black box” to end-users. None of the studies released full code or pre-trained models publicly.

Computational descriptions ranged from short browser-based training times in Teachable Machine [33] to multi-hour or multi-node cloud training on Google infrastructure [26], [28], [30], and long TPOT runs with repeated cross-validation [19], [34]. Reported barriers included limited local processing power and memory for exhaustive feature search, ongoing node-hour costs and data-governance concerns for cloud platforms, annotation burden for imaging datasets, and the high cost and complexity of sequencing and bioinformatics pipelines in omics work.

3.5. Model performance and evaluation characteristics

Across tasks, AutoML models generally achieved strong discrimination, particularly for imaging applications (Table 4). Implant-classification, root-disease and restoration-segmentation models reached accuracies and AUCs around 0.95–0.98 and, in some small studies, perfect metrics [17], [25], [26], [29], [30], [33]. Smartphone-based gingivitis and plaque models achieved AUC/Area Under the Precision-Recall Curve (AUPRC) values above 0.90 with F1-scores > 0.80 [18], [31]. For tabular data, ECC screening and status models performed moderately (AUC ∼0.74–0.80) but were parsimonious and relied on readily available variables [19], [20]. Orthodontic extraction, metabolic-syndrome, and cancer-recurrence studies frequently reported very high performance on internal validation procedures such as cross-validation or on independent hold-out test sets, depending on the study design, with F1-scores often ≥ 0.90 [21], [22], [23], [24]. MRONJ and ICU-admission prediction achieved AUROCs ≥ 0.75–0.84, suggesting clinically useful risk stratification using questionnaire and routine clinical data [27], [32]. Metagenomic peri-implantitis models produced AUCs up to 0.98 for small combinations of taxa, whereas metabolomic ECC models achieved more modest discrimination (AUC ∼0.75) but yielded biologically plausible metabolite signatures [19], [34].

Table 4.

Performance outcomes and clinical implications of included AutoML studies.

Author	Evaluation metrics	Main findings	Clinical Impact	Limitations	Future Work
Lee et al. [17]	AUC: 0.954 (95% CI 0.933–0.970; SE 0.011), Youden index: 0.808, sensitivity: 0.955, specificity: 0.853; panoramic AUC: 0.929; periapical AUC: 0.961; per-class AUC range: 0.903–0.981	Automated DCNN with Neuro-T achieved high accuracy (AUC 0.954) for six implant-system classifications and outperformed most dental professionals.	May serve as a clinical decision-support tool to identify implant systems from routine radiographs and assist management of implant complications.	Limited to six implant types from three hospitals; small dataset; no CBCT/3D inputs; computing-resource constraints; limited interpretability; no external validation.	Larger and more diverse 2D/3D datasets, improved optimization, and clinical feasibility and efficacy studies are required.
Tobias et al. [18]	AUC (training 5-fold CV, per tooth position): 0.72–0.99, AUC (test-set, stage 1 healthy vs inflamed): 0.69–1.0, AUC (test-set, stage 2 mild vs severe): 0.84–1.0, F1-scores: 0.82–1.0 across tooth positions	H2O AutoML with tailored image features and a two-stage cascade accurately classified gum-health status (healthy vs inflamed; mild vs severe) with high AUC despite a small dataset.	Supports remote, noninvasive gingivitis monitoring via dental selfies and mHealth apps, especially when in-person visits are limited.	Small single-center pilot dataset; limited to anterior teeth; single-dentist gingivitis grading; manual bounding boxes required; no external validation; restricted generalizability.	Collect larger, diverse datasets with multiple graders; automate tooth localization; explore deep learning; validate links between frontal and overall gum health; evaluate real-time use in the iGAM app.
Heimisdottir et al. [19]	Balanced accuracy: 0.66 (10-fold CV), AUC: 0.75 for ECC classification, feature-importance (top metabolites): catechin 1.1% decrease in accuracy when permuted, epicatechin 0.87%, fucose 0.48%, imidazole propionate 0.50%, 9,10-DiHOME 0.47%, N-acetylneuraminate 0.43%	TPOT AutoML using 503 metabolites achieved modest ECC discrimination (AUC ∼0.75) and highlighted biologically relevant metabolites.	Provides biochemical insight into ECC and suggests metabolite biomarkers that may support future ECC risk assessment or disease-activity monitoring.	Cross-sectional design; prevalent ECC only; pooled plaque samples; metabolomics alone insufficient for strong prediction; potential dietary/environmental confounding; no external or longitudinal validation.	Conduct longitudinal and site-specific studies integrating metabolomics with microbial, host, behavioral and environmental data; develop refined ECC taxonomies and multi-omics AutoML models with external validation.
Karhade et al. [20]	AUC: 0.74 (best model), sensitivity: 0.67, specificity: 0.66, PPV: 0.64, accuracy: 0.67; external replication-AUC: 0.80, sensitivity: 0.73, PPV: 0.49	A parsimonious AutoML model using only child age and parent-reported oral health provided modest but useful ECC screening and status imputation.	May augment ECC screening where clinical exams are unavailable and enable ECC burden estimation using minimal proxy information.	Cross-sectional design; ECC prevalence rather than risk prediction; high-risk cohort; reliance on parent-reported oral health; modest performance; no prospective validation	Refine models with biological/contextual predictors; test in diverse prospective cohorts; evaluate feasibility and public health impact in real-world settings.
Del Real et al. [21]	Accuracy: up to 93.93%, sensitivity: 0.939, false-positive rate: 0.08, precision: 0.94, F-score: 0.939, MCC: 0.87, ROC-AUC: 0.915, PR-AUC: 0.913 (best model using combined clinical + radiographic data)	Auto-WEKA AutoML generated accurate extraction-prediction models (accuracy up to 93.9%, ROC-AUC 0.915) using cephalometric and clinical features.	May support orthodontists in complex extraction vs non-extraction decisions by providing an additional data-driven recommendation.	Single-center retrospective dataset with small sample; no external validation; risk of overfitting; measurement reliability not assessed; AutoML models difficult to interpret.	Include larger, multi-orthodontist datasets; assess generalizability and overfitting; explore model interpretability; evaluate AutoML support tools in clinical practice.
Boitor et al. [22]	Precision: 0.897, recall: 1.000, accuracy: 0.932, F1-score: 0.945 (H2O AutoML – Distributed Random Forest model)	H2O AutoML Distributed Random Forest achieved high metabolic-syndrome prediction accuracy and identified periodontal and systemic factors as major predictors.	Supports the idea that controlling periodontal disease and lifestyle factors may help reduce systemic inflammation and cardiometabolic risk in metabolic syndrome.	Single-center cross-sectional design; limited sample size; incomplete class-distribution reporting; no external/prospective validation; potential overfitting from single train–test split.	Use larger multi-center, longitudinal cohorts; incorporate lifestyle/biomarker data; perform external validation; clarify causal links and clinical usefulness of AutoML models.
Boitor et al. [23]	Precision: 1.00 (XGBoost), recall: 1.00, accuracy: 1.00, specificity: 1.00, balanced accuracy: 1.00, F1-score: 1.00; DRF model—precision: 1.00, recall: 0.885, accuracy: 0.932; Auto-sklearn RF—precision: 0.929, recall: 1.00, accuracy: 0.955, balanced accuracy: 0.944	H2O and Auto-sklearn produced high-performing metabolic-syndrome models; H2O XGBoost reached 100% accuracy, with SHAP identifying key periodontal and systemic predictors.	May aid identification of individuals at high risk for metabolic syndrome and support multidisciplinary prevention strategies based on periodontal and systemic factors.	Single-center retrospective dataset; small sample; no external validation; simple 70/30 split risks model drift; causality between periodontal disease and metabolic syndrome cannot be inferred.	Larger multi-center cohorts with external/prospective validation; update models periodically to address drift; expand AutoML + SHAP pipelines to guide metabolic-syndrome management.
Fatapour et al. [24]	AutoML (H2O): Gradient Boosting Machine (GBM) selected as best model; Accuracy: 0.818 (5-year), 0.800 (10-year); Precision: 0.977 / 0.940; Recall (Sensitivity): 0.830 / 0.828; AUC (ROC): 0.75 / 0.74	H2O AutoML–based SEER framework predicted 5- and 10-year oral tongue squamous cell carcinoma recurrence with ∼80–82% accuracy; SHAP analysis identified number of prior tumors, patient age, tumor site, histology, and treatment variables as the most influential predictors.	May support population-level risk stratification and follow-up planning for OTSCC by identifying patients at higher risk of locoregional recurrence using routinely available clinical variables; positioned as a screening and decision-support tool rather than a standalone clinical predictor.	Single population-based retrospective study using registry data; potential information and coding bias; absence of detailed histopathologic and treatment-timing variables (e.g., depth of invasion, perineural invasion, radiation dose); recurrence rate lower than reported literature due to strict inclusion criteria; no independent external cohort or prospective validation.	Apply the AutoML framework to other cancer types and updated SEER releases; incorporate additional clinical and histopathologic variables; evaluate alternative tree-based models and extended training; perform independent external and prospective validation to assess real-world generalizability.
Gomez-Rios et al. [35]	F1-score (class of interest – second sedation): 0.83, balanced accuracy: 0.81, precision: 0.92, recall: 0.75; AUC (ROC): 0.92; class “no sedation”—precision: 0.87, recall: 0.96, F1-score: 0.92	ORIENTATE AutoML enabled non-technical researchers to build interpretable models; an 8-predictor decision tree achieved F1 = 0.83 and AUC 0.92 for sedation prediction.	May support preventive planning by identifying children—especially SHCN—at higher risk of second sedation and highlighting modifiable factors.	Single-center retrospective cohort of 230 cases; highly imbalanced outcome; no external validation; limited generalizability; SHAP importance may be misinterpreted causally.	Apply ORIENTATE to larger datasets; improve and benchmark feature-selection algorithms; integrate causal inference; explore clinical deployment and broad validation.
Hamdan et al. [25]	Sensitivity (pixel): 86.64%, specificity (pixel): 99.78%, accuracy (pixel): 99.63%, precision (pixel): 82.4%, F1-score (pixel): 0.844; sensitivity (tooth): 98.34%, specificity (tooth): 98.13%, accuracy (tooth): 98.21%, precision (tooth): 98.85%, F1-score (tooth): 0.98; AUC: 0.978 (95% CI 0.959–0.998)	LandingLens no-code vision model accurately segmented dental restorations (AUC 0.978) with pixel- and tooth-level performance comparable to coded deep-learning systems.	Suggests that no-code AI platforms can democratize dental imaging model development and potentially improve diagnostic workflow efficiency.	Only 100 panoramic images from a single institution/device; limited age and inclusion criteria; no cross-validation or external validation; only one no-code platform tested; difficulty segmenting certain restorations.	Validate multiple no-code platforms on larger, diverse datasets; explore additional imaging tasks; incorporate cross-validation and ensembling; evaluate regulatory pathways and clinical deployment.
Kong et al. [26]	Accuracy: 0.981, precision: 0.963, recall: 0.961, specificity: 0.985, F1-score: 0.962; average precision (AUPRC): 0.983; per-class accuracy ranged from 0.965 to 1.000	Google AutoML Vision CNN achieved high accuracy for classifying four implant systems from periapical radiographs, enabling accurate no-code implant identification.	May support future tools for identifying unknown implant systems in clinical practice, aiding treatment planning and complication management.	Single-center high-quality periapical images; only four implant systems; no external validation; real-world variability not represented; AutoML algorithmic transparency limited; potential cloud-storage data concerns.	Collect larger, multi-center datasets; include more implant systems; test performance on heterogeneous real-world images; address privacy and governance for cloud deployment.
Kwack et al. [27]	AUC (training): 0.8283, sensitivity (training): 83.7%, specificity (training): 71.3%; AUC (test): 0.7526, sensitivity (test): 88.6%, specificity (test): 52.8%	H2O AutoML developed MRONJ prediction models (train AUC 0.8283; test AUC 0.7526) identifying medication duration, age, and surgical factors as key predictors.	May help clinicians estimate MRONJ risk preoperatively and counsel osteoporotic patients using easily obtainable questionnaire data.	Single-center retrospective dataset; limited to female osteoporosis patients; reliance solely on questionnaire data; no imaging/lab variables; no external validation; non-uniform variables may lead to over/underfitting.	Conduct multi-center, larger studies including radiographic/laboratory/genetic data; perform prospective validation; develop advanced models for MRONJ decision-support.
Cheong et al. [28]	Sensitivity: 91.3%, specificity (precision): 92.8%, accuracy: 92%; TP: 63, TN: 64, FP: 5, FN: 6 (Vertex AI image classification model)	Vertex AI AutoML MRI classifier achieved high sinonasal disease detection accuracy (∼92%), showing clinicians can develop MRI models without coding.	May support MRI pre-screening systems for sinonasal disease to streamline radiology workflows and reduce reporting backlogs.	Only one coronal MRI slice analyzed; may miss disease in other planes; dataset differs from real-world scans; no symptom correlation; no external validation or multi-slice/multi-label testing.	Expand to multi-slice and multi-label MRI tasks; implement saliency mapping; test generalizability on real-world clinical MRI; evaluate CDSS deployment.
Gonzalez et al. [29]	Sensitivity: 0.888, specificity: 0.988, precision: 0.958, accuracy: 0.964, F1-score: 0.921, ROC-AUC: 0.965 (no-code AI, LandingLens™)	LandingLens no-code caries model accurately identified pediatric proximal caries (ROC-AUC 0.965), demonstrating strong clinical potential.	Supports AI-assisted early detection of proximal caries in pediatric patients, especially where expert radiology support is limited.	Small single-center dataset (66 bitewings); no external validation; exclusion of low-quality images reduces real-world applicability; radiographic gold standard instead of histology; possible missed early lesions.	Build larger multi-institutional pediatric datasets; include varied sensors and image quality; perform external validation; conduct prospective radiographic–clinical studies.
Jeong et al. [30]	AUC: 0.967, precision: 95.6%, recall (reproduction rate): 95.2%, log loss: 0.085 (Vertex AI AutoML, multi-label dental root disease classification)	Google Vertex AutoML root-disease classifier achieved high performance (AUC 0.967; precision 95.6%; recall 95.2%) supporting clinical diagnostic use.	Shows that non-programmers can use no-code AutoML to build clinically meaningful dental X-ray diagnostic models, improving accessibility of AI in dentistry.	Limited to 2999 images from a public non-Korean dataset; uncertain ground-truth process; no external validation or clinical deployment; restricted to root-disease classification.	Use larger, locally collected datasets; expand disease categories; validate externally and prospectively; assess clinical workflow integration and cost-effectiveness.
Nantakeeratipat et al. [31]	AUPRC (3-class model): 0.907, precision: 86.2%, recall: 83.3%, F1-score: 84.7%; heavy-label precision 100% and recall 90%; AUPRC (2-class model): 0.964, precision: 93.1%, recall: 93.1%, F1-score: 93.1%	Vertex AI AutoML on smartphone photos accurately classified plaque levels; binary plaque acceptability achieved AUPRC 0.964 and F1 93.1%, enabling dye-free detection.	Could support chairside and at-home plaque monitoring via smartphones, reducing the need for disclosing dyes and enabling scalable preventive care.	Single-center sample of 100 dental students; restricted to upper anterior teeth; small, standardized dataset; manual cropping required; no external validation or testing in diverse populations.	Develop larger, more diverse datasets; build multi-tooth/image-level models; validate across different clinics, smartphones and lighting; integrate into user-friendly apps.
Yoon et al. [32]	AUROC (model 1 training): 0.9191, AUROC (model 1 cross-validation): 0.8289; AUROC (model 2 training): 0.9263, AUROC (model 2 cross-validation): 0.8415 (H2O AutoML GLM best performer)	H2O-AutoML GLM using clinical ± MDCT fascial-space data achieved AUROC ≥ 0.83 for predicting ICU admission in odontogenic infection.	Suggests ML models based on simple clinical and laboratory data could assist early ICU triage decisions for patients with oral infections.	Single-center retrospective dataset of 210 patients; no external validation; chart-based variables only; ML model complexity prevents simple clinical thresholds; limited generalizability.	Develop systems to gather/share large-scale blood test and exam data; perform external and prospective validation of ICU prediction models across institutions.
Akadiri et al. [33]	Accuracy: 1.00, precision: 1.00, recall (sensitivity): 1.00, F1-score: 1.00, ROC-AUC: 1.00 for both models (Project 1: malignant vs benign; Project 2: fibrous dysplasia vs ossifying fibroma)	Google Teachable Machine no-code CNN distinguished malignant vs benign lesions and FD vs OF with perfect metrics (AUC 1.0) despite very small datasets.	May improve diagnostic workflows in OMFS in low-resource settings by augmenting radiologist and surgeon interpretations.	Very small training and test samples; no external validation; high similarity between train and test sets; overfitting likely; black-box CNN with limited interpretability.	Collect larger multi-center datasets with standardized imaging; use rigorous validation (cross-validation + external); incorporate explainable AI; expand AI education for OMFS clinicians.
Pais et al. [34]	AUC: 0.70–0.98 (ML-driven biomarker scoring models), sensitivity: 80–100%, specificity: 80–95% for top models; saliva-based models achieved sensitivity: 95–100% and specificity: 95%; single-microorganism models showed sensitivity: up to 95% but specificity: ≤ 42%, AUC: ≤ 0.56	AI-driven AutoML using metagenomic microbiome data yielded high-performing peri-implantitis classifiers; saliva-based 2–4 microbe combinations achieved 80–100% AUC, sensitivity and specificity.	Indicates that saliva- and biofilm-based microbial biomarkers may form the basis of future non-invasive, cost-effect	Small sample (40 patients, 100 metagenomes) from a single population; high-dimensional data increases overfitting risk; no demographic/clinical covariates; dependence on reference databases; no external validation.	Build larger, diverse cohorts; externally validate microbiome models; integrate demographic and clinical factors; develop affordable qPCR-style biomarker assays; create interpretable clinical decision tools.

Open in a new tab

AI: Artificial Intelligence; AutoML: Automated Machine Learning; AUC: Area Under the Curve; ROC-AUC: Receiver Operating Characteristic Area Under the Curve; AUROC: Area Under the Receiver Operating Characteristic Curve; AUPRC: Area Under the Precision–Recall Curve; CBCT: Cone-Beam Computed Tomography; CDSS: Clinical Decision Support System; CI: Confidence Interval; CNN: Convolutional Neural Network; CV: Cross-Validation; DCNN: Deep Convolutional Neural Network; DiHOME: Dihydroxy-octadecenoic Acid; DRF: Distributed Random Forest; ECC: Early Childhood Caries; F1: F1 Score; FD: Fibrous Dysplasia; FN: False Negative; FP: False Positive; GBM: Gradient Boosting Machine; GLM: Generalized Linear Model; ICU: Intensive Care Unit; iGAM: intelligent Gingival Assessment Model; IOS: Intraoral Scanner; Log Loss: Logarithmic Loss; MCC: Matthews Correlation Coefficient; MDCT: Multidetector Computed Tomography; mHealth: Mobile Health; ML: Machine Learning; MRI: Magnetic Resonance Imaging; MRONJ: Medication-Related Osteonecrosis of the Jaw; OF: Ossifying Fibroma; OMFS: Oral and Maxillofacial Surgery; ORIENTATE: Outcome-oRiented Interpretable ENsemble Auto-ML Tool; OTSCC: Oral Tongue Squamous Cell Carcinoma; PPV: Positive Predictive Value; PR: Precision–Recall; PR-AUC: Precision–Recall Area Under the Curve; qPCR: Quantitative Polymerase Chain Reaction; RF: Random Forest; ROC: Receiver Operating Characteristic; SE: Standard Error; SEER: Surveillance, Epidemiology, and End Results Program; SHAP: SHapley Additive exPlanations; SHCN: Special Health Care Needs; TN: True Negative; TP: True Positive; TPOT: Tree-based Pipeline Optimization Tool; XGBoost: Extreme Gradient Boosting; Youden Index: Youden Index.

Despite frequently strong discrimination metrics, key methodological components necessary for clinical decision support were seldom evaluated. Across all nineteen included studies, none reported formal calibration assessment, such as calibration plots, Brier scores, or calibration slope and intercept. Similarly, quantitative evaluation of subgroup performance was not undertaken; results were almost universally presented as overall accuracy or AUC values without stratification by demographic characteristics, acquisition settings, or disease severity. Although one investigation illustrated ambiguous cases, it did not provide statistical comparison between groups [31]. Reporting of failure behavior was limited. Sixteen studies relied exclusively on aggregate metrics or confusion matrices without describing the clinical contexts of error [17], [18], [19], [20], [21], [22], [23], [24], [27], [28], [29], [30], [32], [33], [34], [35], whereas only three offered narrative or visual insight into misclassifications [25], [26], [31]. Transparency of the AutoML process also varied. Explicit version identifiers or sufficiently detailed pipeline information enabling reproducibility were available in only three studies [19], [31], [34], partial configuration details without version traceability were present in one report [35], and the remaining investigations did not provide enough information to reconstruct the modeling workflow. Finally, none of the studies clearly documented safeguards ensuring separation of training and testing data at the patient level. Together, these observations indicate that, although discrimination performance was frequently high, reporting of calibration, subgroup robustness, failure characteristics, and reproducibility requirements was generally insufficient to assess clinical deployment considerations.

Across the included studies, performance was predominantly estimated using internal validation (18 of 19 studies), including single random train–test splits, k-fold cross-validation, or nested cross-validation within the same dataset. As summarized in Table 2, external validation was uncommon, with most studies reporting an external validation size of 0; the principal exception was Karhade et al. [20], which evaluated the model in an independent cohort.

Studies consistently framed AutoML outputs as decision-support tools rather than standalone diagnostic systems. Proposed uses included assisting identification of unknown implant systems, guiding complex extraction decisions, flagging children at risk of second deep sedation or severe ECC, supporting cardiometabolic risk assessment from periodontal data, triaging oncologic follow-up or ICU admission, and enabling smartphone-based plaque and gingivitis monitoring at home. Omics studies emphasized AutoML’s role in biomarker discovery and prioritization rather than immediate clinical deployment.

At the same time, every study acknowledged significant limitations. Many datasets were small, single-center and subject to spectrum and selection bias; external validation was rare and prospective or interventional evaluations were absent. Several of the highest-performing models (including those with perfect metrics) were trained and tested on very limited image sets, raising a high risk of overfitting. Cross-sectional designs limited inference to disease status rather than future risk; reliance on self-reported or registry variables introduced potential misclassification; calibration and fairness were largely unreported; and explainability for complex models remained limited. None of the AutoML systems described had obtained regulatory approval, and real-world deployment was not evaluated.

Overall, the synthesis of Table 1, Table 2, Table 3, Table 4 indicates that AutoML can achieve competitive, and often excellent, predictive performance across diverse dental applications using heterogeneous data sources, while lowering technical barriers for non-programmer clinicians. However, substantial heterogeneity in study design, outcome definitions, AutoML platforms, and validation strategies, together with small samples and lack of external or prospective evaluation, precludes formal meta-analysis and underscores the need for larger, multicenter, rigorously validated and implementation-focused AutoML studies in dentistry.

3.6. Risk of Bias

Using the PROBAST tool (Table 5), a high risk of bias was most frequently observed in the analysis domain, with 15 of 19 studies (78.9%) rated as high risk, mainly due to small sample sizes, lack of external validation, reliance on single random data splits, and overfitting concerns. The participants domain also showed substantial risk of bias, with 13 studies (68.4%) rated as high risk, largely attributable to single-center recruitment, spectrum bias, case–control enrichment, and restricted study populations.

Table 5.

Risk of bias of included studies.

Study	Participants	Predictors	Outcome	Analysis	Overall
Lee et al. [17]	low	low	low	high	high
Tobias et al. [18]	high	high	high	high	high
Heimisdottir et al. [19]	high	high	low	high	high
Karhade et al. [20]	low	low	low	high	high
Del Real et al. [21]	high	low	low	low	high
Boitor et al. [22]	high	high	high	high	high
Boitor et al. [23]	high	high	low	high	high
Fatapour et al. [24]	low	low	low	low	low
Gomez-Rios et al. [35]	low	low	low	high	high
Hamdan et al. [25]	high	low	low	high	high
Kong et al. [26]	low	low	low	low	low
Kwack et al. [27]	high	low	low	high	high
Cheong et al. [28]	high	low	low	high	high
Gonzalez et al. [29]	high	low	low	high	high
Jeong et al. [30]	low	low	low	high	high
Nantakeeratipat et al. [31]	high	low	low	high	high
Yoon et al. [32]	high	low	low	low	high
Akadiri et al. [33]	high	low	low	high	high
Pais et al. [34]	high	low	low	high	high
Overall (%)	low: 31.6% high: 68.4%	low: 78.9% high: 21.1%	low: 89.5% high: 10.5%	low: 21.1% high: 78.9%	low: 10.5% high: 89.5%

Open in a new tab

In contrast, the predictors domain demonstrated predominantly low risk of bias, with 4 studies (21.1%) rated as high risk, mainly due to subjective feature engineering or high-dimensional handcrafted predictors. The outcome domain showed the lowest overall risk of bias, with only 2 studies (10.5%) rated as high risk, as most studies employed objective and well-defined reference standards.

Overall, while predictor measurement and outcome assessment were methodologically robust in most studies, the dominant limitation across the literature remains inadequate statistical analysis and validation strategies, substantially limiting the generalizability of many reported models.

4. Discussion

4.1. Translational Maturity of AutoML in Dentistry

This systematic review synthesizes current evidence on the application of AutoML in oral healthcare and clarifies its present role within dental AI research. Overall, AutoML was primarily used as a methodological facilitator for model development, exploratory analysis, and feasibility testing, rather than as a clinically integrated or deployment-ready technology. This mirrors trends observed across broader healthcare domains, where AutoML is largely employed to reduce technical barriers and accelerate experimentation rather than support validated clinical decision-making systems [36], [37]. The predominance of retrospective designs and moderate sample sizes therefore reflects the early translational stage of AutoML adoption in oral health.

4.2. Performance, Validation, and Generalizability

Across included studies, AutoML-based models demonstrated consistently strong discrimination on internal validation (e.g., cross-validation/validation splits) and/or hold-out testing, particularly in imaging tasks such as radiographic diagnosis, implant classification, and photographic assessment of oral conditions. Similar findings have been reported in general medical imaging, where deep learning models often achieve high internal accuracy [38]. However, these results were typically derived from single-center datasets with narrow inclusion criteria, which can inflate performance estimates. Models trained on such homogeneous data may capture site-specific patterns that fail to generalize, as evidenced by marked performance drops when medical imaging models are tested across institutions or populations [39], [40]. Consequently, although current dental AutoML models exhibit strong internal discrimination, their true clinical utility remains uncertain without external validation on diverse datasets.

Most studies in this review remained at the internal validation stage, while external validation was uncommon and prospective clinical evaluation was absent. Consequently, the available literature supports the technical feasibility of clinician-accessible AutoML, but does not yet justify conclusions regarding implementation effectiveness, workflow impact, or patient outcomes.

4.3. Platform Differences: Local versus Cloud Ecosystems

Interestingly, several studies reported comparable performance across different AutoML platforms. This suggests that data quality, outcome definition, and task formulation are more influential determinants of model performance than the specific algorithm or platform used. Emphasis on standardized annotations, refined cohort definitions, and consistent outcome variables is therefore likely to yield greater performance gains than increasing algorithmic complexity [39], [41], [42]. In this context, richer and more representative data inputs are more valuable than fine-tuning sophisticated models on poorly curated datasets.

However, comparable predictive performance did not imply equivalence in how systems were implemented, governed, or interpreted in practice. While predictive accuracy did not appear to depend consistently on the specific AutoML engine, important differences emerged in infrastructure, transparency, computational control, and implementation pathways. Local frameworks such as H2O, TPOT, Auto-WEKA, Auto-sklearn, and ORIENTATE were used predominantly for structured clinical, registry, and omics datasets. These environments typically require greater user involvement in preprocessing and pipeline preparation; however, they simultaneously enable closer methodological supervision of model construction. Studies employing local systems far more frequently incorporated interpretability strategies, including feature-importance ranking and SHAP-based analyses [19], [22], [23], [24], [27], [32], [34], [35], suggesting stronger compatibility with research scenarios in which auditability and reproducibility are central aims. In addition, maintaining computation within institutional infrastructure generally might avoid external data transfer and therefore simplify governance of sensitive clinical information.

In contrast, cloud ecosystems such as LandingLens, Vertex AI, AutoML Vision, and Teachable Machine were concentrated primarily in image-based applications and emphasized accessibility, rapid deployment, and abstraction of algorithmic complexity. Multiple reports demonstrated that clinicians with minimal programming background could produce high-performing models within short time frames through graphical interfaces and automated optimization pipelines [25], [28], [30], [33]. Nevertheless, this convenience is accompanied by reduced visibility into internal search processes, restricted customization, limited reporting of intermediate modeling decisions, and infrequent application of explainability tools.

Operational trade-offs are evident. Cloud implementations introduce ongoing node-hour expenses, reliance on internet connectivity, and dependence on vendor-managed environments [26], [28], [30], whereas locally deployed platforms are more often constrained by hardware capacity, memory limitations, and prolonged optimization cycles, particularly in evolutionary approaches [19], [34]. Taken together, platform selection therefore appears not as a purely technical preference but as a strategic alignment with the primary objective of a project. Investigations focused on biomarker discovery, hypothesis exploration, or transparent clinical risk modeling may derive greater value from local environments that prioritize interpretability and methodological control. By contrast, rapid prototyping, educational use, and early feasibility demonstrations may benefit more from cloud infrastructures that minimize engineering overhead and accelerate experimentation.

For the field to progress toward clinical translation, future reports should describe platform-specific constraints with greater clarity, including data custody, ownership of derived models, retraining permissions, financial dependencies, and availability of explanation interfaces. Without such information, evaluation of scalability and readiness for real-world integration remains incomplete. Strengthening reporting in these areas will be essential for transforming AutoML from a proof-of-concept accelerator into a trustworthy component of deployable dental decision-support systems.

4.4. Democratization of Model Development

Beyond these technical and operational considerations, AutoML also reshapes who can participate in model development and how quickly clinically relevant questions can be translated into empirical investigation. Its contribution in dentistry is not limited to a simple decrease in programming effort. AutoML does not remove the need for appropriate data preparation, clinical supervision, or critical interpretation of results. However, its impact lies in altering how and by whom artificial intelligence investigations can be initiated and executed. By automating algorithm selection, model construction, and hyperparameter optimization, AutoML enables clinicians to move from passive recipients of externally developed systems toward active participants in model development [3], [11], [36]. This transition is particularly relevant in healthcare environments where access to dedicated data-science teams may be limited, financially restrictive, or associated with long development timelines [3], [36].

AutoML also markedly increases development velocity. Instead of investing substantial time in building bespoke pipelines before any evaluation is possible, clinicians can explore feasibility, compare alternative predictor definitions, and test multiple hypotheses within short timeframes. This capacity for rapid iteration has been identified as a central advantage of AutoML in clinical research [8], [10], [12]. Furthermore, AutoML promotes methodological consistency. Standardized and automated workflows reduce variability arising from individual programming strategies, thereby improving reproducibility and facilitating comparisons across studies and institutions, which are prerequisites for trustworthy medical AI [3], [36].

Dependence on specialized engineers is not only a logistical matter but may also restrict which ideas are pursued. When development requires extensive technical mediation, many clinically relevant hypotheses may remain untested. AutoML lowers this structural barrier and broadens participation in AI research, enabling more practitioner-driven questions to be explored [11]. Therefore, even if AutoML does not eliminate all technical requirements, its principal value lies in democratizing innovation, accelerating exploratory research, and enabling clinician-led model development, thereby expanding the research capacity of the dental community.

4.5. Methodological Quality, Reporting, and Evaluation Practices

At present, AutoML functions mainly as an efficient framework for model selection and hyperparameter optimization within predefined search spaces and remains fundamentally constrained by the quality and consistency of input data, which it cannot inherently correct [36], [43]. This reinforces the need to prioritize data curation and standardization over reliance on black-box optimization for marginal performance improvements. In dental AI, assembling high-quality, heterogeneous datasets with precise outcome definitions is likely to have a greater impact than switching between AutoML platforms.

Despite favorable internal metrics, evaluation in dental AutoML studies remains largely discrimination-focused. Our synthesis highlights a persistent translational gap: high AUC or accuracy does not ensure that predictions are reliable, equitable, or safe for routine care. Without calibration analysis, predicted risks cannot be assumed to represent true probabilities at clinically relevant thresholds [44], [45]. Similarly, the lack of subgroup evaluation limits confidence in performance across populations, acquisition settings, and disease spectra, while sparse reporting of failure modes obscures where models may break down. Reproducibility is further constrained by incomplete documentation of platform configuration, workflow traceability, and safeguards against data leakage, particularly in no-code or low-code environments [46]. Future studies should therefore report calibration, subgroup performance, structured error analysis, and transparent patient-level data separation in addition to discrimination metrics. These elements are also essential for interpreting decision-analytic methods such as decision-curve analysis (DCA) when clinical use is proposed [47].

Most studies emphasized discrimination metrics such as accuracy, AUC, sensitivity, and specificity, whereas calibration and DCA were infrequently reported. DCA is designed to estimate clinical utility by comparing the net benefit of model-guided decisions with default strategies (e.g., treat all vs treat none) across clinically reasonable threshold probabilities. Its interpretation therefore requires explicit specification of the intended clinical action, appropriate comparators, and a justified threshold range aligned with the predicted outcome [48], [49]. Importantly, DCA should clarify the consequences of decisions under prespecified preferences rather than be used to identify an “optimal” threshold by maximizing net benefit [49]. Practical guidance further stresses restricting analyses to plausible thresholds and avoiding presentation choices that may exaggerate benefit [50], [51]. Although conclusions are often directionally correct, shortcomings in application and reporting, particularly unclear threshold justification and limited attention to optimism are common [52].

In this synthesized AutoML evidence, explicit linkage between predicted risk, clinical action, and threshold justification was uncommon. Consequently, even when favorable curves were presented, implications for patient management remain uncertain, and clinical readiness may be overstated. This limitation parallels broader critiques of medical AI, where models with high discrimination often fail clinically due to poor calibration or misaligned decision thresholds [44], [47]. Notably, almost no study examined how AutoML-based predictions would influence clinical decisions or patient outcomes. Without such evaluations, high AUC values may be misleading, particularly in dentistry, where diagnostic and treatment decisions are often subjective and carry significant consequences.

4.6. Transparency, Interpretability, and Accountability

In addition to numerical validity, translation into practice also depends on whether systems are understandable, auditable, and accountable. Across studies, AutoML models were consistently framed as decision-support tools rather than autonomous systems. This aligns with ethical and regulatory consensus in healthcare AI, which emphasizes augmenting clinician judgment while preserving human oversight and accountability [53], [54]. None of the reviewed studies proposed unsupervised or closed-loop automation, and all retained clinicians as final decision-makers. This cautious, assistive approach is appropriate given current technological and regulatory constraints and supports the paradigm of augmented intelligence rather than replacement.

Differences in AutoML application across data modalities were evident. Imaging-based studies frequently reported high predictive performance but rarely explored model reasoning, failure modes, or dataset biases. Few provided visual explanations such as saliency maps or systematic misclassification analyses, reflecting broader concerns about opacity in medical imaging AI [55], [56], [57]. In contrast, studies using structured clinical or omics data more frequently incorporated interpretability analyses, such as feature importance rankings or SHAP values [58], [59]. However, these efforts were largely associative and did not establish causal relationships or yield novel clinical insights beyond existing knowledge [46], [60]. Overall, interpretability in dental AutoML remains limited, which may hinder trust, regulatory approval, and clinical uptake. Future studies should adopt more robust explainability methods, particularly for image-based models, and routinely report known limitations and performance degradation under domain shifts.

Concerns regarding transparency, reproducibility, and accountability were also prominent. While automation lowers technical barriers, it often obscures critical aspects of model development [61]. Many dental AutoML studies provided limited documentation of model selection, feature engineering, or training procedures, which undermines trust and reproducibility. In regulated healthcare settings, insufficient access to technical details complicates auditing and accountability, particularly since clinicians and institutions remain responsible for AI-influenced decisions [62], [63]. Emerging guidelines such as DECIDE-AI and MINIMAR emphasize the need for improved documentation and governance, even for automated systems [64], [65]. Dental AI studies should therefore report sufficient technical detail to enable independent validation and clearly define responsibility for model errors.

4.7. Task Formulation and the Economics of Annotation

Most AutoML applications were framed as supervised classification, while object detection and segmentation were comparatively rare. This imbalance is notable, given the growing interest in localization tasks within dental AI. The three formulations differ substantially in both informational output and clinical utility. Classification provides a direct image- or patient-level decision, making it attractive for screening, triage, and rapid support, but it lacks spatial information regarding lesion extent or anatomical relationships [66]. Detection augments classification by localizing candidate regions, typically via bounding boxes, thereby offering spatial context without pixel-level delineation [67]. Segmentation yields the most detailed representation, enabling quantitative assessment, surgical planning, and longitudinal monitoring [68]. The richer outputs of detection and segmentation, however, come with substantially higher methodological demands. Pixel-level annotation requires expert labor, standardized protocols, and specialized tools and remains a central bottleneck in medical imaging AI [69]. In contrast, classification labels are often retrievable from routine documentation because diagnoses are already captured during daily care [70]. This dramatically lowers the threshold for dataset creation, an important consideration in AutoML projects that are frequently initiated without dedicated annotation infrastructure. The dominance of classification may also mirror the nature of many dental decisions, which commonly involve categorical judgments such as disease presence, referral, or risk assignment [71], [72]. In these situations, image-level outputs may sufficiently support the clinical objective, whereas fine-grained delineation does not necessarily yield proportional gains in actionability.

There is no indication that AutoML itself is intrinsically limited to classification. On the contrary, AutoML approaches have demonstrated effectiveness across segmentation and other complex imaging tasks [73]. The distribution observed here is therefore more plausibly explained by practical constraints, particularly the expert time required for dense annotation [66], [69], alongside governance considerations [74] and the computational demands of pixel-wise optimization, which collectively make classification the most accessible pathway. Comparable trends are reported throughout dental AI research, where classification commonly outweighs segmentation in areas such as caries and periodontal imaging [75], [76]. Taken together, these observations suggest that prevailing task choices might be driven primarily by feasibility and resource considerations rather than technological limitations of AutoML.

4.8. Gaps in Clinical Translation

We identified no studies addressing survival analysis, longitudinal modeling, or multi-step treatment decision pathways. This absence likely reflects technical limitations of current AutoML platforms rather than a lack of clinical relevance. Complex questions involving disease progression or time-to-event outcomes require modeling approaches that are not yet well supported by off-the-shelf AutoML systems [77]. As AutoML tools evolve to accommodate longitudinal and survival data, dental research should expand into these clinically meaningful applications. Another major gap is the lack of evaluation beyond retrospective development and internal validation. No study advanced to prospective clinical implementation. This mirrors broader findings that most published clinical AI models do not progress to prospective validation or implementation [78], [79]. Without real-world evidence of effectiveness, usability, and generalizability, claims of clinical benefit remain speculative.

4.9. Real-World Barriers to Adoption

Several practical factors may limit the incorporation of AutoML systems into routine oral healthcare. Clinical translation requires alignment with infrastructure, governance, regulatory frameworks, professional responsibilities, and economic sustainability [6], [80]. Consequently, movement from experimental validation to everyday deployment remains largely theoretical [12], [65].

Integration into existing workflows represents a primary challenge [5], [6]. Dental practices function within heterogeneous digital environments that combine imaging devices, proprietary software, practice-management platforms, and electronic records [5], [81]. Embedding automated outputs without disrupting established routines often demands interface development, vendor cooperation, and continuous technical support [6], [65]. In high-throughput settings, even minor delays may compromise usability. While AutoML reduces the expertise required for model development, deployment, interoperability, and lifecycle management remain substantial challenges [12], [36].

Data protection and governance create further constraints. Cloud-based computation raises concerns regarding confidentiality, cybersecurity, data ownership, and cross-border transfer [6],64]. Such issues are particularly relevant for small or privately run practices with limited IT resources. Legal ambiguity around secondary use and breach responsibility may deter adoption despite promising performance.

Regulatory approval forms another major boundary between development and patient care. Systems influencing diagnosis or treatment planning increasingly fall within medical-device legislation, requiring evidence of safety, effectiveness, transparency, and post-market oversight [46], [64], [65]. Automated model search may complicate certification by reducing traceability and hindering reconstruction of development pathways [12], [36]. Limited access to proprietary processes can conflict with expectations for auditability and reproducibility [46], [64]. At the same time, clinicians generally retain legal responsibility, creating uncertainty around liability when decisions are AI-supported [62], [63].

Human and organizational dimensions are equally influential. Uptake depends on trust, perceived usefulness, and compatibility with established diagnostic reasoning [5], [6]. Tools viewed as opaque or as constraining professional autonomy may face resistance unless accompanied by transparent communication of limitations and appropriate indications [46], [64]. Implementation research repeatedly demonstrates that strong accuracy metrics alone rarely guarantee adoption without tangible workflow benefit and adequate training [6], [65].

Economic feasibility further shapes decision-making. Licensing, subscriptions, hardware, cybersecurity, integration, and staff education require substantial investment [5], [6]. With reimbursement pathways often undefined, return on investment remains uncertain, particularly for smaller providers. Thus, progression from promising performance to sustainable routine use cannot yet be assumed [36], [80]. These barriers frame the broader question of whether AutoML is presently ready for reliable chairside deployment.

4.10. Readiness for Chairside Use

From a clinical standpoint, the key issue is whether the body of evidence summarized in this review is sufficient to justify routine use of AutoML systems in patient care. When considered collectively, the answer remains cautious. Although many investigations report encouraging internal performance, the broader conditions that typically precede clinical deployment, including consistent multicenter validation, prospective demonstration of benefit, seamless incorporation into everyday workflows, regulatory approval pathways, and sustainable governance mechanisms, have seldom been satisfied within the same framework.

Readiness for chairside adoption depends on more than algorithmic accuracy. Systems intended to inform diagnosis or treatment must demonstrate reliability across diverse clinical environments, produce outputs that clinicians can interpret and defend, and operate within legal, ethical, and financial structures acceptable to healthcare providers. Across these dimensions, the current literature remains preliminary and fragmented. Accordingly, AutoML in dentistry should presently be viewed primarily as an enabling methodology for accelerating research, hypothesis generation, and rapid prototyping rather than as a mature clinical solution. Transition toward dependable real-world use will require coordinated progress not only in predictive performance but also in validation methodology, implementation science, and institutional governance.

4.11. Future Directions

Looking ahead, several priorities emerge to ensure AutoML delivers meaningful clinical value in oral health. External validation on independent, multi-center datasets is essential to establish generalizability. Promising models should undergo prospective clinical evaluation, including pilot implementation studies or randomized trials. Reporting standards should require calibration assessment, decision-analytic evaluation, and full disclosure of model development details. In addition, fairness and bias analyses should be routinely conducted to ensure equitable performance across demographic subgroups. Without these steps, oral healthcare risks replicating patterns seen elsewhere in AI, where technically impressive models fail to translate into patient benefit.

4.12. Strengths and Limitations

This review has notable strengths. To our knowledge, it represents the first systematic review focused specifically on AutoML in oral healthcare, enabling a targeted assessment of its opportunities and challenges. Nevertheless, limitations must be acknowledged. The number of eligible studies was small and heterogeneous, precluding meta-analysis and reflecting the early stage of adoption. Most studies relied on retrospective, single-center datasets with limited population diversity, and none evaluated implementation in routine clinical workflows or reported patient outcome data. As such, the available evidence remains largely methodological rather than clinically outcome-driven. Additionally, although major biomedical databases and grey literature sources were searched, AutoML studies might be published in computer-science or engineering venues rather than traditional biomedical journals and may therefore not be consistently indexed in PubMed, Web of Science, or Cochrane. Therefore, some relevant evidence may not have been captured, and the database scope should be interpreted as a potential limitation. Finally, clinical implications should remain conservative because most included studies were internally validated within single datasets; robust multi-center external validation and real-world clinical evaluation are required before AutoML models can be considered ready for routine use.

5. Conclusion

AutoML in oral healthcare currently serves primarily as a methodological tool for early-stage model development rather than a clinically validated technology. While AutoML has democratized access to advanced machine learning and accelerated exploratory research, its application remains largely confined to proof-of-concept studies. Progress toward clinical adoption will require rigorous external validation, transparent and standardized reporting, and evaluation within real-world clinical settings. As AutoML technologies mature, their contribution to dental practice will depend on robust data curation, appropriate governance, and careful integration into clinical decision-making processes to ensure reliability, reproducibility, and patient benefit.

Ethical approval

Nil.

Funding

This study was not supported by any funding.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Nil.

Consent to Participate

For this type of study, informed consent is not required.

Code Availability

Nil.

Footnotes

^{Appendix A}

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.jdsr.2026.03.001.

Appendix A. Supplementary material

Supplementary material

mmc1.docx^{(15.2KB, docx)}

Data availability

This review is based on previously published studies. All data extracted and analyzed during this systematic review are included within the article and its Supplementary Materials.

References

1.Al-Raeei M. From diagnosis to treatment: the comprehensive integration of artificial intelligence in modern dental care. Curr Trends Dent. 2025;2(1):13–17. [Google Scholar]
2.Sohrabniya F., Hassanzadeh-Samani S., Ourang S.A., Jafari B., Farzinnia G., Gorjinejad F., et al. Exploring a decade of deep learning in dentistry: a comprehensive mapping review. Clin Oral Invest. 2025;29:143. doi: 10.1007/s00784-025-06216-5. [DOI] [PubMed] [Google Scholar]
3.Luo G., Stone B.L., Johnson M.D., Tarczy-Hornoch P., Wilcox A.B., Mooney S.D., et al. Automating Construction of Machine Learning Models With Clinical Big Data: Proposal Rationale and Methods. JMIR Res Protoc. 2017;6 doi: 10.2196/resprot.7757. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Luu J., Borisenko E., Przekop V., Patil A., Forrester J.D., Choi J. Practical guide to building machine learning-based clinical prediction models using imbalanced datasets. Trauma Surg Acute Care Open. 2024;9 doi: 10.1136/tsaco-2023-001222. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Müller A., Mertens S.M., Göstemeyer G., Krois J., Schwendicke F. Barriers and enablers for artificial intelligence in dental diagnostics: a qualitative study. J Clin Med. 2021;10(8):1612. doi: 10.3390/jcm10081612. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hassan M., Kushniruk A., Borycki E. Barriers to and Facilitators of Artificial Intelligence Adoption in Health Care: Scoping Review. JMIR Hum Factors. 2024;11 doi: 10.2196/48633. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Baratchi M., Wang C., Limmer S., Van Rijn J.N., Hoos H., Bäck T., et al. Automated machine learning: past, present and future. Artif Intell Rev. 2024;57:122. [Google Scholar]
8.Park Y. Automated machine learning with R: AutoML tools for beginners in clinical research. J Minim Invasive Surg. 2024;27:129–137. doi: 10.7602/jmis.2024.27.3.129. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Shujaat S. Automated machine learning in dentistry: a narrative review of applications, challenges, and future directions. Diagn (Basel) 2025;15(3):273. doi: 10.3390/diagnostics15030273. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Karmaker S.K., Hassan M.M., Smith M.J., Xu L., Zhai C., Veeramachaneni K. AutoML to date and beyond: challenges and opportunities. ACM Comput Surv. 2021;54(8):1–35. [Google Scholar]
11.Faes L., Wagner S.K., Fu D.J., Liu X., Korot E., Ledsam J.R., et al. Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study. Lancet Digit Health. 2019;1:e232–e242. doi: 10.1016/S2589-7500(19)30108-6. [DOI] [PubMed] [Google Scholar]
12.Thirunavukarasu A.J., Elangovan K., Gutierrez L., Hassan R., Li Y., Tan T.F., et al. Clinical performance of automated machine learning: A systematic review. Ann Acad Med Singap. 2024;53:187–207. doi: 10.47102/annals-acadmedsg.2023113. [DOI] [PubMed] [Google Scholar]
13.Luke A.M., Rezallah N.N.F. Accuracy of artificial intelligence in caries detection: a systematic review and meta-analysis. Head Face Med. 2025;21:24. doi: 10.1186/s13005-025-00496-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Macrì M., D’Albis V., D’Albis G., Forte M., Capodiferro S., Favia G., et al. The Role and Applications of Artificial Intelligence in Dental Implant Planning: A Systematic Review. Bioeng (Basel) 2024;11:778. doi: 10.3390/bioengineering11080778. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Page M.J., McKenzie J.E., Bossuyt P.M., Boutron I., Hoffmann T.C., Mulrow C.D., et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi: 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Wolff R.F., Moons K.G.M., Riley R.D., Whiting P.F., Westwood M., Collins G.S., et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019;170:51–58. doi: 10.7326/M18-1376. [DOI] [PubMed] [Google Scholar]
17.Lee J.H., Kim Y.T., Lee J.B., Jeong S.N. A performance comparison between automated deep learning and dental professionals in classification of dental implant systems from dental imaging: a multi-center study. Diagn (Basel) 2020;10(11):910. doi: 10.3390/diagnostics10110910. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Tobias G., Spanier A.B. Modified gingival index (MGI) classification using dental selfies. Appl Sci. 2020;10:8923. [Google Scholar]
19.Heimisdottir L.H., Lin B.M., Cho H., Orlenko A., Ribeiro A.A., Simon-Soro A., et al. Metabolomics Insights in Early Childhood Caries. J Dent Res. 2021;100:615–622. doi: 10.1177/0022034520982963. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Karhade D.S., Roach J., Shrestha P., Simancas-Pallares M.A., Ginnis J., Burk Z.J.S., et al. An Automated Machine Learning Classifier for Early Childhood Caries. Pedia Dent. 2021;43:191–197. [PMC free article] [PubMed] [Google Scholar]
21.Real A.D., Real O.D., Sardina S., Oyonarte R. Use of automated artificial intelligence to predict the need for orthodontic extractions. Korean J Orthod. 2022;52:102–111. doi: 10.4041/kjod.2022.52.2.102. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Boitor O., Stef L., Antonescu E., Stoica F., Mihăilă R. The importance of some risk factors in patients presenting the association of metabolic syndrome with periodontal disease. Acta Med Transilv. 2023;28:53. [Google Scholar]
23.Boitor O., Stoica F., Mihaila R., Stoica L.F., Stef L. Automated Machine Learning to Develop Predictive Models of Metabolic Syndrome in Patients with Periodontal Disease. Diagn (Basel) 2023;13:3631. doi: 10.3390/diagnostics13243631. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Fatapour Y., Abiri A., Kuan E.C., Brody J.P. Development of a Machine Learning Model to Predict Recurrence of Oral Tongue Squamous Cell Carcinoma. Cancers (Basel) 2023;15:2769. doi: 10.3390/cancers15102769. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Hamdan M., Badr Z., Bjork J., Saxe R., Malensek F., Miller C., et al. Detection of dental restorations using no-code artificial intelligence. J Dent. 2023 doi: 10.1016/j.jdent.2023.104768. [DOI] [PubMed] [Google Scholar]
26.Kong H.J. Classification of dental implant systems using cloud-based deep learning algorithm: an experimental study. J Yeungnam Med Sci. 2023;40:S29–S36. doi: 10.12701/jyms.2023.00465. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Kwack D.W., Park S.M. Prediction of medication-related osteonecrosis of the jaw (MRONJ) using automated machine learning in patients with osteoporosis associated with dental extraction and implantation: a retrospective study. J Korean Assoc Oral Maxillofac Surg. 2023;49:135–141. doi: 10.5125/jkaoms.2023.49.3.135. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Cheong R.C.T., Jawad S., Adams A., Campion T., Lim Z.H., Papachristou N., et al. Enhancing paranasal sinus disease detection with AutoML: efficient AI development and evaluation via magnetic resonance imaging. Eur Arch Otorhinolaryngol. 2024;281:2153–2158. doi: 10.1007/s00405-023-08424-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Gonzalez C., Badr Z., Gungor H.C., Han S., Hamdan M.D. Identifying Primary Proximal Caries Lesions in Pediatric Patients From Bitewing Radiographs Using Artificial Intelligence. Pedia Dent. 2024;46:332–336. [PubMed] [Google Scholar]
30.Jeong H.-J. Preliminary test of google vertex artificial intelligence in root dental x-ray imaging diagnosis. J Korean Soc Radio. 2024;18:267–273. [Google Scholar]
31.Nantakeeratipat T., Apisaksirikul N., Boonrojsaree B., Boonkijkullatat S., Simaphichet A. Automated machine learning for image-based detection of dental plaque on permanent teeth. Front Dent Med. 2024;5 doi: 10.3389/fdmed.2024.1507705. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Yoon J.H., Park S.M. Prediction of intensive care unit admission using machine learning in patients with odontogenic infection. J Korean Assoc Oral Maxillofac Surg. 2024;50:216–221. doi: 10.5125/jkaoms.2024.50.4.216. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Akadiri O.A., Yarhere K.S. Embracing Computer Vision for Diagnostic Maxillofacial Imaging - An Artificial Intelligence Machine Learning (AIML)Pilot Project. Niger Med J. 2025;66:973–982. doi: 10.71480/nmj.v66i3.748. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Pais R.J., Botelho J., Machado V., Alcoforado G., Mendes J.J., Alves R., et al. Exploring AI–driven machine learning approaches for optimal classification of peri-implantitis based on oral microbiome data: a feasibility study. Diagn (Basel) 2025;15:425. doi: 10.3390/diagnostics15040425. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Gomez-Rios I., Egea-Lopez E., Ortiz Ruiz A.J. ORIENTATE: automated machine learning classifiers for oral health prediction and research. BMC Oral Health. 2023;23:408. doi: 10.1186/s12903-023-03112-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Waring J., Lindvall C., Umeton R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif Intell Med. 2020;104 doi: 10.1016/j.artmed.2020.101822. [DOI] [PubMed] [Google Scholar]
37.Thirunavukarasu A.J., Elangovan K., Gutierrez L., Li Y., Tan I., Keane P.A., et al. Democratizing Artificial Intelligence Imaging Analysis With Automated Machine Learning: Tutorial. J Med Internet Res. 2023;25 doi: 10.2196/49949. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Aggarwal R., Sounderajah V., Martin G., Ting D.S.W., Karthikesalingam A., King D., et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4:65. doi: 10.1038/s41746-021-00438-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Zech J.R., Badgeley M.A., Liu M., Costa A.B., Titano J.J., Oermann E.K. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Med. 2018;15 doi: 10.1371/journal.pmed.1002683. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Yu A.C., Mohajer B., Eng J. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review. Radio Artif Intell. 2022;4 doi: 10.1148/ryai.210064. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Beam A.L., Kohane I.S. Big Data and Machine Learning in Health Care. JAMA. 2018;319:1317–1318. doi: 10.1001/jama.2017.18391. [DOI] [PubMed] [Google Scholar]
42.Cahan E.M., Hernandez-Boussard T., Thadaney-Israni S., Rubin D.L. Putting the data before the algorithm in big data addressing personalized healthcare. NPJ Digit Med. 2019;2:78. doi: 10.1038/s41746-019-0157-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Zöller M.A., Huber M.F. Benchmark and survey of automated machine learning frameworks. J Artif Intell Res. 2021;70:409–474. [Google Scholar]
44.Van Calster B., McLernon D.J., Van Smeden M., Wynants L., Steyerberg E.W., Bossuyt P., et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:230. doi: 10.1186/s12916-019-1466-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Steyerberg E.W., Vickers A.J., Cook N.R., Gerds T., Gonen M., Obuchowski N., et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–138. doi: 10.1097/EDE.0b013e3181c30fb2. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Collins G.S., Moons K.G.M., Dhiman P., Riley R.D., Beam A.L., Van Calster B., et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385 doi: 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Vickers A.J., Holland F. Decision curve analysis to evaluate the clinical benefit of prediction models. Spine J. 2021;21:1643–1648. doi: 10.1016/j.spinee.2021.02.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Vickers A.J., Elkin E.B. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak. 2006;26:565–574. doi: 10.1177/0272989X06295361. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Van Calster B., Wynants L., Verbeek J.F.M., Verbakel J.Y., Christodoulou E., Vickers A.J., et al. Reporting and interpreting decision curve analysis: a guide for investigators. Eur Urol. 2018;74:796–804. doi: 10.1016/j.eururo.2018.08.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Kerr K.F., Brown M.D., Zhu K., Janes H. Assessing the clinical impact of risk prediction models with decision curves: guidance for correct interpretation and appropriate use. J Clin Oncol. 2016;34:2534–2540. doi: 10.1200/JCO.2015.65.5654. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Vickers A.J., Van Calster B., Steyerberg E.W. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3:18. doi: 10.1186/s41512-019-0064-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Capogrosso P., Vickers A.J. A systematic review of the literature demonstrates some errors in the use of decision curve analysis but generally correct interpretation of findings. Med Decis Mak. 2019;39:493–498. doi: 10.1177/0272989X19832881. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Topol E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56. doi: 10.1038/s41591-018-0300-7. [DOI] [PubMed] [Google Scholar]
54.Rajpurkar P., Chen E., Banerjee O., Topol E.J. AI in health and medicine. Nat Med. 2022;28:31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]
55.Oakden-Rayner L., Dunnmon J., Carneiro G., Ré C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc ACM Conf Health Inference Learn. 2020;2020:151–159. doi: 10.1145/3368555.3384468. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Huff D.T., Weisman A.J., Jeraj R. Interpretation and visualization techniques for deep learning models in medical imaging. Phys Med Biol. 2021;66:04tr1. doi: 10.1088/1361-6560/abcd17. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Houssein E.H., Gamal A.M., Younis E.M.G., Mohamed E. Explainable artificial intelligence for medical imaging systems using deep learning: a comprehensive review. Clust Comput. 2025;28:469. [Google Scholar]
58.Lundberg S.M., Lee S.I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017:4765–4774. [Google Scholar]
59.Lundberg S.M., Erion G., Chen H., DeGrave A., Prutkin J.M., Nair B., et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56–67. doi: 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Lipton Z.C. The mythos of model interpretability. Commun ACM. 2016;61:36–43. [Google Scholar]
61.Imrie F., Cebere B., McKinney E.F., Van der Schaar M. AutoPrognosis 2.0: Democratizing diagnostic and prognostic modeling in healthcare with automated machine learning. PLOS Digit Health. 2023;2 doi: 10.1371/journal.pdig.0000276. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Price W.N., 2nd, Gerke S., Cohen I.G. Potential Liability for Physicians Using Artificial Intelligence. JAMA. 2019;322:1765–1766. doi: 10.1001/jama.2019.15064. [DOI] [PubMed] [Google Scholar]
63.Neri E., Coppola F., Miele V., Bibbolino C., Grassi R. Artificial intelligence: Who is responsible for the diagnosis? Radio Med. 2020;125:517–521. doi: 10.1007/s11547-020-01135-9. [DOI] [PubMed] [Google Scholar]
64.Hernandez-Boussard T., Bozkurt S., Ioannidis J.P.A., Shah N.H. MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care. J Am Med Inf Assoc. 2020;27:2011–2015. doi: 10.1093/jamia/ocaa088. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Vasey B., Nagendran M., Campbell B., Clifton D.A., Collins G.S., Denaxas S., et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med. 2022;28:924–933. doi: 10.1038/s41591-022-01772-9. [DOI] [PubMed] [Google Scholar]
66.Litjens G., Kooi T., Ehteshami Bejnordi B., Setio A.A.A., Ciompi F., Ghafoorian M., et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. doi: 10.1016/j.media.2017.07.005. [DOI] [PubMed] [Google Scholar]
67.Elhanashi A., Saponara S., Zheng Q., Almutairi N., Singh Y., Kuanar S., et al. AI-powered object detection in radiology: current models, challenges, and future direction. J Imaging. 2025;11:141. doi: 10.3390/jimaging11050141. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Gao Y., Jiang Y., Peng Y., Yuan F., Zhang X., Wang J. Medical image segmentation: a comprehensive review of deep learning-based methods. Tomography. 2025;11:52. doi: 10.3390/tomography11050052. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Tajbakhsh N., Roth H., Terzopoulos D., Liang J. Guest editorial annotation-efficient deep learning: the holy grail of medical imaging. IEEE Trans Med Imaging. 2021;40:2526–2533. doi: 10.1109/tmi.2021.3089292. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Pereira S.C., Mendonça A.M., Campilho A., Sousa P., Teixeira Lopes C. Automated image label extraction from radiology reports – a review. Artif Intell Med. 2024;149 doi: 10.1016/j.artmed.2024.102814. [DOI] [PubMed] [Google Scholar]
71.Katsumata A. Deep learning and artificial intelligence in dental diagnostic imaging. Jpn Dent Sci Rev. 2023;59:329–333. doi: 10.1016/j.jdsr.2023.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Mallineni S.K., Sethi M., Punugoti D., Kotha S.B., Alkhayal Z., Mubaraki S., et al. Artificial intelligence in dentistry: a descriptive review. Bioeng (Basel) 2024;11:1267. doi: 10.3390/bioengineering11121267. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Ali M.J., Essaid M., Moalic L. Idoughar L. A review of AutoML optimization techniques for medical image applications. Comput Med Imaging Graph. 2024;118 doi: 10.1016/j.compmedimag.2024.102441. [DOI] [PubMed] [Google Scholar]
74.Rieke N., Hancox J., Li W., Milletari F., Roth H.R., Albarqouni S., et al. The future of digital health with federated learning. NPJ Digit Med. 2020;3:119. doi: 10.1038/s41746-020-00323-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Moharrami M., Farmer J., Singhal S., Watson E., Glogauer M., Johnson A.E.W., et al. Detecting dental caries on oral photographs using artificial intelligence: a systematic review. Oral Dis. 2024;30:1765–1783. doi: 10.1111/odi.14659. [DOI] [PubMed] [Google Scholar]
76.Mao K., Thu K.M., Hung K.F., Yu O.Y., Hsung R.T.C., Lam W.Y.H. Artificial intelligence in detecting periodontal disease from intraoral photographs: a systematic review. Int Dent J. 2025;75 doi: 10.1016/j.identj.2025.100883. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Olsavszky V., Dosius M., Vladescu C., Benecke J. Time series analysis and forecasting with automated machine learning on a national ICD-10 database. Int J Environ Res Public Health. 2020;17(14):4979. doi: 10.3390/ijerph17144979. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Nagendran M., Chen Y., Lovejoy C.A., Gordon A.C., Komorowski M., Harvey H., et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. doi: 10.1136/bmj.m689. [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Roberts M., Driggs D., Thorpe M., Gilbey J., Yeung M., Ursprung S., et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell. 2021;3:199–217. [Google Scholar]
80.Schwendicke F., Samek W., Krois J. Artificial Intelligence in Dentistry: Chances and Challenges. J Dent Res. 2020;99:769–774. doi: 10.1177/0022034520915714. [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Schleyer T.K.L., Thyvalikakath T.P., Spallek H., Torres-Urquidy M.H., Hernandez P., Yuhaniak J. Clinical computing in general dentistry. J Am Med Inf Assoc. 2006;13:344–352. doi: 10.1197/jamia.M1990. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx^{(15.2KB, docx)}

Data Availability Statement

This review is based on previously published studies. All data extracted and analyzed during this systematic review are included within the article and its Supplementary Materials.

[bib1] 1.Al-Raeei M. From diagnosis to treatment: the comprehensive integration of artificial intelligence in modern dental care. Curr Trends Dent. 2025;2(1):13–17. [Google Scholar]

[bib2] 2.Sohrabniya F., Hassanzadeh-Samani S., Ourang S.A., Jafari B., Farzinnia G., Gorjinejad F., et al. Exploring a decade of deep learning in dentistry: a comprehensive mapping review. Clin Oral Invest. 2025;29:143. doi: 10.1007/s00784-025-06216-5. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Luo G., Stone B.L., Johnson M.D., Tarczy-Hornoch P., Wilcox A.B., Mooney S.D., et al. Automating Construction of Machine Learning Models With Clinical Big Data: Proposal Rationale and Methods. JMIR Res Protoc. 2017;6 doi: 10.2196/resprot.7757. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Luu J., Borisenko E., Przekop V., Patil A., Forrester J.D., Choi J. Practical guide to building machine learning-based clinical prediction models using imbalanced datasets. Trauma Surg Acute Care Open. 2024;9 doi: 10.1136/tsaco-2023-001222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Müller A., Mertens S.M., Göstemeyer G., Krois J., Schwendicke F. Barriers and enablers for artificial intelligence in dental diagnostics: a qualitative study. J Clin Med. 2021;10(8):1612. doi: 10.3390/jcm10081612. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Hassan M., Kushniruk A., Borycki E. Barriers to and Facilitators of Artificial Intelligence Adoption in Health Care: Scoping Review. JMIR Hum Factors. 2024;11 doi: 10.2196/48633. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Baratchi M., Wang C., Limmer S., Van Rijn J.N., Hoos H., Bäck T., et al. Automated machine learning: past, present and future. Artif Intell Rev. 2024;57:122. [Google Scholar]

[bib8] 8.Park Y. Automated machine learning with R: AutoML tools for beginners in clinical research. J Minim Invasive Surg. 2024;27:129–137. doi: 10.7602/jmis.2024.27.3.129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Shujaat S. Automated machine learning in dentistry: a narrative review of applications, challenges, and future directions. Diagn (Basel) 2025;15(3):273. doi: 10.3390/diagnostics15030273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Karmaker S.K., Hassan M.M., Smith M.J., Xu L., Zhai C., Veeramachaneni K. AutoML to date and beyond: challenges and opportunities. ACM Comput Surv. 2021;54(8):1–35. [Google Scholar]

[bib11] 11.Faes L., Wagner S.K., Fu D.J., Liu X., Korot E., Ledsam J.R., et al. Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study. Lancet Digit Health. 2019;1:e232–e242. doi: 10.1016/S2589-7500(19)30108-6. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Thirunavukarasu A.J., Elangovan K., Gutierrez L., Hassan R., Li Y., Tan T.F., et al. Clinical performance of automated machine learning: A systematic review. Ann Acad Med Singap. 2024;53:187–207. doi: 10.47102/annals-acadmedsg.2023113. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Luke A.M., Rezallah N.N.F. Accuracy of artificial intelligence in caries detection: a systematic review and meta-analysis. Head Face Med. 2025;21:24. doi: 10.1186/s13005-025-00496-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Macrì M., D’Albis V., D’Albis G., Forte M., Capodiferro S., Favia G., et al. The Role and Applications of Artificial Intelligence in Dental Implant Planning: A Systematic Review. Bioeng (Basel) 2024;11:778. doi: 10.3390/bioengineering11080778. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Page M.J., McKenzie J.E., Bossuyt P.M., Boutron I., Hoffmann T.C., Mulrow C.D., et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi: 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Wolff R.F., Moons K.G.M., Riley R.D., Whiting P.F., Westwood M., Collins G.S., et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019;170:51–58. doi: 10.7326/M18-1376. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Lee J.H., Kim Y.T., Lee J.B., Jeong S.N. A performance comparison between automated deep learning and dental professionals in classification of dental implant systems from dental imaging: a multi-center study. Diagn (Basel) 2020;10(11):910. doi: 10.3390/diagnostics10110910. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Tobias G., Spanier A.B. Modified gingival index (MGI) classification using dental selfies. Appl Sci. 2020;10:8923. [Google Scholar]

[bib19] 19.Heimisdottir L.H., Lin B.M., Cho H., Orlenko A., Ribeiro A.A., Simon-Soro A., et al. Metabolomics Insights in Early Childhood Caries. J Dent Res. 2021;100:615–622. doi: 10.1177/0022034520982963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Karhade D.S., Roach J., Shrestha P., Simancas-Pallares M.A., Ginnis J., Burk Z.J.S., et al. An Automated Machine Learning Classifier for Early Childhood Caries. Pedia Dent. 2021;43:191–197. [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Real A.D., Real O.D., Sardina S., Oyonarte R. Use of automated artificial intelligence to predict the need for orthodontic extractions. Korean J Orthod. 2022;52:102–111. doi: 10.4041/kjod.2022.52.2.102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Boitor O., Stef L., Antonescu E., Stoica F., Mihăilă R. The importance of some risk factors in patients presenting the association of metabolic syndrome with periodontal disease. Acta Med Transilv. 2023;28:53. [Google Scholar]

[bib23] 23.Boitor O., Stoica F., Mihaila R., Stoica L.F., Stef L. Automated Machine Learning to Develop Predictive Models of Metabolic Syndrome in Patients with Periodontal Disease. Diagn (Basel) 2023;13:3631. doi: 10.3390/diagnostics13243631. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Fatapour Y., Abiri A., Kuan E.C., Brody J.P. Development of a Machine Learning Model to Predict Recurrence of Oral Tongue Squamous Cell Carcinoma. Cancers (Basel) 2023;15:2769. doi: 10.3390/cancers15102769. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Hamdan M., Badr Z., Bjork J., Saxe R., Malensek F., Miller C., et al. Detection of dental restorations using no-code artificial intelligence. J Dent. 2023 doi: 10.1016/j.jdent.2023.104768. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Kong H.J. Classification of dental implant systems using cloud-based deep learning algorithm: an experimental study. J Yeungnam Med Sci. 2023;40:S29–S36. doi: 10.12701/jyms.2023.00465. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Kwack D.W., Park S.M. Prediction of medication-related osteonecrosis of the jaw (MRONJ) using automated machine learning in patients with osteoporosis associated with dental extraction and implantation: a retrospective study. J Korean Assoc Oral Maxillofac Surg. 2023;49:135–141. doi: 10.5125/jkaoms.2023.49.3.135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Cheong R.C.T., Jawad S., Adams A., Campion T., Lim Z.H., Papachristou N., et al. Enhancing paranasal sinus disease detection with AutoML: efficient AI development and evaluation via magnetic resonance imaging. Eur Arch Otorhinolaryngol. 2024;281:2153–2158. doi: 10.1007/s00405-023-08424-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Gonzalez C., Badr Z., Gungor H.C., Han S., Hamdan M.D. Identifying Primary Proximal Caries Lesions in Pediatric Patients From Bitewing Radiographs Using Artificial Intelligence. Pedia Dent. 2024;46:332–336. [PubMed] [Google Scholar]

[bib30] 30.Jeong H.-J. Preliminary test of google vertex artificial intelligence in root dental x-ray imaging diagnosis. J Korean Soc Radio. 2024;18:267–273. [Google Scholar]

[bib31] 31.Nantakeeratipat T., Apisaksirikul N., Boonrojsaree B., Boonkijkullatat S., Simaphichet A. Automated machine learning for image-based detection of dental plaque on permanent teeth. Front Dent Med. 2024;5 doi: 10.3389/fdmed.2024.1507705. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Yoon J.H., Park S.M. Prediction of intensive care unit admission using machine learning in patients with odontogenic infection. J Korean Assoc Oral Maxillofac Surg. 2024;50:216–221. doi: 10.5125/jkaoms.2024.50.4.216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Akadiri O.A., Yarhere K.S. Embracing Computer Vision for Diagnostic Maxillofacial Imaging - An Artificial Intelligence Machine Learning (AIML)Pilot Project. Niger Med J. 2025;66:973–982. doi: 10.71480/nmj.v66i3.748. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Pais R.J., Botelho J., Machado V., Alcoforado G., Mendes J.J., Alves R., et al. Exploring AI–driven machine learning approaches for optimal classification of peri-implantitis based on oral microbiome data: a feasibility study. Diagn (Basel) 2025;15:425. doi: 10.3390/diagnostics15040425. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Gomez-Rios I., Egea-Lopez E., Ortiz Ruiz A.J. ORIENTATE: automated machine learning classifiers for oral health prediction and research. BMC Oral Health. 2023;23:408. doi: 10.1186/s12903-023-03112-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Waring J., Lindvall C., Umeton R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif Intell Med. 2020;104 doi: 10.1016/j.artmed.2020.101822. [DOI] [PubMed] [Google Scholar]

[bib37] 37.Thirunavukarasu A.J., Elangovan K., Gutierrez L., Li Y., Tan I., Keane P.A., et al. Democratizing Artificial Intelligence Imaging Analysis With Automated Machine Learning: Tutorial. J Med Internet Res. 2023;25 doi: 10.2196/49949. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Aggarwal R., Sounderajah V., Martin G., Ting D.S.W., Karthikesalingam A., King D., et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4:65. doi: 10.1038/s41746-021-00438-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Zech J.R., Badgeley M.A., Liu M., Costa A.B., Titano J.J., Oermann E.K. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Med. 2018;15 doi: 10.1371/journal.pmed.1002683. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Yu A.C., Mohajer B., Eng J. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review. Radio Artif Intell. 2022;4 doi: 10.1148/ryai.210064. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41.Beam A.L., Kohane I.S. Big Data and Machine Learning in Health Care. JAMA. 2018;319:1317–1318. doi: 10.1001/jama.2017.18391. [DOI] [PubMed] [Google Scholar]

[bib42] 42.Cahan E.M., Hernandez-Boussard T., Thadaney-Israni S., Rubin D.L. Putting the data before the algorithm in big data addressing personalized healthcare. NPJ Digit Med. 2019;2:78. doi: 10.1038/s41746-019-0157-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] 43.Zöller M.A., Huber M.F. Benchmark and survey of automated machine learning frameworks. J Artif Intell Res. 2021;70:409–474. [Google Scholar]

[bib44] 44.Van Calster B., McLernon D.J., Van Smeden M., Wynants L., Steyerberg E.W., Bossuyt P., et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:230. doi: 10.1186/s12916-019-1466-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45.Steyerberg E.W., Vickers A.J., Cook N.R., Gerds T., Gonen M., Obuchowski N., et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–138. doi: 10.1097/EDE.0b013e3181c30fb2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] 46.Collins G.S., Moons K.G.M., Dhiman P., Riley R.D., Beam A.L., Van Calster B., et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385 doi: 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47.Vickers A.J., Holland F. Decision curve analysis to evaluate the clinical benefit of prediction models. Spine J. 2021;21:1643–1648. doi: 10.1016/j.spinee.2021.02.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] 48.Vickers A.J., Elkin E.B. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak. 2006;26:565–574. doi: 10.1177/0272989X06295361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] 49.Van Calster B., Wynants L., Verbeek J.F.M., Verbakel J.Y., Christodoulou E., Vickers A.J., et al. Reporting and interpreting decision curve analysis: a guide for investigators. Eur Urol. 2018;74:796–804. doi: 10.1016/j.eururo.2018.08.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] 50.Kerr K.F., Brown M.D., Zhu K., Janes H. Assessing the clinical impact of risk prediction models with decision curves: guidance for correct interpretation and appropriate use. J Clin Oncol. 2016;34:2534–2540. doi: 10.1200/JCO.2015.65.5654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] 51.Vickers A.J., Van Calster B., Steyerberg E.W. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3:18. doi: 10.1186/s41512-019-0064-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] 52.Capogrosso P., Vickers A.J. A systematic review of the literature demonstrates some errors in the use of decision curve analysis but generally correct interpretation of findings. Med Decis Mak. 2019;39:493–498. doi: 10.1177/0272989X19832881. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] 53.Topol E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56. doi: 10.1038/s41591-018-0300-7. [DOI] [PubMed] [Google Scholar]

[bib54] 54.Rajpurkar P., Chen E., Banerjee O., Topol E.J. AI in health and medicine. Nat Med. 2022;28:31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]

[bib55] 55.Oakden-Rayner L., Dunnmon J., Carneiro G., Ré C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc ACM Conf Health Inference Learn. 2020;2020:151–159. doi: 10.1145/3368555.3384468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] 56.Huff D.T., Weisman A.J., Jeraj R. Interpretation and visualization techniques for deep learning models in medical imaging. Phys Med Biol. 2021;66:04tr1. doi: 10.1088/1361-6560/abcd17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] 57.Houssein E.H., Gamal A.M., Younis E.M.G., Mohamed E. Explainable artificial intelligence for medical imaging systems using deep learning: a comprehensive review. Clust Comput. 2025;28:469. [Google Scholar]

[bib58] 58.Lundberg S.M., Lee S.I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017:4765–4774. [Google Scholar]

[bib59] 59.Lundberg S.M., Erion G., Chen H., DeGrave A., Prutkin J.M., Nair B., et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56–67. doi: 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib60] 60.Lipton Z.C. The mythos of model interpretability. Commun ACM. 2016;61:36–43. [Google Scholar]

[bib61] 61.Imrie F., Cebere B., McKinney E.F., Van der Schaar M. AutoPrognosis 2.0: Democratizing diagnostic and prognostic modeling in healthcare with automated machine learning. PLOS Digit Health. 2023;2 doi: 10.1371/journal.pdig.0000276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] 62.Price W.N., 2nd, Gerke S., Cohen I.G. Potential Liability for Physicians Using Artificial Intelligence. JAMA. 2019;322:1765–1766. doi: 10.1001/jama.2019.15064. [DOI] [PubMed] [Google Scholar]

[bib63] 63.Neri E., Coppola F., Miele V., Bibbolino C., Grassi R. Artificial intelligence: Who is responsible for the diagnosis? Radio Med. 2020;125:517–521. doi: 10.1007/s11547-020-01135-9. [DOI] [PubMed] [Google Scholar]

[bib64] 64.Hernandez-Boussard T., Bozkurt S., Ioannidis J.P.A., Shah N.H. MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care. J Am Med Inf Assoc. 2020;27:2011–2015. doi: 10.1093/jamia/ocaa088. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] 65.Vasey B., Nagendran M., Campbell B., Clifton D.A., Collins G.S., Denaxas S., et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med. 2022;28:924–933. doi: 10.1038/s41591-022-01772-9. [DOI] [PubMed] [Google Scholar]

[bib66] 66.Litjens G., Kooi T., Ehteshami Bejnordi B., Setio A.A.A., Ciompi F., Ghafoorian M., et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. doi: 10.1016/j.media.2017.07.005. [DOI] [PubMed] [Google Scholar]

[bib67] 67.Elhanashi A., Saponara S., Zheng Q., Almutairi N., Singh Y., Kuanar S., et al. AI-powered object detection in radiology: current models, challenges, and future direction. J Imaging. 2025;11:141. doi: 10.3390/jimaging11050141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib68] 68.Gao Y., Jiang Y., Peng Y., Yuan F., Zhang X., Wang J. Medical image segmentation: a comprehensive review of deep learning-based methods. Tomography. 2025;11:52. doi: 10.3390/tomography11050052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib69] 69.Tajbakhsh N., Roth H., Terzopoulos D., Liang J. Guest editorial annotation-efficient deep learning: the holy grail of medical imaging. IEEE Trans Med Imaging. 2021;40:2526–2533. doi: 10.1109/tmi.2021.3089292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib70] 70.Pereira S.C., Mendonça A.M., Campilho A., Sousa P., Teixeira Lopes C. Automated image label extraction from radiology reports – a review. Artif Intell Med. 2024;149 doi: 10.1016/j.artmed.2024.102814. [DOI] [PubMed] [Google Scholar]

[bib71] 71.Katsumata A. Deep learning and artificial intelligence in dental diagnostic imaging. Jpn Dent Sci Rev. 2023;59:329–333. doi: 10.1016/j.jdsr.2023.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib72] 72.Mallineni S.K., Sethi M., Punugoti D., Kotha S.B., Alkhayal Z., Mubaraki S., et al. Artificial intelligence in dentistry: a descriptive review. Bioeng (Basel) 2024;11:1267. doi: 10.3390/bioengineering11121267. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib73] 73.Ali M.J., Essaid M., Moalic L. Idoughar L. A review of AutoML optimization techniques for medical image applications. Comput Med Imaging Graph. 2024;118 doi: 10.1016/j.compmedimag.2024.102441. [DOI] [PubMed] [Google Scholar]

[bib74] 74.Rieke N., Hancox J., Li W., Milletari F., Roth H.R., Albarqouni S., et al. The future of digital health with federated learning. NPJ Digit Med. 2020;3:119. doi: 10.1038/s41746-020-00323-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib75] 75.Moharrami M., Farmer J., Singhal S., Watson E., Glogauer M., Johnson A.E.W., et al. Detecting dental caries on oral photographs using artificial intelligence: a systematic review. Oral Dis. 2024;30:1765–1783. doi: 10.1111/odi.14659. [DOI] [PubMed] [Google Scholar]

[bib76] 76.Mao K., Thu K.M., Hung K.F., Yu O.Y., Hsung R.T.C., Lam W.Y.H. Artificial intelligence in detecting periodontal disease from intraoral photographs: a systematic review. Int Dent J. 2025;75 doi: 10.1016/j.identj.2025.100883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib77] 77.Olsavszky V., Dosius M., Vladescu C., Benecke J. Time series analysis and forecasting with automated machine learning on a national ICD-10 database. Int J Environ Res Public Health. 2020;17(14):4979. doi: 10.3390/ijerph17144979. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib78] 78.Nagendran M., Chen Y., Lovejoy C.A., Gordon A.C., Komorowski M., Harvey H., et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. doi: 10.1136/bmj.m689. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib79] 79.Roberts M., Driggs D., Thorpe M., Gilbey J., Yeung M., Ursprung S., et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell. 2021;3:199–217. [Google Scholar]

[bib80] 80.Schwendicke F., Samek W., Krois J. Artificial Intelligence in Dentistry: Chances and Challenges. J Dent Res. 2020;99:769–774. doi: 10.1177/0022034520915714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib81] 81.Schleyer T.K.L., Thyvalikakath T.P., Spallek H., Torres-Urquidy M.H., Hernandez P., Yuhaniak J. Clinical computing in general dentistry. J Am Med Inf Assoc. 2006;13:344–352. doi: 10.1197/jamia.M1990. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Clinician-accessible automated machine learning in oral healthcare: A systematic review

Sohaib Shujaat

Marryam Riaz

Hawazin Almutairi

Hala Alanazi

Lujain Altalhi

Naden Alenazi

Haya Bin Osayl

Manju Roby Philip

Ali Anwar Aboalela

Hongyang Ma

Abstract

1. Introduction

2. Methodology

2.1. Protocol and Registration

2.2. Review Question

2.3. Eligibility Criteria

2.4. Information Sources and Search Strategy

2.5. Study Selection and Data Extraction

2.6. Risk of Bias Assessment

3. Results

3.1. Study Selection

Fig. 1.

3.2. Qualitative synthesis of included studies

Table 1.

3.3. Study characteristics

Table 2.

3.4. AutoML platforms

Table 3.

3.5. Model performance and evaluation characteristics

Table 4.

3.6. Risk of Bias

Table 5.

4. Discussion

4.1. Translational Maturity of AutoML in Dentistry

4.2. Performance, Validation, and Generalizability

4.3. Platform Differences: Local versus Cloud Ecosystems

4.4. Democratization of Model Development

4.5. Methodological Quality, Reporting, and Evaluation Practices

4.6. Transparency, Interpretability, and Accountability

4.7. Task Formulation and the Economics of Annotation

4.8. Gaps in Clinical Translation

4.9. Real-World Barriers to Adoption

4.10. Readiness for Chairside Use

4.11. Future Directions

4.12. Strengths and Limitations

5. Conclusion

Ethical approval

Funding

Declaration of Competing Interest

Acknowledgements

Consent to Participate

Code Availability

Footnotes

Appendix A. Supplementary material

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases