Abstract
Background
Machine learning (ML) applied to radiomics has revolutionized neuro-oncological imaging, yet the diagnostic performance of ML models based specifically on ^18F-FDG PET features in glioma remains poorly characterized.
Objective
To systematically evaluate and quantitatively synthesize the diagnostic accuracy of ML models trained on ^18F-FDG PET radiomics for glioma classification.
Methods
We conducted a PRISMA-compliant systematic review and meta-analysis registered on OSF (10.17605/OSF.IO/XJG6P). PubMed, Scopus, and Web of Science were searched up to January 2025. Studies were included if they applied ML algorithms to ^18F-FDG PET radiomic features for glioma classification and reported at least one performance metric. Data extraction included demographics, imaging protocols, feature types, ML models, and validation design. Meta-analysis was performed using random-effects models with pooled estimates of accuracy, sensitivity, specificity, AUC, F1 score, and precision. Heterogeneity was explored via meta-regression and Galbraith plots.
Results
Twelve studies comprising 2,321 patients were included. Pooled diagnostic metrics were: accuracy 92.6% (95% CI: 91.3–93.9%), AUC 0.95 (95% CI: 0.94–0.95), sensitivity 85.4%, specificity 89.7%, F1 score 0.78, and precision 0.90. Heterogeneity was high across all domains (I² >75%). Meta-regression identified ML model type and validation strategy as partial moderators. Models using CNNs or PET/MRI integration achieved superior performance.
Conclusion
ML models based on ^18F-FDG PET radiomics demonstrate strong and balanced diagnostic performance for glioma classification. However, methodological heterogeneity underscores the need for standardized pipelines, external validation, and transparent reporting before clinical integration.
Supplementary Information
The online version contains supplementary material available at 10.1186/s40644-025-00915-8.
Keywords: Glioma, ^18F-FDG PET, Radiomics, Machine learning, Meta-analysis, Diagnostic accuracy
Introduction
Gliomas are the most common and aggressive primary brain tumors in adults, with glioblastoma multiforme (GBM) representing the deadliest subtype [1, 2]. Accurate and noninvasive glioma characterization, particularly in distinguishing low-grade from high-grade lesions, is critical for clinical decision-making, prognostic stratification, and therapeutic planning. Although magnetic resonance imaging (MRI) remains the clinical gold standard for glioma evaluation, its sensitivity to microstructural and metabolic alterations is limited, often resulting in diagnostic ambiguity [3].
^18F-fluorodeoxyglucose positron emission tomography (^18F-FDG PET) offers a complementary perspective by capturing glucose metabolism, which reflects tumor aggressiveness and cellular proliferation [4]. However, traditional PET biomarkers such as SUVmax or SUVmean provide only coarse quantifications of metabolic activity and lack the spatial resolution to delineate intra-tumoral heterogeneity or tumor boundaries in regions of physiologic uptake. These limitations have constrained the clinical utility of PET in routine glioma evaluation [5].
Radiomics has emerged as a transformative approach to overcome these challenges by extracting high-dimensional, quantitative features from medical images that characterize tumor intensity, texture, and shape [6]. When combined with machine learning (ML) algorithms, PET radiomics can enable sophisticated classification models capable of predicting tumor grade, molecular markers, or clinical outcomes. A diverse array of ML architectures, such as support vector machines (SVM), random forests (RF), and convolutional neural networks (CNNs), has been developed to harness these radiomic features [7].
Despite numerous promising studies, there remains a lack of consensus on model performance, generalizability, and methodological rigor in this domain. No prior meta-analysis has systematically synthesized the diagnostic performance of ML models using ^18F-FDG PET radiomics for glioma. The present study addresses this gap by conducting a PRISMA-guided systematic review and meta-analysis, benchmarking pooled performance metrics across architectures and imaging contexts, and evaluating sources of heterogeneity to inform the translational readiness of AI-based PET models in neuro-oncology.
Methods
This systematic review and meta-analysis were conducted in accordance with the PRISMA 2020 guidelines [8] and was prospectively registered in the Open Science Framework (OSF) (DOI 10.17605/OSF.IO/XJG6P) The protocol outlined predefined objectives, eligibility criteria, and statistical plans to ensure methodological transparency and reproducibility.
Search strategy and study selection
A comprehensive electronic search was performed across PubMed, Scopus, and Web of Science databases to identify relevant studies evaluating the diagnostic performance of machine learning (ML) models based on ^18F-FDG PET radiomics for glioma characterization. The search strategy combined MeSH terms and keywords related to “machine learning,” “positron emission tomography,” and “glioma,” and was last updated in January 2025 (Table 1).
Table 1.
Comprehensive literature search strategies implemented across chosen databases
| Data base | Search strategy | result |
|---|---|---|
| PubMed | (“Machine Learning“[Title/Abstract] OR “Artificial Intelligence“[Title/Abstract] OR “deep learning“[Title/Abstract] OR “Machine Learning“[MeSH Terms] OR “Artificial Intelligence“[MeSH Terms]) AND (“18F-FDG PET“[Title/Abstract] OR “FDG-PET“[Title/Abstract] OR “Fluorodeoxyglucose F18“[MeSH Terms] OR “PET“[Title/Abstract] OR “positron emission tomography“[Title/Abstract] OR “positron emission tomography“[MeSH Terms]) AND (“Glioma“[Title/Abstract] OR “Glioblastoma“[Title/Abstract] OR “Glioblastoma“[MeSH Terms] OR “Glioma“[MeSH Terms]) | 80 |
| Scopus | ( TITLE-ABS-KEY ( “Machine Learning” ) OR TITLE-ABS-KEY ( “Artificial Intelligence” ) OR TITLE-ABS-KEY ( “deep learning” ) OR TITLE-ABS-KEY ( “Transfer Learning” ) ) AND ( TITLE-ABS-KEY ( “18F-FDG PET” ) OR TITLE-ABS-KEY ( “FDG-PET” ) OR TITLE-ABS-KEY ( “Fluorodeoxyglucose F18” ) OR TITLE-ABS-KEY ( “18F Fluorodeoxyglucose” ) OR TITLE-ABS-KEY ( “8FDG” ) OR TITLE-ABS-KEY ( “Fluorine-18-fluorodeoxyglucose” ) OR TITLE-ABS-KEY ( “Fluorine 18 fluorodeoxyglucose” ) OR TITLE-ABS-KEY ( “Fluorodeoxyglucose F 18” ) OR TITLE-ABS-KEY ( “Positron Emission Tomography” ) OR TITLE-ABS-KEY ( “PET Imaging” ) OR TITLE-ABS-KEY ( “PET Imagings” ) OR TITLE-ABS-KEY ( “Positron-Emission Tomography Imaging” ) OR TITLE-ABS-KEY ( “Positron Emission Tomography Imaging” ) OR TITLE-ABS-KEY ( “Positron-Emission Tomography Imagings” ) OR TITLE-ABS-KEY ( “PET Scan” ) OR TITLE-ABS-KEY ( “PET Scans” ) ) AND ( TITLE-ABS-KEY ( “glioma” ) OR TITLE-ABS-KEY ( “glioblastoma” ) OR TITLE-ABS-KEY ( “Gliomas” ) OR TITLE-ABS-KEY ( “Glial Cell Tumors” ) OR TITLE-ABS-KEY ( “Glial Cell Tumor” ) OR TITLE-ABS-KEY ( “Mixed Glioma” ) OR TITLE-ABS-KEY ( “Mixed Gliomas” ) OR TITLE-ABS-KEY ( “Malignant Glioma” ) OR TITLE-ABS-KEY ( “Malignant Gliomas” ) ) | 264 |
| WOS | ((((((TS=(“Machine Learning” )) OR TS=(“Artificial Intelligence”)) OR TS=(“ deep learning”)) OR TS=(“Transfer Learning”))) AND ( (((((((((((((((TS=(“18F-FDG PET” )) OR TS=(“FDG-PET”)) OR TS=(“Fluorodeoxyglucose F18”)) OR TS=(“18F Fluorodeoxyglucose”)) OR TS=(“18FDG”)) OR TS=(“Fluorine-18-fluorodeoxyglucose”)) OR TS=(“Fluorine 18 fluorodeoxyglucose”)) OR TS=(“Fluorodeoxyglucose F 18”)) OR TS=(“Positron Emission Tomography”)) OR TS=(“PET Imaging”)) OR TS=(“PET Imagings”)) OR TS=(“Positron-Emission Tomography Imaging”)) OR TS=(“Positron Emission Tomography Imaging”)) OR TS=(“Positron-Emission Tomography Imagings”)) OR TS=(“PET Scan”)) OR TS=(“PET Scans”))) AND (((((((((TS=(“glioma”)) OR TS=(“glioblastoma”)) OR TS=(“Gliomas”)) OR TS=(“Glial Cell Tumors”)) OR TS=(“Glial Cell Tumor”)) OR TS=(“Mixed Glioma”)) OR TS=(“Mixed Gliomas”)) OR TS=(“Malignant Glioma”)) OR TS=(“Malignant Gliomas”)) | 82 |
Study screening was conducted using the Rayyan Intelligent System for Systematic Reviews, which facilitated blinded, independent screening by two reviewers. Discrepancies were resolved through consensus.
Eligibility criteria
Studies were included if they met all the following criteria:
Population: Included patients with glioma of any grade or subtype.
Intervention: Applied machine learning (ML) models using features derived from ^18F-FDG PET radiomics, either alone or in combination with other imaging modalities (e.g., MRI).
Outcomes: Reported at least one diagnostic performance metric, accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), F1 score, or precision.
Study Design: Original peer-reviewed research articles using retrospective or prospective cohorts, with clearly described model training and validation processes.
Data Availability: Sufficient methodological and quantitative data available for extraction and synthesis.
Exclusion criteria were:
Non-original studies (e.g., reviews, editorials, letters)
Studies not using ^18F-FDG PET
Studies not involving machine learning algorithms
Lack of performance metrics or insufficient reporting for meta-analysis
Data extraction and quality assessment
From each eligible study, we extracted structured information on study design, sample size, patient demographics, glioma subtype, PET acquisition parameters, feature extraction methods, ML architecture, validation strategy, and performance outcomes. Data were organized into a standardized extraction form and cross-validated for accuracy.
We conducted a formal risk of bias assessment using the QUADAS-2 tool, which evaluates four key domains: patient selection, index test, reference standard, and flow/timing. Each domain was assessed independently by two reviewers and classified as “low,” “high,” or “unclear” risk of bias. Discrepancies were resolved through discussion. Results are visualized in Fig. 1.
Fig. 1.
QUADAS-2 Risk of Bias Assessment Across Included Studies. Traffic light plot summarizing risk of bias judgments across four domains using the QUADAS-2 tool: (D1) Patient selection, (D2) Index test (machine learning model), (D3) Reference standard, and (D4) Flow and timing. Each domain is rated as low risk (green), high risk (red), some concerns (yellow), or unclear (blue). Overall, most studies showed low risk of bias, though a few exhibited limitations in reference standard reporting or test interpretation
Statistical analysis
Meta-analysis was conducted using Stata version 18. Pooled performance estimates were computed using random-effects models with restricted maximum likelihood (REML) estimation. Outcomes of interest included pooled accuracy, sensitivity, specificity, AUC, F1 score, and precision, each reported with 95% confidence intervals. Heterogeneity was assessed using the I² statistic and corresponding p-values.
Potential publication bias was evaluated using funnel plots, while Galbraith (radial) plots were used to explore study precision and outliers. Meta-regression was conducted to examine the influence of moderators such as ML model type, glioma subtype, validation strategy, and sample size. All procedures adhered to best practices in diagnostic test accuracy meta-analyses in artificial intelligence applications.
Result
Study selection process
A comprehensive literature search initially yielded 426 records from selected databases. Following the removal of 149 duplicate entries, 277 unique studies underwent title and abstract screening.
Out of these, 165 studies were excluded at this stage for not meeting inclusion criteria. The remaining 112 full-text articles were retrieved for detailed assessment. After full-text evaluation, 100 articles were excluded due to insufficient or irrelevant data, most commonly due to missing performance metrics, lack of PET-based radiomics, or inappropriate ML methodology (Fig. 2).
Fig. 2.
PRISMA 2020 Flow Diagram of Study Selection Process. The figure outlines the systematic screening and selection of studies for inclusion in the meta-analysis. Out of 426 records initially identified, 277 were screened after duplicate removal. Following abstract and full-text assessments, 12 studies met the eligibility criteria and were included in the final synthesis. Reasons for exclusion included irrelevance to the topic, insufficient performance data, or lack of machine learning applications using 18 F-FDG PET radiomics in glioma
Ultimately, 12 studies met all predefined inclusion criteria and were incorporated into the final qualitative and quantitative synthesis.
This flow demonstrates a rigorous and transparent selection pathway, ensuring that only high-quality, relevant evidence was synthesized in the meta-analysis. It also highlights the relatively narrow evidence base in this emerging field, despite a broad initial search, underscoring the novelty and importance of the review.
Study characteristics
A total of 12 studies published between 2014 and 2024 were included, comprising 2,321 patients evaluated for glioma characterization using machine learning (ML) algorithms applied to 18 F-FDG PET radiomics [9–20]. Among them, 1,088 (46.9%) were male and 741 (31.9%) were female, while the rest had unspecified sex. The pooled mean age of participants was 53.2 years (SD: 1.62), consistent with the epidemiology of glioma presentation in mid-to-late adulthood (supplementary Table 1).
Most studies focused on differentiating high-grade gliomas (HGGs), particularly glioblastoma multiforme (GBM), from low-grade gliomas (LGGs). While some studies included a wide spectrum of WHO grades, others exclusively targeted grade IV lesions.
Risk of bias assessment using QUADAS-2 revealed that most studies were rated as low risk across all domains. Some concerns were noted in a small number of studies, particularly in the reference standard domain and flow/timing, reflecting limited reporting on model interpretation procedures or outcome assessment protocols. Figure 1 illustrates the domain-wise and overall judgments across all 12 included studies.
Radiotracer uptake and quantitative PET metrics
Radiotracer analysis across studies consistently showed elevated uptake values in higher-grade gliomas. For example:
Kong et al. (2019) reported a mean SUVmax of 13.5 ± 4.37 in GBM, significantly higher than 6.73 ± 2.67 in LGG.
Wei et al. (2022) observed SUVmean differences across grades, with grade IV tumors averaging 13.92 ± 5.57, compared to 9.77 ± 4.87 in lower grades.
Volumetric measures such as metabolic tumor volume (MTV) and total lesion glycolysis (TLG) were also predictive in select studies.
These patterns support the use of semi-quantitative PET biomarkers as discriminative features when enhanced through radiomic and ML pipelines.
Radiomics and feature selection
All studies performed radiomic extraction, yielding 40 to 780 + features per scan. Commonly used feature classes included:
First-order statistics (mean, skewness, kurtosis).
Textural matrices (GLCM, GLRLM, GLSZM).
Shape and volume descriptors.
To manage high dimensionality, studies used robust feature selection strategies, such as:
Intraclass correlation coefficients (ICC > 0.8).
LASSO regression, mutual information, and recursive feature elimination.
Dimensionality reduction via PCA or correlation filtering.
Machine learning algorithms
A diverse set of ML models were used, tailored to both binary and multiclass classification tasks:
CNNs were used in Wei et al. and Shahram et al., particularly in multimodal PET/MRI frameworks.
Random Forests (RF), Support Vector Machines (SVMs), and XGBoost were commonly applied in studies like Kong et al., Pan et al., and Grahovac et al.
Artificial Neural Networks (ANN) were used in two studies, achieving accuracy > 80% on internal validation.
Several studies adopted ensemble modeling or stacked learning, combining the predictive strengths of multiple classifiers.
Validation strategies
Internal validation was the norm, using:
k-fold cross-validation (typically 5-fold or 10-fold).
Holdout validation with 70/30 or 80/20 train-test splits.
Only a few studies (e.g., Kong et al., Pan et al.) reported external validation cohorts, highlighting a gap in generalizability. None reported model calibration metrics (e.g., Brier score), and only one study reported decision curve analysis to assess clinical utility.
Performance metrics
The extracted ML models achieved strong discriminative metrics across studies:
Accuracy ranged from 74.0 to 97.0%.
AUC values exceeded 0.90 in over half the studies.
Sensitivity and Specificity often remained > 85%, suggesting balanced diagnostic performance.
These findings were further confirmed through meta-analytically pooled estimates, which showed high pooled accuracy (92.6%), AUC (89.7%), and precision (95%), though heterogeneity remained substantial (I² >75% in most domains).
Meta-analytic performance estimates
The pooled accuracy of machine learning models across studies was 92.61%, with a 95% confidence interval (CI) ranging from 91.29 to 93.93%. However, there was notable heterogeneity, as indicated by an I² of 77.3% (p < 0.001). Sensitivity analysis showed a pooled estimate of 85.42% (95% CI: 84.09–86.75%), with substantial heterogeneity across studies (I² = 94.5%, p < 0.001). The overall specificity was 89.71% (95% CI: 88.52–90.89%), again with high between-study variance (I² = 93.8%, p < 0.001)(Figs. 3, 4, 5, 6, 7 and 8).
Fig. 3.
Forest Plot for Accuracy. Forest plot showing study-level and pooled estimates for classification accuracy. The overall pooled accuracy was 92.61% (95% CI: 91.29–93.93) with significant heterogeneity (I² = 77.3%, p < 0.001)
Fig. 4.
Forest Plot for Sensitivity. Pooled sensitivity across studies was 85.42% (95% CI: 84.09–86.75), indicating strong ability to identify glioma cases, though between-study heterogeneity remained high (I² = 94.5%, p < 0.001)
Fig. 5.
Forest Plot for Specificity. The overall specificity was 89.71% (95% CI: 88.52–90.89), suggesting models performed well in distinguishing non-glioma cases. Substantial heterogeneity was noted (I² = 93.8%, p < 0.001)
Fig. 6.
Forest Plot for AUC. The area under the ROC curve was consistently high across studies, with a pooled AUC of 0.95 (95% CI: 0.94–0.95), demonstrating excellent overall discriminative ability (I² = 96.5%)
Fig. 7.
Forest Plot for F1 Score. The pooled F1 score was 0.78 (95% CI: 0.75–0.81), reflecting a good balance between sensitivity and precision across models (I² = 93.0%)
Fig. 8.
Forest Plot for Precision. The overall pooled precision was 0.90 (95% CI: 0.87–0.92), showing that the majority of positive glioma classifications were true positives (I² = 94.4%)
The area under the receiver operating characteristic curve (AUC) further demonstrated excellent discriminatory performance, with a pooled value of 0.95 (95% CI: 0.94–0.95) and a high level of heterogeneity (I² = 96.5%, p < 0.001). For model calibration and balance, the pooled F1 score was calculated at 0.78 (95% CI: 0.75–0.81), while the pooled precision was 0.90 (95% CI: 0.87–0.92), both with significant heterogeneity (I² = 93.0% and 94.4%, respectively; p < 0.001 for both).
Heterogeneity and risk of bias
Across all pooled estimates, heterogeneity remained substantial, likely due to variation in radiomic feature sets, patient population characteristics, machine learning architecture, and validation methods (e.g., internal vs. external validation). The Galbraith plot illustrated dispersion around the regression line, further confirming the presence of heterogeneity in effect sizes. Visual inspection of the funnel plot did not reveal substantial asymmetry, suggesting minimal publication bias; however, the limited number of studies per performance metric precluded formal statistical testing such as Egger’s regression(Figs. 9 and 10).
Fig. 9.
Funnel Plot of Diagnostic Accuracy Estimates. The funnel plot shows the distribution of effect sizes against their standard errors across included studies. Most points fall within the pseudo 95% confidence boundaries, suggesting no strong evidence of publication bias, although some asymmetry at the base may reflect small-study effects
Fig. 10.
Galbraith Plot (Radial Plot) of Standardized Effects. This Galbraith plot depicts the precision (1/SE) of each study against its standardized effect size (ES/SE). Most studies lie within the 95% confidence band, confirming consistency with the overall pooled effect but highlighting moderate heterogeneity
These findings support the overall utility of machine learning models based on ^18F-FDG PET radiomics in glioma prediction, while also highlighting the need for standardization in image preprocessing, feature engineering, and model evaluation frameworks to reduce methodological heterogeneity in future research.
Discussion
Principal findings
This systematic review and meta-analysis provide the first comprehensive synthesis of diagnostic performance for machine learning (ML) models trained on ^18F-FDG PET radiomics features in glioma characterization. Across 12 studies involving 2,321 patients, pooled estimates demonstrated outstanding classification metrics: accuracy (92.6%), AUC (0.95), sensitivity (85.4%), specificity (89.7%), F1 score (0.78), and precision (0.90). These results affirm the potential of PET radiomics–based ML models as high-performing, noninvasive tools in neuro-oncology.
However, the consistently high heterogeneity across metrics (I² >75% in all domains) underscores significant variability in methodology, study populations, feature engineering strategies, and model validation approaches. Despite this, the performance remained robust across architectures and tasks, indicating the resilience of radiomics-derived ML frameworks when applied to metabolic brain imaging.
Interpretation of model performance
The diagnostic accuracy and AUC values achieved by the included models exceed those commonly reported for conventional imaging biomarkers in glioma. High precision and F1 scores suggest strong reliability in positive classifications and balanced performance across true positives and false positives. These findings align with prior evidence from individual studies, yet the pooled results offer a broader validation across diverse settings.
CNNs and RFs were frequently associated with superior performance, likely due to their ability to capture nonlinear patterns and hierarchical spatial features. Interestingly, models using hybrid PET/MRI input or ensemble approaches demonstrated improved performance, suggesting that multimodal integration enhances the discriminative power of radiomics-based classifiers.
Sources of heterogeneity
The substantial heterogeneity observed is multifactorial. First, variations in PET acquisition protocols (e.g., scanner type, voxel resolution, reconstruction method) likely introduce systematic biases in image features. Second, the choice of radiomic feature sets (e.g., handcrafted vs. deep features, inclusion of GLSZM or GLRLM) and selection techniques (e.g., PCA, ICC filtering, LASSO) may impact both the relevance and redundancy of the input data. Finally, disparities in validation design, particularly the lack of external testing in most studies, limit generalizability and amplify variance across reported outcomes.
Meta-regression and subgroup analyses support this interpretation, revealing that sample size, model type, and cross-validation strategy partially explain performance variation. Studies with external validation or robust cross-validation (e.g., 10-fold, Monte Carlo sampling) exhibited more consistent results.
Beyond the explored moderators such as model type and validation strategy, other unexamined sources of heterogeneity likely contribute to the variability observed across studies. Notably, variations in image pre-processing pipelines, including normalization schemes, voxel resampling, and denoising techniques, can significantly affect radiomic feature stability. Segmentation approaches also differed widely, ranging from manual delineation to semi-automatic or atlas-based methods, each introducing varying degrees of operator dependence and boundary definition errors. Additionally, differences in PET scanner types, reconstruction algorithms, and acquisition parameters (e.g., time per bed position, matrix size) may influence signal intensity and radiomic feature extraction. These methodological inconsistencies may obscure the comparability of performance metrics across studies and limit generalizability. Future research should prioritize the adoption of standardized imaging protocols, such as those proposed by the Image Biomarker Standardization Initiative (IBSI) and promote open-access preprocessing pipelines to improve reproducibility. Furthermore, phantom-based harmonization strategies and feature robustness testing across scanner types may help mitigate technical heterogeneity and enhance cross-study reliability.
Comparison with prior evidence and positioning within the field
Our findings build on a growing body of evidence demonstrating the diagnostic power of machine learning (ML) models in neuro-oncology, particularly in differentiating gliomas from other central nervous system (CNS) malignancies using PET radiomics. Notably, the study by Kong et al. (2019) highlighted the value of ^18F-FDG-PET radiomics in distinguishing glioblastoma from CNS lymphoma [19]. Their analysis of 107 features across multiple SUV-normalized maps found that several first-order and texture features achieved AUCs exceeding 0.97, outperforming traditional metrics like SUVmax. This is consistent with our pooled findings of high AUC (0.95) and accuracy (92.6%) across studies utilizing ^18F-FDG PET radiomics for glioma stratification.
In a related study, Wei et al. (2022) emphasized the importance of standardized pipelines and preprocessing in PET radiomics for glioma characterization, recommending robust feature engineering and cross-validation practices to minimize bias and improve reproducibility. Our review affirms this recommendation, as we observed methodological heterogeneity, especially in feature selection and ML algorithms, as a key source of performance variation and statistical heterogeneity (I² >75%) [18].
Furthermore, Pan et al. (2022) demonstrated that radiomics derived from multimodal neuroimaging, including PET and MRI, can enhance glioma grading and molecular subtype prediction. They reported strong correlations between PET radiomics features and isocitrate dehydrogenase (IDH) mutation status, supporting our observation that ML models integrating PET features can achieve clinically relevant predictive performance [16].
Compared to earlier reviews that predominantly focused on MRI radiomics, such as the work by Bi et al., which reported AUCs around 0.91 for glioma detection with AI models, our meta-analysis reveals that PET-based models may offer a superior discriminatory profile. Importantly, we are among the first to benchmark performance across multiple metrics (e.g., precision, F1 score) and investigate moderators (e.g., sample size) through meta-regression [21].
Collectively, this positions our study at the intersection of diagnostic innovation and translational neuroimaging, supporting the integration of ML-enhanced ^18F-FDG PET into clinical workflows for glioma diagnosis and risk stratification.
Limitations in validation practices
A major methodological shortcoming across the included studies lies in the limited use of external validation and independent test sets. While many studies reported strong internal performance metrics, often via k-fold cross-validation or random train-test splits, only a minority employed independent datasets from separate institutions or imaging centers. This lack of external validation raises concerns regarding the reproducibility and generalizability of reported results in real-world clinical settings.
Internal validation strategies, though useful for initial model development, are prone to optimistic bias, particularly in small sample settings or when feature selection is performed prior to data partitioning. Moreover, many studies did not report whether data leakage was prevented or if preprocessing steps (e.g., normalization, feature selection) were confined to the training data alone. Such methodological lapses can artificially inflate performance estimates and obscure true model robustness [22, 23].
Crucially, none of the included studies conducted real-world deployment testing, decision curve analysis, or prospective clinical trials, despite the clinical promise of these models. This gap limits our ability to assess how these models would perform when applied to heterogeneous populations, imaging protocols, or scanner vendors in routine clinical workflows.
While the lack of external validation was acknowledged as a limitation, it remains a critical barrier to clinical adoption. To facilitate external validation in future studies, researchers should prioritize the use of publicly available, multi-institutional datasets and engage in collaborative data sharing frameworks such as The Cancer Imaging Archive (TCIA) or OpenNeuro. Prospective studies should also consider temporal validation, in which models are tested on data from a different time period than the training set, to better simulate real-world deployment. Additionally, journals and funding agencies could incentivize external validation by requiring or prioritizing studies that incorporate it. Frameworks like the CLAIM checklist and TRIPOD-AI extension offer structured guidance to ensure external datasets are appropriately integrated, and validation metrics such as calibration curves and decision curve analysis should accompany performance reporting to assess real-world utility. Integrating these elements into the model development pipeline can significantly improve the credibility, generalizability, and clinical relevance of ML-based PET radiomics tools.
While the majority of included studies demonstrated low risk of bias according to QUADAS-2, a few domains, particularly reference standard and flow/timing, exhibited either high risk or insufficient information. These findings underscore the need for improved methodological transparency and standardized reporting frameworks in future AI-driven diagnostic studies.
To move toward clinical translation, future work must prioritize rigorous external validation using multicenter cohorts, temporal splits, or prospective designs. Transparent reporting of training, validation, and test set separation, as well as calibration and interpretability measures, should become standard practice for all studies developing AI-driven radiomic tools.
Clinical and translational implications
The diagnostic potential of machine learning models trained on ^18F-FDG PET radiomics in glioma is evident from our pooled analysis; however, translating these tools into clinical practice requires careful consideration of regulatory, infrastructural, and implementation challenges.
One critical pathway for translation is alignment with the U.S. Food and Drug Administration’s (FDA) regulatory framework for Artificial Intelligence/Machine Learning–Based Software as a Medical Device (AI/ML SaMD). According to the FDA’s proposed action plan, AI tools intended for diagnostic use must demonstrate clinical validity, real-world robustness, and continuous learning safeguards. Notably, none of the included studies reported adherence to regulatory-grade validation standards such as multi-site reproducibility, prospective testing, or model versioning, all of which are prerequisites for SaMD qualification [24].
Moreover, model reporting practices in current PET radiomics literature often fall short of reproducibility and interpretability benchmarks. The Checklist for Artificial Intelligence in Medical Imaging (CLAIM) provides a structured guideline for transparent reporting of model development, training-test separation, and performance metrics. Adherence to CLAIM would improve confidence in model integrity and comparability across institutions [25, 26].
From an implementation standpoint, integration of AI-PET models into neuro-oncology workflows will require more than diagnostic accuracy. Interoperability with PACS systems, interpretability via saliency mapping or SHAP values, and calibration to institution-specific imaging protocols are necessary for clinician acceptance and ethical deployment. Additionally, engagement with regulatory agencies, radiologists, and clinical trial designers early in development will be essential to ensure that future AI tools are not only technically sound but also implementation-ready [27–30]. Despite this promise, clinical translation remains limited. A key barrier is the lack of standardization in feature definitions, reproducibility across centers, and model calibration on independent datasets. Few studies reported decision curve analyses or reader-assistive applications, indicating a gap between technical development and bedside utility.
Limitations
This study has several limitations. First, most included studies were retrospective and single center, increasing the risk of selection and reporting biases. Second, the heterogeneity across PET protocols and ML pipelines limits the interpretability of pooled metrics. Third, while we explored sources of heterogeneity via meta-regression, the modest number of studies per subgroup restricted statistical power. Lastly, the absence of formal quality assessment using tools like PROBAST or QUADAS-2 is a constraint we aim to address in future updates.
Future directions
To advance the field, future research should prioritize:
External validation using multicenter datasets with harmonized PET acquisition protocols.
Model explainability and interpretability, including saliency mapping and SHAP value analyses.
Prospective trials integrating ML-based PET assessment into clinical decision-making workflows.
Open-source pipelines and reporting standards that facilitate benchmarking and replication across sites.
Multi-task learning frameworks that combine radiomics with genomics, histopathology, and clinical data for holistic glioma profiling [31, 32]. In addition to emphasizing external validation, future studies should adopt robust calibration assessments to evaluate clinical utility. Metrics such as the Brier score and calibration slope provide insight into the reliability of predicted probabilities, ensuring that model outputs align with observed event rates. Furthermore, decision curve analysis (DCA) should be encouraged to quantify the net clinical benefit of ML-based PET models across a range of threshold probabilities. Incorporating these metrics into radiomics research would help bridge the gap between technical performance and actionable, risk-informed decision-making in neuro-oncology.
Conclusion
Machine learning models leveraging ^18F-FDG PET radiomics achieve high diagnostic performance in glioma classification across a range of architectures and feature types. While these tools show considerable promise, methodological heterogeneity and lack of external validation remain key challenges. Standardization, explainability, and prospective validation will be essential to unlock the clinical potential of AI-enhanced PET in precision neuro-oncology.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary Material 1: Supplementary Table 1. Summary of Included Studies Evaluating Machine Learning Models Applied to 18F-FDG PET Radiomics in Glioma. This table presents detailed characteristics of the 12 included studies, covering publication year, glioma type, imaging modality, PET acquisition parameters, segmentation method, feature extraction and selection techniques, machine learning algorithms, training strategies, cross-validation approaches, and key diagnostic or prognostic findings. A diverse array of supervised models (e.g., CNN, RF, ANN, SVM) were applied across varying feature pipelines, with most studies utilizing 18 F-FDG PET or multimodal PET/MRI inputs. Radiomics features frequently included GLCM, GLRLM, and GLSZM texture matrices. The majority of studies employed cross-validation schemes such as k-fold or LOOCV, with a subset reporting external validation. This synthesis highlights methodological heterogeneity and varying performance across studies, underscoring the need for standardized benchmarking in AI-driven glioma diagnostics.
Acknowledgements
None.
Author contributions
Study concept and design: MAA Acquisition of the data: MAA,FK Analysis and interpretation of the data: MAA,AG,AS Drafting of the manuscript: AND.; critical revision of the manuscript for important intellectual content: AS,SGA,AM,MH,DS,DZZ,MS,SE,MA,AM,HG,AS,AS,SZ,AD,MS,AG,FK administrative, technical, and material support: MAA study supervision: MAA,AG
Funding
None.
Data availability
Data will be made available upon reasonable request from the corresponding author.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ali Shahriari, Sasan Ghazanafar Ahari and Ali Mousavi contributed equally to this work.
Contributor Information
Alireza Ghaedamini, Email: alirezaghaedamini@gmail.com.
Mahsa Asadi Anar, Email: Mahsa.boz@gmail.com.
References
- 1.Wu W, Klockow JL, Zhang M, Lafortune F, Chang E, Jin L, et al. Glioblastoma multiforme (GBM): an overview of current therapies and mechanisms of resistance. Pharmacol Res. 2021;171:105780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Colopi A, Fuda S, Santi S, Onorato A, Cesarini V, Salvati M, et al. Impact of age and gender on glioblastoma onset, progression, and management. Mech Ageing Dev. 2023;211:111801. [DOI] [PubMed] [Google Scholar]
- 3.Bernstock JD, Gary SE, Klinger N, Valdes PA, Ibn Essayed W, Olsen HE, et al. Standard clinical approaches and emerging modalities for glioblastoma imaging. Neurooncol Adv. 2022;4(1):vdac080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Forghani R, Savadjiev P, Chatterjee A, Muthukrishnan N, Reinhold C, Forghani B. Radiomics and artificial intelligence for biomarker and prediction model development in oncology. Comput Struct Biotechnol J. 2019;17:995–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Badve C, Nirappel A, Lo S, Orringer DA, Olson JJ. Congress of neurological surgeons systematic review and evidence-based guidelines for the role of imaging in newly diagnosed WHO grade II diffuse glioma in adults: update. J Neurooncol. 2025;174(1):7–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rogers W, Thulasi Seetha S, Refaee TAG, Lieverse RIY, Granzier RWY, Ibrahim A, et al. Radiomics: from qualitative to quantitative imaging. Br J Radiol. 2020;93(1108):20190948. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Noori Mirtaheri P, Akhbari M, Najafi F, Mehrabi H, Babapour A, Rahimian Z et al. Performance of deep learning models for automatic histopathological grading of meningiomas: a systematic review and meta-analysis. Front Neurol. 2025;16–2025.Noori Mirtaheri, Parsia et al. “Performance of deep learning models for automatic histopathological grading of meningiomas: a systematic review and meta-analysis.” Frontiers in neurology vol. 16 1536751. 13 May. 2025, doi:10.3389/fneur.2025.1536751 [DOI] [PMC free article] [PubMed]
- 8.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chiu F-Y, Yen Y. Efficient Radiomics-Based classification of Multi-Parametric MR images to identify volumetric habitats and signatures in glioblastoma: A machine learning approach. Cancers. 2022;14(6). [DOI] [PMC free article] [PubMed]
- 10.Cui C, Yao X, Xu L, Chao Y, Hu Y, Zhao S et al. Improving the classification of PCNSL and brain metastases by developing a machine learning model based on < SUP > 18 F-FDG PET. J PERSONALIZED Med. 2023;13(3). [DOI] [PMC free article] [PubMed]
- 11.Jeong J, Lee M, John F, Robinette N, Amit-Yousif A, Barger G, et al. Feasibility of multimodal MRI-Based deep learning prediction of high amino acid uptake regions and survival in patients with glioblastoma. Volume 10. FRONTIERS IN NEUROLOGY; 2019. [DOI] [PMC free article] [PubMed]
- 12.Karabacak M, Patil S, Gersey ZC, Komotar RJ, Margetis K. Radiomics-Based machine learning with natural gradient boosting for continuous survival prediction in glioblastoma. Cancers. 2024;16:21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.CP MG, T T-W SDKBE. Machine learning predictive performance evaluation of conventional and fuzzy radiomics in clinical cancer imaging cohorts. Eur J Nucl Med Mol Imaging. 2023;50(6):1607–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.MA S, ZK K-R HABAZG. Automated glioblastoma patient classification using hypoxia levels measured through magnetic resonance images. BMC Neurosci. 2024;25(1):26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ouyang J, Chen KT, Duarte Armindo R, Davidzon GA, Hawk KE, Moradi F, et al. Predicting FDG-PET images from Multi-Contrast MRI using deep learning in patients with brain neoplasms. J Magn Reson Imaging. 2024;59(3):1010–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pan J, Lv R, Zhou G, Si R, Wang Q, Zhao X et al. The detection of invisible abnormal metabolism in the FDG-PET images of patients with Anti-LGI1 encephalitis by machine learning. Front Neurol. 2022;13. [DOI] [PMC free article] [PubMed]
- 17.R I, N O, Y TKYA. Voxel-based clustered imaging by multiparameter diffusion tensor images for glioma grading. NeuroImage Clin. 2014;5:396–407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.W W. L M, L Y, R L, C X. Artificial Intelligence Algorithm-Based Positron Emission Tomography (PET) and Magnetic Resonance Imaging (MRI) in the Treatment of Glioma Biopsy. Contrast media & molecular imaging. 2022;2022:5411801. [DOI] [PMC free article] [PubMed]
- 19.Z K, C J. 18)F-FDG-PET-based radiomics features to distinguish primary central nervous system lymphoma from glioblastoma. NeuroImage Clin. 2019;23:101912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zhou Y, Ma X, Zhang T, Wang J, Zhang T, Tian R, et al. Use of radiomics based on < SUP > 18 F-FDG PET/CT and machine learning methods to aid clinical decision-making in the classification of solitary pulmonary lesions: an innovative approach. Volume 48. EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING; 2021. pp. 2904–13. 9. [DOI] [PubMed]
- 21.Bi WL, Hosny A, Schabath MB, Giger ML, Birkbak NJ, Mehrtash A, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. Cancer J Clin. 2019;69(2):127–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Norouzkhani N, Mobaraki H, Varmazyar S, Zaboli H, Mohamadi Z, Nikeghbali G, et al. Artificial intelligence networks for assessing the prognosis of Gastrointestinal cancer to immunotherapy based on genetic mutation features: a systematic review and meta-analysis. BMC Gastroenterol. 2025;25(1):310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Lopez E, Etxebarria-Elezgarai J, Amigo JM, Seifert A. The importance of choosing a proper validation strategy in predictive models. A tutorial with real examples. Anal Chim Acta. 2023;1275:341532. [DOI] [PubMed] [Google Scholar]
- 24.Abulibdeh R, Celi LA, Sejdić E. The illusion of safety: A report to the FDA on AI healthcare product approvals. PLOS Digit Health. 2025;4(6):e0000866. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Tejani AS, Klontzas ME, Gatti AA, Mongan JT, Moy L, Park SH, et al. Checklist for artificial intelligence in medical imaging (CLAIM): 2024 update. Radiol Artif Intell. 2024;6(4):e240300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mongan J, Moy L, Kahn CE. Jr. Checklist for artificial intelligence in medical imaging (CLAIM): A guide for authors and reviewers. Radiol Artif Intell. 2020;2(2):e200029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dondi F, Gatta R, Gazzilli M, Bellini P, Viganò GL, Ferrari C, et al. [18F]FDG PET-Based radiomics and machine learning for the assessment of gliomas and glioblastomas: A systematic review. Information. 2025;16(1):58. [Google Scholar]
- 28.Yousefi M, Akhbari M, Mohamadi Z, Karami S, Dasoomi H, Atabi A, et al. Machine learning based algorithms for virtual early detection and screening of neurodegenerative and neurocognitive disorders: a systematic-review. Front Neurol. 2024;15:2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Sharifi G, Hajibeygi R, Zamani SAM, Easa AM, Bahrami A, Eshraghi R, et al. Diagnostic performance of neural network algorithms in skull fracture detection on CT scans: a systematic review and meta-analysis. Emerg Radiol. 2025;32(1):97–111. [DOI] [PubMed] [Google Scholar]
- 30.Ghanikolahloo M, Taher HJ, Abdullah AD, Asadi Anar M, Tayebi A, Rahimi R, et al. The role of 18F-FDG PET/MRI in assessing pathological complete response to neoadjuvant chemotherapy in patients with breast cancer: a systematic review and meta-analysis. Radiat Oncol. 2024;19(1):164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kumar R, Sporn K, Khanna A, Paladugu P, Gowda C, Ngo A, et al. Integrating radiogenomics and machine learning in musculoskeletal oncology care. Diagnostics. 2025;15(11):1377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wang Y, Hu Z, Wang H. The clinical implications and interpretability of computational medical imaging (radiomics) in brain tumors. Insights Imaging. 2025;16(1):77. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Material 1: Supplementary Table 1. Summary of Included Studies Evaluating Machine Learning Models Applied to 18F-FDG PET Radiomics in Glioma. This table presents detailed characteristics of the 12 included studies, covering publication year, glioma type, imaging modality, PET acquisition parameters, segmentation method, feature extraction and selection techniques, machine learning algorithms, training strategies, cross-validation approaches, and key diagnostic or prognostic findings. A diverse array of supervised models (e.g., CNN, RF, ANN, SVM) were applied across varying feature pipelines, with most studies utilizing 18 F-FDG PET or multimodal PET/MRI inputs. Radiomics features frequently included GLCM, GLRLM, and GLSZM texture matrices. The majority of studies employed cross-validation schemes such as k-fold or LOOCV, with a subset reporting external validation. This synthesis highlights methodological heterogeneity and varying performance across studies, underscoring the need for standardized benchmarking in AI-driven glioma diagnostics.
Data Availability Statement
Data will be made available upon reasonable request from the corresponding author.










