Skip to main content
Journal of Imaging Informatics in Medicine logoLink to Journal of Imaging Informatics in Medicine
. 2024 Jan 12;37(2):766–777. doi: 10.1007/s10278-023-00945-3

Diagnostic Performance of Artificial Intelligence in Detection of Primary Malignant Bone Tumors: a Meta-Analysis

Mohammad Amin Salehi 1,#, Soheil Mohammadi 1,✉,#, Hamid Harandi 1, Seyed Sina Zakavi 2, Ali Jahanshahi 3, Mohammad Shahrabi Farahani 4, Jim S Wu 5
PMCID: PMC11031503  PMID: 38343243

Abstract

We aim to conduct a meta-analysis on studies that evaluated the diagnostic performance of artificial intelligence (AI) algorithms in the detection of primary bone tumors, distinguishing them from other bone lesions, and comparing them with clinician assessment. A systematic search was conducted using a combination of keywords related to bone tumors and AI. After extracting contingency tables from all included studies, we performed a meta-analysis using random-effects model to determine the pooled sensitivity and specificity, accompanied by their respective 95% confidence intervals (CI). Quality assessment was evaluated using a modified version of Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) and Prediction Model Study Risk of Bias Assessment Tool (PROBAST). The pooled sensitivities for AI algorithms and clinicians on internal validation test sets for detecting bone neoplasms were 84% (95% CI: 79.88) and 76% (95% CI: 64.85), and pooled specificities were 86% (95% CI: 81.90) and 64% (95% CI: 55.72), respectively. At external validation, the pooled sensitivity and specificity for AI algorithms were 84% (95% CI: 75.90) and 91% (95% CI: 83.96), respectively. The same numbers for clinicians were 85% (95% CI: 73.92) and 94% (95% CI: 89.97), respectively. The sensitivity and specificity for clinicians with AI assistance were 95% (95% CI: 86.98) and 57% (95% CI: 48.66). Caution is needed when interpreting findings due to potential limitations. Further research is needed to bridge this gap in scientific understanding and promote effective implementation for medical practice advancement.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10278-023-00945-3.

Keywords: Bone tumor, Artificial intelligence, Deep learning, Machine learning, Bone malignancies

Introduction

Although relatively uncommon, bone tumors are the third most common cause of death for cancer patients under the age of 20 [1, 2]. It is essential to perform an accurate assessment of the presence and extent of bone tumors, both primary and metastatic, in order to perform the proper staging and treatment of these conditions. The accurate diagnosis of bone tumors hinges upon a comprehensive evaluation of three essential factors: clinical presentation, imaging appearance, and detailed histopathologic assessment [3]. Detecting bone abnormalities can be accomplished through a variety of imaging techniques, such as magnetic resonance imaging (MRI), positron emission tomography (PET) scan, X-ray, scintigraphy, and computed tomography (CT) scans [4]. However, MRI is considered the modality of choice for local staging of bone tumors [5]. Traditional methods, while still a valuable tool for initial bone tumor diagnosis, face challenges in accurately assessing lesions within complex anatomical regions [5].

It has been shown that artificial intelligence (AI) algorithms, particularly deep learning, have made great strides in image recognition [6]. Radiological images are no exception to this role. In the realm of machine learning, deep learning is a specialized subset characterized by the integration of three or more neural network, layers that mimic the functions of human neurons [7].

The application of AI in radiology dates back to 1992 when it was first utilized for the detection of microcalcifications in mammography. However, in recent times, there has been a surge in interest and attention towards it [8].

The use of AI in radiology has been subject to considerable discussion over the past few years, with a focus on its advantages and disadvantages. Despite the potential benefits of incorporating AI into radiology practice, radiologists must take certain barriers into consideration. These include ethical dilemmas, privacy concerns, regulatory compliance, and the need for proper training of AI systems, and safe integration of AI technologies into routine clinical workflows [9]. Additionally, the accuracy and precision of implementing the system in clinical practice needs to be carefully evaluated. This includes assessing its reliability and potential limitations to ensure appropriate utilization in real-world scenarios.

In the domain of AI applications for bone tumor detection, there has been a growing interest in recent years. This surge in interest is driven by the potential advantages AI offers in improving the accuracy and efficiency of bone tumor diagnosis [10]. However, to effectively harness the power of AI in this specialized field, it is crucial to recognize the current state of AI approaches, their associated results, and the existing gaps and limitations. These aspects are pivotal in justifying the need for a comprehensive meta-analysis.

In this study, we aim to conduct a meta-analysis on studies that evaluated the diagnostic performance of AI algorithms in the detection of primary bone tumors, distinguishing them from other bone lesions, and comparing them with clinician assessment. Furthermore, we attempted to compare the diagnostic precision of AI algorithms with that of experienced radiologists.

Materials and Methods

Protocol and Registration

The authors submitted this review to the International Prospective Register of Systematic Reviews (PROSPERO) website (CRD42022321526).

Search Strategy and Study Selection

To find articles related to AI algorithms in bone tumor diagnosis, we searched PubMed, Scopus, Web of Science core collection, CINAHL, EBSCO, IEEE, Medline, and ACM digital library. The initial search was conducted in March 2022 and was updated in January 2023. We used a combination of keywords related to bone tumors and AI (Table E3-E7). We included original studies investigating the diagnostic performance of artificial intelligence algorithms in detection of bone tumors in comparison of a control group (consisting of healthy controls, patients with benign bone lesions, or metastatic bone lesions). Language, publication time, and age of group participants are not limited. We excluded studies that met the following criteria: non-original studies, studies on metastasis outcome, and studies that were not written in English. Our meta-analysis exclusively incorporated studies providing the complete set of four numbers in a contingency table.

Data Extraction

We extracted the following variables in each study: author and publication year; country of study; the number of participants for each trait; imaging modality; utilized algorithm and architecture; imaging dimensions; evaluation metrics; level; training size; type of validation and validation size; imaging view; the number of TN, FP, TP, FN; sensitivity and specificity; positive predictive value, the area under the curve (A.U.C.), and accuracy. We created a contingency table for each study from the nominal data provided. To guarantee the precision of the data for analysis, two reviewers independently assessed each table.

Quality Assessment

We employed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) checklist to assess adherence to reporting guidelines in the studies under review. This checklist consists of 22 recommendations that aim to promote transparent reporting of research that involves the development and/or validation of prediction models [11]. However, since not all items on this checklist were relevant for AI studies (e.g., reporting follow-up time in diagnostic accuracy studies), we used a modified version of TRIPOD (Table E6). To evaluate bias and applicability, we utilized the Prediction Model Study Risk of Bias Assessment Tool (PROBAST) checklist, which is designed to assess papers based on four key areas: participants, predictors, outcomes, and analysis (Table E7) [12]. We evaluated the training and testing images for bias and applicability in the first area.

Statistical Analysis

We performed a meta-analysis of studies to assess the diagnostic performance of AI algorithms and clinicians. If no less than three eligible studies were identified, we used a random-effects model, in order to consider the heterogeneity between studies, to determine the combined sensitivity and specificity, accompanied by their respective 95% confidence intervals (CI). Additionally, summary receiver operating characteristic (ROC) curves were made to calculate the pooled sensitivities and specificities.

Furthermore, we performed a meta-regression analysis to investigate possible reasons for differences among the studies, analyzing factors such as the utilization of data augmentation, imaging view, imaging modality, usage of transfer learning technique, localization of pathology in model output, and the type of approach used in each study. Statistical significance was determined using a significance level of P less than 0.05. We performed all data analysis using Stata version 17 software (Stata Corp, College Station, TX) [13, 14].

Publication Bias

To reduce the potential impact of publication bias, we checked the reference lists of the studies included in our analysis. Furthermore, we thoroughly assessed publication bias by conducting a regression analysis involving diagnostic log odds ratios and asymmetry testing [15].

Results

Study Selection and Characteristics

Based on a database search, we found 3406 articles, of which 794 were retrieved from PubMed and 1406 from Scopus. The same numbers for other databased can be observed in Fig. 1. A total of 1280 were removed by automated and manual deduplication before the screening. The title and abstract screening process started with 2126 articles, 462 of them not being original, nine of which were not written in English, and 1583 were irrelevant to our topic.

Fig. 1.

Fig. 1

Flowchart of study selection

The full-text screening was conducted on the remaining 72 records, which resulted in 45 studies being excluded based on the following criteria: full text not found (n = 13), not enough data to be extracted (n = 17), and only performed segmentation (n = 15). To resolve any disagreements in each phase, the research team engaged in thorough discussions and deliberations among themselves (Fig. 1).

In accordance with Tables 1 and 2, 24 studies were included in the qualitative and quantitative synthesis. Twenty-three studies evaluated their algorithm through internal validation [10, 1637], nine through external validation [1, 10, 18, 21, 24, 25, 3840] (five used both types of validation [10, 18, 21, 24, 25]). Internal validation involves setting aside a portion of the initial patient dataset to evaluate the model’s performance, whereas external validation employs an entirely distinct patient population to assess the model’s performance [41]. Fourteen studies implemented the algorithm on X-ray images [1, 10, 16, 19, 20, 24, 25, 2933, 36], eleven utilized MRI [18, 2123, 27, 28, 34, 35, 37, 38, 40], two used CT-scan, and one used panoramic radiograph [26] (Arana et al. used both CT and X-ray[17]). Gitto et al. also performed PET-CT scan along with a CT scan [39]. In machine learning concept, a tuning set, often referred to as a validation set or a development set, represents a subset of your dataset employed for the purpose of fine-tuning the hyperparameters of a machine learning algorithm. It also plays a crucial role in making decisions concerning the model’s architecture and configuration [42].

Table 1.

Study characteristics, internal validation

First author Year Country Imaging modality Target condition View Comparison group No. of images per set Reference standard Model output
Training Tuning Testing
Reinus 1994 United States X-ray Focal bone lesions NR N/A NR NR NR NR Binary classification and differential diagnosis of tumors
Arana 1998 Spain X-ray, CT scan Calvarial lesions NR Comparison with other algorithm NR NR NR Histologic diagnosis Binary classification
Do 2017 USA X-ray Bone tumors AP view or image with best visualization of the tumor N/A NR NR 710 Pathologic diagnosis and pathognomonic features Pathological diagnosis and differential diagnosis of tumors
Ho 2019 South Korea X-ray Knee bone tumors NR Comparison of multiple architectures NR NR 963 Bone tumor specialists Binary and ternary classification
Gitto 2020 Italy 1.5-T MRI Cartilaginous tumors of the bone T1W & T2W Expert clinician NR NR NR Histological diagnosis Binary classification
He 2020 China X-ray Primary bone tumors NR Expert clinicians NR NR NR Histological diagnosis Binary and ternary classification
Altameem 2020 Saudi Arabia X-ray Bone cancer NR Comparison of multiple algorithms NR NR NR NR Bone cancer detection
Chianca 2021 Italy 1.5-T MRI Spinal lesions T1W, T2W & DWI Comparison of other algorithms and expert clinicians NR NR NR Post-operative pathological diagnosis Binary and ternary classification
Eweje 2021 United States MRI Bone and soft tissue tumors T1W, T2W & T1C Comparison of other algorithms and expert clinicians NR NR NR Histological diagnosis Binary classification
Lee 2021 Korea Panoramic radiographs jaw lesions Panoramic Comparison of multiple algorithms 322 NR 134 CT and histologically confirmed diagnosis Binary classification
Pan 2021 China X-ray Bone tumors AP & lateral N/A NR NR NR Pathological diagnosis Binary and ternary classification
Sharma 2021 India X-ray Bone cancer NR Comparison with other algorithm 65 NR 40 NR Detection of cancerous bone from healthy bone
Liu 2021 China X-ray Bone tumors AP & lateral Five radiologists 784 97 101 Pathological diagnosis Binary and ternary classification
Gitto-A 2022 Italy 1.5-T MRI Spine bone tumors T2W & DWI Models with different feature numbers NR NR NR Histological diagnosis Binary classification
Park 2022 Korea X-ray Proximal femur tumors AP Expert clinicians and CNN model without applying image pre-processing procedures 538 NR NR Confirmed using MRI and tissue specimens Ternary classification
Liu-A 2022 China MRI Primary spine tumors Axial and Sagittal Expert clinicians 14,072 NR 6161 Pathological diagnosis Binary classification
Consalvo 2022 Germany X-ray Ewing sarcoma and acute osteomyelitis NR NR 127.4 27.3 27.3 Histopathological diagnosis Binary classification
Liu-B 2022 China MRI Spinal tumors

Sagittal

T1W, T2W, and FS-T2W

Expert clinicians 15,778 NR 4815 Pathologically confirmed by trocar biopsy or surgery Binary classification
Von Schacky 2022 Germany X-ray Primary bone tumors NR Expert clinicians NR NR NR Histopathological diagnosis Binary classification
Keyang Zhao 2022 China 3-T MRI Musculoskeletal tumors

Axial and sagittal, and coronal

T1W, T2W, DWI, and CET1-W

Expert clinicians 180 62 62 Pathological diagnosis Binary classification
Shen Zhao 2022 China MRI Vertebrae tumor NR NR NR NR NR NR Detection of vertebral tumor

Postero-anterior (PA), Magnetic resonance imaging (MRI), Tesla (T), T1-weighted (T1W), T2-weighted (T2W), fat saturated T2-weighted (FS-T2W), post contrast T1-weighted (T1C), contrast-enhanced T1-weighted (CET1-W), diffusion-weighted imaging (DWI), computed tomography (CT), not reported (NR), not applicable (N/A)

Table 2.

Study characteristics, External Validation

First author Year Country Imaging modality Target condition View Comparison group No. of images per set Reference standard Model output
Training Tuning Testing
Ho 2019 South Korea X-ray Knee bone tumors NR Comparison of multiple architectures NR NR 963 Bone tumor specialists Ternary classification
He 2020 China X-ray Primary bone tumors NR Expert clinicians NR NR NR Histological diagnosis Binary and ternary classification
Chianca 2021 Italy 1.5-T MRI Spinal lesions T1W, T2W, and DWI Comparison of other algorithms NR NR NR Post-operative pathological diagnosis Binary and ternary classification
Eweje 2021 United States MRI Bone & soft tissue tumors T1W, T2W, and T1C Comparison of other algorithms NR NR NR Histological diagnosis Binary classification
Gitto 2021 Italy CT and PET-CT scan Atypical cartilaginous tumor or appendicular chondrosarcoma NR Musculoskeletal radiologist 84 NR 36 Histological diagnosis Binary classification
Gitto-B 2022 Italy 1.5-T and 3-T MRI ACT & CST of long bones T1W and T2W Expert bone tumor radiologist 93 NR 65 Post-operative pathological diagnosis Binary classification
Li 2022 China X-ray Bone tumors NR Expert clinicians 740 262 428 Histological diagnosis Four-way classification
Von Schacky 2022 Germany X-ray Primary bone tumors NR Expert clinicians NR NR NR Histopathological diagnosis Binary Classification

Magnetic resonance imaging (MRI), Tesla (T), T1-weighted (T1W), T2-weighted (T2W), post contrast T1-weighted (T1C), diffusion-weighted imaging (DWI), computed tomography (CT), positron emission tomography (PET), atypical cartilaginous tumor (ACT), grade II chondrosarcoma (CS2), not reported (NR)

In most of the included studies, tumors have been detected in multiple body parts simultaneously. The standard reference for validation of the diagnosis of the lesion was also histopathology (in 20 studies) [1, 10, 1725, 2730, 34, 37, 39, 40, 43] or expert radiologists and biopsy at the same time in two studies [2631]. Five studies did not mention their standard reference [16, 32, 33, 35, 36].

Study Participants and Algorithm Development

The total number of study participants or bone lesions could not be determined since some articles provided insufficient data. The mean age of the participants differed widely between studies (median, 33.1; IQR, 26–53), the same as the percentage of disease-positive participants (median, 46.6; IQR, 30.6–75.8). Across all experiments, a cumulative quantity of 37,501 images was employed to train algorithms to detect lesions on 19,130 images as testing sets. A total of 43.2% contained primary bone malignancies, while the remaining was considered as control group. Eleven studies used data augmentation, and six used transfer learning. (Tables E1-3).

Model Performance

Tables E2 and E3 supplementary presents the results of various studies that evaluated model output using different metrics. Specifically, 21 studies used sensitivity and specificity, all 24 used accuracies, 14 used the area under the receiver operating characteristic curve, 19 used positive and negative predictive values, and 18 used F1 as the metric for assessment.

Quality Assessment

Figure 2 illustrates the compliance of each study to the modified version of the TRIPOD statement tool. None of the studies included in the analysis comply with items 8 and 9 of the protocol, pertaining to sample size and description of the handling of missing data, respectively. Furthermore, the adherence percentages were also notably low for some other key items, such as the description of participant flow through the study (24%), explanation of the usage of the prediction model (24%), and dates of image collection (30%). Of the 27 items evaluated, 17 received adherence ratings equal to or above 50%.

Fig. 2.

Fig. 2

Adherence to Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD), and Prediction Model Study Risk of Bias Assessment Tool (PROBAST) reporting guidelines

PROBAST was also evaluated on the included studies, resulting in a total rating of 37 (55.22%) studies with a low risk of bias and 31 (67.39%) studies with low concerns for applicability, as depicted in Fig. 2. Notably, 23 (95.83%) studies were classified as having a high risk of patient selection bias due to the predetermined inclusion and exclusion criteria. However, five studies (20.83%) raised concerns regarding the applicability of the results.

Meta-Analysis

A total of 27 studies containing adequate data were included in our study, from which we extracted 86 contingency tables for binary bone tumor detection. Of these, 42 contingency tables from 21 studies were related to AI algorithms with internal validation, 11 contingency tables from eight studies were related to AI algorithms with external validation, ten contingency tables from two studies were related to clinicians’ performance with internal validation, and 16 contingency Tables from three studies were related to clinicians’ performance with external validation. An additional seven contingency tables were extracted from a single study investigating clinicians who used AI as an assistant tool [34]. A meta-analysis was conducted for all five groups of studies; however, due to insufficient data, the groups of clinicians with AI assistance and clinicians with internal validation had less than three studies each, necessitating further research to validate the results in these groups. Figure 3 is a forest plot displaying the effect estimates of four groups of our study and their confidence intervals. Each row in the plot corresponds to an AI model, with some studies evaluating multiple AI algorithms. Figure 4 demonstrates the hierarchical summary receiver operating characteristic (HSROC) curves for internal and external validation test sets, respectively.

Fig. 3.

Fig. 3

Forest plots of algorithms and clinicians’ performance. The plot shows the effect size estimate (diamond symbol), confidence interval (horizontal line), and the individual study effect sizes (square symbols) with their corresponding weights

Fig. 4.

Fig. 4

Hierarchical summary receiver operating characteristic (HSROC) curves for A algorithms on internal validation, B algorithms on external validation, C clinicians on internal validation, D clinicians on external validation

The pooled sensitivities for AI algorithms and clinicians on internal validation test sets were 84% (95% CI: 79,88) and 76% (95% CI: 64,85), and pooled specificities were 86% (95% CI: 81,90), and 64% (95% CI: 55,72), respectively. At external validation, the pooled sensitivity and specificity for AI algorithms were 84% (95% CI: 74,90) and 91% (95% CI: 83,96), and also for clinicians were 85% (95% CI: 73,92) and 94% (95% CI: 89,97) respectively (Table 3). The pooled sensitivity and specificity from seven contingency tables of clinicians with AI assistance are 95% (95% CI: 86.98) and 57% (95% CI: 48,66). The positive and negative likelihood ratio, AUC, and diagnostic odds ratio of each group is also demonstrated in Table 3.

Table 3.

Pooled sensitivities, specificities, area under the curve, positive likelihood ratio, negative likelihood ratio, and diagnostic odds ratio

Parameter Sensitivity (%) Specificity (%) AUC Positive likelihood ratio Negative likelihood ratio Diagnostic odds ratio No. of contingency tables
Algorithms, internal validation 84 (79–88) 86 (81–90) 0.92 (0.89–0.94) 6.0 (4.2–8.6) 0.18 (0.13–0.25) 33 ( 17–61) 42
Algorithms, external validation 84 ( 75–90) 91 ( 83–96) 0.93 ( 0.91–0.95) 9.6 ( 4.9–19.0) 0.18 ( 0.11–0.28) 54 ( 25–117) 11
Clinicians, internal validation 76 ( 64–85) 64 ( 55–72) 0.74 ( 0.70–0.78) 2.1 ( 1.7–2.6) 0.38 ( 0.26–0.54) 6 ( 4–9) 10
Clinicians, external validation 85 ( 73–92) 94 ( 89–97) 0.96 ( 0.94–0.97) 13.6 ( 7.2–25.7) 0.316 ( 0.09–0.3) 85 ( 29–250) 16
Clinicians with AI assistance 95 ( 86–98) 57 ( 48–66) 0.66 ( 0.62–0.70) 2.2 ( 1.8–2.7) 0.09 ( 0.03–0.26) 24 ( 8–75) 7

Area under the curve (AUC), artificial intelligence (AI)

The findings from our analyses of studies using the internal validation demonstrate that the use of MRI was associated with lower sensitivity and specificity (75%, CI: 67, 83, and 74%, CI: 66, 83, respectively) compared to other imaging modalities (90%, CI: 86, 94, P-value = 0.00 and 92%, CI: 89, 95, P-value = 0.00). It was also found that the use of data augmentation resulted in a lower sensitivity (82%, CI:74–90 versus 86%, CI: 80–91, P-value:0.00) and higher specificity (90%, CI: 85–96 versus 82%, CI: 75–89, P-value:0.10). Furthermore, transfer learning had a positive impact on both sensitivity and specificity (Sensitivity: 97%, CI: 94–100 versus 81%, CI: 76,86, P-value: 0.36) (specificity: 97%, CI: 93,100 versus 83%, CI:78–89, P-value: 0.43). The study found that there was no significant difference in sensitivity and specificity between studies that utilized patients without bone lesions as a control group versus those that used patients with other types of tumors as a control group (excluding the specific tumor under investigation). The P-values for sensitivity and specificity were 0.26 and 0.77, respectively. In Tables E9-12, other parameters and results from meta-regression can be found for other groups as well.

Publication Bias

Publication bias assessment was performed individually for all five groups, including studies with internally validated AI, externally validated AI, internally validated clinician, externally validated clinician, and clinicians with AI assistance performance. The slope coefficient was − 31.15 (95%CI: − 48.92, − 13.38; P = 0.00), − 13.91 (95%CI: − 31.17, 3.35; P = 0.10), − 3.26 (95%CI: − 27.74, 21.22; P = 0.77), − 24.45 (95%CI: − 65.62, 16.72; P = 0.22), and − 985.26 (95%CI: − 1311.26, − 659.23; P = 0.00) respectively, which indicates a weak association between sample size and effect size for the first four groups, and low potential for publication bias (Table E8).

Discussion

To the best of our knowledge, there appears to be an absence of meta-analytical investigations into the performance of artificial intelligence in detecting primary bone lesions via imaging. There is, however, a growing body of original literature on this subject. Zheng and colleagues conducted a meta-analysis, examining the use of artificial intelligence in detecting metastasis in various parts of the body, including lymph nodes, bone, and other types, through imaging [44]. The authors reported a diagnostic odds ratio of 22.14 (95% CI: 18.52, 26.46) for all metastases, significantly less than the diagnostic odds ratios derived from both our internal and external validation studies, which are 33 (95% CI: 17,61) and 54 (95% CI: 25,117).

In certain cases, the machines outperformed human experts, demonstrating robust performance. Based on these findings, automated methods may offer an alternative to traditional expert-driven approaches for the diagnosis of certain medical conditions. Despite the promising results of our study, caution must be exercised when interpreting the findings due to the potential for underreporting of radiologists’ performance. This may be attributed to a range of factors, including the following:

  1. The studies under review lack specification regarding the level of experience exhibited by the radiologists.

  2. The modalities used in different studies to detect bone tumor may not be the modality of choice for this purpose.

  3. The lack of sufficient number of studies that have simultaneously examined the quality of algorithms and the accuracy of radiologists on a dataset.

In a meta-analysis of 21 studies performed in 2020, Younis et al. measured the pooled sensitivity and specificity of 18-fluorodeoxyglucose-positron emission tomography (18 F-FDG-PET) combined with CT, modalities used by expert radiologists, for the detection of bone and soft tissue sarcomas, 89.2 and 76.3%, respectively [45]. A meta-analysis of 31 studies examined expert radiologists’ performance in detecting Ewing sarcoma (ES) using 18 F-FDG PET as well as PET/computed tomography (PET/CT). The pooled sensitivity and specificity for total ES lesions were found to be 92.6% and 74.1%, respectively [46].

Furthermore, several secondary outcomes have been identified from the data in addition to the general findings. The meta-regression findings in Table E9 illustrate that applying AI to the identification of bone tumors using X-ray and CT pictures gave much better results than MRI images, which was predictable based on previous studies revealing that CT is a superior modality for visualizing bone structure and abnormalities [47]. Furthermore, the application of transfer learning, a machine learning approach for enhancing learning in new tasks by leveraging knowledge gained from a related task (48), has significantly improved the performance of the automated method. It should be noted, however, that data augmentation, a strategy for constructing incremental algorithms or sampling algorithms using unobserved data or latent variables [48], adversely affects the sensitivity of machine learning algorithms [43, 49].

For a proper evaluation of the results of this study, consideration must be given to the limitations encountered during the investigation. It is difficult to achieve the utmost precision in the conclusions of this study due to a lack of data available for analysis. As can be observed in Table E1, the assessment of the accuracy of AI on external validation was predicated solely on data derived from five studies [1, 10, 18, 38, 39]. One of the main challenges was analyzing the performance of radiologists. Table E1 also demonstrates that just two studies reported the data from radiologist on internal validation [3443] and three on external validation [1, 10, 38]. Presently, a scarcity of research on this topic creates a remarkable chance for diligent researchers to conduct further investigations and bridge the existing gap in scientific understanding. Such endeavors would offer valuable insights into the viability of utilizing this technology in real-world contexts and thus promote its effective implementation for the advancement of medical practice. The other limitations we encountered in this review article were the absence of precise participant and image quantity details in some studies, particularly regarding those with or without a bone lesion. Additionally, the limited number of studies for each specific body region hindered our ability to conduct meaningful sub-group analyses based on different areas containing bone malignancies. Consequently, future research endeavors in this field hold the potential to provide valuable insights, enabling researchers to more accurately assess AI performance in detecting bone tumors within distinct body regions. We intended to perform a sensitivity analysis to minimize the impact of publication bias measured by modified-PROBAST method. However, the results of quality assessment showed that the number of studies in low risk of bias subgroup did not reach a suitable number (four studies) to perform analysis.

Supplementary Information

Below is the link to the electronic supplementary material.

Author Contribution

Soheil Mohammadi and Mohammad Amin Salehi designed the project, contributed to protocol development, literature search, screening, data extraction, and writing of the original draft. Hamid Harandi contributed to writing of the original draft. Seyed Sina Zakavi and Ali Jahanshahi contributed to screening and protocol development. Mohammad Shahrabi Farahani contributed to the writing of the original draft. Jim S. Wu encouraged and supervised the project, reviewed the manuscript, and contributed to the writing of the original and final draft.

Data Availability

The data that support the findings of this study are available from the authors upon reasonable request.

Declarations

Conflict of Interest

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Mohammad Amin Salehi and Soheil Mohammadi contributed equally.

References

  • 1.Li J, Li S, Li X, Miao S, Dong C, Gao C, et al. Primary bone tumor detection and classification in full-field bone radiographs via YOLO deep learning model. European Radiology. 2022. [DOI] [PubMed]
  • 2.Kerr DL, Dial BL, Lazarides AL, Catanzano AA, Lane WO, Blazer DG, 3rd, et al. Epidemiologic and survival trends in adult primary bone tumors of the spine. Spine J. 2019;19(12):1941–1949. doi: 10.1016/j.spinee.2019.07.003. [DOI] [PubMed] [Google Scholar]
  • 3.Georgeanu VA, Mămuleanu M, Ghiea S, Selișteanu D. Malignant Bone Tumors Diagnosis Using Magnetic Resonance Imaging Based on Deep Learning Algorithms. Medicina (Lithuania). 2022;58(5). [DOI] [PMC free article] [PubMed]
  • 4.Salazar C, Leite M, Sousa A, Torres J. Correlation between imagenological and histological diagnosis of bone tumors. A retrospective study. Acta Ortop Mex. 2019;33(6):386–390. [PubMed] [Google Scholar]
  • 5.Goyal N, Kalra M, Soni A, Baweja P, Ghonghe NP. Multi-modality imaging approach to bone tumors - State-of-the art. J Clin Orthop Trauma. 2019;10(4):687–701. doi: 10.1016/j.jcot.2019.05.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts H. Artificial intelligence in radiology. Nat Rev Cancer. 2018;18(8):500–510. doi: 10.1038/s41568-018-0016-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Han SH, Kim KW, Kim S, Youn YC. Artificial Neural Network: Understanding the Basic Concepts without Mathematics. Dement Neurocogn Disord. 2018;17(3):83–89. doi: 10.12779/dnd.2018.17.3.83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Driver CN, Bowles BS, Bartholmai BJ, Greenberg-Worisek AJ. Artificial Intelligence in Radiology: A Call for Thoughtful Application. Clin Transl Sci. 2020;13(2):216–218. doi: 10.1111/cts.12704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nair AV, Ramanathan S, Sathiadoss P, Jajodia A, Blair Macdonald D. Barriers to artificial intelligence implementation in radiology practice: What the radiologist needs to know. Radiologia (Engl Ed). 2022;64(4):324–332. doi: 10.1016/j.rxeng.2022.04.001. [DOI] [PubMed] [Google Scholar]
  • 10.von Schacky CE, Wilhelm NJ, Schäfer VS, Leonhardt Y, Jung M, Jungmann PM, et al. Development and evaluation of machine learning models based on X-ray radiomics for the classification and differentiation of malignant and benign bone tumors. European Radiology. 2022;32(9):6247–6257. doi: 10.1007/s00330-022-08764-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Annals of internal medicine. 2015;162(1):55–63. doi: 10.7326/M14-0697. [DOI] [PubMed] [Google Scholar]
  • 12.Wolff RF, Moons KG, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Annals of internal medicine. 2019;170(1):51–58. doi: 10.7326/M18-1376. [DOI] [PubMed] [Google Scholar]
  • 13.Dwamena B. MIDAS: Stata module for meta-analytical integration of diagnostic test accuracy studies. 2009.
  • 14.Harbord RM, Whiting P. Metandi: meta-analysis of diagnostic accuracy using hierarchical logistic regression. The Stata Journal. 2009;9(2):211–229. doi: 10.1177/1536867X0900900203. [DOI] [Google Scholar]
  • 15.Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. Journal of clinical epidemiology. 2005;58(9):882–893. doi: 10.1016/j.jclinepi.2005.01.016. [DOI] [PubMed] [Google Scholar]
  • 16.Altameem T. Fuzzy rank correlation-based segmentation method and deep neural network for bone cancer identification. Neural Computing and Applications. 2020;32(3):805–815. doi: 10.1007/s00521-018-04005-8. [DOI] [Google Scholar]
  • 17.Arana E, Marti-Bonmati L, Bautista D, Paredes R. Calvarial eosinophilic granuloma: Diagnostic models and image feature selection with a neural network. Academic Radiology. 1998;5(6):427–434. doi: 10.1016/S1076-6332(98)80030-5. [DOI] [PubMed] [Google Scholar]
  • 18.Chianca V, Cuocolo R, Gitto S, Albano D, Merli I, Badalyan J, et al. Radiomic Machine Learning Classifiers in Spine Bone Tumors: A Multi-Software. Multi-Scanner Study. Eur J Radiol. 2021;137:109586. doi: 10.1016/j.ejrad.2021.109586. [DOI] [PubMed] [Google Scholar]
  • 19.Consalvo S, Hinterwimmer F, Neumann J, Steinborn M, Salzmann M, Seidl F, et al. Two-Phase Deep Learning Algorithm for Detection and Differentiation of Ewing Sarcoma and Acute Osteomyelitis in Paediatric Radiographs. Anticancer Research. 2022;42(9):4371–4380. doi: 10.21873/anticanres.15937. [DOI] [PubMed] [Google Scholar]
  • 20.Do BH, Langlotz C, Beaulieu CF. Bone Tumor Diagnosis Using a Naïve Bayesian Model of Demographic and Radiographic Features. J Digit Imaging. 2017;30(5):640–647. doi: 10.1007/s10278-017-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Eweje FR, Bao B, Wu J, Dalal D, Liao WH, He Y, et al. Deep Learning for Classification of Bone Lesions on Routine MRI. EBioMedicine. 2021;68:103402. doi: 10.1016/j.ebiom.2021.103402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gitto S, Cuocolo R, Albano D, Chianca V, Messina C, Gambino A, et al. MRI radiomics-based machine-learning classification of bone chondrosarcoma. Eur J Radiol. 2020;128:109043. doi: 10.1016/j.ejrad.2020.109043. [DOI] [PubMed] [Google Scholar]
  • 23.Gitto S, Cuocolo R, van Langevelde K, van de Sande MAJ, Parafioriti A, Luzzati A, et al. MRI radiomics-based machine learning classification of atypical cartilaginous tumour and grade II chondrosarcoma of long bones. EBioMedicine. 2022;75. [DOI] [PMC free article] [PubMed]
  • 24.He Y, Pan I, Bao B, Halsey K, Chang M, Liu H, et al. Deep learning-based classification of primary bone tumors on radiographs: A preliminary study. EBioMedicine. 2020;62:103121. doi: 10.1016/j.ebiom.2020.103121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ho NH, Yang HJ, Kim SH, Jung ST, Joo SD. Regenerative semi-supervised bidirectional w-network-based knee bone tumor classification on radiographs guided by three-region bone segmentation. IEEE Access. 2019;7:154277–154289. doi: 10.1109/ACCESS.2019.2949125. [DOI] [Google Scholar]
  • 26.Lee A, Kim MS, Han SS, Park PG, Lee C, Yun JP. Deep learning neural networks to differentiate Stafne's bone cavity from pathological radiolucent lesions of the mandible in heterogeneous panoramic radiography. PLoS ONE. 2021;16(7 July). [DOI] [PMC free article] [PubMed]
  • 27.Liu H, Jiao M, Yuan Y, Ouyang H, Liu J, Li Y, et al. Benign and malignant diagnosis of spinal tumors based on deep learning and weighted fusion framework on MRI. Insights Imaging. 2022;13(1):87. doi: 10.1186/s13244-022-01227-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Liu H, Jiao ML, Xing XY, Ou-Yang HQ, Yuan Y, Liu JF, et al. BgNet: Classification of benign and malignant tumors with MRI multi-plane attention learning. Front Oncol. 2022;12:971871. doi: 10.3389/fonc.2022.971871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Liu Y, Yang P, Pi Y, Jiang L, Zhong X, Cheng J, et al. Automatic identification of suspicious bone metastatic lesions in bone scintigraphy using convolutional neural network. BMC Med Imaging. 2021;21(1):131. doi: 10.1186/s12880-021-00662-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Pan D, Liu R, Zheng B, Yuan J, Zeng H, He Z, et al. Using Machine Learning to Unravel the Value of Radiographic Features for the Classification of Bone Tumors. Biomed Res Int. 2021;2021:8811056. doi: 10.1155/2021/8811056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Park CW, Oh SJ, Kim KS, Jang MC, Kim IS, Lee YK, et al. Artificial intelligence-based classification of bone tumors in the proximal femur on plain radiographs: System development and validation. PLoS One. 2022;17(2):e0264140. doi: 10.1371/journal.pone.0264140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Reinus WR, Wilson AJ, Kalman B, Kwasny S. Diagnosis of focal bone lesions using neural networks. Investigative Radiology. 1994;29(6):606–611. doi: 10.1097/00004424-199406000-00002. [DOI] [PubMed] [Google Scholar]
  • 33.Sharma A, Yadav DP, Garg H, Kumar M, Sharma B, Koundal D. Bone Cancer Detection Using Feature Extraction Based Machine Learning Model. Comput Math Methods Med. 2021;2021:7433186. doi: 10.1155/2021/7433186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhao K, Zhang M, Xie Z, Yan X, Wu S, Liao P, et al. Deep Learning Assisted Diagnosis of Musculoskeletal Tumors Based on Contrast-Enhanced Magnetic Resonance Imaging. Journal of Magnetic Resonance Imaging. 2022;56(1):99–107. doi: 10.1002/jmri.28025. [DOI] [PubMed] [Google Scholar]
  • 35.Zhao S, Chen B, Chang H, Chen B, Li S. Reasoning discriminative dictionary-embedded network for fully automatic vertebrae tumor diagnosis. Med Image Anal. 2022;79:102456. doi: 10.1016/j.media.2022.102456. [DOI] [PubMed] [Google Scholar]
  • 36.Do NT, Jung ST, Yang HJ, Kim SH. Multi-Level Seg-Unet Model with Global and Patch-Based X-ray Images for Knee Bone Tumor Detection. Diagnostics (Basel). 2021;11(4). [DOI] [PMC free article] [PubMed]
  • 37.Fouad H, Hassanein AS, Soliman AM, Al-Feel H. Internet of Medical Things (IoMT) Assisted Vertebral Tumor Prediction Using Heuristic Hock Transformation Based Gautschi Model–A Numerical Approach. IEEE Access. 2020;8:17299–17309. doi: 10.1109/ACCESS.2020.2966272. [DOI] [Google Scholar]
  • 38.Gitto S, Bologna M, Corino VDA, Emili I, Albano D, Messina C, et al. Diffusion-weighted MRI radiomics of spine bone tumors: feature stability and machine learning-based classification performance. Radiol Med. 2022. [DOI] [PMC free article] [PubMed]
  • 39.Gitto S, Cuocolo R, Annovazzi A, Anelli V, Acquasanta M, Cincotta A, et al. CT radiomics-based machine learning classification of atypical cartilaginous tumours and appendicular chondrosarcomas. EBioMedicine. 2021;68:103407. doi: 10.1016/j.ebiom.2021.103407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ouyang H, Meng F, Liu J, Song X, Li Y, Yuan Y, et al. Evaluation of Deep Learning-Based Automated Detection of Primary Spine Tumors on MRI Using the Turing Test. Front Oncol. 2022;12:814667. doi: 10.3389/fonc.2022.814667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Shung D, Simonov M, Gentry M, Au B, Laine L. Machine Learning to Predict Outcomes in Patients with Acute Gastrointestinal Bleeding: A Systematic Review. Dig Dis Sci. 2019;64(8):2078–2087. doi: 10.1007/s10620-019-05645-z. [DOI] [PubMed] [Google Scholar]
  • 42.Montesinos-López O, Montesinos A, Crossa J. Overfitting, Model Tuning, and Evaluation of Prediction Performance. 2022. p. 109–39.
  • 43.Liu S, Feng M, Qiao T, Cai H, Xu K, Yu X, et al. Deep Learning for the Automatic Diagnosis and Analysis of Bone Metastasis on Bone Scintigrams. Cancer Manag Res. 2022;14:51–65. doi: 10.2147/CMAR.S340114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zheng Q, Yang L, Zeng B, Li J, Guo K, Liang Y, et al. Artificial intelligence performance in detecting tumor metastasis from medical radiology imaging: A systematic review and meta-analysis. EClinicalMedicine. 2021;31:100669. doi: 10.1016/j.eclinm.2020.100669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Younis MH, Abu-Hijleh HA, Aldahamsheh OO, Abualruz A, Thalib L. Meta-Analysis of the Diagnostic Accuracy of Primary Bone and Soft Tissue Sarcomas by 18F-FDG-PET. Med Princ Pract. 2020;29(5):465–472. doi: 10.1159/000505651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Seth N, Seth I, Bulloch G, Siu AHY, Guo A, Chatterjee R, et al. (18) F-FDG PET and PET/CT as a diagnostic method for Ewing sarcoma: A systematic review and meta-analysis. Pediatr Blood Cancer. 2022;69(3):e29415. doi: 10.1002/pbc.29415. [DOI] [PubMed] [Google Scholar]
  • 47.Zimmer WD, Berquist TH, McLeod RA, Sim FH, Pritchard DJ, Shives TC, et al. Bone tumors: magnetic resonance imaging versus computed tomography. Radiology. 1985;155(3):709–718. doi: 10.1148/radiology.155.3.4001374. [DOI] [PubMed] [Google Scholar]
  • 48.van Dyk DA, Meng X-L. The Art of Data Augmentation. Journal of Computational and Graphical Statistics. 2001;10(1):1–50. doi: 10.1198/10618600152418584. [DOI] [Google Scholar]
  • 49.Torrey L, Shavlik J. Transfer learning. Handbook of Research on Machine Learning Applications. 2009.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The data that support the findings of this study are available from the authors upon reasonable request.


Articles from Journal of Imaging Informatics in Medicine are provided here courtesy of Springer

RESOURCES