Abstract
Background
Osteoporosis is one of the most common metabolic diseases that is characterized by a decrease in bone density and a loss of the quality of the bone structure. The use of deep learning in the prediction of osteoporosis can provide a non-invasive, cost-effective, and efficient approach. The aim of this study is to investigate the diagnostic accuracy of deep learning in the prediction of osteoporosis.
Methods
This is a systematic review and meta-analysis study that was conducted on the diagnostic accuracy of deep learning algorithms for predicting osteoporosis. A literature search was performed in electronic databases including PubMed, Elsevier, and Google Scholar to identify relevant articles until December 1, 2023. Articles were searched in databases by combining related terms such as “deep learning”, “convolutional neural network”, and “osteoporosis”. We conducted title, abstract, and full-text screening based on inclusion/exclusion criteria. Various metrics, such as sensitivity, specificity, and area under the curve (AUC), were used to assess the diagnostic performance of deep learning models.
Results
Out of the 181 articles initially identified, 10 studies were included in the analysis. All studies used a convolutional neural network (CNN) as the deep learning model. Three studies investigated multiple deep learning models. Eight studies used various architectures of CNN, such as ResNet, VGG, and EfficientNet. The pooled sensitivity and specificity were 0.86 (95% CI, 0.82–0.89) and 0.89 (95% CI, 0.85–0.91), respectively. The bivariate approach’s pooled SROC curve produced an AUC of 0.94 (95% CI 0.91–0.95). The Diagnostic Odds Ratio (DOR) for the deep learning models was 49.09 (95% CI, 28.74–83.84). Deeks’ funnel plot asymmetry test (P = 0.4) suggested no potential publication bias.
Conclusions
Deep learning has an acceptable performance for the diagnosis of osteoporosis, even better than other ML algorithms. However, further research is needed to validate the findings of this study in clinical trials.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12891-024-08120-7.
Keywords: Osteoporosis, BMD, Deep learning, Machine learning, Diagnosis
Background
Osteoporosis is a disease that is characterized by a decrease in bone density and a loss of the quality of the bone structure, leading to a specific increase in bone fragility and an increase in the risk of fracture [1, 2]. It is considered one of the most common metabolic diseases, and it is known as a silent disease because it does not cause specific symptoms in the affected person. Osteoporosis is one of the most common diseases in old age. Osteoporosis is growing more prevalent as the world’s population ages quickly, creating significant medical, economic, and social challenges [3] Often, the occurrence of a fracture is the first sign of the disease [4, 5]. Genetic factors and race are important factors affecting bone density. Also, physiological, environmental, and lifestyle factors can significantly contribute to reaching the maximum bone density and maintaining it throughout life [6, 7]. In 2019, the global incident cases of osteoporosis increased to 41.5 million. The projected global total number of incidence cases for osteoporosis between 2030 and 2034 is estimated to reach 263.2 million (154.4 million for females and 108.8 for males) [8]. About 20% of patients who experience a bone fracture require long-term care in treatment centers, and 60% of patients never regain the functional independence they had before the fracture [9]. Fractures caused by osteoporosis, in addition to the pain and discomfort that directly affect the patient, also cause long-term complications and disabilities that can severely reduce the quality of life of a person and definitely lead to his death [10].
Prevention and timely diagnosis of osteoporosis play a significant role in maintaining bone density and treating this disease [11, 12]. The conventional method for the diagnosis of osteoporosis is to measure bone mineral density (BMD) using dual-energy X-ray absorptiometry (DEXA, or DXA) [13]. While this method is the gold standard, there are some challenges, such as its high cost, exposure to radiation, the need for interpretation of results by experts, and its inappropriateness for mobile screening events. In addition, bone density is not the only factor related to this disease, and many factors can be effective in the occurrence of osteoporosis [6, 13] .
With the advancement of artificial intelligence and machine learning (ML), the use of these methods has increased in the healthcare domain, especially in the prognosis and diagnosis of various diseases, and favorable results have been obtained [14, 15]. In recent years, the application of deep learning (DL), as a subfield of ML, in medical imaging has increased significantly due to its potential to revolutionize disease diagnosis and risk prediction [16]. DL algorithms, particularly convolutional neural networks (CNNs), have demonstrated impressive performance in a variety of medical applications, from neuroimaging analysis to cancer detection [17, 18]. Their capacity to automatically learn complex patterns and features from large datasets makes DL an appealing opportunity for improving osteoporosis diagnosis. The use of DL in the prediction of osteoporosis may provide a non-invasive, cost-effective, and efficient approach to enhance diagnostic accuracy [19]. In recent years, the research and literature on the development of DL models for the prediction of osteoporosis, with different methods and heterogeneous study designs, have increased. To provide a comprehensive overview, we conducted a systematic review and meta-analysis to synthesize the current evidence on the diagnostic accuracy of DL models in predicting osteoporosis.
This systematic review and meta-analysis aim to contribute valuable insights into the evolving landscape of osteoporosis diagnostics, shedding light on the potential of DL as a transformative tool in this field. By synthesizing and critically evaluating the available evidence, we want to provide clinicians, researchers, and policymakers with a robust foundation for understanding the current state of DL in osteoporosis prediction and its implications for clinical practice. Ultimately, this work aims to guide future research endeavors, inform healthcare decision-making, and foster the integration of innovative technologies for improved osteoporosis management.
Methods
Data resources and search strategy
This study was conducted based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [20] (Fig. 1). The study protocol was registered in PROSPERO (CRD42023461046). A literature search was performed in electronic databases including MEDLINE (PubMed), EMBASE (Elsevier), and Google Scholar to identify relevant articles until December 1, 2023. For Google Scholar, we only used the first 50 results because Google Scholar returns the most relevant results for the search term first. The following combinations of medical subject headlines (MeSH) terms and keywords were used for literature retrieval: (“Deep learning” OR “deep-learn” OR “deep-learning” OR “deep learn” OR “transfer learning” OR “convolutional neural network” OR “CNN”) AND (“Osteoporosis” OR “Bone Mineral Density” OR “BMD” OR “Bone Loss”). Methodological search filters were avoided. The detailed search strategy is shown in Supplementary File 1, Table S1. For management, all records were imported into Endnote 20, and any duplicates were removed.
Fig. 1.
Preferred reporting items for systematic reviews and meta-analyses (PRISMA) flow diagram
Inclusion and exclusion criteria
For selecting eligible studies, the inclusion and exclusion criteria were determined based on PICOS (Population, Intervention, Comparison, Outcome, Study design) framework (Table 1).
Table 1.
The inclusion and exclusion criteria based on the PICOS framework
| Inclusion criteria | Exclusion criteria | |
|---|---|---|
| Population (P) | Studies involving patients with suspected or diagnosed with osteoporosis. | Studies fewer than 10 participants. |
| Intervention (I) | Studies using deep learning models for the diagnosis of osteoporosis | Studies using non-deep learning methods (e.g., traditional machine learning, statistical methods) |
| Comparator (C) | Studies comparing deep learning models to conventional diagnostic methods, such as dual-energy X-ray absorptiometry (DXA), CT scans, or expert radiologist assessment | |
| Outcome (O) |
• Studies reporting diagnostic accuracy measures, such as sensitivity, specificity, area under the curve (AUC) • Studies that provide data enabling the calculation of these metrics. |
Studies not reporting on diagnostic accuracy or not providing sufficient data to calculate these metrics. |
| Study Design (S) |
• Peer-reviewed original studies that employed DL algorithms for prediction or Diagnosis of osteoporosis. • published in English language |
Review, editorials, commentaries, letters to the editor, case series, case reports, conference abstract and preprint articles |
Study selection
The screening process were conducted independently and in duplicate by two reviewers (M.A and F.A). Each reviewer independently screened titles and abstracts to identify studies that met the inclusion criteria. Full-text articles were then independently reviewed to confirm eligibility. Any discrepancies between the two reviewers were discussed and resolved through consensus. In cases where consensus could not be reached, a third reviewer (M.H) was consulted to make the final decision. We recorded the number of articles screened and the reasons for exclusion at each stage.
Data extraction
Two investigators (M.A. and F.A.) independently extracted data from selected studies. The data extracted included the first author’s name, publication year, study design, country, sample size, gender participants, imaging modality, image location, reference test, DL model, and model performance metrics (such as AUC, sensitivity, specificity, and accuracy). The numbers of TP, TN, FP, and FN were also calculated from each study. If multiple DL models were developed in one study, we would gather data from each model. If more than one image group (anatomic location or image type) was used for model development, data was extracted from all of them. The extracted data was recorded in an Excel spreadsheet. Any discrepancies in data extraction were resolved by discussion, and a third reviewer (M.H.) was consulted when necessary. The data extraction form is shown in Supplemental File 1, Table S2.
Assessment of risk of bias
Two raters (M.A. and F.A.) independently assessed the potential risks of bias in selected studies using the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) tool, which includes four key domains: (1) patient selection; (2) index test; (3) reference standard; and (4) flow and timing [21]. Two authors independently evaluated the quality of the selected studies using the designed scale. Any disagreements were resolved with the third author.
Statistical analysis
In the included studies, various metrics, such as sensitivity, specificity, and area under the curve (AUC), were used to assess the diagnostic performance of DL models. Summaries of sensitivity, specificity, and AUC data were calculated using bivariate random-effects models for diagnostic meta-analysis [22]. Forest plots were used to display the pooled sensitivity and specificity with a 95% confidence interval (95% CI). In order to quantify the probability of a positive or negative test, the positive/negative likelihood ratios (LR+/LR-) were also computed. As a result, the Diagnostic Odds Ratio (DOR) was calculated with the 95% CI, and a forest plot was constructed. Based on the bivariate method, the summary receiver-operating characteristic curve (SROC) was plotted, and the AUC was computed. Heterogeneity between studies was investigated using the inconsistency index (I2). If I2 > 50%, indicated that there was significant heterogeneity, and then a random effects model was applied to the research. To explore the potential sources of heterogeneity, Sensitivity analyses, meta-regression analysis and subgroup analysis were performed. Subgroup analyses were done based on the Type of Image (lumbar spine / hip images), Validation Type (internal / external validation) and DL methods (transfer / non-transfer learning). In addition to examining heterogeneity, subgroup analysis was used to compare diagnostic accuracy of DL models in each subgroup. Deeks’ funnel plot asymmetry test was used to evaluate publication bias. Deeks’ funnel plot asymmetry test was used to evaluate publication bias. A statistically significant P-value was defined as less than 0.05. All statistical analyses were conducted using the midas and metandi packages in STATA version 17 (StataCorp LP, College Station, TX).
Results
Study selection
Figure 1 illustrates our study selection process. A total of 181 articles were identified from electronic database searching. After reviewing the titles and abstracts of the articles and excluding 107 duplicates and irrelevant articles, 76 studies were obtained for further assessment of eligibility. After reviewing the full text of the articles, 10 articles were eventually selected.
Study characteristics
The selected articles were published between 2019 and 2023. The studies were conducted in Japan (n = 3), China (n = 2), South Korea (n = 2), Taiwan (n = 1), Poland (n = 1), and Turkey (n = 1). The total number of participants in the studies was 69,006, of whom 76.7% were female. The eight studies were performed on patients aged 50 years or older [23–30]. In nine studies, DXA was used as a reference standard for confirming osteoporosis patients [23–31], and in one study, a CT scan was used [32]. In the included studies, x-ray (n = 6) [23–25, 28, 30, 31], CT (n = 2) [29, 32], and both MRI and CT (n = 1) [27] images were used as the train and test datasets. In two studies, datasets were collected from multi-centers [24, 28]. In the included studies, DL prediction models were constructed from lumbar spine (n = 5) [23, 24, 27, 29, 32] and hip (n = 4) [25, 26, 30, 31] images. One of the studies used both hip and lumbar spine images [28]. Table 2 provides a summary of the information on the selected studies.
Table 2.
Summary of the data obtained from the included studies
| Study/ Year | Country | Study Design | Number of patients | Mean age | Center | Imaging modality | Image type | Reference test | DL Methods | External validation test |
|---|---|---|---|---|---|---|---|---|---|---|
| Dzierżak et al./ 2022 [32] | Poland | retrospective | 100 | NA | single-center | CT | lumbar spine | CT scan (with standard protocol) | VGG16, VGG19, MobileNetV2, ResNet50, InceptionResNetV2, Xception | NO |
| Hsieh et al./ 2021 [28] | Taiwan | Cohort | 36,279 |
67.2 ± 12.5 72.8 ± 11.2 |
multi center | X-Ray | lumbar spine - hip | DXA | VGG-16 | Yes |
| Yamamoto et al./2020 [26] | Japan | retrospective | 1131 | > 60 | single-center | X-Ray | hip | DXA | EfficientNet b3, GoogleNet, ResNet34, ResNet18 EfficientNet b4 | NO |
| Mao et al./2022 [24] | China | retrospective | 6908 | > 50 | multi center | X-Ray | lumbar spine | DXA | DenseNet | NO |
| Hong et al./2023 [23] | South Korea | cohort | 9276 | > 50 | single-center | X-Ray | lumbar spine | DXA | EfficientNet-B4 | Yes |
| Liu et al./2019 [31] | China | retrospective | 89 | NA | single-center | X-Ray | hip | DXA | U-NET | NO |
| Yamamoto et al./2021 [25] | Japan | retrospective | 1699 | > 60 | single-center | X-Ray | hip | DXA |
ResNet50 ResNet18 ResNet34 ResNet101 ResNet152 |
NO |
| Jang et al./2021 [30] | South Korea | retrospective | 1001 | 72.5 ± 12.5 | single-center | X-Ray | hip | DXA | VGG16 | Yes |
| Yasaka et al./2020 [29] | Japan | retrospective | 278 | > 60 | single-center | CT | lumbar spine | DXA | CNN | Yes |
| Küçükçiloğlu et al./2023 [27] | Turkey | retrospective | 220 |
65 ± 9.9 68 ± 8.7 |
single-center | MRI - CT | lumbar spine | DXA | CNN | NO |
Mean age: mean age of participants. Center: data gathering centers (Single center/Multi center). Image type: anatomical site of images (hip or lumbar spine images). DXA: dual-energy X-ray absorptiometry. DL Methods: Deep learning methods. NA: Not Available
All studies used CNN as the DL model. Three studies were investigated multiple DL models [25, 26, 32]. Eight studies were used various architectures of CNN, such as ResNet [32], VGG, and EfficientNet [23–26, 28, 30–32]. Two studies were used k-fold cross-validation methods [25, 27], and the others were applied simple random sampling methods for splitting datasets into training, validation, and testing datasets. Four studies used an external dataset for validation test [23, 28–30].
Risk of bias within studies
Using the QUADAS-2 method, we also assessed the studies’ quality and the possibility of bias. There were several studies with low risk of bias in the “patient selection” and “reference standard” categories; however, one study was ambiguous. Every study was determined to have a comparatively low risk of bias in the “index test” domain. Six studies were considered unclear in “flow and timing,” three studies had a low risk of bias, and one study had a high risk. Supplemental File 1, Fig.S1 and Table S3 provide detailed information on the results of the quality assessment.
Diagnostic accuracy of DL models
The included studies’ developed DL models have an AUC ranging from 70% [30, 31] to 98.8% [27]. The sensitivity and specificity of these studies ranged from 62% [23] to 98.4% [27] and 72.3% [31] to 99.2% [27], respectively. The detailed results of developed DL models are provided in Supplemental File 1, Table 4. Please check In terms of 95% confidence intervals, the pooled sensitivity and specificity were 0.86 (0.82–0.89) and 0.89 (0.85–0.91), respectively. Figures 2 and 3 show the forest plots for the sensitivities and specificities with a 95% CI. Sensitivity (I2 = 97.9%, p < 0.001) and specificity (I2 = 99.13%, p < 0.01) showed statistically significant heterogeneity. The bivariate approach’s pooled SROC curve produced an AUC of 0.94 (95% CI 0.91–0.95) (Fig. 4). The DOR for the DL models was 49.09 (95% CI 28.74–83.84). The detailed information on the meta-analysis of diagnostic accuracy is provided in Supplemental File 2, Fig.S1-S3.
Fig. 2.
Forest plot of the pooled sensitivity of deep learning in the predicting osteoporosis
Fig. 3.
Forest plot of the pooled specificity of deep learning in the predicting osteoporosis
Fig. 4.

Summarized receiver-operating curves (SROC) using the bivariate approach
Meta-regression analysis and subgroup analysis
The meta-regression analysis was conducted to explore possible sources of between-study heterogeneity. As shown in Fig. 5, The DL methods and image type significantly accounted for the heterogeneity.
Fig. 5.

Meta-regression and subgroup analysis
Furthermore, subgroup analyses were also done to find the sources of heterogeneity based on the Type of Image, Validation Type and DL methods (Table 3). Based on Image Type, the diagnostic accuracy of developed models from lumbar spine images (sensitivity, 0.9; specificity, 0.94; PLR, 14.9; NLR, 0.11; DOR, 136) had no significantly better than created models from hip images (sensitivity, 0.83; specificity, 0.83; PLR, 5; NLR, 0.21; DOR, 24). Subgroup meta-analyses were also performed based on validation type (internal/external). According to the findings, no significant differences in diagnostic accuracy between internal and external validation subgroups, with a sensitivity (0.87 vs. 0.78), specificity (0.89 vs. 0.89), PLR (7.8 vs. 6.8), NLR (0.15 vs. 0.25), DOR (54 vs. 28), and AUC (0.94 vs. 0.88). Regarding the DL methods, the non-transfer learning models (sensitivity, 0.78; specificity, 0.86; PLR, 5.29; NLR, 0.21; DOR, 30.04) showed no significantly higher diagnostic accuracy compared with the transfer learning methods (sensitivity, 0.78; specificity, 0.86; PLR, 5.29; NLR, 0.21; DOR, 30.04).
Table 3.
The results of subgroup analyses
| Subgroup | DL models N (%) |
Sen (95%CI) |
Spe (95%CI) |
PLR (95%CI) |
NLR (95%CI) |
DOR (95%CI) | AUC (95%CI) |
|---|---|---|---|---|---|---|---|
| Overall | 26 (100) |
0.86 (0.82–0.89) |
0.89 (0.85–0.92) |
7.9 (5.7–10.9) |
0.16 (0.12–0.21) |
49 (29–84) |
0.94 (0.91–0.95) |
| Type of Image | |||||||
| Lumbar Spine | 13(50) |
0.9 (0.83–0.94) |
0.94 (0.91–0.96) |
14.9 (9.1–24.5) |
0.11 (0.07–0.19) |
136 (54–342) |
0.97 (0.95–0.98) |
| Hip | 13(50) |
0.83 (0.8–0.85) |
0.83 (0.8–0.86) |
5.0 (4.0-6.2) |
0.21 (0.18–0.24) |
24 (17–33) |
0.89 (0.86–0.92) |
| Validation Type | |||||||
| Internal | 21(81) |
0.87 (0.83-0.90) |
0.89 (0.85-0.92) |
7.8 (5.6–10.9) |
0.15 (0.11–0.19) |
54 (30–97) |
0.94 (0.92–0.96) |
| External | 5(19) |
0.78 (0.69–0.85) |
0.89 (0.75–0.95) |
6.8 (2.9–16.3) |
0.25 (0.17–0.37) |
28 (9–87) |
0.88 (0.84–0.90) |
| DL Methods | |||||||
| Transfer learning | 22(85) |
0.83 (0.80–0.86) |
0.87 (0.83–0.89) |
6.2 (4.9–7.9) |
0.2 (0.17–0.23) |
31 (22–46) |
0.91 (0.88–0.93) |
| Non-transfer learning | 4(15) |
0.97 (0.93–0.98) |
0.97 (0.94–0.99) |
35.5 (17.1–73.6) |
0.04 (0.02–0.07) |
1004 (359–2814) |
0.99 (0.98-1) |
Sen: sensitivity, Spe: specificity, PLR: positive likelihood ratios, NLR: negative likelihood ratios, DOR: diagnostic odds ratio, AUC: area under the curve, DL Methods: deep learning methods
Sensitivity analysis and publication bias
Sensitivity analyses were conducted to investigate the sources of heterogeneity. The bivariate box plot (Supplementary file 2, Fig. S4) showed that six models were outliers. We excluded these models and re-performed meta-analysis. According to the result of sensitivity analysis, there was no significant effect on this meta-analysis after excluding outliers (Supplementary file 2, Table S1). In our meta-analysis, Deeks’ funnel plot asymmetry test was conducted to evaluate potential publication bias (Fig. 6). No significant publication bias existed among the studies (P = 0.4).
Fig. 6.
Deeks’ funnel plot asymmetry test for publication
Discussion
This systematic review and meta-analysis investigated the diagnostic accuracy of DL models for predicting osteoporosis. The results suggest that DL models have the potential to be a valuable tool. DL models are a non-invasive technique that can help radiologists and physicians by aiding in the early diagnosis and prediction of osteoporosis. This is important since early diagnosis improves the prognosis and allows for more effective treatment and a higher survival rate. In addition, osteoporosis screening can be enhanced by DL algorithms without interfering with clinical workflow, and significant osteoporotic fractures can be treated with sufficient discriminative performance [33, 34]. Yen and et al. recently conducted a meta-analysis on the performance of DL models in diagnosing osteoporosis [35]. The results of this study showed a high performance of diagnostic accuracy. However, this study had limitations including the few numbers of reviewed studies and failure to perform some analyses, including meta-regression, subgroup analysis, sensitivity analysis, and evaluation of publication bias. In our study, these limitations resolved.
In the reviewed studies, DL models have been applied to the prediction of osteoporosis based on hip and lumbar images of patients. An appropriate DL model is determined by high-accuracy metrics like AUC, sensitivity, and specificity that able to accurately categorize patients and healthy subjects. The resulting pooled AUC, sensitivity, and specificity were 0.94 (95% CI 0.91–0.95), 0.846 (95% CI 0.82–0.89, I2 = 97.9%), and 0.89 (95% CI 0.85–0.91, I2 = 99.13%), respectively. Yen and et al. obtained similar results in their study, with an AUROC of 0.88, a sensitivity of 0.81, and a specificity of 0.87 [35]. In addition, our findings have shown that DL models were better than other ML methods in predicting osteoporosis, as reported in a similar study by Rahim et al. [36]. However, it is important to note that this is a relatively new area of research, and more studies are required to validate the generalizability of these findings and to optimize the performance of DL models for clinical use.
A pooled DOR of 49.09 (95% CI 28.74–83.84, I2 = 100%) was found in this meta-analysis, which suggests that using DL to diagnose osteoporosis is generally beneficial and superior ML technique. likelihood ratios are significant metrics that illustrate the range of disease frequencies and can aid in the improvement of clinical judgment [37]. In this study, a pooled LR + of 7.9 (ranges 5.7–10.9) was found. This indicates that a prediction of osteoporosis from the DL is 7.9 times more likely to be correct than a negative prediction. This is a strong positive predictive value, meaning that the model is good at identifying people who actually have osteoporosis. Likewise, the pooled LR − of 0.16 (ranges 0.12–0.21) was obtained. This means that the DL is also good at identifying non-osteoporosis. Rahim and et al. in their systematic review showed that the rate of LR + and LR- for ML model in prediction of osteoporosis with 3.7 and 0.22, respectively [36]. So, in compared with our study, we could result that DL have better performance than another ML algorithms.
In our meta-analysis, we observed substantial heterogeneity across the included studies, as indicated by the high I² values. To explore the sources of heterogeneity, we conducted subgroup analyses and meta-regressions to assess the impact of these factors on the outcomes. Our meta-regression analysis identified that both the image type and the specific deep learning model used were significant contributors to heterogeneity. This suggests that the variability in image types and differences in DL architecture may be influencing the predictive performance outcomes across studies.
In this study, a sub-group analysis based on the image types, including hip and lumbar spine images, was conducted to evaluate this factor’s impact on the diagnostic accuracy outcomes. The results of the analysis have shown that the diagnostic accuracy of developed models from lumbar spine images a little better than created models from hip images. However, various factors, such as the quality of the images and the method of model training, can affect these results.
Most of the reviewed studies used transfer learning models. VGG and ResNet architectures were used more than others and had high performance. Therefore, it seems that the use of these types of architectures can work better in diagnosing osteoporosis. Of course, it should be noted that various factors, such as the size of the dataset, the quality of the images, and the model training parameters, can have an impact on these results. Of course, it is important to note that various factors, such as the size of the dataset, the quality of the images, and model training parameters, can have an impact on the performance of the models.
External validation of a DL model is essential to guaranteeing its reliability before use in clinical practice [38]. In the included studies, they often relied on results from the internal validation rather than conducting external validation of the model. This could cause the performance of the model to be overestimated. Therefore, only after external validation on independent test sets can a true representation of these algorithms’ performance be achieved.
One of the strengths of this study is that it is the first comprehensive assessment of the diagnostic accuracy of DL models for predicting osteoporosis. As a result, this study’s findings can be used as a reference. While heterogeneity was a significant challenge in our meta-analysis, we took several steps to understand and address its sources. These efforts enhance the robustness of our results, but it is important for future research to standardize methodologies and reporting practices to reduce heterogeneity in this field. However, a limitation of the study is that the types of datasets (hip and lumbar spine images) in the included studies varied, which may have introduced some heterogeneity into the results. In addition, other factors such as various DL models, different image modality, type of validation (internal and external), and data gathering center (single or multi-center) can cause heterogeneity in this meta-analysis. In our analysis, we took steps to assess and address the potential impact of publication bias. The absence of significant publication bias, as indicated by Deeks’ funnel plot asymmetry test, strengthens the confidence in our findings. Nevertheless, we recommend caution in interpreting these results, as some bias may remain undetected.
Despite the high performance of DL in the prediction of osteoporosis, further research is needed to validate the findings of this study in prospective clinical trials. Additionally, research is needed to develop and optimize DL models that can be integrated into clinical practice for the prediction of osteoporosis. Finally, it is important to develop strategies for addressing the ethical and societal implications of using DL for predicting osteoporosis.
Conclusions
The results of this systematic review and meta-analysis have shown that the pooled sensitivity, specificity, and AUC of DL models for the diagnosis of osteoporosis were 86%, 89%, and 94%, respectively. In conclusion, DL has an acceptable performance for the diagnosis of osteoporosis, even better than other ML algorithms. Transfer learning models especially VGG and ResNet were used more than others and had high performance. Therefore, it seems that the use of these types of architectures can work better in diagnosing osteoporosis. Our findings indicate that DL models can achieve high sensitivity and specificity, which are crucial for the early detection of osteoporosis. Therefore, we suggested that DL based decision support systems can be used as a promising non-invasive predictive tools in early diagnosis of osteoporosis. However, our study’s results should be viewed as part of an evolving field, where continued research is needed. Therefore, more large-scale and multi-center clinical studies are still required to refine these algorithms further and validate their performance across at-risk populations and real-world settings.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
Not applicable.
Abbreviations
- BMD
measure bone mineral density
- DEXA
dual-energy X-ray absorptiometry
- ML
machine learning
- DL
deep learning
- CNN
convolutional neural networks
- PRISMA
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
- TP
true positives
- FP
false positives
- TN
true negatives
- FN
false negatives
- AUC
area under the curve
- DOR
Diagnostic Odds Ratio
- SROC
summary receiver-operating characteristic curve
Author contributions
M.A. and F.A. wrote the main manuscript text and formal analysis. M.H. and P.A.prepared figures and editing the manuscript . All authors reviewed the manuscript.
Funding
No funding was received to assist with the preparation of this manuscript.
Data availability
“All data generated or analysed during this study are included in this published article and its supplementary information files.”
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Shevroja E, Cafarelli FP, Guglielmi G, Hans D. DXA parameters, trabecular bone score (TBS) and bone Mineral density (BMD), in fracture risk prediction in endocrine-mediated secondary osteoporosis. Endocrine. 2021;74(1):20–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sözen T, Özışık L, Başaran NÇ. An overview and management of osteoporosis. Eur J Rheumatol. 2017;4(1):46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zanker J, Duque G. Osteoporosis in older persons: Old and New players. J Am Geriatr Soc. 2019;67(4):831–40. [DOI] [PubMed] [Google Scholar]
- 4.Golchin MM, Heidari L, Ghaderian SMH, Akhavan-Niaki H. Osteoporosis: a silent disease with complex genetic contribution. J Genet Genomics. 2016;43(2):49–61. [DOI] [PubMed] [Google Scholar]
- 5.Eghbali T, Abdi K, Nazari M, Mohammadnejad E, Gheshlagh RG. Prevalence of osteoporosis among Iranian Postmenopausal women: a systematic review and Meta-analysis. Clin Med Insights Arthritis Musculoskelet Disord. 2022;15:11795441211072471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Anthamatten A, Parish A. Clinical update on osteoporosis. J Midwifery Women’s Health. 2019;64(3):265–75. [DOI] [PubMed] [Google Scholar]
- 7.Pouresmaeili F, Kamalidehghan B, Kamarehei M, Goh YM. A comprehensive overview on osteoporosis and its risk factors. Therapeutics and clinical risk management. 2018:2029–49. [DOI] [PMC free article] [PubMed]
- 8.Zhu Z, Yu P, Wu Y, Wu Y, Tan Z, Ling J, et al. Sex specific global burden of osteoporosis in 204 countries and territories, from 1990 to 2030: an age-period-cohort modeling study. J Nutr Health Aging. 2023;27(9):767–74. [DOI] [PubMed] [Google Scholar]
- 9.Golob AL, Laya MB. Osteoporosis: screening, prevention, and management. Med Clin. 2015;99(3):587–606. [DOI] [PubMed] [Google Scholar]
- 10.Walker J. Osteoporosis and fragility fractures: risk assessment, management and prevention. Nurs Older People. 2020;32(1):34–41. [DOI] [PubMed] [Google Scholar]
- 11.Banh K. Essentials of Osteoporosis: Early Prevention, Screening, and Management of this Silent Disease. 2022.
- 12.Larijani B, Tehrani MRM, Hamidi Z, Soltani A, Pajouhi M. Osteoporosis, prevention, diagnosis and treatment. J Reprod Infertility. 2005;6(1).
- 13.Clynes MA, Harvey NC, Curtis EM, Fuggle NR, Dennison EM, Cooper C. The epidemiology of osteoporosis. Br Med Bull. 2020;133(1):105–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J. 2017;15:104–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Aggarwal R, Sounderajah V, Martin G, Ting DS, Karthikesalingam A, King D, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4(1):65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Castiglioni I, Rundo L, Codari M, Di Leo G, Salvatore C, Interlenghi M, et al. AI applications to medical images: from machine learning to deep learning. Physica Med. 2021;83:9–24. [DOI] [PubMed] [Google Scholar]
- 17.Ker J, Wang L, Rao J, Lim T. Deep learning applications in medical image analysis. Ieee Access. 2017;6:9375–89. [Google Scholar]
- 18.Forte GC, Altmayer S, Silva RF, Stefani MT, Libermann LL, Cavion CC, et al. Deep learning algorithms for diagnosis of lung cancer: a systematic review and meta-analysis. Cancers. 2022;14(16):3856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chawla SK, Malhotra D, editors. Prediction of Osteoporosis Using Artificial Intelligence Techniques: A Review. The International Conference on Recent Innovations in Computing; 2022: Springer.
- 20.Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Whiting P, Rutjes A, Westwood M, Mallett S, Leeflang M, Reitsma H et al. Updating QUADAS: evidence to inform the development of QUADAS-2. 2014.
- 22.Reitsma JB, Glas AS, Rutjes AW, Scholten RJ, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005;58(10):982–90. [DOI] [PubMed] [Google Scholar]
- 23.Hong N, Cho SW, Shin S, Lee S, Jang SA, Roh S, et al. Deep-learning-based detection of vertebral fracture and osteoporosis using lateral spine X-Ray radiography. J Bone Min Res. 2023;38(6):887–95. [DOI] [PubMed] [Google Scholar]
- 24.Mao L, Xia Z, Pan L, Chen J, Liu X, Li Z, et al. Deep learning for screening primary osteopenia and osteoporosis using spine radiographs and patient clinical covariates in a Chinese population. Front Endocrinol (Lausanne). 2022;13:971877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yamamoto N, Sukegawa S, Yamashita K, Manabe M, Nakano K, Takabatake K et al. Effect of patient clinical variables in osteoporosis classification using hip X-rays in Deep Learning Analysis. Med (Kaunas). 2021;57(8). [DOI] [PMC free article] [PubMed]
- 26.Yamamoto N, Sukegawa S, Kitamura A, Goto R, Noda T, Nakano K, et al. Deep learning for osteoporosis classification using hip radiographs and patient clinical covariates. Biomolecules. 2020;10(11):1534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Şekeroğlu B, Adalı T, Şentürk N. Prediction of osteoporosis using MRI and CT scans with unimodal and multimodal deep-learning models. 2023. [DOI] [PMC free article] [PubMed]
- 28.Hsieh CI, Zheng K, Lin C, Mei L, Lu L, Li W, et al. Automated bone mineral density prediction and fracture risk assessment using plain radiographs via deep learning. Nat Commun. 2021;12(1):5472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Yasaka K, Akai H, Kunimatsu A, Kiryu S, Abe O. Prediction of bone mineral density from computed tomography: application of deep learning with a convolutional neural network. Eur Radiol. 2020;30(6):3549–57. [DOI] [PubMed] [Google Scholar]
- 30.Jang R, Choi JH, Kim N, Chang JS, Yoon PW, Kim C-H. Prediction of osteoporosis from simple hip radiography using deep learning algorithm. Sci Rep. 2021;11(1):19997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Liu J, Wang J, Ruan W, Lin C, Chen D. Diagnostic and gradation model of osteoporosis based on Improved Deep U-Net Network. J Med Syst. 2019;44(1):15. [DOI] [PubMed] [Google Scholar]
- 32.Dzierżak R, Omiotek Z. Application of deep convolutional neural networks in the diagnosis of osteoporosis. Sens (Basel). 2022;22(21). [DOI] [PMC free article] [PubMed]
- 33.Villamor E, Monserrat C, Del Río L, Romero-Martín J, Rupérez MJ. Prediction of osteoporotic hip fracture in postmenopausal women through patient-specific FE analyses and machine learning. Comput Methods Programs Biomed. 2020;193:105484. [DOI] [PubMed] [Google Scholar]
- 34.Ou Yang W-Y, Lai C-C, Tsou M-T, Hwang L-C. Development of machine learning models for prediction of osteoporosis from clinical health examination data. Int J Environ Res Public Health. 2021;18(14):7635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Yen T-Y, Ho C-S, Chen Y-P, Pei Y-C. Diagnostic accuracy of Deep Learning for the prediction of osteoporosis using plain X-rays: a systematic review and Meta-analysis. Diagnostics. 2024;14(2):207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Rahim F, Zaki Zadeh A, Javanmardi P, Emmanuel Komolafe T, Khalafi M, Arjomandi A, et al. Machine learning algorithms for diagnosis of hip bone osteoporosis: a systematic review and meta-analysis study. Biomed Eng Online. 2023;22(1):68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Jaeschke R, Guyatt GH, Sackett DL, Guyatt G, Bass E, Brill-Edwards P, et al. Users’ guides to the medical literature: III. How to use an article about a diagnostic test B. What are the results and will they help me in caring for my patients? JAMA. 1994;271(9):703–7. [DOI] [PubMed] [Google Scholar]
- 38.Bleeker SE, Moll HA, Steyerberg EW, Donders AR, Derksen-Lubsen G, Grobbee DE, et al. External validation is necessary in prediction research: a clinical example. J Clin Epidemiol. 2003;56(9):826–32. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
“All data generated or analysed during this study are included in this published article and its supplementary information files.”




