Artificial Intelligence in Paediatric and Adolescent Fracture Detection: A Systematic Review and Meta-Analysis

Jordan Calleja; Kyle Muscat; Jacques Calleja; Gregory Firth

doi:10.7759/cureus.92199

. 2025 Sep 13;17(9):e92199. doi: 10.7759/cureus.92199

Artificial Intelligence in Paediatric and Adolescent Fracture Detection: A Systematic Review and Meta-Analysis

Jordan Calleja ^1,^✉, Kyle Muscat ¹, Jacques Calleja ², Gregory Firth ³

Editors: Alexander Muacevic, John R Adler

PMCID: PMC12516640 PMID: 41089155

Abstract

Fractures are among the most common injuries in children, yet their radiographic detection is challenging due to the unique anatomy of the developing skeleton, leading to significant diagnostic errors. To address this, a systematic review and meta-analysis was conducted to evaluate how accurately and efficiently artificial intelligence (AI) detects fractures in children and adolescents. Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, a systematic search of PubMed, EMBASE, and Web of Science identified 11 studies published between 2019 and 2024 evaluating AI for detecting appendicular skeletal fractures in patients under 21 years. A meta-analysis revealed that standalone AI demonstrated a statistically significantly higher sensitivity compared to human interpretation (mean difference: 0.04, 95% CI [0.02, 0.07], p = 0.0005) with non-inferior specificity. Furthermore, AI-assisted diagnosis led to a statistically significant improvement in clinician sensitivity (mean difference: 0.07, p = 0.003). To sum up, AI exhibits high diagnostic performance for paediatric fractures and serves as a promising adjunct tool to enhance clinical efficiency and accuracy; however, further large-scale, multi-centre prospective trials are required to validate its real-world applicability and address current limitations before widespread adoption.

Keywords: artificial intelligence, diagnostic accuracy, fracture, fracture detection, imaging, meta-analysis, paediatric, systematic review

Introduction and background

Fractures represent one of the most common injuries in children presenting to emergency departments, often posing diagnostic challenges due to subtle radiographic findings and the variability of paediatric skeletal anatomy. Artificial intelligence (AI) has demonstrated the potential to improve diagnostic accuracy and efficiency in detecting fractures. This systematic review evaluates the performance of AI algorithms in paediatric fracture detection.

AI is the application of algorithms that provide machines with the ability to solve problems that traditionally require human intelligence. Now a fundamental part of everyday life, AI comprises several subfields, including machine learning, deep learning, and generative AI. Machine learning is a subset of AI that enables machines to learn patterns from data and make decisions without being explicitly programmed. Deep learning, a further subset of machine learning, uses neural networks with multiple layers to analyse complex patterns in data, making it particularly powerful for image and speech recognition. Generative AI, another subfield, focuses on creating new content based on learned patterns, such as generating realistic images, text, and even synthetic medical data to augment AI model training.

Fractures are one of the most common reasons children present to the emergency department, with an estimated 50% of all children sustaining a fracture during childhood. These injuries can be challenging to diagnose due to subtle radiographic findings and variations in skeletal anatomy during growth. In paediatrics, missed fractures are a significant cause of delayed treatment and may lead to long-term disability. This is particularly important due to medicolegal considerations, as studies have found that surgical specialities produce the highest number of malpractice claims, with Orthopaedic Surgery ranking first. Studies indicate that emergency physicians may miss up to 11% of paediatric fractures [1], and missed fractures have been identified as the most common cause of misdiagnosis, accounting for up to 44% of errors [2].

Artificial intelligence in orthopaedics and fracture detection

Advances in computational power, coupled with the increasing availability of large-scale medical datasets, have facilitated the rapid expansion of AI into healthcare. Within the field of orthopaedics, AI has seen widespread application in diagnostic imaging analysis, preoperative risk stratification, and clinical decision support, with the goals of enhancing patient outcomes and optimising clinical workflows. One of the key areas where AI has shown significant promise is in fracture detection. The accurate identification of fractures in children is crucial, as misinterpretations can lead to long-term complications and functional impairment. Paediatric fractures, particularly those involving the appendicular skeleton, can be subtle and challenging to diagnose, often requiring an experienced radiologist for accurate interpretation. However, given the global shortage of paediatric radiology specialists and the increasing demand for rapid diagnostic tools, AI-based systems have emerged as potential adjuncts to traditional radiographic interpretation. The application of AI in fracture detection typically involves convolutional neural networks (CNNs) trained on extensive datasets of radiographic images.

Evidence indicates that these AI systems can achieve diagnostic accuracy on par with, or superior to, human radiologists. Moreover, the integration of AI as a decision support tool has been shown to improve the diagnostic performance of junior clinicians and decrease the time required for interpretation. The National Institute for Health and Care Excellence (NICE) has advocated for the integration of AI into clinical practice to improve fracture detection accuracy [3]. Beyond fracture detection, AI has been implemented in various other aspects of orthopaedic care. It is used in automated image analysis for conditions such as osteoarthritis, scoliosis, and osteoporosis, enabling earlier, more precise diagnoses. AI has also been applied to preoperative planning, where machine learning models can predict the optimal implant size and alignment for joint replacement surgeries [4].

Robotic-assisted surgery, guided by AI algorithms, has been shown to improve the precision of procedures like total knee and hip arthroplasty [5]. AI is also utilised in predictive analytics to identify patients at higher risk of postoperative complications such as infections, thromboembolism, and implant failure. Additionally, natural language processing (NLP) is used to extract clinical insights from unstructured electronic health records, streamlining patient management. In orthopaedic trauma care, AI-based decision support systems help triage patients and optimise resource allocation [6].

Despite its potential, the integration of AI into clinical practice presents several challenges. Variability in model performance across different populations, inconsistencies in ground truth labelling, and the lack of standardised validation methodologies remain key barriers to widespread adoption. Furthermore, ethical and medicolegal considerations, such as the transparency of AI decision-making and clinician accountability, must be addressed to ensure safe and effective implementation. This systematic review aims to evaluate the current literature on the use of AI in detecting paediatric fractures by summarising its diagnostic performance, advantages, limitations, and potential role in clinical workflows. By critically analysing existing studies, this review will provide insights into the effectiveness of AI-based fracture detection systems and highlight areas for future research.

Review

Methods

This systematic review was registered with the PROSPERO International Prospective Register of Systematic Reviews (CRD42024619744) and follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

Literature Review

A comprehensive search was performed using the PubMed, EMBASE, and Web of Science databases for studies published in English between January 2019 and December 2024. The search used database-specific Boolean strategies with terms and word variations of ‘fracture’, ‘paediatrics’, and ‘Artificial Intelligence’. The eligibility criteria included studies focusing on fracture detection in the appendicular skeleton using plain radiographs in patients aged 0-21 years. Randomised controlled trials, prospective and retrospective cohort studies, and diagnostic accuracy studies that evaluated an AI algorithm and reported quantitative outcomes were included.

The exclusion criteria comprised studies not written in English; systematic reviews, case reports, editorials, opinion articles, and pictorial reviews; and multimedia files such as online videos and podcasts. Studies were also excluded if the reference standard was unclear, if no distinct paediatric subgroup was identified, if they focused on adults or conditions unrelated to fractures, or if they were qualitative-only studies. Studies employing other imaging modalities, such as CT or ultrasound, or those that were not fully accessible, were also excluded. Two reviewers independently screened all titles and abstracts. A full-text review was then performed for the remaining 22 studies, of which 11 were subsequently excluded (Figure 1). A third reviewer was consulted to resolve any conflicts.

PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Data from each study - including the AI system used, number of patients and radiographs, participant age, inclusion and exclusion criteria, reference standard, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) - was entered into a spreadsheet (Microsoft Excel) by a single reviewer. When required, the standard deviation (SD) was calculated from the 95% confidence intervals (CI) for each relevant study. Therefore, the formula, SE = (upper limit - lower limit) / 3.92, was used, then SD = SE √N.

Risk of Bias

A modified QUADAS-2 tool was utilised, using the QUADAS-2 framework combined with the CLAIM checklist to assess risk of bias. This approach ensures a standardised evaluation of diagnostic accuracy studies while addressing AI-specific considerations. The assessment focuses on four domains:

Patient selection: Risk of bias will be assessed by evaluating the representativeness and balance of the patient population, data sources, and inclusion/exclusion criteria. Applicability will consider alignment with real-world paediatric fracture demographics and the AI’s intended usage.

Index test (AI algorithm): Risk of bias includes evaluation of reported performance metrics, uncertainty measures, and algorithm transparency. Applicability will assess external validation and the AI’s generalisability across diverse settings and populations.

Reference standard: Risk of bias will examine the reliability of the “ground truth,” including blinding and clinician expertise. Applicability will ensure the reference standard aligns with accepted clinical practices.

Flow and timing: Bias will be assessed based on consistent application of the index test and reference standard, avoiding temporal discrepancies.

Meta-analysis was performed using Review Manager (RevMan) software [7]. A random-effects model was employed using the inverse variance method with mean difference as the outcome measure, assuming high heterogeneity due to variability in study designs, populations, AI models, and reference standards across the included studies. Forest plots were generated to visualise the data, and funnel plots were used as a visual aid to assess for publication bias.

Results

The systematic search yielded 473 studies, from which 113 duplicates were removed. After title and abstract screening, 338 studies were excluded, leaving 22 full-text articles for evaluation. Following this review, 11 studies that met the inclusion criteria were analysed (Table 1).

Table 1. Summary of the included studies.

AI: artificial intelligence; CNN: convolutional neural network; R-CNN: region-based convolutional neural network

Study	AI system	Body part	No. of radiographs	No of participants	Age, years	Inclusion criteria	Exclusion criteria	Ground truth	Ground truth blinded?	Single centre?
Choi et al., 2020 [8]	Developed a dual-input CNN	Elbow	95	48	Range: 0–19, no median or mean provided, but % in each age group provided in the study	Paeds ED suspected traumatic supracondylar 01/2013 – 12/2018	Not initial radiograph, patients with nonsupracondylar fracture, dislocation, or underlying bone dysplasia	All radiographs were re-reviewed by two paediatric radiologists	Yes	Two centres, same city
Dupuiset al., 2022 [9]	Rayvolve	All (except Radiographs of the skull, ribs, and spine)	2,634 radiographic sets (images = 5,865)	2,549	Mean age, 8.5 ± 4.5	Radiography sets 03/2019-/03/2020 from all consecutive trauma paediatric ED patients	Febrile lameness, whitlow or other infectious context, foreign body detection, or some other indisputably non-traumatic context	Senior radiologist from panel of 11 - including author	Unclear	Yes
Hayashi et al., 2022 [10]	BoneView	Hand/wrist, elbow/upper arm, shoulder/clavicle, foot/ankle, leg/knee	300 (60 per body part)	300	2–21, 10.8 ± 4.9	Post-traumatic X-rays from a US-based data provider	Pelvis, skull, spine, rib cage radiographs, non-acute injury, poor image quality quota reached	Two board-certified MSK radiologists marked a bounding box; cases of disagreement were resolved by a third radiologist	Yes	US data provider
Nguyen et al., 2022 [11]	BoneView	Hand/wrist, elbow/upper arm, shoulder/clavicle, foot/ankle, leg/knee	300	300	10.8 ± 4.9	Anonymised radiographs of- paediatric patient, 2-21 years	Body part not intended for use with BoneView (pelvis, skull, spine, rib cage). Poor quality exams. Poor image quality	Two MSK radiologists; discrepancies were resolved by consensus with a third radiologist	Yes	US data provider
Zech et al., 2022 [12]	Faster R-CNN model	Wrist	125 (test)	395	Mean = 10.1, range: 0.8-17.8	0-18-year-old patients with wrist radiographs	Follow-up imaging, cast or splint	Report by the paediatric fellowship-trained attending radiologist at the time of clinical interpretation was used to establish the ground	No	Yes
Zech et al., 2023 [13]	BoneView	Upper extremity radiographs (finger/hand, wrist/forearm, elbow, humerus, shoulder/clavicle)	819 (internal test)	819	Internal test 10.24; 5.86	All upper extremity radiographic examinations performed on patients aged 0–21 years between 1/1/2015 and 12/31/2021 (n = 44,729 examinations) from initial encounter	Follow-up or bone age examination, ACJ, and scapula, missing location, prior surgery, cast/splint, subacute injury	Silver standard - NLP algorithm from the report. Gold standard - manual review by a resident based on the NLP algorithm, bound box over fracture or elbow effusion. For test data, any discrepancy between the original report and the manual review was decided by a review by the senior author (MSK radiologist with 13 years of experience)	Yes	Yes
Altmann‐Schneider et al., 2023 [14]		Lower limb, elbow, and forearm	2,100 lower limbs, 2,051 forearms, 1,104 elbows	1,000 per body part	Forearm 7.8 ± 3.9. Elbow 7.7 ± 3.7. Lower Limb 4.9±4	Radiographs of either the forearm, lower leg, or elbow in at least two projections were performed on the day of attendance at the emergency department.	Radiographs of patients with fractures with a high specificity for child abuse (e.g., metaphyseal corner fractures) were excluded, as the AI software is not trained to detect the	All radiographs contained an original report from the attending paediatric radiology staff member with varying degrees of experience. Each radiograph received a second reading performed by a certified paediatric radiologist	No	Yes
Gasmi et al., 2023 [15]	Rayvolve	All	878	878	<18, mean. Female 8.4. Male 8.3	<18, recent non-life-threatening injury, and at least one appendicular radiograph	Radiographs not available for review, not reported	2 paediatric radiologists	Yes	Yes
Dupuis et al., 2024 [16]	SmartUrgences	Elbow	741 radiographic sets, 1,601 images	695	7.27 ± 3.97	This single-centre retrospective study was conducted on elbow radiography sets collected from January 2018 to December 2021 from all consecutive patients younger than 18 years who were referred by the paediatric emergency room in a trauma context	Wrong X-ray location, osteitis search, infectious context foreign body search, or no Milvue interpretation	Read by a junior and then a senior radiologist during routine care	No	Yes
Kavak et al., 2024 [17]	CNN model You Only Look Once (YOLO) v8	All	7,150	.	Mean age 8.3 years	Availability of a radiograph of an appendicular part taken after a recent trauma	Radiographs featuring implants, casts, or any other pathological lesions in the bones, as well as patients presenting fractures highly specific to child abuse (e.g., metaphyseal corner fractures), were excluded	Consensus bounding box between 3 radiologists	Unclear	Yes
Zech et al., 2024 [18]	Childfx	Upper extremity	1,693	240	Mean = 11.3 years, range: 0–22	Same as Zech et al. 2023	Same as Zech et al. 2023	MSK radiologist + paediatric radiologist; disagreements were resolved by a second musculoskeletal radiologist	Yes	Yes

Open in a new tab

This review focused on paediatric fractures of the appendicular skeleton. Of the 11 included studies, eight were single-centre. The exceptions were a two-centre study by Choi et al., which was limited to a single city, and studies by Nguyen et al. and Hayashi et al., which used data from a US provider. Several studies excluded radiographs of limbs in a plaster or cast and metaphyseal corner fractures. All included studies used retrospective data, and six directly compared AI performance with human readers. The AI systems evaluated included BoneView (four studies), Rayvolve (two studies), SmartUrgences, YOLO v8, Childfx, a custom CNN, and a faster R-CNN model. In four studies, the ground truth was not blinded to clinical details, introducing a potential source of bias. Risk of bias analysis (Figure 2) identified a high risk of bias for the reference standard in Dupuis 2022 due to the use of a single radiologist's report as the gold standard. The Kavak 2024 study was found to have a high risk of bias in the index test, as the AI algorithm was unable to investigate lateral radiographs, while the Nguyen 2022 study was considered high risk for timing and flow since radiographs were read with AI assistance immediately after being read without it.

AI Diagnostic Accuracy

The overall sensitivity of AI in detecting fractures ranged from 88% to 96% across the included studies (Table 2). This performance varied based on the AI model used, with BoneView studies reporting sensitivity between 80.5% and 92.9%, Rayvolve studies demonstrating sensitivity of approximately 95%, and other models reporting sensitivity ranging from 88% to 96%. Six studies directly compared AI performance with human readers. For example, Nguyen et al. (2022) reported that AI achieved an AUC of 0.932, outperforming all human readers with a stand-alone sensitivity and specificity of 91% and 90%, respectively. Zech et al. (2022) found that AI demonstrated an accuracy of 88%, significantly higher than the residents' accuracy of 80%. Zech et al. (2023) showed that AI sensitivity reached 90.8% in overnight preliminary interpretations. Zech et al. (2024) noted improved AUC scores for both radiology and paediatric residents when assisted by AI, with AI-assisted interpretation reducing radiograph interpretation times from 52.1 seconds to 38.9 seconds (p = 0.030).

Table 2. Literature review results.

AI: artificial intelligence; NPV: negative predictive value; N: not reported; PPV: positive predictive value

Study	Diagnosis	No. of patients	Sensitivity	Specificity	PPV	NPV	Accuracy
Altmann-Schneider et al., 2023 [14]	Lower limb	1,000	90.6 (88.0–92.8)	97.1 (96.1–97.9)	92.6	96.2	NR
	Forearm	1,000	96.0 (94.7–97.1)	92.9 (91.0–94.4)	94.4	94.9	NR
	Elbow	1,000	80.5 (88.7–93.8)	63.7 (59.7–67.6)	69	89.5	NR
Dupuis et al., 2024 [16]	Fracture detection	695	92.9 (89.0−95.5)	76.8 (72.6-80.5)	70.8 (65.9−75)	94.7 (91.7−96)	82.9 (79.9-85.5)
Dupuis et al., 2022 [9]	AI detection	2,634	0.957 (0.94-0.969)	0.912 (0.898-0.925)	NR	NR	0.926 (0.915-0.936)
Nguyen et al., 2022 [11]	AI detection		91	90	NR	NR	93.2
	Human detection	300	73.17 (65.33-80.07)	89.58 (83.55- 93.97)	NR	NR	NR
	AI-assisted	300	82.67 (75.65-88.36)	90.33 (84.43-94.55)	NR	NR	NR
Kavak et al., 2024 [17]	AI detection	5,150	95.8 (95.5-96.6)	72.3 (71.13-73.57)	91.3	84.96	90
Zech et al., 2024 [18]	AI detection	240	90 (83.2-94.7)	88.3 (81.2-93.5)	NR	NR	89.2 (85.2-93.1)
	Human detection (resident)	240	78.1 (72.0-84.1)	0.756 (0.710-0.801)	NR	NR	0.768 (0.730-0.806)
	Human detection (attending)	240	84.2 (78.6-0.897)	0.892 (0.850-0.934)	NR	NR	0.867 (0.832-0.902)
	AI-assisted (resident)	240	87.6 (84.5-0.908)	0.869 (0.827-0.912)	NR	NR	0.876 (0.845-0.908)
	AI-assisted (attending)	240	85.8 (0.802-0.914)	0.921 (0.882-0.959)	NR	NR	0.890 (0.856-0.924)
Zech et al., 2023 [13]	AI detection	819	0.922 (0.896-0.948)	0.866 (0.833-0.899)	0.894 (0.873-0.915)	NR	NR
	Human detection	819	0.87 (0.838-0.903)	0.832 (0.795-0.868)	0.851 (0.827-0.875)	NR	NR
Zech et al., 2022 [ 12]	AI detection	125	88% (78–94%)	89% (76–96)	NR	NR	88 (81–93)
	Human detection (residents)	500	78% (73–82%)	85% (79–90)	NR	NR	80% (77–84)
	AI-assisted (residents)	500	91% (87–94%)	96% (91–98%)	NR	NR	93% (0.9-0.95)
Hayashi et al., 2022 [10]	AI detection	300	91.3% (85.6, 95.3)	90.0% (84.0,94.3)	NR	NR	0.93
Gasmi et al., 2023 [15]	AI detection	878	95.7% (0.93-0.99)	91.6% (0.89–0.94)	74.7% (0.69-0.80)	98.8% (0.98–0.99)	0.79 (0.74-0.83)
Human detection	Paediatric radiologists		98.4% (0.97-1.00)	99.7% (0.99-1.00)	98.9% (0.97-1.00	99.0% (0.99-1.00)	0.98 (0.97-0.99)
	Emergency physicians		81.9% (0.76-0.88)	95.0% (0.93-0.97)	81.0% (0.75-0.87)	95.0% (0.94-0.97)	NR
	Senior residents		95.1% (0.92-0.98)	98.0% (0.96-0.99)	92.5% (0.89-0.96)	98.7% (0.98-1.00)	NR
	Junior residents		90.1% (0.86-0.94)	96.6% (0.95-0.98)	87.2% (0.93-0.92	97.4% (0.96-0.99	NR
Choi et al., 2020 [8]	AI detection	258	93.9% (0.90–0.93)	92.2% (0.874–0.956)	80.5 (71.7–87.1)	97.8 (94.5–99.1)	0.992 (0.947–1.000)
	Radiologist 1	95	95.7 (78.1–99.9)	97.2 (90.3–99.7)	91.7 (73.7–97.7)	98.6 (91.1–99.8)	0.977 (0.924–0.997)
	Radiologist 2	95	95.7 (78.1–99.9)	97.2 (90.3–99.7)	91.7 (73.7–97.7)	98.6 (91.1–99.8)	0.997 (0.956–1.000)
	Radiologist 3	95	95.7 (78.1–99.9)	100.0 (95.0–100.0)	100	98.6 (91.4–99.8)	0.978 (0.924–0.997)
	Radiologist 1 w/AI	95	100.0 (85.2–100.0)	97.2 (90.3–99.7)	92.0 (74.6–97.8)	100	0.993 (0.949–1.000)

Open in a new tab

To synthesise these findings, a meta-analysis was performed. The pooled analysis (Figure 3) of five studies assessing sensitivity revealed a statistically significant improvement in favor of AI, with a mean difference of 0.04 (95% CI: 0.02, 0.07, Z = 3.49, p = 0.0005). This analysis showed moderate heterogeneity between studies (I² = 46%, p = 0.11).

AI: artificial intelligence; CI: confidence interval; SD: standard deviation

Specificity ranged widely from 63.7% to 92.2%, with significant variation between studies. BoneView studies reported a specificity between 63.7% and 92.9%. Rayvolve studies reported specificity above 90%. SmartUrgences demonstrated significantly lower specificity for plastered patients at 54.5% compared to 95.5% for uncasted patients. The pooled analysis for specificity (Figure 4), which also included five studies, showed no significant difference between AI and human interpretation, with a mean difference of 0.00 (95% CI: -0.05, 0.05, Z = 0.02, p = 0.99). The specificity analysis exhibited substantial heterogeneity (I² = 81%, p = 0.001).

Anatomic Location and Fracture Type

AI performance varied by anatomical location and fracture type. The study by Altmann-Schneider et al. (2024) divided fractures into lower leg, forearm, and elbow categories. The highest sensitivity and specificity were seen in forearm fractures at 92.9% and 98.1%, respectively. The lowest sensitivity, 87.5%, and the lowest specificity, 80.5%, were observed for lower leg fractures. Fractures such as complete diaphyseal and metaphyseal radius, ulna, and tibia fractures had detection rates between 99% and 100%. More inconspicuous fractures, such as Salter-Harris II fractures of the proximal tibia, showed a detection rate of 60%, bowing fractures of the radius had a detection rate of 18%, and avulsion fractures of the ulnar epicondyle showed a detection rate of 25%.

The study by Zech et al. demonstrated a greater improvement in detecting non-obvious fractures, particularly where fractures were not displaced or angulated (p = 0.001). This improvement was still observed in more obvious cases (p = 0.013). The study by Nguyen also showed a greater improvement in the detection of non-obvious fractures, with the highest gains in diagnostic performance noted in buckle fractures and Salter-Harris II and IV fractures, with absolute differences of 20.63% (p < 0.001), 11.31% (p = 0.003), and 29.17% (p = 0.006), respectively. The study by Dupuis et al. reported the lowest sensitivity for pelvis fractures at 75% and noted that sensitivity for plastered patients was significantly lower at 54.5% compared to 95.5% for patients without casts.

AI-Assisted Interpretation

Four studies evaluated the impact of AI-assisted interpretation on clinician performance. Zech et al. (2022) found that AI assistance significantly improved resident accuracy from 80% to 93%, particularly for buckle fractures, while Nguyen et al. (2022) reported a 10% absolute increase in sensitivity across all readers when using AI. The meta-analysis for AI-assisted detection showed a statistically significant improvement in sensitivity (Figure 5), with a mean difference of 0.07 (p=0.003). This analysis, however, showed considerable heterogeneity (I² = 72%, p=0.01). The analysis for specificity (Figure 6) showed a trend towards improvement with a mean difference of 0.05, though this was not statistically significant (p=0.08), and also demonstrated significant heterogeneity (I² = 76%, p=0.005). The study by Zech et al. (2024) was the only one to assess efficiency and found that AI-assisted interpretation also significantly reduced radiograph interpretation times.

Discussion

This systematic review and meta-analysis provide compelling evidence that AI holds significant promise in the challenging domain of paediatric fracture detection. The pooled data clearly demonstrate that standalone AI systems possess a statistically significantly higher sensitivity compared to human readers, without a corresponding drop in specificity. This finding suggests that AI can identify more true fractures than clinicians alone. While a 4% absolute increase in sensitivity may appear modest, its clinical significance is potentially substantial in high-volume settings like an emergency department. Perhaps more importantly for clinical practice, the analysis also revealed that AI-assisted interpretation leads to a statistically significant improvement in clinicians' own sensitivity.

These results collectively underscore the potential of AI not as a replacement for human expertise, but as a powerful augmenting tool that can enhance diagnostic capabilities. By integrating AI assistance into clinical workflows, clinicians may benefit from an increased ability to detect subtle fractures, thereby reducing the rate of missed injuries, while maintaining the high specificity required to avoid unnecessary interventions.. The publication bias was investigated through the use of funnel plots (Figures 7, 8), which were seen to be symmetrical, showing that there is a limited effect of publication bias in studies included for meta-analysis.

A critical aspect of interpreting these meta-analyses is the consideration of heterogeneity. The analysis of standalone AI specificity and both analyses for AI-assisted interpretation showed considerable to substantial heterogeneity between the included studies. This variability suggests that while the overall trend is positive, the magnitude of AI's effect differs significantly across different contexts. This is likely attributable to the wide diversity in the AI models themselves, the specific patient populations studied, and the methodologies employed in each trial. Therefore, while AI demonstrates high diagnostic accuracy, the limited external validation of many of these models remains a concern, raising questions about their generalisation to broader paediatric populations and varied clinical settings.

There are also concerns regarding data security and the potential ethical implications of AI; however, a survey performed indicated that 64% of parents were comfortable with an AI program diagnosing their child's fracture, while 82% supported AI being used as an adjunct to a clinician’s diagnosis. This suggests that while there is growing trust in AI applications, human oversight remains a crucial factor for widespread acceptance and successful implementation in paediatric fracture detection [19].

Limitations

This systematic review is limited by potential sources of bias and methodological issues. A primary concern is the overlap in authorship and patient cohorts in several studies, which could skew results and reduce the generalisability of the findings. Bias may also be present in the meta-analysis, as studies focusing on specific anatomical locations might report higher sensitivity than those examining all fracture types. The funding of included studies by the AI software companies creates a potential conflict of interest and a significant risk of bias, which must be considered when interpreting data. Additionally, all included studies were retrospective, which inherently limits the ability to establish causality or control for confounding variables. The scope and design of the included studies also present constraints. Most were single-centre studies with small external test datasets, and each assessed only one AI software, making it impossible to compare the performance of different systems.

The review highlights a critical need for multicentre randomised controlled trials that use a "gold standard" reference like CT imaging for fracture confirmation. Furthermore, the exclusion of crucial clinical information, such as patient history and physical exams, from the diagnostic process is a major weakness that could lead to misclassification. High heterogeneity was noted, likely reflecting the diversity in AI models and patient populations. Finally, the inconsistent distribution of fracture types meant that less obvious injuries, like buckle or Salter-Harris fractures, were more likely to be missed, affecting overall accuracy.

Conclusions

AI demonstrates high diagnostic performance in detecting paediatric fractures, with a statistically significantly higher sensitivity than human readers. When used as an assistive tool, it significantly improves clinicians' detection rates. Future research should prioritise multi-centre, prospective, randomised controlled trials to evaluate AI performance across diverse clinical settings, enhancing the generalisability of findings. Comparative studies involving multiple AI systems and the integration of clinical data are essential to determine the most effective and cost-efficient tools for clinical practice.

Disclosures

Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the following:

Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work.

Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work.

Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.

Author Contributions

Concept and design: Jordan Calleja, Gregory Firth

Acquisition, analysis, or interpretation of data: Jordan Calleja, Jacques Calleja, Kyle Muscat

Drafting of the manuscript: Jordan Calleja, Jacques Calleja, Kyle Muscat

Critical review of the manuscript for important intellectual content: Jordan Calleja, Gregory Firth

References

1.National trends and cost of litigation in UK National Health Service (NHS): a specialty-specific analysis from the past decade. Lane J, Bhome R, Somani B. Scott Med J. 2021;66:168–174. doi: 10.1177/00369330211052627. [DOI] [PubMed] [Google Scholar]
2.Diagnostic error in the emergency department: learning from national patient safety incident report analysis. Hussain F, Cooper A, Carson-Stevens A, Donaldson L, Hibbert P, Hughes T, Edwards A. BMC Emerg Med. 2019;19:77. doi: 10.1186/s12873-019-0289-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Artificial intelligence technologies to help detect fractures on X-rays in urgent care: early value assessment. [ Jan; 2025 ]. 2025. https://www.nice.org.uk/guidance/hte20 https://www.nice.org.uk/guidance/hte20
4.Understanding the use of artificial intelligence for implant analysis in total joint arthroplasty: a systematic review. Shah AK, Lavu MS, Hecht CJ 2nd, Burkhart RJ, Kamath AF. Arthroplasty. 2023;5:54. doi: 10.1186/s42836-023-00209-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Robotic arm-assisted total knee arthroplasty is associated with improved surgical and postoperative outcomes compared with imageless computer navigation: a large single-centre study. Tay ML, Kawaguchi K, Bolam SM, Bayan A, Young SW. Bone Joint J. 2025;107-B:804–812. doi: 10.1302/0301-620X.107B8.BJJ-2024-1499.R1. [DOI] [PubMed] [Google Scholar]
6.Artificial intelligence in orthopedic surgery: current applications, challenges, and future directions. Han F, Huang X, Wang X, et al. MedComm (2020) 2025;6:0. doi: 10.1002/mco2.70260. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Revaman Version 9.10.0. [ Jan; 2025 ]. 2025. https://revman.cochrane.org/ https://revman.cochrane.org/
8.Using a dual-input convolutional neural network for automated detection of pediatric supracondylar fracture on conventional radiography. Choi JW, Cho YJ, Lee S, et al. Invest Radiol. 2020;55:101–110. doi: 10.1097/RLI.0000000000000615. [DOI] [PubMed] [Google Scholar]
9.External validation of a commercially available deep learning algorithm for fracture detection in children. Dupuis M, Delbos L, Veil R, Adamsbaum C. Diagn Interv Imaging. 2022;103:151–159. doi: 10.1016/j.diii.2021.10.007. [DOI] [PubMed] [Google Scholar]
10.Automated detection of acute appendicular skeletal fractures in pediatric patients using deep learning. Hayashi D, Kompel AJ, Ventre J, Ducarouge A, Nguyen T, Regnard NE, Guermazi A. Skeletal Radiol. 2022;51:2129–2139. doi: 10.1007/s00256-022-04070-0. [DOI] [PubMed] [Google Scholar]
11.Assessment of an artificial intelligence aid for the detection of appendicular skeletal fractures in children and young adults by senior and junior radiologists. Nguyen T, Maarek R, Hermann AL, et al. Pediatr Radiol. 2022;52:2215–2226. doi: 10.1007/s00247-022-05496-3. [DOI] [PubMed] [Google Scholar]
12.Detecting pediatric wrist fractures using deep-learning-based object detection. Zech JR, Carotenuto G, Igbinoba Z, Tran CV, Insley E, Baccarella A, Wong TT. Pediatr Radiol. 2023;53:1125–1134. doi: 10.1007/s00247-023-05588-8. [DOI] [PubMed] [Google Scholar]
13.Artificial intelligence to identify fractures on pediatric and young adult upper extremity radiographs. Zech JR, Jaramillo D, Altosaar J, Popkin CA, Wong TT. Pediatr Radiol. 2023;53:2386–2397. doi: 10.1007/s00247-023-05754-y. [DOI] [PubMed] [Google Scholar]
14.Artificial intelligence-based detection of paediatric appendicular skeletal fractures: performance and limitations for common fracture types and locations. Altmann-Schneider I, Kellenberger CJ, Pistorius SM, et al. Pediatr Radiol. 2024;54:136–145. doi: 10.1007/s00247-023-05822-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Gasmi I, Calinghen A, Parienti JJ, Belloy F, Fohlen A, Pelage JP. Pediatr Radiol. 2023;53:1675–1684. doi: 10.1007/s00247-023-05621-w. [DOI] [PubMed] [Google Scholar]
16.External validation of an artificial intelligence solution for the detection of elbow fractures and joint effusions in children. Dupuis M, Delbos L, Rouquette A, Adamsbaum C, Veil R. Diagn Interv Imaging. 2024;105:104–109. doi: 10.1016/j.diii.2023.09.008. [DOI] [PubMed] [Google Scholar]
17.Detecting pediatric appendicular fractures using artificial intelligence. Kavak N, Kavak RP, Güngörer B, Turhan B, Kaymak SD, Duman E, Çelik S. Rev Assoc Med Bras (1992) 2024;70:0. doi: 10.1590/1806-9282.20240523. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Artificial intelligence improves resident detection of pediatric and young adult upper extremity fractures. Zech JR, Ezuma CO, Patel S, et al. Skeletal Radiol. 2024;53:2643–2651. doi: 10.1007/s00256-024-04698-0. [DOI] [PubMed] [Google Scholar]
19.A survey of patient acceptability of the use of artificial intelligence in the diagnosis of paediatric fractures: an observational study. Roberts F, Roberts T, Gelfer Y, Hing C. Ann R Coll Surg Engl. 2024;106:694–699. doi: 10.1308/rcsann.2024.0008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[REF1] 1.National trends and cost of litigation in UK National Health Service (NHS): a specialty-specific analysis from the past decade. Lane J, Bhome R, Somani B. Scott Med J. 2021;66:168–174. doi: 10.1177/00369330211052627. [DOI] [PubMed] [Google Scholar]

[REF2] 2.Diagnostic error in the emergency department: learning from national patient safety incident report analysis. Hussain F, Cooper A, Carson-Stevens A, Donaldson L, Hibbert P, Hughes T, Edwards A. BMC Emerg Med. 2019;19:77. doi: 10.1186/s12873-019-0289-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[REF3] 3.Artificial intelligence technologies to help detect fractures on X-rays in urgent care: early value assessment. [ Jan; 2025 ]. 2025. https://www.nice.org.uk/guidance/hte20 https://www.nice.org.uk/guidance/hte20

[REF4] 4.Understanding the use of artificial intelligence for implant analysis in total joint arthroplasty: a systematic review. Shah AK, Lavu MS, Hecht CJ 2nd, Burkhart RJ, Kamath AF. Arthroplasty. 2023;5:54. doi: 10.1186/s42836-023-00209-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[REF5] 5.Robotic arm-assisted total knee arthroplasty is associated with improved surgical and postoperative outcomes compared with imageless computer navigation: a large single-centre study. Tay ML, Kawaguchi K, Bolam SM, Bayan A, Young SW. Bone Joint J. 2025;107-B:804–812. doi: 10.1302/0301-620X.107B8.BJJ-2024-1499.R1. [DOI] [PubMed] [Google Scholar]

[REF6] 6.Artificial intelligence in orthopedic surgery: current applications, challenges, and future directions. Han F, Huang X, Wang X, et al. MedComm (2020) 2025;6:0. doi: 10.1002/mco2.70260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[REF7] 7.Revaman Version 9.10.0. [ Jan; 2025 ]. 2025. https://revman.cochrane.org/ https://revman.cochrane.org/

[REF8] 8.Using a dual-input convolutional neural network for automated detection of pediatric supracondylar fracture on conventional radiography. Choi JW, Cho YJ, Lee S, et al. Invest Radiol. 2020;55:101–110. doi: 10.1097/RLI.0000000000000615. [DOI] [PubMed] [Google Scholar]

[REF9] 9.External validation of a commercially available deep learning algorithm for fracture detection in children. Dupuis M, Delbos L, Veil R, Adamsbaum C. Diagn Interv Imaging. 2022;103:151–159. doi: 10.1016/j.diii.2021.10.007. [DOI] [PubMed] [Google Scholar]

[REF10] 10.Automated detection of acute appendicular skeletal fractures in pediatric patients using deep learning. Hayashi D, Kompel AJ, Ventre J, Ducarouge A, Nguyen T, Regnard NE, Guermazi A. Skeletal Radiol. 2022;51:2129–2139. doi: 10.1007/s00256-022-04070-0. [DOI] [PubMed] [Google Scholar]

[REF11] 11.Assessment of an artificial intelligence aid for the detection of appendicular skeletal fractures in children and young adults by senior and junior radiologists. Nguyen T, Maarek R, Hermann AL, et al. Pediatr Radiol. 2022;52:2215–2226. doi: 10.1007/s00247-022-05496-3. [DOI] [PubMed] [Google Scholar]

[REF12] 12.Detecting pediatric wrist fractures using deep-learning-based object detection. Zech JR, Carotenuto G, Igbinoba Z, Tran CV, Insley E, Baccarella A, Wong TT. Pediatr Radiol. 2023;53:1125–1134. doi: 10.1007/s00247-023-05588-8. [DOI] [PubMed] [Google Scholar]

[REF13] 13.Artificial intelligence to identify fractures on pediatric and young adult upper extremity radiographs. Zech JR, Jaramillo D, Altosaar J, Popkin CA, Wong TT. Pediatr Radiol. 2023;53:2386–2397. doi: 10.1007/s00247-023-05754-y. [DOI] [PubMed] [Google Scholar]

[REF14] 14.Artificial intelligence-based detection of paediatric appendicular skeletal fractures: performance and limitations for common fracture types and locations. Altmann-Schneider I, Kellenberger CJ, Pistorius SM, et al. Pediatr Radiol. 2024;54:136–145. doi: 10.1007/s00247-023-05822-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[REF15] 15.Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Gasmi I, Calinghen A, Parienti JJ, Belloy F, Fohlen A, Pelage JP. Pediatr Radiol. 2023;53:1675–1684. doi: 10.1007/s00247-023-05621-w. [DOI] [PubMed] [Google Scholar]

[REF16] 16.External validation of an artificial intelligence solution for the detection of elbow fractures and joint effusions in children. Dupuis M, Delbos L, Rouquette A, Adamsbaum C, Veil R. Diagn Interv Imaging. 2024;105:104–109. doi: 10.1016/j.diii.2023.09.008. [DOI] [PubMed] [Google Scholar]

[REF17] 17.Detecting pediatric appendicular fractures using artificial intelligence. Kavak N, Kavak RP, Güngörer B, Turhan B, Kaymak SD, Duman E, Çelik S. Rev Assoc Med Bras (1992) 2024;70:0. doi: 10.1590/1806-9282.20240523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[REF18] 18.Artificial intelligence improves resident detection of pediatric and young adult upper extremity fractures. Zech JR, Ezuma CO, Patel S, et al. Skeletal Radiol. 2024;53:2643–2651. doi: 10.1007/s00256-024-04698-0. [DOI] [PubMed] [Google Scholar]

[REF19] 19.A survey of patient acceptability of the use of artificial intelligence in the diagnosis of paediatric fractures: an observational study. Roberts F, Roberts T, Gelfer Y, Hing C. Ann R Coll Surg Engl. 2024;106:694–699. doi: 10.1308/rcsann.2024.0008. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Artificial Intelligence in Paediatric and Adolescent Fracture Detection: A Systematic Review and Meta-Analysis

Jordan Calleja

Kyle Muscat

Jacques Calleja

Gregory Firth

Abstract

Introduction and background

Review

Figure 1. PRISMA chart depicting the selection of studies.

Table 1. Summary of the included studies.

Figure 2. Risk of bias analysis.

Table 2. Literature review results.

Figure 3. Forest plot for sensitivity - AI vs. human reader.

Figure 4. Forest plot for specificity - AI detection vs. human reader.

Figure 5. Forest plot for sensitivity - AI-assisted vs. human reader.

Figure 6. Forest plot for specifity - AI-assisted vs. human reader.

Figure 7. Funnel plot - sensitivity .

Figure 8. Funnel plot - specificity.

Conclusions

Disclosures

Author Contributions

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Artificial Intelligence in Paediatric and Adolescent Fracture Detection: A Systematic Review and Meta-Analysis

Jordan Calleja

Kyle Muscat

Jacques Calleja

Gregory Firth

Abstract

Introduction and background

Review

Figure 1. PRISMA chart depicting the selection of studies.

Table 1. Summary of the included studies.

Figure 2. Risk of bias analysis.

Table 2. Literature review results.

Figure 3. Forest plot for sensitivity - AI vs. human reader.

Figure 4. Forest plot for specificity - AI detection vs. human reader.

Figure 5. Forest plot for sensitivity - AI-assisted vs. human reader.

Figure 6. Forest plot for specifity - AI-assisted vs. human reader.

Figure 7. Funnel plot - sensitivity .

Figure 8. Funnel plot - specificity.

Conclusions

Disclosures

Author Contributions

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases