Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?

Yi-Chu Li; Hung-Hsun Chen; Henry Horng-Shing Lu; Hung-Ta Hondar Wu; Ming-Chau Chang; Po-Hsin Chou

doi:10.1097/CORR.0000000000001685

. 2021 Feb 26;479(7):1598–1612. doi: 10.1097/CORR.0000000000001685

Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?

Yi-Chu Li ¹, Hung-Hsun Chen ², Henry Horng-Shing Lu ³, Hung-Ta Hondar Wu ^4,⁵, Ming-Chau Chang ^4,⁶, Po-Hsin Chou ^4,^6,^✉

PMCID: PMC8208416 PMID: 33651768

Abstract

Background

Vertebral fractures are the most common osteoporotic fractures in older individuals. Recent studies suggest that the performance of artificial intelligence is equal to humans in detecting osteoporotic fractures, such as fractures of the hip, distal radius, and proximal humerus. However, whether artificial intelligence performs as well in the detection of vertebral fractures on plain lateral spine radiographs has not yet been reported.

Questions/purposes

(1) What is the accuracy, sensitivity, specificity, and interobserver reliability (kappa value) of an artificial intelligence model in detecting vertebral fractures, based on Genant fracture grades, using plain lateral spine radiographs compared with values obtained by human observers? (2) Do patients’ clinical data, including the anatomic location of the fracture (thoracic or lumbar spine), T-score on dual-energy x-ray absorptiometry, or fracture grade severity, affect the performance of an artificial intelligence model? (3) How does the artificial intelligence model perform on external validation?

Methods

Between 2016 and 2018, 1019 patients older than 60 years were treated for vertebral fractures in our institution. Seventy-eight patients were excluded because of missing CT or MRI scans (24% [19]), poor image quality in plain lateral radiographs of spines (54% [42]), multiple myeloma (5% [4]), and prior spine instrumentation (17% [13]). The plain lateral radiographs of 941 patients (one radiograph per person), with a mean age of 76 ± 12 years, and 1101 vertebral fractures between T7 and L5 were retrospectively evaluated for training (n = 565), validating (n = 188), and testing (n = 188) of an artificial intelligence deep-learning model. The gold standard for diagnosis (ground truth) of a vertebral fracture is the interpretation of the CT or MRI reports by a spine surgeon and a radiologist independently. If there were any disagreements between human observers, the corresponding CT or MRI images would be rechecked by them together to reach a consensus. For the Genant classification, the injured vertebral body height was measured in the anterior, middle, and posterior third. Fractures were classified as Grade 1 (< 25%), Grade 2 (26% to 40%), or Grade 3 (> 40%). The framework of the artificial intelligence deep-learning model included object detection, data preprocessing of radiographs, and classification to detect vertebral fractures. Approximately 90 seconds was needed to complete the procedure and obtain the artificial intelligence model results when applied clinically. The accuracy, sensitivity, specificity, interobserver reliability (kappa value), receiver operating characteristic curve, and area under the curve (AUC) were analyzed. The bootstrapping method was applied to our testing dataset and external validation dataset. The accuracy, sensitivity, and specificity were used to investigate whether fracture anatomic location or T-score in dual-energy x-ray absorptiometry report affected the performance of the artificial intelligence model. The receiver operating characteristic curve and AUC were used to investigate the relationship between the performance of the artificial intelligence model and fracture grade. External validation with a similar age population and plain lateral radiographs from another medical institute was also performed to investigate the performance of the artificial intelligence model.

Results

The artificial intelligence model with ensemble method demonstrated excellent accuracy (93% [773 of 830] of vertebrae), sensitivity (91% [129 of 141]), and specificity (93% [644 of 689]) for detecting vertebral fractures of the lumbar spine. The interobserver reliability (kappa value) of the artificial intelligence performance and human observers for thoracic and lumbar vertebrae were 0.72 (95% CI 0.65 to 0.80; p < 0.001) and 0.77 (95% CI 0.72 to 0.83; p < 0.001), respectively. The AUCs for Grades 1, 2, and 3 vertebral fractures were 0.919, 0.989, and 0.990, respectively. The artificial intelligence model with ensemble method demonstrated poorer performance for discriminating normal osteoporotic lumbar vertebrae, with a specificity of 91% (260 of 285) compared with nonosteoporotic lumbar vertebrae, with a specificity of 95% (222 of 234). There was a higher sensitivity 97% (60 of 62) for detecting osteoporotic (dual-energy x-ray absorptiometry T-score ≤ -2.5) lumbar vertebral fractures, implying easier detection, than for nonosteoporotic vertebral fractures (83% [39 of 47]). The artificial intelligence model also demonstrated better detection of lumbar vertebral fractures compared with detection of thoracic vertebral fractures based on the external dataset using various radiographic techniques. Based on the dataset for external validation, the overall accuracy, sensitivity, and specificity on bootstrapping method were 89%, 83%, and 95%, respectively.

Conclusion

The artificial intelligence model detected vertebral fractures on plain lateral radiographs with high accuracy, sensitivity, and specificity, especially for osteoporotic lumbar vertebral fractures (Genant Grades 2 and 3). The rapid reporting of results using this artificial intelligence model may improve the efficiency of diagnosing vertebral fractures. The testing model is available at http://140.113.114.104/vght_demo/corr/. One or multiple plain lateral radiographs of the spine in the Digital Imaging and Communications in Medicine format can be uploaded to see the performance of the artificial intelligence model.

Level of Evidence

Level II, diagnostic study.

Introduction

Vertebral fractures are associated with back pain, kyphotic deformity, reduced quality of life, and increased morbidity and mortality [17]. Despite an increase in the proportion of diagnosed vertebral fractures, underdiagnosis [7] still occurs [8, 18, 32]; a multicenter, multinational study reported that the incidence of false-negative diagnoses of vertebral fractures was 34% [6]. Early screening for vertebral fractures is important considering the increased risk of subsequent hip fractures [22], which may further negatively impact patients’ quality of life and increase the social burden; in the year after hip fractures occur, patient mortality is increased [15]. Early detection of vertebral fractures is important not only for orthopaedists, but also for all physicians to improve patients’ quality of life and reduce social and medical burdens.

Recent developments in artificial intelligence (AI) deep-learning models have revealed remarkable advances in evaluating medical images [19]. Automated detection of vertebral fractures on CT images by a computer software system has a sensitivity of 95.7% [3]. However, plain radiographs remain the first-line modality for detecting vertebral fractures because they are readily available and have lower radiation exposure and expense than CT [1], even though CT is better able to visualize vertebral fractures with higher accuracy [30].

Back pain, which can be caused by vertebral fractures, is one of the most common health problems among older people and can result in disability [21]. In addition to orthopaedic surgeons, other physicians may need to diagnose vertebral fractures in various clinical settings, such as hospital inpatient or outpatient clinics or in the emergency room. Radiologists or orthopaedists may not be immediately available for consultation during peak times in a clinic or smaller hospitals [19]; hence, AI-assisted vertebral fracture detection may be a useful alternative. Moreover, if the performance of the AI model is stable and convincing, AI might localize the vertebral fracture in the plain lateral radiographs, thus assisting orthopaedic surgeons in identifying vertebral fractures and helping them to avoid missing these injuries, especially when they are busy and potentially distracted. However, we are aware of no clinical investigations evaluating AI deep-learning models for detecting vertebral fractures using plain lateral radiographs. We therefore sought to evaluate an automated AI deep-learning model for detecting vertebral fractures using plain lateral radiographs of the thoracolumbar or lumbar spine from a single medical center based on the Genant vertebral fracture grades.

Specifically, we asked (1) What is the accuracy, sensitivity, specificity, and interobserver reliability (kappa value) of an AI model in detecting vertebral fractures based on the Genant fracture grades using plain lateral spine radiographs compared with values obtained by human observers? (2) Do patients’ clinical data, including the anatomic location of the fracture (thoracic or lumbar spine), T-score on dual-energy x-ray absorptiometry (DEXA), or fracture grade severity, affect the performance of an AI model? (3) How does the AI model perform on external validation?

Patients and Methods

Patients

In this retrospective study, the inclusion criteria for selecting plain lateral radiographs were patients aged older than 60 years with radiographic reports of vertebral fractures on CT or MRI. The exclusion criteria included patients with a history of malignancy, vertebral osteomyelitis, or spinal instrumentation. Multiple vertebral fractures observed on plain lateral radiographs was not an exclusion criterion.

Between 2016 and 2018, 1019 patients were treated for vertebral fractures at our institution. Seventy-eight patients were excluded because of missing CT or MRI scans (24% [19]), poor image quality in plain lateral spine radiographs (54% [42]), multiple myeloma (5% [4]), and prior spine instrumentation (17% [13]). The plain lateral radiographs of 941 patients (one radiograph per person) with a mean age of 76 ± 12 years and 1101 vertebral fractures between T7 and L5 were retrospectively evaluated for training (n = 565), validating (n = 188), and testing (n = 188) of an AI deep-learning model (Fig. 1).

Fig. 1 — This Standards for Reporting of Diagnostic Accuracy Studies (STARD) chart shows the training, validation, and testing datasets. All datasets (n = 941 patients) were randomly divided into training (n = 565), validation (n = 188), and testing (n = 188) with a ratio of 6:2:2. However, not all thoracic vertebrae in plain lateral radiographs were detected by YOLOv3 based on the AI framework. YOLOv3 detected eight vertebrae (three thoracic and five lumbar vertebrae), regardless of fracture status, from T10 to L5. The vertebrae detected by YOLOv3 were initially classified into thoracic and lumbar locations and further subcategorized into fractured and nonfractured vertebrae based on prior human labels. The normal vertebrae were under-sampled using the Python program; PLRs = plain lateral radiographs.

Gold Standard for Diagnosis

We considered human interpretation of CT or MRI scans as the gold standard for vertebral fracture diagnosis. All human labels on the plain lateral radiographs of spines were judged by a spine surgeon (PHC) and a radiologist (HTHW) independently, based on the CT or MRI images. If there were any disagreements between the human observers, they rechecked the corresponding CT or MRI images together to reach a consensus. Accordingly, we took advantage of the more accurate diagnostic rates of CT or MRI scans in identifying vertebral fractures in the plain lateral radiograph. The CT or MRI scan were our references and were used to strengthen and improve the accuracy of our ground truth in judging whether vertebra were fractured in each plain lateral radiograph. Each plain radiograph following human labels was defined as the ground truth in our dataset for AI model training, validation, and testing.

In terms of when a CT or an MRI would have been available on a particular patient, it is important to know a bit about the practice patterns and indications for ordering imaging tests. After vertebral fractures were diagnosed on the plain lateral spine radiographs, orthopaedic physicians at our center would very likely have ordered CT or MRI using different indications to further evaluate the vertebral fracture patterns. Based on the policy of Taiwan’s National Health Insurance Administration, CT and MRI scan cannot be ordered simultaneously to evaluate the fracture patterns. For clinical practice in our institution, most orthopaedic physicians generally order CT scans if the patient has a one-level vertebral fracture and no neurologic deficit or they order MRI if the patient has either multiple vertebral fractures or a neurologic deficit.

Plain Radiographic Techniques

AP and lateral radiographs of the thoracolumbar (T5/T6 to L5) or lumbar (T8/T9 to S1) spine were obtained. The dataset only included lateral radiographs of the spine. The most caudal mobile disc level was defined as L5 to S1, regardless of lumbar anomalies. The radiography machine used a high-voltage generator (SHIMADZU, UD150B-40) with a voltage of 94 kVp and an average current of 56 mAs for 360 msec. We did not routinely perform imaging of the T4 to L4 levels to survey for osteoporotic vertebral fractures, based on the policy of Taiwan’s National Health Insurance Administration.

DEXA

The DEXA indications covered by our National Health Insurance Administration are women aged older than 50 years, women who have gone through menopause who are receiving antiosteoporosis treatment, or patients who have sustained fragility fractures. Completion of a DEXA examination was not an inclusion criterion for this study, so not all patients had completed a DEXA examination in our dataset. In our testing dataset, we used lower T-score at either hip as our result. We did not investigate why some patients did not complete the DEXA examinations in this series.

Training, Validating, and Testing the Dataset

We selected plain lateral radiographs taken between January 2016 and December 2018 for training, validating, and testing. We used 565 plain lateral radiographs for data training, 188 for validating, and 188 for testing. A total of 941 patients (one plain lateral radiograph per person) with 1101 vertebrae with vertebral fractures (Table 1) and 6358 normal vertebrae were included (Table 2). There were 655 fractured vertebrae in the training group, 226 in the validation dataset, and 220 in the testing set. The training set had 3752 normal vertebrae, the validation group had 1280, and the testing dataset had 1326 normal vertebrae. There were 4407 combined fractured and normal vertebrae in the training dataset, 1506 in the validation set, and 1546 in the testing group.

Table 1.

Demographic data of included patients

Parameter	Value
Age in years	76 ± 12
Males	30 (283)
DEXA T-score	-3.0 ± 1.21
Fracture location
T7	1 (11)
T8	1 (11)
T9	2 (17)
T10	2 (21)
T11	7 (79)
T12	24 (265)
L1	28 (309)
L2	13 (145)
L3	11 (121)
L4	8 (90)
L5	3 (32)

Open in a new tab

Data are presented as % (n) or mean ± SD. There were 941 patients and 1101 vertebrae included in the dataset.

Table 2.

Numbers of patients and vertebrae for training, validating, and testing

Parameter	Training data		Validation data		Test data
Parameter	Fracture (n = 655)	Normal (n = 3752)	Fracture (n = 226)	Normal (n = 1280)	Fracture (n = 220)	Normal (n = 1326)
Anatomic location
Thoracic	34 (223)	49 (1827)	38 (87)	46 (589)	36 (79)	48 (637)
Lumbar	66 (432)	51 (1925)	62 (139)	54 (691)	64 (141)	52 (689)

Open in a new tab

Data presented as % (n). The vertebrae above T9 in patients with plain lateral radiographs were not consistently clearly visualized because of the diaphragm or lung markings. Accordingly, YOLOv3 detected approximately eight vertebrae (three thoracic and five lumbar vertebrae) in one plain lateral radiograph in the dataset, regardless of whether the vertebrae was fractured or nonfractured. The vertebrae marked with a bonding box by YOLOv3 were categorized by thoracic and lumbar location and subcategorized into fractured and nonfractured vertebrae based on prior human labels by physicians.

All human labels in each image were agreed on and interpreted by one experienced spine surgeon (PHC; 10 years of clinical practice) and a radiologist (HTHW; 20 years of clinical practice) based on vertebral body height measurement of the plain lateral radiographs. The grade of each vertebral fracture was correlated with corresponding CT or MRI reports.

All digitally obtained plain lateral radiographs were presented in Digital Imaging and Communications in Medicine format with the same computer software (Smart Viewer 3.2; Taiwan Electronic Data Processing Cooperation) for vertebral body height measurement. A total of 131 patients in the testing dataset completed DEXA examinations of their bilateral hips.

For the Genant classification of vertebral fractures [11], the injured vertebral body height was measured at the anterior, middle, and posterior third of the vertebrae. The assumed preinjured vertebral body height was defined as the mean height of one cephalic and one caudal vertebral body height of the injured level. The percentage of vertebral body height loss was calculated as Δ (vertebral body height preinjury – measured)/vertebral body height preinjury. Fractures were graded as Grade 1 (< 25%), Grade 2 (26% to 40%), and Grade 3 (> 40%). An endplate fracture was not defined as the mild stage of vertebral fracture based on Genant grading and was not included in the AI model training.

Dataset Processing

There were 655 and 226 fractured vertebrae in the training and validation datasets, respectively (Fig. 1). There were 5032 normal vertebrae, a number than exceeded the number of fractured vertebrae (n = 881) by nearly sixfold (Table 2). For optimal AI model training, the number of normal and fractured vertebrae should be approximately equal. Imbalanced datasets may result in poor AI performance; therefore, under-sampling of normal vertebrae was performed for the training and validation datasets. To equalize the datasets, we randomly selected normal vertebrae using the Python program (GPU specification and software environment: NVIDIA GeForce GTX 1080 Ti; CUDA 9.0 [v99.0.176]; cuDNN 7.0.5; Python 3.6.7; Keras 2.2.4; Tensorflow 1.6.0).

Not all thoracic vertebrae in each plain lateral radiograph were detected by You Only Look Once version 3 (YOLOv3) because of interference with the diaphragm or lung markings in the datasets, particularly above the T9 level in the model, regardless of whether vertebrae were fractured or nonfractured. Generally, three thoracic and five lumbar vertebrae (average of eight vertebrae) in each plain lateral radiograph could be detected by YOLOv3 (Fig. 1). The vertebrae with a bonding box detected by YOLOv3 were initially classified and further subcategorized into fractured and nonfractured vertebrae based on prior human labels.

Detection of Fractured and Nonfractured Vertebrae by You Only Look Once Version 3

The backbone of the YOLOv3 software [23] is the Darknet53 architecture, a convolution neural network (CNN) model (Supplementary Appendix 1; Supplemental Digital Content 1, http://links.lww.com/CORR/A502). YOLOv3 selects various scale features and can concatenate them to predict the center point, width, and height of a bonding box. With logistic regression and the anchor box, better object detection and more accurate bounding can be achieved [23] (Fig. 2A). We detected the thoracic and lumbar vertebrae using YOLOv3 and further subcategorized these vertebrae into fractured and nonfractured vertebrae based on labels determined by human observers (Fig. 1).

Fig. 2 — **A-C** (A) This illustration shows each vertebral body that was detected with YOLOv3 on the original plain lateral radiographs, regardless of the status of vertebral fractures. A drawback of the current bounding box was that it was horizontal without rotation and was parallel to the vertebral endplates, which may lead to bias. The optimal boxing boundary needed to be parallel to the vertebral endplates to make individual vertebrae well-centered in the box. (B) Automatic data preprocessing was categorized into image-quality and image-size preprocessing, then combined into four data preprocessing methods to optimize image quality for AI analysis. (C) This illustration shows the development of the AI deep learning ensemble model with various architectures of the pretrained models.

Automatic Data Preprocessing: Image-quality and Image-size Preprocessing

The original plain lateral radiographs did not always have good quality. To improve image quality, we adopted automatic data preprocessing with two parts: image-quality and image-size preprocessing (Fig. 2B). Image-quality preprocessing (Supplementary Appendix 1; Supplemental Digital Content 1, http://links.lww.com/CORR/A502) was performed using a combination of four techniques: color inversion, Gaussian blur, median filter, and contrast-limited adaptive histogram equalization step-by-step to reduce noise and adjust brightness and contrast (contrast-enhanced method).

Image-size preprocessing (Supplementary Appendix 1; Supplemental Digital Content 1, http://links.lww.com/CORR/A502) was conducted with either of these methods: images were resized to the identical format (224*224 pixels) or the original image ratio was preserved by filling with black blocks to match the identical format (padding). Two categorization methods were combined into four combination methods for automatic data preprocessing: (1) the contrast-enhanced method with resize, (2) contrast-enhanced method with padding, (3) no contrast-enhanced method with resize, and (4) no contrast-enhanced method with padding (Fig. 2B). Moreover, automatic data preprocessing or the whole plain lateral radiograph was also applied before object detection by YOLOv3 in the preliminary test. However, placing the step of automatic data preprocessing before object detection did not improve the performance of AI.

Development of the AI Deep-learning Ensemble Model

The concept of transfer learning [20] (Supplementary Appendix 1; Supplemental Digital Content 1, http://links.lww.com/CORR/A502) was adopted to train the AI model or train the data based on well-known pretrained models, including ResNet34, ResNet50, DenseNet121, DenseNet169, and DenseNet201 (Fig. 2C). The weight obtained on pretrained ImageNet data was the initial weight for our AI model training. To investigate the best method for data preprocessing and pretrained models, we processed data independently using four data preprocessing methods that were separately trained on five pretrained models, and we analyzed the results.

The no contrast-enhanced with padding method performed the best among the models of ResNet34, DenseNet121, and DenseNet 201 in the preliminary results (Supplementary Appendix Table 1; Supplemental Digital Content 2, http://links.lww.com/CORR/A503). Accordingly, an ensemble model was developed consisting of three pretrained models (Supplementary Appendix 1; Supplemental Digital Content 1, http://links.lww.com/CORR/A502), which simultaneously judged the fracture status of the vertebrae and calculated the average probability by their own judgement after plain lateral radiograph input (Fig. 3A). The average probability was computed using either model (if more than two models agreed that the vertebra was fractured).

Fig. 3 — **A-B** (A) The AI deep learning ensemble model consisted of the ResNet34, DenseNet121, and DenseNet201 pretrained models, which were used to assess the vertebral fractures based on their own judgment, with input from the original plain lateral radiographs of the thoracolumbar or lumbar spine. (B) Our ensemble model (red line) revealed greater accuracy and stable performance than for the ResNet34 (blue line), DenseNet121 (orange line), and DenseNet201 (green line) pretrained models. This study applied bootstrapping to evaluate the model performance. The x-axis and y-axis represent the number of resamplings and accuracy for each resampling, respectively; PLR = plain lateral radiograph. A color image accompanies the online version of this article.

Optimal Framework of the AI Model

The ensemble model provided stable performance with greater accuracy and less standard variation than did the ResNet34, DenseNet121, and DenseNet201 pretrained models (Fig. 3B). Consequently, the AI deep-learning model framework included YOLOv3 for fractured and nonfractured vertebrae detection, automatic data preprocessing, and ensemble model for detecting vertebral fractures (Fig. 2).

Ethical Approval

Ethical approval for this study was obtained from the institutional review board of Taipei Veterans General Hospital (IRB no: 2017-10-008BC).

AI Model Evaluation and Statistical Analysis

An independent testing dataset (n = 188) was applied to compare AI performance and human observers in accuracy, sensitivity, specificity, and kappa values. We also used a dataset from another medical center that used a different plain radiographic technique for external validation; these images involved patients from a population with similar ages. That dataset included 52 patients (mean age 72 ± 10 years), 76 fractured vertebrae, and 284 normal vertebrae; 14 vertebrae with pedicle screws were excluded. Vertebral fractured levels were listed as T10:0; T11:5, T12:14, L1:25, L2:12, L3:11, L4:6, and L5:3 (Supplementary Appendix Table 2; Supplemental Digital Content 3, http://links.lww.com/CORR/A504). Gradient-weighted Class Activation Mapping (Grad-CAM) (Supplementary Appendix 1; Supplemental Digital Content 1, http://links.lww.com/CORR/A502) [4] was used to evaluate the heatmap for evidence that the model recognized vertebral fractures. A confusion matrix was conducted to evaluate the performance. Various classification probability thresholds were set for the AI model, receiver operating characteristic curves were drawn, and the area under the curve (AUC) was computed to investigate the model’s performance. The receiver operating characteristic curve were used to investigate the relationship between the Genant fracture grade and the performance of the AI model. To obtain point and interval estimators, we used the bootstrap method (Supplementary Appendix 1; Supplemental Digital Content 1, http://links.lww.com/CORR/A502) for resampling with replacement test data 1000 times in the testing and external validation datasets; the mean accuracy, sensitivity, specificity, and 95% CI were computed (IBM SPSS, version 21.0).

Clinical Application of the AI model

Original plain lateral radiographs were directly imputed by physicians, and the AI model performed the entire procedure automatically and provided output regarding fractured vertebrae with location labeling and fracture probability. Approximately 90 seconds was required from the time plain lateral radiographs were inputted to data output for the final results (Fig. 4).

Results

The AI model can be applied clinically (Fig. 4). The current testing model is available at http://140.113.114.104/vght_demo/corr/. One or multiple plain lateral radiographs in Digital Imaging and Communications in Medicine format can be uploaded to see the AI machine’s performance. Web application interface for vertebral fracture detection was also revealed (Supplementary Appendix 2; Supplemental Digital Content 4, http://links.lww.com/CORR/A505).

Performance of the AI Model for Thoracic or Lumbar Spine Vertebral Fractures

After applying the AI model to the independent testing dataset and comparing AI performance with human observers, the model performed with the bootstrapping method delivered a mean accuracy in the detection of lumbar vertebral fractures of 92% (95% CI 92.15% to 92.67%; p < 0.001), sensitivity of 91% (95% CI 90.83% to 91.62%; p < 0.001), and specificity of 94% (95% CI 93.27% to 93.92%; p < 0.001) (Table 3).

Table 3.

Testing dataset on anatomic locations to compare AI performance and human labels in accuracy, sensitivity, and specificity with bootstrapping method

Anatomic location	Accuracy			Sensitivity			Specificity
Anatomic location	Mean, %	95% CI, %	p value	Mean, %	95% CI, %	p value	Mean, %	95% CI, %	p value
Thoracic	92	91.84-92.36	< 0.001	95	94.75-95.34	< 0.001	89	88.72-89.58	< 0.001
Lumbar	92	92.15-92.67	< 0.001	91	90.83-91.62	< 0.001	94	93.27-93.92	< 0.001
Overall	92	91.61-92.15	< 0.001	91	90.55-91.32	< 0.001	93	92.46-93.19	< 0.001

Open in a new tab

The confusion matrix for thoracic vertebral fractures had an imbalance of false positives and negatives; however, the detection of lumbar vertebral fractures was balanced (Table 4). However, misclassification by the AI model may still occur. For misclassified vertebral fractures, Grad-CAM mapping was applied [27] to visualize the fractured-discriminative area in the hot spot on the Grad-CAM heat map. Common causes of false-positive misclassification included osteoporosis (DEXA T-score ≤ -2.5) (9 of 21 vertebral fractures) and lung markings or the diaphragm (5 of 21 vertebral fractures) (Table 5). The Grad-CAM heat map revealed that the AI model was focused on the concave portion of the vertebra (Fig. 5A) and the lung markings or diaphragm (Fig. 5B), which collectively resulted in a misclassified fracture. False-negative misclassification often occurred for Grade 1 fractures (11 of 16 vertebral fractures) and lung markings (3 of 16) with an inconspicuous depression in the vertebra (Fig. 5C) or Grade 3 fracture with decreased vertebral body height greater than 80% (Fig. 5D) on the gradient-weighted Grad-CAM.

Table 4.

Confusion matrix of lumbar spines on balanced testing dataset^a

Thoracic spine (n = 134)
Prediction	Truth
	Fracture	Normal
Fracture	46 (129)	4 (10)
Normal	4 (12)	46 (131)
Lumbar spine (n = 225)
Prediction	Truth
	Fracture	Normal
Fracture	47 (75)	7 (11)
Normal	3 (4)	43 (68)

Open in a new tab

Data presented as % (n).

A balanced testing dataset means we have positive values that are approximately same as negative values in our dataset. In this testing set, it has the same number of normal and fractured vertebrae.

Table 5.

Causes of false-positive and false-negative on balanced testing dataset

	Parameter	Number of fractures with parameter of interest
False positive (n = 21)
	T-score > -2.5	3
	T-score ≤ -2.5	9
	Bean-can effect	1
	Lung marking	5
	Other	3
False negative (n = 16)
	Grade 1	11
	Grade 3	2
	Lung marking	3
	Other	0

Open in a new tab

Fig. 5 — **A-D** These images show an illustration of vertebral fractures detected by the AI model in the Gradient-weighted Class Activation Mapping (Grad-CAM) heat map: (A) false positive = osteoporosis (DEXA T score = -3.3); (B) false positive = lung marking or diaphragm; (C) false negative = Grade 1 fracture; (D) false negative = Grade 3 fracture (decreased vertebral body height > 80%).

Performance of the AI Model in Osteoporotic or Nonosteoporotic Vertebral Fractures

The lumbar vertebra dataset was categorized into two subgroups according to DEXA T-scores (> -2.5 represented nonosteoporosis;≤ -2.5 represented osteoporosis). With the bootstrapping method, the mean of sensitivity for detecting nonosteoporotic and osteoporotic lumbar vertebral fractures was 83% (95% CI 82.55% to 83.59%; p < 0.001) and 97% (95% CI 96.71% to 97.19%; p < 0.001), respectively. The specificity for detecting nonosteoporotic and osteoporotic normal lumbar vertebrae was 94.96% (95% CI 94.64% to 95.27%; p < 0.001) and 91.35% (95% CI 90.95% to 91.75%; p < 0.001), respectively (Table 6). Higher sensitivity (97%) for detecting osteoporotic vertebral fractures implies easier detection than for nonosteoporotic vertebral fractures. Lower specificity (91%) for osteoporotic normal lumbar vertebrae suggests that the AI model may not detect a normal osteoporotic vertebra as fractured.

Table 6.

Testing dataset on osteoporosis to evaluate AI performance in lumbar vertebral fractures compared between AI performance and human labels in terms of accuracy, sensitivity, and specificity with bootstrapping method

T-score in the Lumbar Spine	Accuracy			Sensitivity			Specificity
T-score in the Lumbar Spine	Mean, %	95% CI, %	p value	Mean, %	95% CI, %	p value	Mean, %	95% CI, %	p value
> -2.5	89.01	88.71-89.31	< 0.001	83.07	82.55-83.59	< 0.001	94.96	94.64-95.27	< 0.001
≤ -2.5	94.15	93.92-94.38	< 0.001	96.95	96.71-97.19	< 0.001	91.35	90.95-91.75	< 0.001

Open in a new tab

Only 131 patients in the testing dataset completed DEXA examinations of the hip. In all, 58 patients were categorized as nonosteoporotic (fractured vertebrae: 47; normal vertebrae: 234) and 73 patients were included in the osteoporotic group (fractured vertebrae: 62; normal vertebrae: 285).

Performance of the AI Model in Vertebral Fractures with Different Grades

The AI model output had better performance for detecting Grades 2 and 3 lumbar vertebral fractures than for Grade 1 fractures (Fig. 6). The sensitivity for detecting Grade 2 lumbar vertebral fractures was 98% (53 of 54 vertebral fractures) (Table 7). The AUCs for Grades 2 and 3 fractures were similar, which suggests that more severe fractures were more easily detected by the AI model.

Table 7.

Testing dataset on Genant fracture grading in the lumbar spine to compare AI performance with human labels in accuracy and sensitivity with bootstrapping method

Degree of lumbar fractures	Accuracy			Sensitivity
Degree of lumbar fractures	Mean, %	95% CI, %	p value	Mean, %	95% CI, %	p value
Grade 1	84	83.55-84.23	< 0.001	78	77.39-78.53	< 0.001
Grade 2	95	94.74-95.14	< 0.001	99	98.41-100.0	< 0.001
Grade 3	94	93.50-93.97	< 0.001	97	97.12-97.57	< 0.001

Open in a new tab

A total of 141 fractured lumbar vertebrae were included in the test dataset; the fractured lumbar vertebrae included Grade 1 (n = 50), Grade 2 (n = 54), and Grade 3 (n = 37) fractures.

Interobserver Reliability of AI Performance and Human Observers

The interobserver agreement (kappa) of the AI model and human observers for thoracic and lumbar vertebrae were 0.72 (95% CI 0.65 to 0.80; p < 0.001) and 0.77 (95% CI 0.72 to 0.83; p < 0.001), respectively. The results revealed good agreement in the detection of thoracic and lumbar vertebral fractures between the AI model and human labels.

Performance of the AI Model in External Validation

External validation was performed as part of the AI performance evaluation as noted earlier. The AI model improved the detection of lumbar vertebral fractures compared with the detection of thoracic vertebral fractures based on the external dataset with different radiographic techniques. With the bootstrapping method, the means of overall accuracy, sensitivity, and specificity were 89%, 83%, and 95%, respectively (Supplementary Appendix Table 2; Supplemental Digital Content 3, http://links.lww.com/CORR/A504).

Discussion

Artificial intelligence has been applied to image analysis with good performance for detecting fragility-related fractures, such as fractures of the hips [4], distal radius [10], and proximal humerus [5]. Although vertebral fracture is the most common fragility fracture among older patients, orthopaedic surgeons may not always be immediately available for medical consultation in smaller hospitals to assess images on patients who may have spine fractures. For that reason, AI-assisted screening for vertebral fracture may be a useful alternative to share clinical loading for orthopaedic surgeons. Our artificial intelligence model detected vertebral fractures at the T10 to L5 levels on plain lateral radiographs with high accuracy, sensitivity, and specificity, especially osteoporotic lumbar vertebral fractures with Genant Grades 2 and 3. The rapid reporting of results using this AI-based model may improve the efficiency with which vertebral fractures are diagnosed. However, clinical users should be informed that our AI model cannot identify malignant vertebral fractures caused by metastasis or multiple myeloma, and formal reports would still be required for final judgement by radiologists.

Limitations

The current study has limitations. First, the AI model detected only eight vertebrae on plain lateral radiographs. It is difficult for the AI model to detect vertebral fractures above T9 on plain lateral radiographs. The native lung markings and diaphragm are major obstacles for better performance of our AI model. We might continue to improve the AI model with a lager dataset or with other models. Second, the AI model was developed for detecting vertebral fractures without investigating underlying causes such as neoplasm, osteomyelitis, or multiple myeloma; thus, as noted, this AI model should not be used if cancer or infection are prominent in the differential diagnosis. This major limitation must be considered by users to avoid missing diagnoses, and clinical evaluation and correlation also is important for diagnosing pathologic vertebral fractures.

In addition, only plain lateral radiographs were used for detecting vertebral fractures. Some burst fractures had interpedicular-space widening [25] in the AP projection. The aim of the AI model is to assist physicians in detecting or screening for vertebral fractures. With the assistance of the AI model, physicians still have to judge vertebral fractures both on the AP and lateral projections of the plain radiographs of the spine and make appropriate clinical correlations without merely relying on the model. Moreover, baseline information was founded on vertebral body height measurements and interpretations by one experienced spine surgeon (PHC) and a radiologist (HTHW), which may have limited the reliability of the AI model’s performance and further biased it. If more experienced physicians are involved in the human-determined labels of baseline data, increased reliability may be achieved, resulting in improved AI performance. To overcome this limitation and increase the reliability of the ground truth, we labeled each vertebral body on the plain lateral radiograph of spines according to the corresponding CT or MRI scan to reduce the human-determined bias and increase our consensus. We further note that the number of plain lateral radiographs (n = 941) was a small sample for AI model training and validation, and a larger dataset might improve the AI model’s diagnostic performance. We applied the concept of transfer learning with pretrained models to overcome the problem of small sample size and used the ensemble model to improve AI performance (Supplementary Appendix Table 3, Supplemental Digital Content 5, http://links.lww.com/CORR/A506). As yet, there is no consensus as to the necessary sample size for deep learning models of this kind.

Data in this study were derived from vertebral fractures in patients older than 60 years; therefore, the same AI model might not have equal performance for detecting vertebral fractures in younger populations because of different baseline data. A further prospective study should be conducted to detect vertebral fractures using the AI model in older and younger populations. Related to this, the AI model was unable to differentiate acute and subacute stages of fracture, degenerative spondylolisthesis, and disc degeneration. We might expand the AI model into multifunctional models with larger datasets for different clinical diagnoses.

In our study, we used the receiver operating characteristic curve and Youden index to select the threshold to balance the costs of the false positive and false negative. Clinically, we may tend to use the AI model to screen more vertebral fractures to avoid missing diagnoses, and the threshold could be set lower. Hence, different thresholds could be set, and they will influence the AI performance. This issue was one of our major limitations.

Diagnostic Effectiveness of this AI Model

Our AI model demonstrated accuracy, sensitivity, and specificity more than 90% and AUCs above 0.9 with a system that provided answers in about 90 seconds. The AI model will help orthopaedic surgeons avoid missing vertebral fractures, especially when they are busy and potentially distracted. With the assistance of an AI model, primary care physicians may not always need to consult orthopaedic surgeons about patients with suspected vertebral fractures; however, future clinical studies will be needed to determine whether this tool is effective in that clinical setting. Furthermore, our AI model can identify osteoporotic vertebral fractures on plain lateral radiographs in patients older than 60 years of age, and thus assist orthopaedic surgeons in identifying vertebral fractures that are difficult to diagnose because of osteoporosis; this may help identify sentinel events, and perhaps therefore to prevent subsequent hip fracture [22].

A previous study [3] reported automatic systems for detecting vertebral fractures of the thoracic and lumbar spine using CT scans. The explanation for the difference may be that CT scans can visualize vertebral fractures with higher accuracy and sensitivity. Another advantage of this clinical investigation was that the included vertebral levels were from T1 to L5 levels, which provided many more vertebrae for fracture identification or screening than our study, with only eight vertebrae included (T10-L5 levels).

However, radiation exposure associated with CT remains a concern for patients. The radiation exposure associated with CT of the spine varies among CT scanners [28, 29]. The calculated effective doses for CT of the thoracic and lumbar spine are approximately 10 mSv and 5.6 mSv, respectively [24, 29], whereas the calculated effective doses for lumbar AP radiographs and plain lateral radiographs are 2.2 mSv and 1.5 mSv, respectively [29]. Physicians routinely order plain radiographs as part of the initial evaluation of spinal disorders including vertebral fractures.

Another issue is that endplate fractures, Schmorl nodes [16], and Scheuermann kyphosis [26] share similar radiographic findings in patients with vertebral fractures and are not categorized according to the Genant classification. These deformities could be easily diagnosed by experienced radiologists or orthopaedists but are not discriminated by an AI model.

Parameters that Influence the AI Model’s Effectiveness

Numerous factors are associated with the varying effectiveness of our AI model, although it generally performed well across a variety of settings. Clinical factors that were associated with poorer performance included lung markings, the diaphragm, and the bean-can effect (a biconcave normal appearance of age-related endplate changes) [13], which may bias the detection of vertebral fractures in the AI model. Using morphometry, Guglielmi et al. [14] reported a failure rate exceeding 20% in T5 to T10 fractures. Aging-related vertebral wedge changes [9] were reported, similar to osteoporotic vertebral fractures [19].

We investigated whether it is easier to detect vertebral fractures in the osteoporotic spine than in the nonosteoporotic spine. One possibility is that for vertebral fractures in osteoporotic spines, bone mineral content is unchanged, but bone mineral density increases because of a reduced vertebral area [12]. Denser fractured vertebrae may be more prominent in patients with osteoporosis.

Burns et al. [3] reported increased sensitivity in the automatic detection of Genant Grade 3 fractures using CT images compared with Genant Grades 1 and 2 fractures. Previous studies have reported that radiologists do not diagnose Genant Grade 1 vertebral fractures very well [2, 31]. More severely fractured vertebrae were associated with the misidentification of vertebra as normal by the AI model. A possible reason for this was that the bounding box for severely deformed vertebral fractures might misbound adjacent normal vertebral endplates, which was a drawback to the AI model.

Using YOLOv3, we investigated vertebrae that were not well centered and clipped after object detection. One possible reason for this was that bounding by YOLOv3 was rectangular without rotation and was parallel to the endplates. Moreover, physiologic thoracic kyphosis and lumbar lordosis make endplates have their own directions.

Use of the Model in Practice

Our AI model can be uploaded to computers, such as in emergency rooms and outpatient and inpatient clinics for use by primary physicians and orthopaedic surgeons. We provided the clinical application interface in our computer system (Fig. 4). The procedure consisted of clicking “sent to AI,” waiting 90 seconds for the AI operation, and clicking “view AI result.” It is not difficult or time-consuming for clinical users to operate the interface and see the AI reports. One or multiple plain lateral radiographs in Digital Imaging and Communications in Medicine format could be directly uploaded without processing. Using the AI model in clinical practice for assisted identification or screening of vertebral fractures offers some advantages: It allows orthopaedic surgeons to share the clinical load, helps avoid missing diagnoses, and makes clinical pathways efficient.

Future Directions

The AI model in this report had only a single function—labeling vertebral fractures between the T10 and L5 levels—but we continue to expand the functions of our model with more datasets for training and validation. Examples of such improvements include automatic Genant fracture grading, description of the affected anatomic levels, identification of degenerative lumbar disorders (such as spondylolisthesis and pars interarticularis fracture), and discrimination of endplate erosion caused by osteomyelitis.

Conclusion

We developed an AI model for the detection of vertebral fractures, which demonstrated excellent diagnostic performance. In clinical use, the procedure consists merely of clicking “sent to AI” and “view AI result” in only 90 seconds. If the performance of our tool can be validated by others, this kind of rapid, accurate reporting of results may help reduce clinical load and avoid missing vertebral fractures. However, clinical correlation with patients’ chief complaints, history and demographics, symptoms, and physical examinations still are needed. We continue to develop our AI model, and future versions will include automatic fracture grading with the Genant classification and its anatomic location, identification of degenerative spine disorders such as pars fracture and spondylolisthesis, and discrimination of endplate erosion caused by osteomyelitis.

Supplementary Material

SUPPLEMENTARY MATERIAL

abjs-479-1598-s001.docx^{(15KB, docx)}

abjs-479-1598-s002.docx^{(14.8KB, docx)}

abjs-479-1598-s003.docx^{(13.3KB, docx)}

abjs-479-1598-s004.docx^{(17.9KB, docx)}

abjs-479-1598-s005.pptx^{(1,000.4KB, pptx)}

Acknowledgments

We thank Dr. Chien-Jen Hsu, associate professor, from Kaohsiung Veterans General Hospital for providing dataset for external validation.

Footnotes

All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.

The institution of one or more of the authors (HHC, HHSL, HTHW, MCC, PHC) has received, during the study period, funding from the Ministry of Science and Technology Taiwan (MOST 108-3011-F-075-001).

Each author certifies that neither he nor she, nor any member of his or her immediate family, has funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article.

Ethical approval for this study was obtained from the institutional review board of Taipei Veterans General Hospital (IRB no: 2017-10-008BC).

This work was performed at Taipei Veterans General Hospital, Taipei, Taiwan and National Chiao Tung University, Hsinchu, Taiwan.

Contributor Information

Yi-Chu Li, Email: yvonne111305@gmail.com.

Hung-Hsun Chen, Email: hhchen@nctu.edu.tw.

Henry Horng-Shing Lu, Email: hslu@stat.nctu.edu.tw.

Hung-Ta Hondar Wu, Email: htwu@vghtpe.gov.tw.

Ming-Chau Chang, Email: mcchang@vghtpe.gov.tw.

References

1.Berry GE, Adams S, Harris MB, et al. Are plain radiographs of the spine necessary during evaluation after blunt trauma? Accuracy of screening torso computed tomography in thoracic/lumbar spine fracture diagnosis. J Trauma. 2005;59:1410-1413. [DOI] [PubMed] [Google Scholar]
2.Binkley N, Krueger D, Gangnon R, Genant HK, Drezner MK. Lateral vertebral assessment: a valuable technique to detect clinically significant vertebral fractures. Osteoporosis Int. 2005;16:1513-1518. [DOI] [PubMed] [Google Scholar]
3.Burns JE, Yao J, Summers RM. Vertebral body compression fractures and bone density: automated detection and classification on CT images. Radiology. 2017;284:788-797. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cheng CT, Ho TY, Lee TY, et al. Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs. Eur Radiol. 2019;29:5469-5477. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chung SW, Han SS, Lee JW, et al. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop. 2018;89:468-473. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Delmas PD, van de Langerijt L, Watts NB, et al. Underdiagnosis of vertebral fractures is a worldwide problem: the IMPACT study. J Bone Miner Res. 2005;20:557-563. [DOI] [PubMed] [Google Scholar]
7.Fink HA, Milavetz DL, Palermo L, et al. What proportion of incident radiographic vertebral deformities is clinically diagnosed and vice versa? J Bone Miner Res . 2005;20:1216-1222. [DOI] [PubMed] [Google Scholar]
8.Francis RM, Aspray TJ, Hide G, Sutcliffe AM, Wilkinson P. Back pain in osteoporotic vertebral fractures. Osteoporosis Int. 2008;19:895-903. [DOI] [PubMed] [Google Scholar]
9.Frobin W, Brinckmann P, Kramer M, Hartwig E. Height of lumbar discs measured from radiographs compared with degeneration and height classified from MR images. Eur Radiol. 2001;11:263-269. [DOI] [PubMed] [Google Scholar]
10.Gan K, Xu D, Lin Y, et al. Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments. Acta Orthop. 2019;90:394-400. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Genant HK, Wu CY, van Kuijk C, Nevitt MC. Vertebral fracture assessment using a semiquantitative technique. J Bone Miner Res. 1993;8:1137-1148. [DOI] [PubMed] [Google Scholar]
12.Gregson CL, Hardcastle SA, Cooper C, Tobias JH. Friend or foe: high bone mineral density on routine bone density scanning, a review of causes and management. Rheumatology (Oxford). 2013;52:968-985. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Griffith JF, Guglielmi G. Vertebral fracture. Radiol Clin North Am. 2010;48:519-529. [DOI] [PubMed] [Google Scholar]
14.Guglielmi G, Palmieri F, Placentino MG, D'Errico F, Stoppino LP. Assessment of osteoporotic vertebral fractures using specialized workflow software for 6-point morphometry. Eur Radiol. 2009;70:142-148. [DOI] [PubMed] [Google Scholar]
15.Johnell O, Kanis JA. An estimate of the worldwide prevalence and disability associated with osteoporotic fractures. Osteoporosis Int. 2006;17:1726-1733. [DOI] [PubMed] [Google Scholar]
16.Kyere KA, Than KD, Wang AC, et al. Schmorl's nodes. Eur Spine J. 2012;21:2115-2121. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lenchik L, Rogers LF, Delmas PD, Genant HK. Diagnosis of osteoporotic vertebral fractures: importance of recognition and description by radiologists. AJR Am J Roentgenol. 2004;183:949-958. [DOI] [PubMed] [Google Scholar]
18.Lentle BC, Brown JP, Khan A, et al. Recognizing and reporting vertebral fractures: reducing the risk of future osteoporotic fractures. Can Assoc Radiol J. 2007;58:27-36. [PubMed] [Google Scholar]
19.Olczak J, Fahlberg N, Maki A, et al. Artificial intelligence for analyzing orthopedic trauma radiographs. Acta Orthop. 2017;88:581-586. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Pan SJ, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering. 2010;22:1345-1359. [Google Scholar]
21.Prince MJ, Wu F, Guo Y, et al. The burden of disease in older people and implications for health policy and practice. Lancet. 2015;385:549-562. [DOI] [PubMed] [Google Scholar]
22.Puisto V, Heliövaara M, Impivaara O, et al. Severity of vertebral fracture and risk of hip fracture: a nested case-control study. Osteoporosis Int. 2011;22:63-68. [DOI] [PubMed] [Google Scholar]
23.Redmon J, Farhadi A. YOLOv3: an incremental improvement. Available at: https://arxiv.org/abs/1804.02767. Accessed July 10, 2020.
24.Richards PJ, George J, Metelko M, Brown M. Spine computed tomography doses and cancer induction. Spine. 2010;35:430-433. [DOI] [PubMed] [Google Scholar]
25.Ruiz Santiago F, Tomas Munoz P, Moya Sanchez E, Revelles Paniza M, Martinez Martinez A, Perez Abela AL. Classifying thoracolumbar fractures: role of quantitative imaging. Quant Imaging Med Surg. 2016;6:772-784. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Sardar ZM, Ames RJ, Lenke L. Scheuermann's kyphosis: diagnosis, management, and selecting fusion levels. J Am Acad Orthop Surg. 2019;27:e462-e472. [DOI] [PubMed] [Google Scholar]
27.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision. 2017:618-626. [Google Scholar]
28.Simpson AK, Whang PG, Jonisch A, Haims A, Grauer JN. The radiation exposure associated with cervical and lumbar spine radiographs. J Spinal Disord Tech. 2008;21:409-412. [DOI] [PubMed] [Google Scholar]
29.Tozakidou M, Reisinger C, Harder D, et al. Systematic radiation dose reduction in cervical spine CT of human cadaveric specimens: how low can we go? Am J Neuroradiol. 2018;39:385-391. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.VandenBerg J, Cullison K, Fowler SA, Parsons MS, McAndrew CM, Carpenter CR. Blunt thoracolumbar-spine trauma evaluation in the emergency department: a meta-analysis of diagnostic accuracy for history, physical examination, and imaging. J Emerg Med. 2019;56:153-165. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Vokes TJ, Dixon LB, Favus MJ. Clinical utility of dual-energy vertebral assessment (DVA). Osteoporosis Int. 2003;14:871-878. [DOI] [PubMed] [Google Scholar]
32.Wong CC, McGirt MJ. Vertebral compression fractures: a review of current management and multimodal therapy. J Multidiscip Healthc. 2013;6:205-214. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY MATERIAL

abjs-479-1598-s001.docx^{(15KB, docx)}

abjs-479-1598-s002.docx^{(14.8KB, docx)}

abjs-479-1598-s003.docx^{(13.3KB, docx)}

abjs-479-1598-s004.docx^{(17.9KB, docx)}

abjs-479-1598-s005.pptx^{(1,000.4KB, pptx)}

[R1] 1.Berry GE, Adams S, Harris MB, et al. Are plain radiographs of the spine necessary during evaluation after blunt trauma? Accuracy of screening torso computed tomography in thoracic/lumbar spine fracture diagnosis. J Trauma. 2005;59:1410-1413. [DOI] [PubMed] [Google Scholar]

[R2] 2.Binkley N, Krueger D, Gangnon R, Genant HK, Drezner MK. Lateral vertebral assessment: a valuable technique to detect clinically significant vertebral fractures. Osteoporosis Int. 2005;16:1513-1518. [DOI] [PubMed] [Google Scholar]

[R3] 3.Burns JE, Yao J, Summers RM. Vertebral body compression fractures and bone density: automated detection and classification on CT images. Radiology. 2017;284:788-797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Cheng CT, Ho TY, Lee TY, et al. Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs. Eur Radiol. 2019;29:5469-5477. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Chung SW, Han SS, Lee JW, et al. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop. 2018;89:468-473. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Delmas PD, van de Langerijt L, Watts NB, et al. Underdiagnosis of vertebral fractures is a worldwide problem: the IMPACT study. J Bone Miner Res. 2005;20:557-563. [DOI] [PubMed] [Google Scholar]

[R7] 7.Fink HA, Milavetz DL, Palermo L, et al. What proportion of incident radiographic vertebral deformities is clinically diagnosed and vice versa? J Bone Miner Res . 2005;20:1216-1222. [DOI] [PubMed] [Google Scholar]

[R8] 8.Francis RM, Aspray TJ, Hide G, Sutcliffe AM, Wilkinson P. Back pain in osteoporotic vertebral fractures. Osteoporosis Int. 2008;19:895-903. [DOI] [PubMed] [Google Scholar]

[R9] 9.Frobin W, Brinckmann P, Kramer M, Hartwig E. Height of lumbar discs measured from radiographs compared with degeneration and height classified from MR images. Eur Radiol. 2001;11:263-269. [DOI] [PubMed] [Google Scholar]

[R10] 10.Gan K, Xu D, Lin Y, et al. Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments. Acta Orthop. 2019;90:394-400. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Genant HK, Wu CY, van Kuijk C, Nevitt MC. Vertebral fracture assessment using a semiquantitative technique. J Bone Miner Res. 1993;8:1137-1148. [DOI] [PubMed] [Google Scholar]

[R12] 12.Gregson CL, Hardcastle SA, Cooper C, Tobias JH. Friend or foe: high bone mineral density on routine bone density scanning, a review of causes and management. Rheumatology (Oxford). 2013;52:968-985. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Griffith JF, Guglielmi G. Vertebral fracture. Radiol Clin North Am. 2010;48:519-529. [DOI] [PubMed] [Google Scholar]

[R14] 14.Guglielmi G, Palmieri F, Placentino MG, D'Errico F, Stoppino LP. Assessment of osteoporotic vertebral fractures using specialized workflow software for 6-point morphometry. Eur Radiol. 2009;70:142-148. [DOI] [PubMed] [Google Scholar]

[R15] 15.Johnell O, Kanis JA. An estimate of the worldwide prevalence and disability associated with osteoporotic fractures. Osteoporosis Int. 2006;17:1726-1733. [DOI] [PubMed] [Google Scholar]

[R16] 16.Kyere KA, Than KD, Wang AC, et al. Schmorl's nodes. Eur Spine J. 2012;21:2115-2121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Lenchik L, Rogers LF, Delmas PD, Genant HK. Diagnosis of osteoporotic vertebral fractures: importance of recognition and description by radiologists. AJR Am J Roentgenol. 2004;183:949-958. [DOI] [PubMed] [Google Scholar]

[R18] 18.Lentle BC, Brown JP, Khan A, et al. Recognizing and reporting vertebral fractures: reducing the risk of future osteoporotic fractures. Can Assoc Radiol J. 2007;58:27-36. [PubMed] [Google Scholar]

[R19] 19.Olczak J, Fahlberg N, Maki A, et al. Artificial intelligence for analyzing orthopedic trauma radiographs. Acta Orthop. 2017;88:581-586. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Pan SJ, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering. 2010;22:1345-1359. [Google Scholar]

[R21] 21.Prince MJ, Wu F, Guo Y, et al. The burden of disease in older people and implications for health policy and practice. Lancet. 2015;385:549-562. [DOI] [PubMed] [Google Scholar]

[R22] 22.Puisto V, Heliövaara M, Impivaara O, et al. Severity of vertebral fracture and risk of hip fracture: a nested case-control study. Osteoporosis Int. 2011;22:63-68. [DOI] [PubMed] [Google Scholar]

[R23] 23.Redmon J, Farhadi A. YOLOv3: an incremental improvement. Available at: https://arxiv.org/abs/1804.02767. Accessed July 10, 2020.

[R24] 24.Richards PJ, George J, Metelko M, Brown M. Spine computed tomography doses and cancer induction. Spine. 2010;35:430-433. [DOI] [PubMed] [Google Scholar]

[R25] 25.Ruiz Santiago F, Tomas Munoz P, Moya Sanchez E, Revelles Paniza M, Martinez Martinez A, Perez Abela AL. Classifying thoracolumbar fractures: role of quantitative imaging. Quant Imaging Med Surg. 2016;6:772-784. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Sardar ZM, Ames RJ, Lenke L. Scheuermann's kyphosis: diagnosis, management, and selecting fusion levels. J Am Acad Orthop Surg. 2019;27:e462-e472. [DOI] [PubMed] [Google Scholar]

[R27] 27.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision. 2017:618-626. [Google Scholar]

[R28] 28.Simpson AK, Whang PG, Jonisch A, Haims A, Grauer JN. The radiation exposure associated with cervical and lumbar spine radiographs. J Spinal Disord Tech. 2008;21:409-412. [DOI] [PubMed] [Google Scholar]

[R29] 29.Tozakidou M, Reisinger C, Harder D, et al. Systematic radiation dose reduction in cervical spine CT of human cadaveric specimens: how low can we go? Am J Neuroradiol. 2018;39:385-391. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.VandenBerg J, Cullison K, Fowler SA, Parsons MS, McAndrew CM, Carpenter CR. Blunt thoracolumbar-spine trauma evaluation in the emergency department: a meta-analysis of diagnostic accuracy for history, physical examination, and imaging. J Emerg Med. 2019;56:153-165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Vokes TJ, Dixon LB, Favus MJ. Clinical utility of dual-energy vertebral assessment (DVA). Osteoporosis Int. 2003;14:871-878. [DOI] [PubMed] [Google Scholar]

[R32] 32.Wong CC, McGirt MJ. Vertebral compression fractures: a review of current management and multimodal therapy. J Multidiscip Healthc. 2013;6:205-214. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?

Yi-Chu Li, MS

Hung-Hsun Chen, PhD

Henry Horng-Shing Lu, PhD

Hung-Ta Hondar Wu, MD

Ming-Chau Chang, MD

Po-Hsin Chou, MD, PhD

Abstract

Background

Questions/purposes

Methods

Results

Conclusion

Level of Evidence

Introduction

Patients and Methods

Patients

Fig. 1.

Gold Standard for Diagnosis

Plain Radiographic Techniques

DEXA

Training, Validating, and Testing the Dataset

Table 1.

Table 2.

Dataset Processing

Detection of Fractured and Nonfractured Vertebrae by You Only Look Once Version 3

Fig. 2.

Automatic Data Preprocessing: Image-quality and Image-size Preprocessing

Development of the AI Deep-learning Ensemble Model

Fig. 3.

Optimal Framework of the AI Model

Ethical Approval

AI Model Evaluation and Statistical Analysis

Clinical Application of the AI model

Fig. 4.

Results

Performance of the AI Model for Thoracic or Lumbar Spine Vertebral Fractures

Table 3.

Table 4.

Table 5.

Fig. 5.

Performance of the AI Model in Osteoporotic or Nonosteoporotic Vertebral Fractures

Table 6.

Performance of the AI Model in Vertebral Fractures with Different Grades

Fig. 6.

Table 7.

Interobserver Reliability of AI Performance and Human Observers

Performance of the AI Model in External Validation

Discussion

Limitations

Diagnostic Effectiveness of this AI Model

Parameters that Influence the AI Model’s Effectiveness

Use of the Model in Practice

Future Directions

Conclusion

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases