Abstract
Aiming at the difficulties of lumbar vertebrae segmentation in computed tomography (CT) images, we propose an automatic lumbar vertebrae segmentation method based on deep learning. The method mainly includes two parts: lumbar vertebra positioning and lumbar vertebrae segmentation. First of all, we propose a lumbar spine localization network of Unet network, which can directly locate the lumbar spine part in the image. Then, we propose a three‐dimensional XUnet lumbar vertebrae segmentation method to achieve automatic lumbar vertebrae segmentation. The method proposed in this paper was validated on the lumbar spine CT images on the public dataset VerSe 2020 and our hospital dataset. Through qualitative comparison and quantitative analysis, the experimental results show that the method proposed in this paper can obtain good lumbar vertebrae segmentation performance, which can be further applied to detection of spinal anomalies and surgical treatment.
Keywords: CT images, lumbar vertebrae segmentation, spinal anomalies, surgical treatment
1. INTRODUCTION
Over the past few decades, various medical imaging modalities, including x‐rays, computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) have been widely used to examine spinal structural abnormalities. 1 , 2 CT imaging technology has been used to intuitively observe the internal information of the human body and count the volume information of various tissues due to its advantages of fast scanning speed and clear image generation. 3 Accurate lumbar vertebrae segmentation is an important step in subsequent analysis and treatment, including examining abnormal vertebrae fractures, biomechanical modeling, lumbar inter‐body fusion, and anterior and posterior lumbar inter‐body fusion. 4 , 5 In particular, these analyses and treatments place high demands on the accuracy of lumbar inter‐body segmentation. Before the maturity of artificial intelligence technology, the segmentation of vertebral bodies was mostly done manually by multiple experienced radiologists. However, the artificial vertebrae delineation method is labor‐intensive and time‐consuming, and the delineation standards of different radiologists are inconsistent, resulting in large differences in the segmentation results of the vertebrae. Therefore, an automatic and efficient automatic segmentation method of vertebrae is of great significance for the subsequent morphological analysis and clinical treatment.
The essence of automatic segmentation of vertebral bodies from CT images is to segment the vertebraes in CT images to provide a reliable basis for clinical diagnosis and treatment research. It can be understood as a pixel‐level classification method. Traditional image segmentation methods such as threshold‐based and region‐based segmentation can be applied to medical image segmentation. However, due to the influence of imaging equipment, imaging principles and other factors, the content and shape of medical images are relatively complex, and these traditional methods still have great challenges in segmentation accuracy.
Recently, significant progress has been made in image segmentation technology based on deep learning mainly with convolutional neural networks. There are many segmentation architectures with good performance, such as fully convolutional neural (FCN), 6 SegNet, 7 U‐Net, 8 and 3D UNet, 9 etc. The networks used in many image segmentation studies are based on the above networks or their corresponding 3D versions. The U‐Net 8 model adopts a symmetrical U‐shaped structure and achieves good results in many medical segmentation tasks, but its skip‐layer connection forces the encoder and decoder to perform feature fusion only on the same depth layer. Neglecting the fusion of semantic information with different scales will result in the loss of certain details. UNet++ 10 optimizes according to the structure of the UNet model, alleviates the unknown network depth with an efficient ensemble of U‐Nets of different depths, redesigns skip connections to aggregate features of different semantic scales at the decoder sub‐networks, and designs a pruning scheme to accelerate the inference speed of UNet++. 10 UNet3+ 11 utilizes full‐scale skip connections and deep supervisions. DeepLab V1 12 uses VGG16 13 as the Backbone, and uses atrous convolution and conditional random fields to improve the classification accuracy. DeepLab V2 14 uses ResNet‐101 15 as Backbone, uses atrous convolution more flexibly, and proposes atrous space pyramid pooling. DeepLab V3 16 abandons conditional random fields, improves atrous space pyramid pooling and uses atrous convolution to deepen the network.
Aiming at the difficulties of lumbar vertebrae segmentation in CT images, we propose a two‐stage automatic lumbar vertebrae segmentation method based on deep learning. The proposed method mainly includes two stages of lumbar vertebra positioning and lumbar vertebrae segmentation. First of all, we propose a lumbar spine localization network of Unet network, which can directly locate the lumbar spine part in the image. Then, this paper proposes a three‐dimensional XUnet lumbar vertebrae segmentation method to achieve automatic lumbar vertebrae segmentation. Different from existing methods, a novel low‐parametric nature of Inception block is introduced to capture the rich feature information in the encoder part and multi‐scale features are fused in the decode part to taking into account that the surrounding context to enable sound segmentation performance. The method proposed in this paper was validated on the lumbar spine CT images on the public dataset VerSe 2020 and our hospital dataset. 17 Through qualitative comparison and quantitative analysis, the experimental results show that the method proposed in this paper can obtain good lumbar vertebrae segmentation performance, which can be further applied to detection of spinal anomalies and surgical treatment.
2. RELATED WORKS
In recent years, various machine learning algorithms have also been widely used in the field of image segmentation. Jimenez‐Pastor et al. 18 introduced a two‐stage decision forest and morphological image processing technique to automatically detect and identify vertebral bodies in arbitrary field‐of‐view volume CT scans. Combining the convolutional neural network with the deformation model, Korez et al. 19 learn the spine features and output the probability map of the spine through the convolutional neural network, so as to guide the deformation model to generate the boundary of the spine, and finally realize the spine segmentation.
Benefiting from the improvement of computational resource performance and the increase of datasets available for training, deep learning is widely used in image segmentation. Model‐based spine segmentation methods are gradually being replaced by data‐driven deep learning methods as public spine image datasets become more sophisticated. Sekuboyina et al. 20 used two convolutional neural networks for spine localization and segmentation, respectively. The input of the localization network is a two‐dimensional slice of the spine image, and the ground‐truth of the localization network is obtained by downsampling the real spine mask, using patches to segment the vertebrae one by one. Similarly, Janssens et al. 21 used two three‐dimensional FCN 6 networks for spine localization and segmentation. The localization network uses regression to obtain the bounding box of the spine to localize the spine. The segmentation network directly performs voxel‐level multi‐classification. Lessmann et al. 22 used only one three‐dimensional FCN 6 for spine segmentation. They use an iterative method to segment the vertebrae one by one according to the prior rules of the appearance of vertebrae, and determine whether the vertebrae in the current block have been segmented by adding a memory component. In addition, there is a classification component dedicated to labeling vertebrae that have been segmented. However, the above‐mentioned method for segmenting the spine using blocks is difficult to solve the problem of overlapping between the vertebrae. Paye et al. 23 proposed a multi‐stage method for spine segmentation, first using 3D U‐Net to roughly localize the spine, then using SpatialConfiguration‐Net 24 to perform heat map regression for spine localization and identification, and finally using 3D U‐Net to perform binary segmentation on the identified vertebrae. The above deep learning‐based spine segmentation research has achieved good segmentation results.
3. METHOD
3.1. Method overview
Aiming at the difficulties of lumbar vertebrae segmentation in CT images, we propose a novel two‐stage lumbar vertebrae segmentation method based on deep learning. The method mainly includes two stages of lumbar vertebra positioning and lumbar vertebrae segmentation. First of all, we propose a lumbar spine localization network of Unet network, which can directly locate the lumbar spine part in the image, and crop out the image containing the lumbar spine, so as to improve the performance of subsequent segmentation. Then, we propose a three‐dimensional XUnet lumbar vertebrae segmentation method with an encoder‐decoder framework to achieve automatic lumbar vertebrae segmentation. At the encoder part, a novel low‐parametric nature of Inception block is introduced to capture the rich feature information. At the decoder part, the multi‐scale features are fused in the decode part to taking into account that the surrounding context to enable sound segmentation performance.
3.2. Spine positioning method and image cropping
3.2.1. UNet—based localization network model
In order to obtain the accurate positioning of the lumbar spine, this paper proposes a positioning network model based on UNet, as shown in Figure 1. The network model structure is as follows:
FIGURE 1.

UNet‐based localization network.
- Suppose the input is , and [1, 4] is obtained after feature extraction by the encoder, then the decoding feature can be expressed as:
(1)
where is downsampling, is upsampling, σ is the activation function, means that the convolution operation is performed with a convolution kernel of 3*3*3 size.
According to the method described in Equation (1), multi‐scale feature fusion is realized in the decoding process. The specific flow of data in the network is as follows:
The first layer of the encoder: the input scale is 96*96*128*1, and the feature map is obtained after two convolutions to extract the feature, and the feature map with 64 channels and the size of 48*48*64 is obtained after one down sampling;
The second layer of the encoder: the input feature map with a scale of 48*48*64, the feature map X 2 is obtained after two convolutions to extract the feature, and the feature map with 64 channels and a size of 24*24*32 is obtained after one down sampling;
The third layer of the encoder: the input feature map with a scale of 24*24*32, the feature map X 3 is obtained after two convolutions to extract the feature, and the feature map with 64 channels and size 12*12*16 is obtained after one down sampling;
The fourth layer of the encoder: the input feature map with a scale of 12*12*16, the feature map X 4 is obtained after two convolutions to extract the feature, and the feature map with 64 channels and size 6*6*8 is obtained after one down sampling;
The fifth layer of the encoder: the input feature map with a scale of 6*6*8, and the feature map with 64 channels and a size of 6*6*8 is obtained after two convolutions to extract the feature;
The first layer of the decoder: the multi‐scale features X 4[16:48] and the deconvolutional feature map X 5 are spliced along the channel direction to obtain features Y 1, and then 64 channels with a size of 12*12* 16 feature maps is obtained by performing feature fusion through two convolutions;
In the second layer of the decoder, the multi‐scale feature X 3[16:48] and the up‐sampled feature map are spliced along the channel direction to obtain features Y 2, and then 64 channels with a size of 24*24*32 feature map is obtained by performing feature fusion through two convolutions;
In the third layer of the decoder, the multi‐scale feature X 2[16:48] and the up‐sampled feature map are spliced along the channel direction to obtain features Y 3, and then 64 channels with a size of 48*48*64 feature map is obtained by performing feature fusion by twice using convolutional layer; and
In the fourth layer of the decoder, the multi‐scale feature X 1[16:48] and the upsampled feature map are spliced along the channel direction to obtain the feature map Y 4, and then 64 channels, with a size of 96*96*128 feature map is obtained by performing feature fusion through two convolutions. Finally, a convolution kernel of size 1*1*1 is used to generate a localization heat map.
Through the method proposed in this paper, the precise positioning of the lumbar vertebrae can be obtained, which assists the subsequent accurate segmentation of the lumbar vertebrae.
3.2.2. Image cropping
After completing the localization of the lumbar vertebrae, a bounding box containing the lumbar vertebrae can be obtained, and then the image containing the lumbar vertebrae can be cropped. The calculation of the bounding box requires the first and last occurrences of subscripts with a HU value greater than a certain threshold in each dimension of the calculation site. This pair of subscripts constitutes the range of the bounding box in the corresponding dimension. After obtaining the three‐dimensional range, it can be directly cropped, but the model prediction result may have multiple concentrated bright areas, the boundary range obtained by using the above algorithm will be too large, and there will be many black areas in the cropped image. Therefore, it is also necessary to determine the area with the largest range of light. In addition, in order to ensure that the cropped image completely contains the lumbar spine, the calculated bounding box needs to be expanded. For example, each dimension is expanded by 0.1 times the corresponding size of the image. Crop the image according to the resulting bounding box.
This scheme uses four neighborhoods to obtain the range of the lumbar vertebra to be selected, and selects the voxels whose HU value is greater than 0.3 times the maximum HU value of the image as the possible lumbar vertebrae, and finally expands the obtained bounding box in each dimension by 0.1 times the size of the corresponding image. Figure 2 shows the cropping result of the lumbar vertebrae image for two samples, where the lumbar vertebrae region is in the rectangle with red line. It can be seen that a lot of redundant information is deleted after cropping.
FIGURE 2.

Cropping result of the lumbar vertebrae image for two samples, where the lumbar vertebrae region is in the rectangle with red line.
3.3. A lumbar spine segmentation model based on 3D multi‐scale XUNet
In order to accurately segment the lumbar spine, we propose a 3D multi‐scale XUNet‐based spine segmentation method (Figure 3). XUnet uses the Inception block for feature extraction, where the Inception block includes two feature extraction layers, the first layer includes three convolution blocks (denoted as convblock1‐1, convblock1‐2, and convblock1‐3) and a maximum pooling, where convblock1‐1 consists of 64 convolution kernels of size 1*1 *1, convblock1‐2 consists of 32 convolution kernels of size 1*1 *1, convblock1‐3 consists of 16 convolution kernels of size 3*3*3 convolution kernels. The second layer includes three convolution blocks (denoted as convblock2‐1, convblock2‐2, and convblock2‐3), where convblock2‐1 consists of 128 convolution kernels of size 3*3*3, and convblock2‐2 consists of 32 It consists of 5*5*5 convolution kernels, and convblock2‐3 consists of 32 1*1*1 convolution kernels. The feature extraction results from convblock1‐1, convblock2‐1, convblock2‐2, and convblock2‐3 are concatenated and fused as the final output of the Inception block. This enables the aggregation of features at different semantic scales. At the same time, the Inception block is used to replace the convolution operation, which further reduces the redundant parameters in the network while deepening the width and depth of the network and improving the expressive ability of the network.
FIGURE 3.

Schematic diagram of 3D multi‐scale XUNet network.
The specific structure of the network is as follows:
- Suppose the input is , is obtained after feature extraction by the encoder, then the decoding feature can be expressed as:
(2)
where is downsampling, is upsampling, and σ is an activation function, means that the convolution operation is performed with a convolution kernel of 3 *3*3 size.
Multi‐scale feature fusion is realized in the decoding process, and the specific flow of data in the network is as follows:
The first layer of the encoder: the input scale is 64*64*128*1, the feature map X 1 is obtained after two convolutions to extract the feature, and the feature map with 32 channels and the size of 32*32*64 is obtained after one downsampling;
The second layer of the encoder: the input feature map with a scale of 32*32*64, the feature map is obtained after two convolutions to extract the feature, and the feature map with 64 channels and a size of 16* 16 *32 is obtained after one downsampling;
The third layer of the encoder: input feature map X 3 with a scale of 16 *16*32, extract features through two convolutions to obtain a feature map, and a feature map with 128 channels and a size of 8*8*16 is obtained after one downsampling;
The fourth layer of the encoder: the input feature map with a scale of 8*8*16, after a convolution and an Inception block to extract features, and a feature map X 4 with 256 channels and a size of 4*4*8 is obtained;
The first layer of the decoder: the multi‐scale feature maps X 2, X 3 and the deconvolutional feature map X 4 are spliced along the channel direction to obtain the feature map Y 1, and then feature map with 128 channels and a size of 16*16*32 is obtained by performing feature fusion through two convolutions;
In the second layer of the decoder, the multi‐scale feature maps and the up‐sampled feature map are spliced along the channel direction to obtain feature map Y 2, and then feature map with 64 channels and a size of 32*32*64 is obtained by performing feature fusion through two convolutions; and
In the third layer of the decoder, the feature maps X 1, X 2 and the multi‐scale feature map are spliced along the channel direction to obtain the feature map Y 3, and then a feature map with 32 channels and a size of 64*64*128 is obtained by performing feature fusion through two convolutions. Finally, combine the Softmax function to complete the segmentation and return the segmentation result graph.
Considering that too many parameters of the model will bring unnecessary computing overhead, the Inception block is used to replace the original convolution, which reduces the computing overhead while increasing the depth and width of the network, and improves the expressiveness of the model. The specific steps are as follows:
The Inception block includes two feature extraction layers, the first layer includes three convolution blocks (denoted as convblock1‐1, convblock1‐2, and convblock1‐3) and a maximum pooling, where convblock1‐1 consists of 64 convolution kernels of size 1*1*1, convblock1‐2 consists of 32 convolution kernels of size 1*1*1, convblock1‐3 consists of 16 sizes of 3*3*3 convolution kernels. The second layer includes three convolution blocks (denoted as c onvblock2‐1, convblock2‐2, and convblock2‐3), where convblock2‐1 consists of 128 convolution kernels of size 3*3*3, convblock2‐2 consists of 32 convolution kernels with a size of 5*5*5, and convblock2‐3 is composed of 32 convolution kernels with a size of 1*1*1. The feature extraction results from convblock1‐1, convblock2‐1, convblock2‐2, and convblock2‐3 are concatenated and fused as the final output of the Inception block.
4. RESULTS
4.1. Experimental data
The VerSe 2020 public dataset and our own hospital dataset are used in this paper. Our study is specifically focused on the segmentation of lumbar vertebra. We have selected the CT data which covers the lumbar vertebra. For the VerSe 2020 dataset, after the careful selection, we have obtained a total of 156 CT samples with the lumbar vertebra. The experiments were performed with a five‐fold cross‐validation. For our own hospital dataset, there are 500 samples of CT data in total. All these datasets are provided with ground truth segmentation results from radiologists. A total of 400 samples are utilized for training and other 100 samples are utilized for testing. Figure 4 shows three different sections of a sample of 3D CT image in the dataset, horizontal, sagittal, and coronal.
FIGURE 4.

Example of computed tomography (CT) image of lumbar spine. (a) horizontal plane; (b) sagittal plane; (c) coronal plane.
Data preprocessing mainly includes generating spine heatmaps and adding windows in the images. Figure 5 shows an example of the heatmaps and Figure 6 shows the images before and after adding the windows. The lumbar spine heatmap can be obtained by Gaussian smoothing on the lumbar spine mask. This heatmap can be used as a target for a subsequent localization network, which can roughly represent the location of the lumbar spine. Looking at the literature, it can be known that the window and window width of the lumbar skeletal tissue are 400 and 1800, respectively, which can better present the lumbar vertebrae to the network.
FIGURE 5.

Mask and heat map of lumbar vertebrae for an example.
FIGURE 6.

The images before and after adding the windows for an example.
4.2. Experimental environment
This article uses the python programming language, which integrates the deep learning framework Tensorflow and the image processing library. The code of this article is completed under Windows system using Pycharm as an integrated development environment. The convolutional neural network part uses the popular deep learning framework Tensorflow. The software development environment and hardware equipment used in this paper are as follows.
Software development environment: Windows 10 system, Pycharm 2020.3 × 64 version, Tensorflow 1.16. Hardware equipment: Nvidia 1080T i GPU video memory 1 1G, memory 6 4G, hard disk 2 T.
4.3. Evaluation metrics
The performance is evaluated using several commonly utilized evaluation metrics, including dice similarity coefficient (DSC), false negative dice (FND), false positive dice (FPD), maximum symmetric surface distance (MSD), and average symmetric surface distance (ASD). DSC, FND, and FPD are defined by the following equations:
| (3) |
| (4) |
| (5) |
where TP, FN, and FP represents the true positive voxels, false negative voxels and false positive voxels. DSC can capture the relative overlap of segmentation volumes and ground truth volumes. FND and FPD describes the rate of under‐segmentation and over‐segmentation, respectively. Moreover, DSC, FND, and FPD values range from 0 to 1.
The metrics of MSD and ASD can quantify the distance between segmentation contour surface and ground truth contour surface. MSD is defined by the following equation:
| (6) |
where represents the contour distance between segmentation results and ground truth. ASD is defined by the following equation:
| (7) |
| (8) |
4.4. Experimental results
To verify the performance of our method, our proposed method is compared with the classic Unet method, the iterative fully convolutional neural method (IFCN) 22 and the nnUnet method. 25 In this paper, these methods are experimented on the same dataset as our method, and the same preprocessing method is adopted. Figures 7, 8, and 9 are the different 2D views of the visual representation of the 3D results. From left to right are (a) original image, (b) ground truth, (c) Unet, (d) IFCN, (e) nnUnet, and (f) proposed method. The five lumbar vertebrae are marked with different colors to distinguish them. As can be seen from the figure, the method proposed in this paper can segment the lumbar vertebrae well, which is better than the classic other methods. The proposed method has achieved comparable performance with the promising nnUnet approach.
FIGURE 7.

Horizontal visualization of lumbar vertebrae segmentation results of (a) original image, (b) ground truth, (c) Unet, (d) IFCN, (e) nnUnet, and (f) proposed method.
FIGURE 8.

Sagittal visualization of lumbar vertebrae segmentation results of (a) original image, (b) ground truth, (c) Unet, (d) IFCN, (e) nnUnet, and (f) proposed method.
FIGURE 9.

Coronal visualization of lumbar vertebrae segmentation results of (a) original image,(b) ground truth, (c) Unet, (d) IFCN, (e) nnUnet, and (f) proposed method.
This paper further visualizes the surface of the 3D vertebrae. Figure 10 shows the visualization results of the 3D surface of the segmented vertebrae. From left to right are (a) ground truth, (b) UNet segmentation result, (c) IFCN segmentation result, (e) nnUnet segmentation result, and (f) the segmentation results of this method. The 3D visualization results further verify the superiority of the method. The 3D visualization results are consistent with the 2D results.
FIGURE 10.

3D visualization of the lumbar vertebrae of (a) ground truth, (b) Unet result, (c) IFCN, (d) nnUnet, and (e) proposed method.
We further verified the performance of the proposed method through quantitative analysis on the VerSe dataset and our hospital dataset. The quantitative analysis results of different segmentation results are shown in Tables 1 and 2. The method proposed in our study can achieve better segmentation results. The quantitative analysis is consistent with the visual evaluation. The performance on the hospital dataset is better than the VerSe dataset. The reason is that there are more training data in hospital dataset.
TABLE 1.
Quantitative evaluation of different methods on the Verse 2020 datasets.
| Lumbar segments | Methods | DSC | FND | FPD | MSD | ASD |
|---|---|---|---|---|---|---|
| Lumbar segment 1 | UNet | 0.8634 | 0.1943 | 0.0788 | 15.73 | 2.353 |
| IFCN | 0.8814 | 0.1112 | 0.1248 | 7.636 | 1.890 | |
| nnUNet | 0.9067 | 0.1060 | 0.0716 | 6.589 | 1.872 | |
| Proposed | 0.9125 | 0.0974 | 0.0688 | 6.321 | 1.719 | |
| Lumbar segment 2 | UNet | 0.8173 | 0.2768 | 0.0885 | 19.14 | 2.380 |
| IFCN | 0.8826 | 0.1125 | 0.1239 | 15.74 | 2.002 | |
| nnUNet | 0.9090 | 0.0998 | 0.0734 | 7.586 | 1.914 | |
| Proposed | 0.9088 | 0.0968 | 0.0856 | 6.570 | 1.901 | |
| Lumbar segment 3 | UNet | 0.7855 | 0.3507 | 0.0782 | 20.81 | 2.134 |
| IFCN | 0.8834 | 0.1104 | 0.1228 | 11.33 | 1.938 | |
| nnUNet | 0.8930 | 0.0897 | 0.0778 | 10.96 | 1.656 | |
| Proposed | 0.9107 | 0.1072 | 0.0714 | 6.566 | 1.169 | |
| Lumbar segment 4 | UNet | 0.8267 | 0.3277 | 0.3277 | 25.46 | 3.675 |
| IFCN | 0.8802 | 0.1159 | 0.0953 | 17.18 | 2.569 | |
| nnUNet | 0.8813 | 0.0953 | 0.1214 | 10.11 | 1.803 | |
| Proposed | 0.9011 | 0.0975 | 0.0975 | 10.85 | 1.775 | |
| Lumbar segment 5 | UNet | 0.8320 | 0.3262 | 0.1235 | 19.10 | 4.391 |
| IFCN | 0.8481 | 0.1455 | 0.0970 | 12.07 | 2.352 | |
| nnUNet | 0.8790 | 0.1379 | 0.0721 | 9.286 | 0.9870 | |
| Proposed | 0.8878 | 0.1166 | 0.0788 | 8.762 | 0.9991 |
Abbreviations: ASD, average symmetric surface distance; DSC, dice similarity coefficient; FND, false negative dice; FPD, false positive dice; IFCN, iterative fully convolutional neural; MSD, maximum symmetric surface distance.
TABLE 2.
Quantitative evaluation of different methods on the hospital dataset.
| Lumbar segments | Methods | DSC | FND | FPD | MSD | ASD |
|---|---|---|---|---|---|---|
| L1 | UNet | 0.9470 | 0.0394 | 0.0435 | 6.681 | 0.3950 |
| IFCN | 0.9612 | 0.0310 | 0.0289 | 5.087 | 0.2802 | |
| nnUNet | 0.9717 | 0.0298 | 0.0266 | 2.828 | 0.1881 | |
| Proposed | 0.9805 | 0.0206 | 0.0182 | 2.349 | 0.1683 | |
| L2 | UNet | 0.9511 | 0.0322 | 0.0313 | 5.318 | 0.2886 |
| IFCN | 0.9661 | 0.0284 | 0.0252 | 3.731 | 0.2404 | |
| nnUNet | 0.9752 | 0.0255 | 0.0230 | 3.162 | 0.1757 | |
| Proposed | 0.9821 | 0.0200 | 0.0157 | 2.625 | 0.1516 | |
| L 3 | UNet | 0.9512 | 0.0380 | 0.0324 | 5.385 | 0.2815 |
| IFCN | 0.9691 | 0.0276 | 0.0257 | 3.599 | 0.2304 | |
| nnUNet | 0.9721 | 0.0241 | 0.0214 | 2.671 | 0.1865 | |
| Proposed | 0.9802 | 0.0204 | 0.0190 | 2.336 | 0.1649 | |
| L 4 | UNet | 0.9499 | 0.0392 | 0.0372 | 6.403 | 0.2937 |
| IFCN | 0.9668 | 0.0302 | 0.0297 | 3.605 | 0.2187 | |
| nnUNet | 0.9736 | 0.0262 | 0.0265 | 2.975 | 0.1613 | |
| Proposed | 0.9775 | 0.0229 | 0.0237 | 2.449 | 0.1584 | |
| L 5 | UNet | 0.9385 | 0.0792 | 0.0462 | 6.881 | 0.3814 |
| IFCN | 0.9577 | 0.0689 | 0.0256 | 5.174 | 0.2649 | |
| nnUNet | 0.9730 | 0.0578 | 0.0159 | 3.312 | 0.1979 | |
| Proposed | 0.9700 | 0.0592 | 0.0196 | 3.793 | 0.2260 |
Abbreviations: ASD, average symmetric surface distance; DSC, dice similarity coefficient; FND, false negative dice; FPD, false positive dice; IFCN, iterative fully convolutional neural; MSD, maximum symmetric surface distance.
The position part and the inception part is further evaluated with the ablation experiments. Table 3 shows the quantitative analysis of the ablation experiments. Removing one part can slightly decrease the performance of the proposed method. The experiments further evaluate the performance of the developed modules.
TABLE 3.
Quantitative evaluation of the ablation experiments on the Verse 2020 datasets.
| Lumbar segments | Ablation part | DSC | FND | FPD | MSD | ASD |
|---|---|---|---|---|---|---|
| Lumbar segment 1 | w/o position | 0.8900 | 0.1072 | 0.0826 | 11.072 | 3.161 |
| w/o inception | 0.8947 | 0.1063 | 0.0942 | 9.282 | 4.738 | |
| Proposed | 0.9125 | 0.0974 | 0.0688 | 6.321 | 1.719 | |
| Lumbar segment2 | w/o position | 0.8955 | 0.1068 | 0.1021 | 8.512 | 2.835 |
| w/o inception | 0.8843 | 0.1217 | 0.1095 | 9.018 | 3.544 | |
| Proposed | 0.9088 | 0.0968 | 0.0856 | 6.570 | 1.901 | |
| Lumbar segment 3 | w/o position | 0.8924 | 0.1303 | 0.0847 | 8.060 | 1.810 |
| w/o inception | 0.8893 | 0.1322 | 0.0891 | 10.11 | 2.658 | |
| Proposed | 0.9107 | 0.1072 | 0.0714 | 6.566 | 1.169 | |
| Lumbar segment 4 | w/o position | 0.8826 | 0.1080 | 0.1080 | 15.50 | 2.451 |
| w/o inception | 0.8716 | 0.1215 | 0.1215 | 18.81 | 3.597 | |
| Proposed | 0.9011 | 0.0975 | 0.0975 | 10.85 | 1.775 | |
| Lumbar segment 5 | w/o position | 0.8695 | 0.1340 | 0.0970 | 13.94 | 3.172 |
| w/o inception | 0.8760 | 0.1377 | 0.1003 | 11.99 | 2.317 | |
| Proposed | 0.8878 | 0.1166 | 0.0788 | 8.762 | 0.9991 |
Abbreviations: ASD, average symmetric surface distance; DSC, dice similarity coefficient; FND, false negative dice; FPD, false positive dice; MSD, maximum symmetric surface distance.
5. SUMMARY
Aiming at the difficulties of lumbar vertebrae segmentation in CT images, we propose an automatic lumbar vertebrae segmentation method based on deep learning. The method mainly includes two parts: lumbar vertebra positioning and lumbar vertebrae segmentation. First of all, we propose a lumbar spine localization network of Unet network, which can directly locate the lumbar spine part in the image. Then, we propose a three‐dimensional XUnet lumbar vertebrae segmentation method to achieve automatic lumbar vertebrae segmentation. The method proposed in this paper is validated on the lumbar spine CT images on the public dataset VerSe 2020 and our hospital dataset. Through qualitative comparison and quantitative analysis, the experimental results show that the method proposed in this paper can obtain good lumbar vertebrae segmentation performance, which can be further applied to detection of spinal anomalies and surgical treatment.
AUTHOR CONTRIBUTIONS
Hongjiang Lu and Mingying Li: methodology, investigation, formal analysis, writing ‐ original draft. Kun Yu and Yingjiao Zhang: data curation, investigation, formal analysis, validation. Liang Yu: conceptualization, supervision, writing ‐ review & editing.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
ACKNOWLEDGMENTS
The authors have nothing to report.
Lu H, Li M, Yu K, Zhang Y, Yu L. Lumbar spine segmentation method based on deep learning. J Appl Clin Med Phys. 2023;24:e13996. 10.1002/acm2.13996
DATA AVAILABILITY STATEMENT
All data included in this study are available upon request by contact with the corresponding author.
REFERENCES
- 1. Jeon I, Kong E. Application of simultaneous 18F‐FDG PET/MRI for evaluating residual lesion in pyogenic spine infection: a case report. Infect Chemother. 2020;52(4):626‐633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Sneath RJS, Khan A, Hutchinson C. An objective assessment of lumbar spine degeneration ageing seen on MRI using an ensemble method‐a novel approach to lumbar MRI reporting. Spine (Phila Pa 1976). 2022;47(5):E187‐E195. [DOI] [PubMed] [Google Scholar]
- 3. Rousseau J, Dreuil S, Bassinet C, Cao S, Elleaume H. Surgivisio® and O‐arm® O2 cone beam CT mobile systems for guidance of lumbar spine surgery: comparison of patient radiation dose. Phys Med. 2021;85:192‐199. [DOI] [PubMed] [Google Scholar]
- 4. Radaelli D, Zanon M, Concato M, et al. Spine surgery and fat embolism syndrome. Defining the boundaries of medical accountability by hospital autopsy. Front Biosci (Landmark Ed). 2021;26(12):1760‐1768. [DOI] [PubMed] [Google Scholar]
- 5. Cuzzocrea F, Ghiara M, Gaeta M, Fiore MR, Benazzo F, Gentile L. Carbon fiber screws in spinal tumor and metastasis: advantages in surgery, radio‐diagnostic and hadrontherapy. J Biol Regul Homeost Agents. 2019;33(4):1265‐1268. [PubMed] [Google Scholar]
- 6. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015;3431‐3440. [DOI] [PubMed]
- 7. Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder‐decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481‐2495. [DOI] [PubMed] [Google Scholar]
- 8. Ronneberger O, Fischer P, Brox T. U‐net: convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer‐Assisted Intervention. Springer; 2015:234‐241. [Google Scholar]
- 9. Çiçek Ö, Abdulkadir A, Lienkamp SS, et al. 3D U‐Net: learning dense volumetric segmentation from sparse annotation. International Conference on Medical Image Computing and Computer‐Assisted Intervention. Springer; 2016:424‐432. [Google Scholar]
- 10. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, et al. Unet++: a nested u‐net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer; 2018:3‐11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Huang H, Lin L, Tong R, et al. Unet 3+: a full‐scale connected unet for medical image segmentation. ICASSP 2020‐2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020:1055‐1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Liang‐Chieh C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs. International Conference on Learning Representations. 2015.
- 13. Simonyan K, Zisserman A. Very deep convolutional networks for large‐scale image recognition. International Conference on Learning Representations. 2015.
- 14. Chen LC, Papandreou G, Kokkinos I, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834‐848. [DOI] [PubMed] [Google Scholar]
- 15. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770‐778.
- 16. Florian LC, Adam SH, Rethinking atrous convolution for semantic image segmentation. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE/CVF. 2017;6. [Google Scholar]
- 17. Sekuboyina A, Husseini ME, Bayat A, et al. VerSe: a vertebrae labelling and segmentation benchmark for multi‐detector CT images. Medical Image Anal. 2021;73:102166. [DOI] [PubMed] [Google Scholar]
- 18. Jimenez‐Pastor A, Alberich‐Bayarri A, Fos‐Guarinos B, et al. Automated vertebrae localization and identification by decision forests and image‐based refinement on real‐world CT data. Radiol Med. 2020;125(1):48‐56. [DOI] [PubMed] [Google Scholar]
- 19. Korez R, Likar B, Pernuš F, et al. Model‐based segmentation of vertebral bodies from MR images with 3D CNNs. International Conference on Medical Image Computing and Computer‐Assisted Intervention. Springer; 2016:433‐441. [Google Scholar]
- 20. Sekuboyina A, Kukačka J, Kirschke JS, et al. Attention‐driven deep learning for pathological spine segmentation. International Workshop on Computational Methods and Clinical Applications in Musculoskeletal Imaging. Springer; 2017:108‐119. [Google Scholar]
- 21. Janssens R, Zeng G, Zheng G. Fully automatic segmentation of lumbar vertebrae from CT images using cascaded 3D fully convolutional networks. 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE; 2018:893‐897. [Google Scholar]
- 22. Lessmann N, Van Ginneken B, De Jong PA, et al. Iterative fully convolutional neural networks for automatic vertebra segmentation and identification. Med Image Anal. 2019;53:142‐155. [DOI] [PubMed] [Google Scholar]
- 23. Payer C, Stern D, Bischof H, et al. Coarse to fine vertebrae localization and segmentation with spatialconfiguration‐net and U‐Net. VISIGRAPP (5: VISAPP). 2020:124‐133. [Google Scholar]
- 24. Payer C, Štern D, Bischof H, et al. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Medical Image Anal. 2019;54:207‐219. [DOI] [PubMed] [Google Scholar]
- 25. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier‐Hein KH. nnU‐Net: a self‐configuring method for deep learning‐based biomedical image segmentation. Nat Methods. 2021;18(2):203‐211. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data included in this study are available upon request by contact with the corresponding author.
