Abstract
Early and accurate diagnosis of osteosarcomas (OS) is of great clinical significance, and machine learning (ML) based methods are increasingly adopted. However, current ML-based methods for osteosarcoma diagnosis consider only X-ray images, usually fail to generalize to new cases, and lack explainability. In this paper, we seek to explore the capability of deep learning models in diagnosing primary OS, with higher accuracy, explainability, and generality. Concretely, we analyze the added value of integrating the biochemical data, i.e., alkaline phosphatase (ALP) and lactate dehydrogenase (LDH), and design a model that incorporates the numerical features of ALP and LDH and the visual features of X-ray imaging through a late fusion approach in the feature space. We evaluate this model on real-world clinic data with 848 patients aged from 4 to 81. The experimental results reveal the effectiveness of incorporating ALP and LDH simultaneously in a late fusion approach, with the accuracy of the considered 2608 cases increased to 97.17%, compared to 94.35% in the baseline. Grad-CAM visualizations consistent with orthopedic specialists further justified the model’s explainability.
Keywords: Osteosarcoma diagnosis, Neural network interpretability, Machine learning, Deep learning
Introduction
Primary bone tumors account for 0.2% of human tumors, with an annual incidence of approximately 20 cases per 10,000 people [1, 2]. These tumors often cause damage to skeletal structures, resulting in long-term functional disability and reduced quality of life for patients [3–5]. Currently, X-ray imaging, owing to its cost-effectiveness, is the most common imaging tool for diagnosing primary bone tumors [6, 7]. Among all types of bone tumors, osteosarcoma (OS) is the most common primary malignancy of bone [8]. However, OS are difficult to identify, either due to their rareness or the limited knowledge of doctors in interpreting X-ray images of OS. Consequently, OS are often misdiagnosed or delayed in the clinic, and when diagnosed, there are already lung metastases. Therefore, early and accurate diagnosis of OS is of great significance.
In order to facilitate medical diagnoses, such as OS, with higher accuracy and efficiency, researchers are increasingly resorting to machine-learning-based methods [9–12]. For example, He et al. [13] used a ResNet-based machine learning (ML) method to diagnose bone tumors with X-ray imaging. Olczak et al. [14] leveraged deep learning models to orthopedic radiographs, which have competitive accuracy with senior orthopedic surgeons. Fave et al. [15] used the ML model to predict a particular non-small cell lung cancer patient’s response to treatment.
Although promising, these machine-learning-based methods in medical diagnosis usually fail to generalize to new cases and lack explainability. From our analysis, directly using deep learning models on X-ray images can lead to an overfitting issue in that it performs well on the training set but does poorly in practice. Essentially, they have trouble generalizing what they learned from the training set, leading to low F1 scores (See Sect. 6 for details).
In this paper, we explore the capability of machine-learning-based methods in diagnosing OS, with higher accuracy, explainability, and generality. In particular, we leverage the visual interpretability methods for deep neural networks, the results reveal that the naive artificial intelligence (AI) model incorrectly placed the attention on the X-ray image of a patient’s normal tissues around the knee, and in some cases, the areas it focused on are not even the lower extremities, which explains the poor performance. To this end, we further analyze the added value of integrating the biochemical data, i.e., alkaline phosphatase (ALP) in Bone Tumor Diagnosis with deep learning techniques. We design a novel model with late fusion in the feature space that incorporates the numerical features of ALP information and the visual features of X-ray imaging. To be specific, we take the backbone of Resnet as the image encoder to extract features from X-ray images and concatenate them with ALP information. With biochemical information normalized and concatenated to the feature vector, we then use a two-layer Multilayer Perceptron (MLP) as the classifier head to predict the label according to the concatenated features. We then evaluate this model on real-world clinic data with 848 patients aged from 4 to 81 collected from Peking University People’s Hospital. The dataset consists of 2608 X-ray images and corresponding clinic records, which are divided into non-tumor patients and OS patients. We evaluate the performance gain against the pure image-based Deep Neural Network (ResNet). The accuracy of the considered 2608 cases was increased to 97.17%, compared to 94.35% in the baseline.
Furthermore, the improvement is rather explainable both medically and computationally. ALP is an enzyme that plays a role in bone metabolism, and elevated levels of ALP in the blood can be associated with certain bone disorders, including OS. Integrating such clinical data with imaging data in machine-learning models can provide additional information and improve diagnostic accuracy. Our fusion model enables itself to consider both the visual characteristics of the tumor and the patient’s biochemical profile. Experimental results demonstrate that it is capable of learning complex patterns and relationships that may exist between imaging features and ALP levels. In summary, our contributions are:
Through the investigation into important reference factors in OS diagnosis, we propose a novel late fusion network structure that integrates visual information from X-rays and numerical information from patient basic information and biochemical tests of ALP and LDH.
Utilizing real-world clinical data from 848 patients aged 4 to 81, collected at Peking University People’s Hospital, our exhaustive experiments demonstrate the potential of incorporating additional clinical information to enhance the performance of OS diagnostic systems, with ALP and LDH contributing significantly to the performance. Specifically, the average accuracy under four metrics increased to 97.17%, compared to 94.35% in the baseline.
Through visual interpretability analysis, we verified that the visual attention of the proposed network is consistent with orthopedic experts when diagnosing OS, emphasizing the high interpretability of our method. At the same time, interpretability research enhances the reliability of ML-based medical diagnosis systems in real-world applications and broadens the application of AI in the medical field.
Related works
Convolutional neural networks
Convolutional neural networks (CNNs) stand out as a key architecture in deep learning, particularly for image classification tasks [16]. They leverage multiple convolutional layers to automatically learn hierarchical features from input images.
Residual Networks (ResNets), introduced by He et al. [17], has been pivotal in enabling the training of deeper networks. ResNets incorporate shortcut connections, enhancing gradient flow during backpropagation.
In medical image processing, ResNets have gained prominence, as reviewed comprehensively by Xu et al. [18].
However, existing research on applications of CNNs in medical diagnosis systems focuses on visual features only [19, 20] and neglects the use of patients’ biochemical indicators and basic information, which are powerful auxiliary information in real-world diagnostics.
Interpretability of neural networks
Research on the interpretability of neural networks seeks to elucidate the mechanisms underlying their operations [21]. Neural network interpretability research is crucial for understanding decision-making processes, especially in critical applications like healthcare [22]. In the context of OS research, it’s essential due to the high-stakes decisions involved. This work enhances transparency in AI diagnosis systems, fostering trust and addressing issues like algorithmic bias.
The Grad-CAM (Gradient-weighted Class Activation Mapping), introduced by Selvaraju et al. [23], is a widely embraced method for visualizing deep neural network activations. This approach is essential for interpreting predictions and understanding model decisions. Grad-CAM leverages gradients of the classification score with respect to the final convolutional feature map, identifying image regions that significantly influence the classification score. Through gradient-based localization, it generates visual explanations from deep networks, pinpointing specific regions crucial to a classification decision. Notably, Grad-CAM extends beyond its precursor, the Class Activation Mapping (CAM) technique [24], initially designed for specific network architectures. It adapts to any CNN architecture, facilitating activation visualization in any network layer. This versatile technique demonstrates effectiveness across diverse tasks, including weakly supervised object localization and segmentation [25].
Machine-learning-based medical diagnosis
With the development of ML, a larger amount of related research has emerged to explore the application of ML methods in medical diagnosis [26–32].
Bai et al. [33] explored the feasibility of using machine-learning models such as logistic regression, naïve Bayes, and random forest, to predict the risk of end-stage kidney disease (ESKD) in chronic kidney disease (CKD) patients. However, this paper only focuses on baseline characteristics and routine blood test results, neglecting the significant visual features.
Soni et al. [19] integrated the space transformer network (STN) with CNN to address irregular image orientation (e.g. rotating and inclined), simplifying the diagnosis of lung disease for both specialists and physicians. Chowdhury et al. [20] designed a parallel-dilated CNN-based COVID-19 detection system. Yet, these works are limited to visual features extracted from X-ray images, ignoring other commonly used biochemical test indicators that can assist the judgment. In addition, the lack of interpretability research on DNNs makes the method untrustable and difficult to apply in the real world.
Anand et al. [34] developed a Convolutional Extreme Learning Machine (DC-ELM) algorithm for the assessment of cancer type based on analyzing histopathology images. Nevertheless, the histopathology process requests the collection of tissue samples through biopsies or surgical procedures, which makes this method severely limited in wide applications such as the examination of OS in large groups.
The studies most closely aligned with our research are [35, 36], where the integration of visual and clinical features was investigated for bone tumor classification. Nevertheless, there are substantial distinctions between their approaches and ours. [35] employs a conventional method for extracting visual features, while [36] employs a non-deep neural network (DNN) fusion module, neglecting to incorporate the fusion process within the DNNs. Moreover, both methods necessitate extra time and human resources for annotating regions of interest, and they lack an analysis of the visual comprehension capabilities inherent in ML models.
In conclusion, existing research on machine-learning-based medical diagnosis either has weak generalizability or lacks interpretability, making them unfeasible for the diagnosis of OS.
Medical backgrounds
Osteosarcoma is the most common primary bone tumor among children and adolescents [37, 38]. The unknown etiology, significant histological heterogeneity, lack of biomarkers, high aggressiveness and early metastasis potential made OS a devastating disease. The current treatment modalities for OS include preoperative neoadjuvant chemotherapy, surgical excision and postoperative chemotherapy [39], The 5-year survival rate is 60% for patients with localized OS, but the 5-year survival rate plummet to 20% for patients with metastases and local recurrence [40], which highlights the early diagnosis and effective treatment of OS. However, due to the rareness, occult onset, and the characteristic that tends to occur in children and adolescents, OS is often misdiagnosed as “growing pain” by primary hospitals that lack such experiences in the early stage and delay the correct diagnosis and treatment, leading to a bad prognosis. Therefore, a method that can assist in the diagnosis of OS in primary hospitals is urgently needed. Currently, machine-learning-based methods have been employed in medical diagnoses with high accuracy and efficiency, we plan to develop an early diagnosis tool for OS based on ML, so that OS can be correctly diagnosed and treated as soon as possible.
Overview
Figure 1 illustrates the workflow of our work, it consists of three main components: (1) data collection and preprocessing; (2) feature extraction, fusion, and classification that provides a preliminary diagnosis; (3) interpretability research that enhances the reliability of the system.
Fig. 1.

Overview of our work in the form of a flowchart. We first perform two tests on the patient and then achieve a preliminary diagnosis through alignment, normalization, feature extraction, feature fusion, and classification. Subsequently, visual interpretability work will be carried out through gradcam to provide doctors with a comprehensive reference
In the following, we will introduce the design details and considerations in these processes respectively.
Data collection
We obtained a real-world dataset of OS diagnosis and treatment records from Peking University People’s Hospital, serving as the primary source for our experiments. This dataset guarantees both authenticity and validity. It comprises 848 patients, along with 2608 X-ray images and their corresponding clinic records. The dataset consists of two categories: non-tumor knee injury patients and OS patients. The category of non-tumor knee injury includes patients with knee injuries such as meniscal injuries, ligament tears, and common inflammation. Consequently, the primary distinction between the two categories lies in the presence or absence of OS in the patient. Despite the challenge of distinguishing between the two categories, this setting accurately reflects real-world scenarios where patients may present with various diseases causing leg pain.
Among the 848 patients in the data set, there exists 454 non-tumor patients and 394 OS patients, covering a wide age range from 4 to 81. Since X-rays may be taken from multiple angles, the corresponding number of X-ray images can be about two times larger. The final number of selected X-ray images is 1432 and 1176, respectively. With respect to the preprocessing of the X-ray images, all the images are adjusted to the appropriate brightness and contrast by professional radiologists. Since the raw images contain both cropped images and the original desktop screenshot, filtering and cropping each image by hand is time-consuming. Therefore, we filter the screenshots according to the filename and crop them at the same location that covers the most information. The selected X-ray images have a minimal resolution of 224 × 224, ensuring high image quality.
Figure 2 shows the sample images for the two categories. As mentioned above, the X-ray images can be hard to distinguish because of the setting of the non-tumor category, we can find non-tumor patients who have suspicious tissues that are visually similar to OS. In the meanwhile, some OS patients are still in the early stages of the disease and the tumor tissue is not obvious, which makes accurate machine-learning-based diagnosis even harder to realize.
Fig. 2.

The X-ray images for two categories. a in the left is from a patient who has OS. b in the right is from a patient not having OS
As a consequence, to achieve an early and accurate diagnosis of OS, it’s necessary to integrate additional clinical information (e.g. biochemical data) to assist the AI diagnostic system in comprehensively analyzing the patient’s condition and giving a diagnosis like a real doctor. Apart from basic information like age and gender, we chose two biochemical data as the additional information.
The first biochemical indicator is ALP, which is a glycoprotein that plays a crucial role in various biological processes. It is found in many tissues throughout the body, including the liver, bones, intestines, and kidneys. ALP is involved in the hydrolysis of phosphate ester bonds under alkaline conditions, leading to the release of inorganic phosphate. The serum ALP level serum can be easily obtained through blood biochemical analysis and is an important marker reflecting bone formation [41], OS is characterized by the production of osteoid tissue or immature bone, thus the abnormal osteogenesis of OS will lead to the serum ALP changes. many researchers have found that serum ALP is a valuable tumor marker with high specificity in OS [42]. High serum ALP level is related to a higher OS activity [43]. There were other studies suggesting that ALP has predictive value in lung metastasis and the prognosis of patients, the patients with higher serum ALP levels have a worse prognosis and a higher possibility of lung metastasis. In summary, based on the important predictive value in OS, serum ALP level was selected as one of the training indicators in the machine-learning model.
The other biochemical indicator is lactate dehydrogenase (LDH), which is an enzyme that plays a crucial role in energy metabolism. It is found in various tissues and cells throughout the body, including the heart, liver, muscles, and red blood cells. LDH is a nicotinamide adenine dinucleotide (NAD+) dependent enzyme, it catalyzes a reversible reaction of pyruvate to lactate. In previous studies, Researchers have found that OS patients whose tumor tissue has lower expression of LDH mRNA exhibited a longer survival time (LDH may be a significant predictor of poor prognosis in OS). Serum LDH level was also found to be associated with the prognosis of patients with OS [44], higher serum LDH level is closely related to lower event-free survival rate in OS patients, the National Comprehensive Cancer Network guidelines also indicate that LDH is associated with the diagnosis and prognosis of OS [45].
Finally, the patient’s X-rays and biochemical indicators can be linked through the medical record number, and Fig. 3 is an example input of the improved AI diagnostic system.
Fig. 3.

An example input of the improved AI diagnostic system. Basic patient information and biochemical results of ALP and LDH are provided as auxiliary data for diagnosis
Approach and experimental design
As discussed in “Data collection” section, the X-ray imaging itself can not serve as the only basis for diagnosis, since the situation for real-world patients can become extremely complicated, we can’t provide a high-confidence correct prediction and may result in serious misdiagnosis. To improve the accuracy of OS diagnosis, we suggest that AI diagnostic systems need to incorporate more clinical information (e.g. biochemical data) besides the imaging data. In this way, they can analyze the patient’s condition comprehensively and provide a diagnosis like a real doctor. We select two biochemical markers, along with basic information such as age and gender, as additional information.
There are two primary methods for integrating X-ray imaging data and biochemical data: early fusion and late fusion. In early fusion, the features from the biochemical data are encoded into the images. However, this approach is not ideal because convolutional networks are designed to imitate human eyes, focusing on learning local features and filters, which may not perform well in understanding letters in the images. Therefore, we choose to use a late fusion method instead. Since the biochemical indicators have already been normalized and are all numerical inputs, there is no need to build a sub-network to encode the non-image inputs. We define an adjustable concatenate layer to conduct the late fusion on the flattened image features (i.e., below the average pooling layer) and additional inputs with different lengths. To better utilize the concatenated features, we define the full connection block as a two-layer MLP, where the intermediate dimension is 512 and the output dimension is 2.
Denoting the visual inputs from X-ray imaging as X and the auxiliary inputs from clinical information as Y, the inference of our network is denoted as:
| 1 |
where F is the feature extractor (backbone) of the CNN that encodes the X-ray image X as a feature vector, N encodes non-numeral inputs and normalizes them. Concat represents the concatenate layer that appends the N(Y) to the end of the feature vector of X. G is the classification layer, which is a two-layer MLP consisting of two full connection layers. The detailed network structure is illustrated in Fig. 4.
Fig. 4.
The proposed network. Our network has two major types of inputs, i.e., the visual input of X-ray imaging and the non-visual auxiliary information including textual (gender) and numerical (age, ALP, LDH) inputs. Firstly, the visual inputs are encoded to a 1*512 feature vector by the feature extractor of a CNN backbone, auxiliary inputs are encoded and normalized through corresponding normalization operations, forming a 1*4 vector. Secondly, the two vectors are concatenated by the concat layer in the second dimension. Lastly, the MLP containing two full connection layers followed by a softmax layer gets the outputs from the fused feature vector
Since the new network structure can utilize additional information for classification and we use a good initialization for the image encoder, the weights can be mostly fixed in the initial stage of training, and gradients are not propagated back through the image classification branch. Another probable weakness of this method is that the network may waste time trying to distinguish visually similar images that can be easily classified with the additional non-image information. This can be the main aspect of future study.
We summarize the fusion components in Fig. 5 and describe them in detail in the following subsections.
Fig. 5.

Involved fusion models. The ’Baseline’ indicates the model has only X-ray images as the inputs, while the rest models are our proposed networks having different combinations of auxiliary information. We have 7 models in total to investigate the effectiveness of the proposed fusion network
Baseline
The baseline model exclusively utilizes features extracted from X-ray images in the training dataset and serves as a reference for subsequent experiments involving additional clinical data. The architecture of the baseline model is based on a conventional CNN that utilizes ResNet18. Small adjustments were made to optimize the network for the specific task and ensure fairness of comparison. Specifically, the backbone of the ResNet18 model uses the same ImageNet pre-trained weights as initialization, and the fully connected layer is implemented as a two-layer MLP to ensure comparability with the improved versions in terms of the number of neurons. The activation was adjusted for binary classification, converting the detected features into a single prediction per image.
Integration of basic information
According to the Surveillance, Epidemiology, and End Results database, some factors including age and gender are related to the incidence of OS, which is probably related to bone growth, hormonal changes, and pubertal development. The incidence of OS showed an obvious age-related bimodal distribution, with the first peak in adolescents aged 10–14 years, and the second peak occurring in adults older than 65 years [46]. OS is more likely to occur in males compared with females, the overall male-to-female ratio was 1.3:1 (males, IR, 8.1; 95% CI 7.7–8.6 vs females, IR, 6.2; 95% CI 5.8–6.6), where IR and CI stand for incidence rate and confidence interval respectively. The incidence peak of females (10–14 years old) was earlier than males (15–19 years old) [47]. Thus, age and gender have valuable meaning in the establishment of the machine-learning model.
We one-hot encode gender and subtract 0.5 from the result. Therefore, the female gender is represented as 0.5 and the male gender as − 0.5. This encoding method is more suitable as input for neural networks. Since the patient samples collected may not be comprehensive, and there may be younger or older patients, min–max normalization is not very appropriate for age information. Instead, we directly use age divided by 100 as the encoding of age information.
Integration of biochemical indicators
The serum ALP and LDH level was obtained from the biochemical analysis of patients, serum ALP and LDH levels have been proven to be two important biochemical indicators of OS, high serum levels of ALP and LDH were related to poor prognosis.
For the reason that there are several different ALP kits used in the biochemical analysis of OS patients, therefore the normal ranges are different, too, but we standardized it in the study to make ALP data comparable from different ALP kits. To ensure the results of different tests are comparable, we here normalize them according to the following linear projection:
| 2 |
where y is the biochemical test input, N(y) returns the normalized result corresponding to y, L is the recommended lower normal range of the kit of ALP or LDH in this inspection, and H is the upper limit of the normal range. Test results that surpass H will have a normalized result larger than 1, and test results that are lower than L will have a normalized result smaller than 0, both cases indicate the abnormal serum ALP/LDH level.
Evaluation
The evaluation consists of two corresponding components: (1) numerical analysis, which examines the model’s performance across various metrics. (2) interpretability research, which investigates the reasons behind the model’s predictions.
Experimental setup
The neural networks in our experiments are implemented using Python and widely used ML libraries such as Pytorch, Scikit-learn, and OpenCV2. ResNet is chosen as the convolutional basic for the experiments due to its great success in deep learning tasks and performance in multiple computer vision fields. According to the standard image input size of ResNet, we downsampled all X-ray images to 224 × 224 through resizing. Random horizontal flipping is usually added to the data to enhance the robustness of training, but we omitted this operation because each patient already has various X-ray images taken at different angles as mentioned in “Data collection” section. To test the model’s performance on this dataset more objectively and comprehensively, we divided the dataset into five equal parts, using each part as a test set (20%) and the rest as a training set (80%), which is a 5-fold cross-validation. We train each model for 50 epochs with a batch size of 256. To achieve the best performance, we adopted a learning rate scheduler to adjust the learning rate as training progressed. The initial learning rate was 0.01, and it was reduced to one-tenth of the previous value at the 20th, 35th, and 45th epochs. Therefore, the minimal learning rate was 0.00001.
Numerical results
We test our model with four commonly used metrics: Accuracy, Precision, Recall, and F1 score.
Accuracy is the ratio of correctly classified samples to the total number of samples. It is the most straightforward evaluation metric, indicating the proportion of samples that the model predicts correctly.
Precision is the proportion of true positive samples among the samples predicted as positive by the model. It measures the accuracy of the model’s positive predictions.
Recall is the proportion of true positive samples among the actual positive samples. It measures the model’s ability to find positive samples.
The F1 score is the harmonic mean of precision and recall, providing a balanced measure of the model’s accuracy and recall.
The following equations show the definition of these metrics in detail:
| 3 |
where, TP (True Positive) is the number of true positive samples, i.e., the number of patients correctly predicted as suffering from OS. TN (True Negative) is the number of true negative samples, i.e., the number of patients correctly predicted as not having OS. FP (False Positive) is the number of false positive samples, i.e., the number of patients incorrectly predicted as having OS (misdiagnosis). FN (False Negative) is the number of false negative samples, i.e., the number of patients incorrectly predicted as not having OS (misdiagnosis).
These metrics are commonly used together to evaluate the performance of classification models since accuracy alone cannot fully assess the model’s performance, and there is often a trade-off between precision and recall. The F1 score combines the information from precision and recall to provide a comprehensive evaluation of the model’s classification ability.
The primary results obtained using ResNet18 are presented in Table 1. The numerical results indicate that incorporating the ALP value yields higher effectiveness (+ 2% accuracy, + 1.16% precision, + 3.37% recall, + 2.34% F1-score) than LDH. Additionally, using gender and age as the additional input of the network contributes to the diagnosis of OSs to some extent as well. Although the LDH value (+ 0.54% accuracy, + 0.07% precision, + 1.08% recall, + 0.66% F1-score), as well as gender, do not significantly improve the performance, their combination with other information can lead to the best performance. The overall best performance is achieved by adding all this discussed information to the model input, yielding 97.19% accuracy, 98.26% Precision, 95.49% Recall, and 96.85% F1 score.
Table 1.
Main results with ResNet18 as the backbone
| Method | Add. Info | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|---|
| Baseline | – | 0.9435 | 0.9588 | 0.9144 | 0.9357 |
| Gender | Encoded gender | 0.9493 | 0.9571 | 0.9300 | 0.9432 |
| Age | Encoded age | 0.9643 | 0.9680 | 0.9532 | 0.9600 |
| Gender & age | Encoded gender and age | 0.9612 | 0.9652 | 0.9481 | 0.9566 |
| LDH | LDH value | 0.9489 | 0.9595 | 0.9262 | 0.9423 |
| ALP | ALP value | 0.9635 | 0.9704 | 0.9481 | 0.9591 |
| ALP&LDH | ALP and LDH value | 0.9643 | 0.9662 | 0.9538 | 0.9599 |
| All | All the above information | 0.9719 | 0.9826 | 0.9549 | 0.9685 |
The baseline method is the network without any additional information, by adding different combinations of basic information (gender and age) of patients and biochemical information (LDH and ALP value). The best result for each metric (column) is highlighted in bold. The average performance of our fusion network consistently surpasses that of the baseline
The results align well with intuition, indicating the added inputs contribute significantly to OS diagnosis. Consequently, the results with additional information outperform the baseline across all four metrics. However, certain additional inputs, such as LDH, do not exhibit the anticipated efficacy. One likely explanation is that an abnormal LDH value can signal various other diseases, particularly liver-related conditions like acute hepatitis, chronic active hepatitis, cirrhosis, and liver cancer. Consequently, a patient experiencing leg pain and displaying an abnormal LDH value may not necessarily be afflicted with OS. Things are similar when it comes to ALP, although ALP works well for the diagnosis of OS, it still can’t serve as a decisive diagnostic factor in isolation. Under this situation, it is undeniable that patients who have both abnormal LDH and ALP values are more likely to have OS. This probably explains the strong performance of ALP and LDH when utilized together.
Typically, age and gender are not regarded as diagnostic factors. An intriguing observation is that age and gender, although employed in isolation, can surprisingly contribute to the diagnosis of OS. What’s more, their combined use with laboratory indicators proves to be highly effective. Their effectiveness may stem from variations in the normal ALP and LDH ranges among individuals of different ages and genders. Consequently, combining age and gender with these indicators leads to better judgment on the ALP and LDH levels, and enhances diagnostic accuracy.
To effectively illustrate the strengths and weaknesses of various models regarding specific metrics, supplemental experiments with ResNet50 were conducted. The comprehensive bar charts depicting the experimental outcomes are presented in Fig. 6. As depicted in Fig. 6, utilizing ResNet50 as the image feature extractor yields superior results. The highest accuracy reaches 98.23%, and there are improvements in the best recall, precision, and F1 score. It’s noteworthy that as the image encoder strengthens, the model achieves more precise predictions solely with X-ray images, minimizing room for improvement. Nevertheless, the inclusion of additional inputs consistently enhances performance compared to the baseline, thus affirming the effectiveness of our proposed method.
Fig. 6.
The complete results on ResNet18 and ResNet50 compared with the baseline. The integration of biochemical data consistently improved the model performance compared with the baseline, with ALP and LDH as the most efficient auxiliary data
Interpretability research
We employ Grad-CAM to analyze the entire dataset, where Layer 4 in ResNet is chosen as the target layer for Grad-CAM, and the heatmap of each image indicates the region the model scrutinizes to make predictions. Figure 7 illustrates the instances that were correctly classified. The figure reveals that the model distinctly concentrates on suspicious tumor tissue when assessing patients with OS. Subsequently, the corresponding areas in X-ray images were evaluated by specialists and described in technical terms, the radiographic features of this area can be summarized as fluffy, cloud-like, and sunburst-like abnormal tissues, which are also the characteristic imaging changes of OS. This implies that the model recognizes these image patterns as indicative of OS characteristics and renders judgments accordingly. Conversely, for non-tumor patients, the model’s attention is ultimately directed toward the lower end of the bone in the image, signifying the absence of suspicious tumor tissue and leading to the determination that the patient is not afflicted with OS. In summary, the model accurately captured the region of interest for both categories of patients and made accurate predictions.
Fig. 7.
The Grad-CAM results of correctly classified samples. The model distinctly concentrates on suspicious tumor tissue when assessing patients with OS
From an alternative perspective, we seek to understand the reasons behind models occasionally making incorrect predictions and the factors that lead the model to make erroneous decisions. In Fig. 8, the misclassified samples by the baseline model are depicted. Within these false negative samples, the X-ray images do not reveal evident tumor tissues, capturing the model’s attention and leading it to incorrectly conclude the absence of OS in the patient. Conversely, in the case of false positive samples, the model misinterprets abnormal knee X-ray textures, resulting from other conditions, as tumor tissue. Consequently, these patients receive a misdiagnosis of having OS.
Fig. 8.
The Grad-CAM results of samples misclassified by the baseline model. The baseline model is unable to focus on evident tumor tissues, resulting in an incorrect diagnosis of the absence of OS
We continue to investigate the effect of additional inputs on the attention of the model, in addition to analyzing the baseline model. We selected the samples that were misclassified by the baseline model and correctly classified by the improved model with all the discussed additional inputs. As shown in Fig. 9, when the model is not assisted by additional inputs, the attention is separated or omits the core region. In contrast, the model’s attention correctly focuses on the region of interest with the help of auxiliary inputs. Take the first case as an example, the original model failed to find the key region, while the improved model, although not perfect, successfully turned part of the attention to the tumor tissues. Meanwhile, we analyzed the gradient contribution of the additional inputs, the statistical results show that ALP contributes 2.415 times more to the gradient than LDH. This finding again emphasizes that ALP is significant for the diagnosis of OS.
Fig. 9.
The change of attention given additional inputs. The model’s attention correctly focuses on the region of interest with the help of auxiliary inputs
Conclusion
In this paper, we present a deep learning-based approach to improve the diagnosis of OS by integrating biochemical data with X-ray imaging. Our fusion model achieves a high accuracy rate of 97.17% on real-world clinic data, demonstrating the potential of incorporating additional clinical information to enhance the performance of diagnostic systems.
Our approach has several advantages over traditional diagnostic methods. First, it provides a more comprehensive analysis of the patient’s condition by incorporating additional clinical information. Second, it improves the accuracy of diagnosis by leveraging the strengths of both biochemical data and imaging data. Third, it provides more explainable results by generating Grad-CAM visualizations that highlight the regions of the X-ray images that contribute to the model’s decision.
We explore two main approaches to integrating biochemical data and imaging data: early fusion and late fusion. We recognize late fusion is a more effective approach to our task. We use a two-layer MLP to better utilize the concatenated features, and we adjust the full connection layer to improve the performance of the model.
Our approach is capable of being applied to other medical imaging modalities beyond X-rays, such as magnetic resonance imaging (MRI) or computed tomography (CT) scans. Additionally, it is worthwhile to incorporate more clinical data, such as patient history and symptoms. Moreover, extending our work to other medical fields beyond OS diagnosis, such as cancer detection and diagnosis of other diseases, would also be a valuable undertaking.
Considering the limitations, there do exist weaknesses that can be addressed in future research. For example, our model may waste time trying to distinguish visually similar images that can be easily classified with additional non-image information. Additionally, our model is trained and evaluated on a relatively small dataset, although our data achieves data set balance in terms of categories, the imbalance of age distribution affects the generalization of the model to a certain extent. Uniting more medical and health organizations to collect more diagnosis data, and solving the problem of data set imbalance is a very important part of future work. In future works, we will investigate approaches to achieving input-aware inference with adaptive weighted fusion modules. In addition, we will focus on improving the efficiency of data collection through deep learning-based image preprocessing, such as image enhancement and segmentation.
In conclusion, our study demonstrates the potential of deep learning-based approaches for improving the diagnosis of primary OS. By integrating biochemical data with imaging data, we highlight the importance of integrating multiple sources of clinical data, achieving higher accuracy rates and more comprehensive analyses of the patient’s condition. We hope that our work can contribute to the development of more effective and efficient diagnostic systems, and inspires further research and advancements in the field of medical image analysis and diagnosis.
Acknowledgements
This work is supported by the Natural Science Foundation of Jiangsu Province under grant BK20230083, Xicheng district financial science and technology special project XCSTS-SD2022-15, Peking University People’s Hospital research and development funds RDX2023-01.
Data availability
The datasets generated and/or analyzed during the current study are not publicly available due to privacy restrictions and ethical considerations related to patient data.
Declarations
Conflict of interest
The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
This research was approved by the Ethics Review Committee, Peking University People’s Hospital (Approval No. 2023PHB271-001).
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Shidong Wang and Yangyang Shen have contributed equally to this work.
Contributor Information
Xiaodong Tang, Email: tang15877@126.com.
Beilun Wang, Email: beilun@seu.edu.cn.
References
- 1.Gaume M, Chevret S, Campagna R, Larousserie F, Biau D. The appropriate and sequential value of standard radiograph, computed tomography and magnetic resonance imaging to characterize a bone tumor. Sci Rep. 2022;12:6196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Shao R, et al. Bone tumors effective therapy through functionalized hydrogels: Current developments and future expectations. Drug Delivery. 2022;29:1631–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mullard M, et al. Sonic hedgehog signature in pediatric primary bone tumors: effects of the gli antagonist gant61 on ewing’s sarcoma tumor growth. Cancers. 2020;12:3438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fauske L, Bruland OS, Grov EK, Bondevik H, et al. Cured of primary bone cancer, but at what cost: a qualitative study of functional impairment and lost opportunities. Sarcoma. 2015;2015: 484196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Davies M, Lalam R, Woertler K, Bloem JL, Åström G. Ten commandments for the diagnosis of bone tumors. Semin Musculoskelet Radiol. 2020;24(3):203–13. [DOI] [PubMed] [Google Scholar]
- 6.Stefanini FS, Gois FWC, de Arruda TCSB, Bitencourt AGV, Cerqueira WS. Primary bone lymphoma: pictorial essay. Radiol Bras. 2020;53:419–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Miwa S, Otsuka T. Practical use of imaging technique for management of bone and soft tissue tumors. J Orthop Sci. 2017;22:391–400. [DOI] [PubMed] [Google Scholar]
- 8.Lindsey BA, Markel JE, Kleinerman ES. Osteosarcoma overview. Rheumatol Ther. 2017;4:25–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Richens JG, Lee CM, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun. 2020;11:3923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lei Y, et al. Applications of machine learning to machine fault diagnosis: a review and roadmap. Mech Syst Signal Process. 2020;138: 106587. [Google Scholar]
- 11.Shung DL, Sung JJ. Challenges of developing artificial intelligence-assisted tools for clinical medicine. J Gastroenterol Hepatol. 2021;36:295–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tătaru OS, et al. Artificial intelligence and machine learning in prostate cancer patient management—current trends and future perspectives. Diagnostics. 2021;11:354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.He F, et al. Study on machine learning model of primary bone tumor around knee joint assisted diagnosis based on X-ray images. Prog Mod Biomed. 2021;21 (in Chinese)
- 14.Olczak J, et al. Artificial intelligence for analyzing orthopedic trauma radiographs: deep learning algorithms—are they on par with humans for diagnosing fractures? Acta Orthop. 2017;88:581–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fave X, et al. Delta-radiomics features for the prediction of patient outcomes in non-small cell lung cancer. Sci Rep. 2017;7:588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Alzubaidi L, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8:1–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. arXiv Preprint (2015). arXiv:1512.03385
- 18.Xu W, Fu Y-L, Zhu D. ResNet and its application to medical image processing: research progress and challenges. Comput Methods Programs Biomed. 2023;240: 107660. [DOI] [PubMed] [Google Scholar]
- 19.Soni M, et al. Hybridizing convolutional neural network for classification of lung diseases. Int J Swarm Intell Res (IJSIR). 2022;13:1–15. [Google Scholar]
- 20.Chowdhury NK, Rahman MM, Kabir MA. PDCOVIDNet: a parallel-dilated convolutional neural network architecture for detecting COVID-19 from chest X-ray images. Health Inf Sci Syst. 2020;8:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zhang Y, Tiňo P, Leonardis A, Tang K. A survey on neural network interpretability. IEEE Trans Emerg Top Comput Intell. 2021;5:726–42. [Google Scholar]
- 22.Vellido A. The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Comput Appl. 2020;32:18069–83. [Google Scholar]
- 23.Selvaraju RR, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. arXiv e-prints (2016). arXiv:1610.02391
- 24.Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. arXiv e-prints (2015). arXiv:1512.04150
- 25.Chan L, Hosseini MS, Plataniotis KN. A comprehensive analysis of weakly-supervised semantic segmentation in different image domains. Int J Comput Vis. 2021;129:361–84. [Google Scholar]
- 26.Zhang X, et al. Prospective clinical research of radiomics and deep learning in oncology: a translational review. Crit Rev Oncol Hematol. 2022;179: 103823. [DOI] [PubMed] [Google Scholar]
- 27.Sarki R, Ahmed K, Wang H, Zhang Y. Automated detection of mild and multi-class diabetic eye diseases using deep learning. Health Inf Sci Syst. 2020;8:32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Abd El-Wahab BS, Nasr ME, Khamis S, Ashour AS. Btc-fcnn: Fast convolution neural network for multi-class brain tumor classification. Health Inf Sci Syst. 2023;11:3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bansal P, Gehlot K, Singhal A, Gupta A. Automatic detection of osteosarcoma based on integrated features and feature selection using binary arithmetic optimization algorithm. Multimedia Tools Appl. 2022;81:8807–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Zhao Y, et al. Identification of gastric cancer with convolutional neural networks: a systematic review. Multimedia Tools Appl. 2022;81:11717–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bhandari B, Alsadoon A, Prasad P, Abdullah S, Haddad S. Deep learning neural network for texture feature extraction in oral cancer: Enhanced loss function. Multimedia Tools Appl. 2020;79:27867–90. [Google Scholar]
- 32.Anoop V, Bipin PR, Anoop BK. Automated biomedical image classification using multi-scale dense dilated semi-supervised u-net with cnn architecture. Multimed Tools Appl. 2024;83:30641–73. 10.1007/s11042-023-16659-1. [Google Scholar]
- 33.Bai Q, Su C, Tang W, Li Y. Machine learning to predict end stage kidney disease in chronic kidney disease. Sci Rep. 2022;12:8377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Anand D, Arulselvi G, Balaji G, Chandra GR. A deep convolutional extreme machine learning classification method to detect bone cancer from histopathological images. Int J Intell Syst Appl Eng. 2022;10:39–47. [Google Scholar]
- 35.von Schacky CE, et al. Development and evaluation of machine learning models based on x-ray radiomics for the classification and differentiation of malignant and benign bone tumors. Eur Radiol. 2022;32:6247–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Liu R, et al. A deep learning-machine learning fusion approach for the classification of benign, malignant, and intermediate bone tumors. Eur Radiol. 2022;32:1371–83. [DOI] [PubMed] [Google Scholar]
- 37.Cole S, Gianferante DM, Zhu B, Mirabello L. Osteosarcoma: a surveillance, epidemiology, and end results program-based analysis from 1975 to 2017. Cancer. 2022;128:2107–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Meltzer PS, Helman LJ. New horizons in the treatment of osteosarcoma. N Engl J Med. 2021;385:2066–76. [DOI] [PubMed] [Google Scholar]
- 39.Bian J, et al. Research progress in the mechanism and treatment of osteosarcoma. Chin Med J. 2023;136:2412–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gorlick R, et al. Children’s oncology group’s 2013 blueprint for research: bone tumors. Pediatr Blood Cancer. 2013;60:1009–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Gu R, Sun Y. Does serum alkaline phosphatase level really indicate the prognosis in patients with osteosarcoma? a meta-analysis. J Cancer Res Ther. 2018;14:S468–72. [DOI] [PubMed] [Google Scholar]
- 42.Su Z, Huang F, Yin C, Yu Y, Yu C. Clinical model of pulmonary metastasis in patients with osteosarcoma: A new multiple machine learning-based risk prediction. J Orthop Surg. 2023;31:10225536231177102. [DOI] [PubMed] [Google Scholar]
- 43.Basoli S, et al. The prognostic value of serum biomarkers for survival of children with osteosarcoma of the extremities. Curr Oncol. 2023;30:7043–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Fu Y, Lan T, Cai H, Lu A, Yu W. Meta-analysis of serum lactate dehydrogenase and prognosis for osteosarcoma. Medicine. 2018;97: e0741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Biermann JS, et al. NCCN guidelines insights: bone cancer, version 2.2017. J Natl Compreh Cancer Netw. 2017;15(2):155–67. 10.6004/jnccn.2017.0017. [DOI] [PubMed] [Google Scholar]
- 46.Ottaviani G, Jaffe N. The epidemiology of osteosarcoma. Cancer Treat Res. 2010;2:3–13. [DOI] [PubMed] [Google Scholar]
- 47.Sadykova LR, et al. Epidemiology and risk factors of osteosarcoma. Cancer Invest. 2020;38:259–69. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analyzed during the current study are not publicly available due to privacy restrictions and ethical considerations related to patient data.





