Abstract
Background
Hip osteoarthritis (HOA) profoundly impairs individuals’ quality of life. Accurate Kellgren–Lawrence (KL) grading is essential for guiding interventions to delay the progression of HOA. However, manual KL grading is constrained by inherent subjectivity and low interobserver reliability. This study aimed to develop and validate a deep learning–based model for the automated grading of HOA.
Methods
We retrospectively collected 20,745 hip radiographs from two Chinese hospitals for model development, 1,928 radiographs from a third hospital for external validation, and 1,249 hips from the Osteoarthritis Initiative (OAI) dataset. A ResNet-50 network with a Convolutional Block Attention Module was trained and evaluated. Comprehensive performance was evaluated across multiple metrics and compared with orthopedic surgeons of varying clinical experience. In addition, Gradient-weighted Class Activation Mapping (Grad-CAM) was used for interpretability.
Results
The model achieved 90.83% (95% confidence interval [CI]: 89.96–91.72) accuracy (area under the receiver operating characteristic curve [AUC]: 0.94) on the internal dataset, 86.67% (95% CI: 85.11–88.12) accuracy (AUC: 0.93) externally, and 82.29% (95% CI: 80.22–84.39) accuracy (AUC: 0.90) on the OAI dataset, with most misclassifications confined to adjacent KL grades. In the reader comparison study, it matched deputy chief surgeons. Grad-CAM confirmed that the model predominantly attended to clinically relevant anatomical features associated with KL grading.
Conclusions
The developed model enables automatic and objective assessment of HOA severity using KL grading across diverse populations and imaging conditions. This tool shows potential to support disease monitoring, and large-scale epidemiologic research to enhance standardization and reproducibility in HOA assessment.
Keywords: Classification, deep learning, hip osteoarthritis, radiography
Introduction
Osteoarthritis (OA) is a chronic degenerative joint disease characterized by cartilage degradation, osteophyte formation, and synovial inflammation and has emerged as a major global public health challenge [1]. Recent epidemiological data indicate that since 1990, the global prevalence of OA has increased by more than 132%, reaching 595 million individuals by 2020, and OA is one of the top ten leading causes of disability worldwide [2]. The burden of OA in China is similarly substantial, with more than 130 million individuals affected as of 2019, which poses significant challenges to both public health and the healthcare system [3]. Beyond severely limiting patients’ daily activities and quality of life, OA imposes a considerable socioeconomic burden. For example, OA-related healthcare expenditures in the United States reached $80 billion in 2016 [4], while in Hong Kong, the combined direct and indirect costs exceeded $400 million as early as 2003 [5]. Although hip osteoarthritis (HOA) accounts for only approximately 6% of all OA cases, its impact on lower limb function is particularly debilitating. Up to 30% of patients with advanced disease eventually require joint replacement surgery, underscoring the high clinical importance of HOA in OA management [6]. Early identification and accurate assessment of HOA severity are therefore essential to delay disease progression, optimize therapeutic strategies, and reduce long-term healthcare costs [7].
Currently, the Kellgren–Lawrence (KL) grading system based on standard X-rays is widely regarded as the gold standard for assessing disease severity in both clinical and research settings [8,9]. The KL system classifies HOA into grades 0 to 4 based on characteristic radiographic features and serves as a critical tool for clinical decision-making [10]. However, the accuracy of KL grading is heavily reliant on the clinician’s subjective interpretation, which can lead to substantial inter-observer variability. This limitation is particularly evident in primary care settings, where early-stage cases are susceptible to misclassification, resulting in missed or delayed diagnoses and suboptimal timing of intervention [11]. Therefore, improving the accuracy and consistency of KL grading is essential to enhance HOA diagnosis and management, reduce disability rates, and optimize healthcare resource allocation [12–15].
In recent years, deep learning (DL) technologies—particularly convolutional neural networks (CNNs)—have demonstrated remarkable advancements in medical imaging and are now widely applied to the diagnosis and classification of orthopedic diseases [16–23]. Within OA research, CNNs have been successfully utilized for automated analysis and KL grading of knee radiographs. For example, Tiulpin et al. [16] developed a model capable of accurately identifying knee OA severity, achieving grading performance highly consistent with expert assessments. In contrast, research on the automated evaluation of HOA remains in its early stages. Although several artificial intelligence (AI) studies have reported promising results in related hip disorders—such as hip fractures [24], developmental dysplasia of the hip [25], and osteonecrosis of the femoral head [18]—intelligent KL grading for HOA continues to face notable limitations. First, restricted training datasets and sample homogeneity limit the generalizability of current models. Second, the absence of multicenter validation hinders the assessment of model robustness across diverse populations and imaging conditions. Collectively, these limitations hamper the clinical translation of AI-based technologies in HOA diagnosis and management.
Consequently, this study sought to develop a DL-based model for the automated assessment of HOA severity. The model was trained and evaluated using multicenter, multisample data, and its performance was systematically compared with the grading results of clinical experts. This research seeks to enhance the accuracy and consistency of radiographic grading for HOA, reduce clinician workload, facilitate early screening and personalized treatment planning, and promote the integration of AI into osteoarthritis management.
Methods
Datasets
This multicenter, retrospective study included hip anteroposterior radiographs collected from three tertiary hospitals in China: the Second Hospital of Jilin University (Hospital #1), China–Japan Union Hospital of Jilin University (Hospital #2), and the Affiliated Hospital of Shandong Second Medical University (Hospital #3). All radiographic data underwent rigorous de-identification to remove personally identifiable information, thereby ensuring patient privacy. The study protocol was approved by the Ethical Institutional Review Board of the Second Hospital of Jilin University (No. 2025-127) and received ethical endorsement from all participating centers. The Ethical Institutional Review Board of the three hospitals waived the requirement for patient informed consent due to the retrospective design and the use of anonymized data. All hip radiographs were exported and saved in the digital imaging and communication in medicine (DICOM) format.
Data from Hospital #1 were collected between January 2020 and August 2024, while data from Hospital #2 were collected between August 2022 and August 2024. A total of 20,669 hip radiographs from adult patients meeting the inclusion criteria were obtained from both centers. During data collection, patients were screened based on their unique registration numbers and names. Only the first anteroposterior hip radiograph from each patient’s medical record was included in the dataset, ensuring that each patient contributed only one image. This patient-level exclusion criterion prevented any possibility of data leakage between training and test sets. Inclusion criteria were age ≥18 years and radiographic coverage extending from the anterior superior iliac spine to the proximal third of the femur. Exclusion criteria included (1) the presence of other hip conditions that could affect KL grade assessment, such as hip fractures, tumors, developmental dysplasia of the hip, osteonecrosis of the femoral head, or infection; (2) implantation of hip prostheses; and (3) inadequate image quality precluding KL grade assessment.
To further assess the generalizability and robustness of the developed DL model in varied clinical contexts, an independent external validation dataset was collected from Hospital #3, which was not involved in model training. This dataset, collected between October 2024 and February 2025, comprised 1,928 hip radiographs from adult patients meeting the same inclusion and exclusion criteria and was used solely for external performance validation. The three centers differed in geographic location, demographic characteristics, and imaging equipment. This multicenter heterogeneity enabled a systematic evaluation of the model’s adaptability across diverse and real-world clinical environments. Additionally, to address potential limitations of validation within the same national healthcare system, we incorporated data from the Osteoarthritis Initiative (OAI) database. From the OAI baseline cohort, we randomly selected 800 participants. After applying our inclusion and exclusion criteria, 1,249 hips were included. The participant selection flow is illustrated in Figure 1.
Figure 1.
Flowchart of the participant recruitment and dataset construction process for the study.
KL grading and annotation protocol
To ensure the accuracy and consistency of the labeled data, KL grading was independently performed by two advanced orthopedic surgeons (JLZ and JLX). Cases with concordant grades were accepted as the final labels. For cases with discrepancies, a third advanced orthopedic professor (YGQ) adjudicated the final label through a consensus conference. Prior to grading, all participating experts underwent standardized training and reviewed the KL grading criteria along with a reference atlas of radiographic examples to enhance interrater reliability [26]. A blinded reading protocol was employed during the grading process; readers were blinded to the image source, acquisition site, and all clinical information, relying solely on radiographic features to assign grades. This approach minimized potential bias. All labeled data were stored in a centralized system with rigorous version control and access management protocols.
Image preprocessing
Prior to model training, all imaging data underwent standardized preprocessing. First, all DICOM images were converted to PNG format, and PNG images were read and uniformly converted to RGB format. The images were then cropped and resized to 224 × 224 pixels to meet the input specifications of deep neural networks. Global pixel mean and standard deviation values, calculated from the training set, were subsequently used to normalize all images. To enhance generalizability and mitigate overfitting, a variety of data augmentation strategies from the Albumentations library were applied during the training phase. These included geometric transformations such as random rotation, translation, and scaling, as well as photometric adjustments including brightness–contrast modulation, Gaussian noise, multiplicative noise, and coarse dropout. For the test set, only resizing and normalization were performed, with no augmentation applied.
Model development and experimental environment
The backbone architecture of the model was based on ResNet-50. To further enhance the model’s capacity to extract discriminative features from HOA radiographs, we integrated a Convolutional Block Attention Module (CBAM) into the bottleneck residual blocks. CBAM sequentially applies channel and spatial attention mechanisms to guide the network’s focus toward more informative regions and feature channels, thereby strengthening its representational power. The original fully connected layer at the network’s output was replaced with a new classification layer corresponding to the number of KL grades. Additionally, dropout regularization was incorporated to mitigate the risk of overfitting.
The training strategy combined transfer learning with progressive fine-tuning. To ensure class balance across different grades, the dataset was split using stratified sampling into a training set and an internal test set at an 80% to 20% ratio, maintaining a uniform distribution of KL grades in both subsets. Optimization was performed using the AdamW optimizer, coupled with a cosine annealing learning rate scheduler to enhance training stability. To mitigate overconfidence and improve generalizability, label smoothing cross-entropy was adopted as the loss function. An early stopping strategy was implemented throughout the training process, with the best-performing model weights on the validation set preserved to prevent overfitting. The overall architecture of the proposed CBAM-enhanced ResNet-50 model is depicted in Figure 2 and is available at https://github.com/guowangX66/hip-OA-KL-grading.
Figure 2.
A schematic illustration of the convolutional block attention module-enhanced ResNet-50 based framework for hip osteoarthritis Kellgren–Lawrence grading.
Model training and testing were conducted on a single workstation equipped with an Intel Xeon Platinum 8360Y CPU, 16 × 32 GiB (3200 MHz, DDR4) memory totaling 512 GB RAM, and one NVIDIA A100 GPU. The system runs Red Hat Enterprise Linux 9.4 with CUDA 11.8. The complete framework was developed in Python 3.11 using PyTorch.
The choice of the ResNet-50 architecture combined with the CBAM was based on rigorous empirical evaluation and technical considerations tailored to the KL grading task for hip osteoarthritis. ResNet-50, with its residual connections, effectively mitigates the vanishing gradient problem and enables stable training of deep networks. Its parameter scale offered an optimal balance for our dataset size, avoiding both underfitting and overfitting. CBAM enhances feature refinement by introducing channel and spatial attention mechanisms, which is particularly advantageous in this context—radiographic features critical to KL grading, such as joint space narrowing, osteophytes, subchondral sclerosis, and cysts, often appear across varying locations and scales. This attention-driven enhancement improves the model’s ability to capture subtle differences between KL grades while suppressing irrelevant background signals. Additionally, the use of an ImageNet-pretrained ResNet-50 backbone ensures robust low-level feature extraction, with proven transferability to radiographic tasks. CBAM’s lightweight design introduces minimal additional parameters, thereby preserving the computational efficiency required for clinical deployment.
Model visualization
To interpret the decision-making process of the deep learning model, Gradient-weighted Class Activation Mapping (Grad-CAM) was employed to visualize the discriminative image regions contributing to classification [27]. Grad-CAM is a post hoc explanation tool that localizes image regions contributing to the predicted class; it provides qualitative, correlation-based evidence and does not establish causal mechanisms. The last convolutional block of ResNet-50 was selected as the target layer. The CAMs were overlaid on the input X-ray images to evaluate whether the model focused on clinically relevant anatomical structures. To contextualize the heatmaps clinically, two advanced orthopedic surgeons (JLZ and JLX) independently reviewed a stratified sample of correctly and incorrectly classified cases and assessed whether attention overlapped with radiographic hallmarks used in KL grading. Discrepancies were resolved by consensus. For discrepancies, a third advanced orthopedic professor (YGQ) adjudicated the final label through a consensus conference.
Expert comparison for model validation
To comprehensively evaluate the clinical utility of the developed deep learning model for automated KL grading of HOA severity, a human–machine comparison study was conducted. Six orthopedic physicians, who did not participate in the establishment of the reference standard but had experience in musculoskeletal imaging, were recruited and categorized into three groups based on their level of clinical experience: (1) two orthopedic residents with 3 years of experience; (2) two attending orthopedic surgeons with 5 years of experience; and (3) two deputy chief orthopedic surgeons with 10 years of experience. All participants independently graded hip radiographs from the external validation dataset using the KL grading system, blinded to both the model’s predictions and patient clinical information. The model’s predictions and the physicians’ manual grading results were then compared against the expert consensus reference standard. Accuracy, weighted F1 score, and Cohen’s kappa coefficient were used to systematically evaluate differences in grading performance between the model and the physicians, thereby assessing the model’s feasibility and reliability as a clinical decision support tool.
Statistical analysis
Continuous variables are presented as mean (range), and categorical variables are expressed as frequencies (n) and percentages (%). Model performance metrics included accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve analysis, and area under the ROC curve (AUC), providing a comprehensive assessment of the model’s discriminatory ability in grading HOA severity.
To compare the model’s performance with that of orthopedic physicians across different experience levels in the image grading task, accuracy, precision, recall, and F1 score were evaluated. Interrater reliability among physicians was assessed using Cohen’s kappa coefficient to quantify the degree of agreement between manual graders. McNemar’s test was employed to determine statistically significant differences in grading accuracy between the model and each physician group. All statistical analyses were conducted using R software (version 4.4.2; Vienna, Austria). A p-value < 0.05 was considered statistically significant.
Results
Baseline characteristics
Table 1 summarizes the baseline characteristics of the study population from the three participating hospitals. Within the internal dataset, Hospital #1 contributed 8,357 participants, encompassing 13,394 hip radiographs, with a mean age of 53.85 years; 3,527 (42.20%) were men and 4,830 (57.80%) were women. Hospital #2 contributed 5,609 participants, comprising 7,351 hip radiographs, with a mean age of 55.69 years; 2,265 (40.38%) were men and 3,344 (59.62%) were women. In the internal dataset from both centers, KL grade 0 was the most prevalent, accounting for more than half of all gradings. In the external dataset, Hospital #3 contributed 1,256 participants, encompassing 1,928 hip radiographs, with a mean age of 55.35 years; 548 (43.63%) were men and 708 (56.37%) were women. Unlike the internal dataset, the external dataset showed a more balanced distribution between KL grade 0 and KL grade 1 radiographs. The OAI dataset included 632 participants with 1,249 hips. The mean age was 62.28 years, comprising 381 men and 251 women. Similarly, KL grade 0 accounted for more than half of the cases. Table S1 shows the KL grades partitioning for the training and test datasets.
Table 1.
Demographic characteristics of participants.
| Characteristics | Internal dataset |
External dataset |
||
|---|---|---|---|---|
| Hospital #1 | Hospital #2 | Hospital #3 | OAI dataset | |
| No. of participants | 8,357 | 5,609 | 1,256 | 632 |
| No. of hips | 13,394 | 7,351 | 1,928 | 1,249 |
| Mean age (range) | 53.85 (18–94) | 55.69 (18–98) | 55.35 (18–88) | 62.28 (41–78) |
| Sex (%) | ||||
| Men | 3,527 (42.20) | 2,265 (40.38) | 548 (43.63) | 381 (60.28) |
| Women | 4,830 (57.80) | 3,344 (59.62) | 708 (56.37) | 251 (39.72) |
| KL grades (%) | ||||
| 0 | 7,567 (56.50) | 3,783 (51.46) | 767 (39.78) | 769 (61.57) |
| 1 | 3,776 (28.19) | 2,289 (31.14) | 726 (37.66) | 295 (23.62) |
| 2 | 1,556 (11.62) | 957 (13.02) | 234 (12.14) | 167 (13.37) |
| 3 | 294 (2.20) | 198 (2.69) | 57 (2.96) | 11 (0.88) |
| 4 | 201 (1.50) | 124 (1.69) | 44 (2.28) | 7 (0.56) |
| Total hips | 20,745 | 3,177 | ||
No., number; KL, Kellgren–Lawrence.
Model performance
The validated model was applied to the automated classification of HOA severity based on the KL grading system and systematically evaluated using two independent test datasets. In the internal test set, the model achieved an overall accuracy of 90.83% (95% CI: 89.96–91.72). The macro-averaged area under the AUC, precision, recall, and F1 score were 0.94, 89.35% (95% CI: 87.33–91.21), 91.42% (95% CI: 89.77–92.87), and 90.21% (95% CI: 88.44–91.83), respectively (Figure 3A,B; Table S2). Across individual KL grades, model performance remained consistently high. For KL grade 0, the model demonstrated an AUC of 0.94, with precision, recall, and F1 score of 93.57% (95% CI: 92.60–94.56), 95.73% (95% CI: 94.89–96.52), and 94.63% (95% CI: 93.95–95.29), respectively. KL grade 1 yielded an AUC of 0.88, with corresponding metrics of 89.57% (95% CI: 87.60–91.29), 80.47% (95% CI: 78.31–82.75), and 84.77% (95% CI: 83.15–86.34). For KL grade 2, the AUC is 0.95, with corresponding metrics of 81.72% (95% CI: 78.32–84.74), 93.01% (95% CI: 90.62–95.16), and 86.99% (95% CI: 84.63–89.19). KL grade 3 achieved an AUC of 0.95, with precision of 91.93% (95% CI: 86.54–96.74), recall of 90.89% (95% CI: 84.78–96.04), and an F1 score of 91.36% (95% CI: 87.06–95.19). For KL grade 4, performance further improved, with an AUC of 0.98 and respective precision, recall, and F1 scores of 89.97% (95% CI: 82.86–96.05), 96.99% (95% CI: 92.42–100.00), and 93.31% (95% CI: 88.46–97.10) (Figure 3A,B; Table S2). Importantly, the model’s classification errors were predominantly confined to adjacent KL grades, indicating that misclassifications rarely occurred across non-sequential severity levels. A small number of exceptions were observed; notably, 11 radiographs graded as KL 0 by expert consensus were misclassified by the model as KL grade 2 (Figure 3C).
Figure 3.
Model performance on the internal test dataset. (A) Receiver operating characteristic curves for the hip osteoarthritis Kellgren–Lawrence classification task. (B) The bar charts and radar charts demonstrate the different performances of the model for the hip osteoarthritis Kellgren–Lawrence grading classification. (C) Confusion matrix of the classification results. The strong diagonal indicates high classification accuracy across all classes, with rows representing accurate labels and columns representing predicted labels.
The model exhibited robust generalizability in the external test set, achieving 86.67% (95% CI: 85.11–88.12) accuracy and macro-averaged AUC, precision, recall, and F1 score values of 0.93, 87.58% (95% CI: 84.99–89.89), 87.78% (95% CI: 87.61–91.65), and 88.55% (95% CI: 86.26–90.67), respectively (Figure 4A,B; Table S2). Performance remained consistent across individual KL grades. For KL grade 0, the model achieved an AUC of 0.91, with precision, recall, and F1 scores of 88.58% (95% CI: 86.31–90.68), 89.96% (95% CI: 87.78–92.09), and 89.26% (95% CI: 87.45–90.80), respectively. KL grade 1 yielded an AUC of 0.86, with a precision of 86.43% (95% CI: 83.63–89.09), a recall of 79.88% (95% CI: 77.06–82.69), and an F1 score of 83.02% (95% CI: 80.91–85.11). For KL grade 2, performance improved with an AUC of 0.94, precision of 82.05% (95% CI: 78.15–85.46), 91.90% (95% CI: 88.98–94.84), and 86.68% (95% CI: 84.03–89.12). KL grade 3 achieved an AUC of 0.95, with precision of 89.39% (95% CI: 80.88–96.88), recall of 89.44% (95% CI: 80.77–96.43), and an F1 score of 89.33% (95% CI: 83.19–94.83). The highest performance was observed for KL grade 4, with an AUC of 0.99 and corresponding metrics of 91.46% (95% CI: 82.35–98.11) for precision, 97.73% (95% CI: 92.16–100.00) for recall, and 94.44% (95% CI: 88.57–98.82) for F1 score (Figure 4A,B; Table S2). Consistent with findings from the internal dataset, the majority of classification errors were confined to adjacent KL grades (Figure 4C), reinforcing the model’s reliability in differentiating between gradations of disease severity. These results highlight the CNN model’s strong discriminatory capability and its potential for generalizable application across diverse clinical environments and imaging conditions.
Figure 4.
Model performance on the external test dataset. (A) Receiver operating characteristic curves for the hip osteoarthritis Kellgren–Lawrence classification task. (B) The bar charts and radar charts demonstrate the different performances of the model for the hip osteoarthritis Kellgren-Lawrence grading classification. (C) Confusion matrix of the classification results. The strong diagonal indicates high classification accuracy across all classes, with rows representing accurate labels and columns representing predicted labels.
In the OAI dataset, the model performance further declined, possibly due to different imaging protocols or poorer image quality. The model achieved an accuracy of 82.29% (95% CI: 80.22–84.39), with macro-averaged values of 0.90 for AUC, 80.99% (95% CI: 72.81–86.99) for precision, 87.16% (95% CI: 82.39–90.12) for recall, and 83.59% (95% CI: 76.43–88.17) for F1 score (Figure 5A,B; Table S2). Among all grades, KL grade 1 showed the largest decline, with an AUC of 0.78, precision of 62.40% (95% CI: 56.50–67.58), recall of 69.08% (95% CI: 63.97–74.27), and F1 score of 65.53% (95% CI: 60.93–69.89). Similarly, most grading errors were concentrated between adjacent grades, particularly between KL grade 0 and KL grade 1 (Figure 5C).
Figure 5.
Model performance on the Osteoarthritis Initiative dataset. (A) Receiver operating characteristic curves for the hip osteoarthritis Kellgren–Lawrence classification task. (B) The bar charts and radar charts demonstrate the different performances of the model for the hip osteoarthritis Kellgren–Lawrence grading classification. (C) Confusion matrix of the classification results. The strong diagonal indicates high classification accuracy across all classes, with rows representing accurate labels and columns representing predicted labels.
Performance comparison between model and experts on external dataset
Using the external test dataset, a comprehensive evaluation was conducted to compare the performance of the model against orthopedic surgeons of varying clinical seniority in KL grading. The model significantly outperformed attending and resident orthopedic surgeons in accuracy (all p < 0.001) and showed performance comparable to deputy chief orthopedic surgeons. There is no statistical difference from the result of one deputy chief orthopedic surgeon (p = 0.271), but there is a difference from the other (p = 0.043) (Table S3). Figures S1–S6 present ROC curves (A) and confusion matrices (B) for the six surgeons, illustrating classification preferences and recognition capability differences across KL grading. The inter-rater reliability for KL grading among six surgeons showed substantial agreement, with most discrepancies occurring between adjacent grades (Figure S7). Misclassifications across two or more grades were rare, indicating clinically acceptable consistency.
Model-assisted expert performance on external dataset
Two weeks later, the six physicians independently repeated the classifications with model assistance. All resident orthopedic surgeons showed significant improvements in accuracy (from 74.60% [95% CI: 72.72–76.50] and 70.59% [95% CI: 68.46–72.56] to 81.31% [95% CI: 79.56–83.09] and 80.44% [95% CI: 78.58–82.16]), as did attending orthopedic surgeons (from 78.87% [95% CI: 76.97–80.76] and 76.60% [95% CI: 74.48–78.48] to 84.63% [95% CI: 82.99–86.10] and 85.28% [95% CI: 83.71–86.88]) (Table S3; Figure 6A). Similar improvements were observed for Macro AUC, Macro precision, Macro-recall, and Macro-F1 (Table S3; Figures S1C,D–S6C,D). Notably, accuracy for deputy chief orthopedic surgeons also increased (from 85.44% [95% CI: 83.87–86.98] and 84.38% [95% CI: 82.78–85.94] to 90.77% [95% CI: 89.42–92.12] and 88.60% [95% CI: 87.14–89.99]). Their classification performance in collaboration with the model was significantly superior to that when they classified independently (Table S3; Figure 6B). Concordance analysis further showed that diagnostic agreement among the six surgeons improved overall with model assistance (Figure S7). These findings demonstrate the stability and consistency of the model and indicate potential clinical value for diagnostic assistance. The detailed classification of each KL grade by the six surgeons independently and with the assistance of the model is shown in Table S4.
Figure 6.
Performance change for six different levels of orthopedic surgeons with model assistance. (A) Changes in six key diagnostic metrics for six surgeons (two from each seniority level: Resident, Attending, Deputy Chief) before and after using the model. The plots consistently show performance gains with model assistance across all metrics and experience levels. (B) Comparison of Macro-average ROC curves for each seniority group, with and without AI assistance. The improved Macro-AUC values in all three plots highlight the model’s positive impact on diagnostic accuracy. Abbreviations: ROC, Receiver operating characteristic; AUC, Area under the curve.
Model interpretation and visualization
To further assess the model’s capacity to recognize radiographic features relevant to osteoarthritis, an interpretability analysis was conducted using Grad-CAM to visualize the model’s decision-making process. Heatmaps generated by Grad-CAM were superimposed on the original hip radiographs to highlight regions of interest identified by the model. As illustrated in Figure S8, the model predominantly focused on areas exhibiting abnormal changes in joint space and adjacent bony structures—key radiographic indicators strongly associated with KL grading of hip osteoarthritis. These findings support the model’s strong clinical interpretability and alignment with radiological diagnostic patterns. Conversely, Figure S9 displays Grad-CAM heatmaps for misclassified cases, where the model’s attention was partially directed toward non-critical anatomical regions or background areas, suggesting potential causes for grading errors. These results underscore both the model’s capacity to identify relevant pathological features and the importance of interpretability in understanding and improving model performance in clinical applications. It should be noted that the heatmaps display the areas on which the neural network focused its attention when making a prediction and do not provide causal explanations for the model’s decisions.
Discussion
Globally, standard anteroposterior hip radiographs remain the preferred imaging modality for the initial diagnosis, severity assessment, and routine follow-up of HOA [9]. Compared with computed tomography (CT) and magnetic resonance imaging (MRI), radiography offers several advantages, including ease of acquisition, low cost, rapid examination, and relatively low radiation exposure, making it an integral component of routine clinical workflows. Critically, the most widely adopted international standard—the KL grading system—is based on radiographic findings, with its core assessment criteria readily identifiable on standard X-rays [28]. Although MRI provides superior visualization of early pathological changes such as cartilage degeneration, synovial inflammation, and bone marrow edema, and CT offers enhanced assessment of osseous structures, conventional radiographs adequately capture the key features necessary for grading HOA severity according to the KL system. As such, the development of an AI model capable of automatically detecting and accurately interpreting these radiographic features is well aligned with current clinical practice. It holds significant promise for improving diagnostic efficiency, reducing interobserver variability, and enabling more consistent assessment of disease severity in both clinical and research settings.
Using standard anteroposterior hip radiographs, we developed and validated a fully automated deep learning model for KL grading of HOA, demonstrating strong performance across a large and diverse multicenter dataset. In a completely independent external validation dataset, the model achieved an overall accuracy of 86.67% and an area under the AUC of 0.93—comparable to the grading performance of deputy chief orthopedic surgeons. In the OAI dataset, the accuracy of the model further drops to 82.31%. A key strength of this study lies in its multicenter design, which incorporates heterogeneity in geographic regions, imaging protocols, equipment parameters, and patient demographics, thereby enhancing the model’s generalizability. This automated, objective, and efficient grading tool has the potential to compensate for interobserver variability in manual assessment, providing reliable support for standardized management of HOA and large-scale epidemiological research.
Although deep learning has advanced substantially in the field of OA imaging—particularly in automated grading of KOA [13–16]—research on HOA has progressed at a comparatively slower pace. Early studies related to HOA often focused on binary classification tasks and were typically limited to single-center datasets, thereby constraining their clinical applicability and generalizability [29,30]. More recent efforts have explored finer-grained, multi-task, or multicenter approaches to enhance model robustness and clinical relevance. For instance, von Schacky et al. [21] developed a model capable of assessing five independent radiographic features of HOA using the OAI dataset as well as external datasets, thereby improving interpretability. However, their approach did not provide overall KL grading, which remains the most widely used standard in clinical decision-making. In another study, Chen et al. [31] proposed a unified model to simultaneously identify HOA and osteonecrosis of the femoral head. While their results demonstrated the potential of AI to address multiple hip pathologies, the model was not specifically optimized for accurate KL grading of HOA or for generalizability across multiple clinical centers.
Objective and consistent KL grading of HOA is essential for implementing a stepped-care treatment approach, moving beyond simplistic binary classification tasks. Given the inherent interobserver and intraobserver variability associated with traditional radiographic assessment, CNN models offer the potential to deliver highly consistent, objective, and reproducible quantitative evaluations, thereby substantially improving diagnostic accuracy and reliability [32,33]. Enhanced grading accuracy is critical in clinical practice, enabling clinicians to more effectively stratify patients across treatment pathways—ranging from lifestyle modification and conservative management in early-stage disease to pharmacological or injection therapies for moderate disease and ultimately to surgical decision-making, such as joint replacement, in severe cases [34,35]. Moreover, CNN models can markedly increase the efficiency of radiographic interpretation, particularly when applied to large-scale datasets, making them valuable tools in epidemiological surveillance, clinical trial evaluation, and long-term monitoring of disease progression [32,33]. Importantly, through autonomous feature learning, CNNs may also identify deeper radiographic patterns not captured by traditional KL criteria, potentially uncovering novel prognostic indicators or symptom-related features.
Another key finding of this study is the model’s consistent performance across the full spectrum of KL grades. The CNN model achieved particularly high accuracy in KL grades 0 and 4, where radiographic features tend to be more pronounced and diagnostically distinct. Importantly, even in intermediate grades (KL grades 1 to 3), where grading boundaries are inherently more ambiguous and inter-clinician variability is more prevalent, the model demonstrated strong discriminative performance. Interpretability analysis using Grad-CAM further revealed that the model’s attention was predominantly directed toward clinically relevant anatomical structures—such as joint space narrowing, acetabular margins, and osteophyte formation. This concordance with the interpretive focus of radiologists and orthopedic surgeons enhances both the transparency and clinical credibility of the model’s decision-making process.
Despite the promising results, this study has several limitations. First, as a retrospective study, it is inherently subject to potential selection bias and incomplete data capture. Second, although the reference standard was established through consensus among multiple experienced experts, it remains constrained by only moderate interobserver agreement, a known limitation of the KL grading system [36]. Third, Grad-CAM improved transparency by highlighting model-attended regions that typically corresponded to KL-relevant structures. However, Grad-CAM is sensitive to layer selection, preprocessing, and model confidence and should not be over-interpreted as causal attribution. In our expert review, correctly classified cases showed high qualitative concordance between heatmaps and KL hallmarks, whereas misclassifications more often displayed dispersed or off-target attention. We therefore position Grad-CAM as an interpretability aid that complements, but does not replace, formal performance validation. Fourth, the dataset primarily included patients with primary HOA; as such, the generalizability of the model to secondary HOA requires further validation. Fifth, this study utilized PNG format images without standardizing physical resolution. Different imaging equipment and protocols may impact model performance, though our model demonstrated good generalization ability and robustness on the independent external test dataset. Finally, there was an imbalance in sample sizes across KL grades. Although this was addressed through weighted loss functions during model training, the class imbalance may have influenced predictive performance for certain grades. Nevertheless, the observed sample distribution closely reflects the true epidemiological profile of HOA in routine clinical settings. This real-world representativeness not only strengthens the clinical relevance of the model but also enhances its utility in managing commonly encountered pathological presentations.
Future research should focus on conducting prospective, multicenter clinical validation studies to further test the model’s applicability and stability across different tiers and geographic healthcare systems. It is also necessary to investigate the extent to which the model enhances physician diagnoses in real-world clinical settings, physician acceptance of the model, and its actual impact on diagnostic and treatment efficiency. Through multidisciplinary collaboration, we hope this automated tool will contribute positively to improving the accessibility, standardization, and efficiency of HOA diagnosis and to reducing healthcare disparities across regions and populations.
Conclusion
In summary, this multicenter study developed and validated an automated deep learning model based on the ResNet-50 architecture with integrated CBAM, which demonstrates accurate and consistent KL grading of HOA from standard radiographs. This objective, efficient, and reproducible tool shows potential to support future clinical decision-making and may contribute to the standardization of radiographic assessment. Further rigorous external validation, interpretability analyses, and real-world studies will be required before considering clinical deployment. Meanwhile, the model’s scalability and robustness across diverse datasets suggest possible utility in large-scale epidemiologic studies and clinical trials, supporting the advancement of standardized, AI-assisted evaluation in OA research.
Supplementary Material
Acknowledgments
We gratefully acknowledge the participating hospitals and Osteoarthritis Initiative for providing the data used in this study. We also thank Y.H.L., whose major is Artificial Intelligence of Jilin University, for providing valuable assistance in the construction of the model. Finally, we thank the review panel members for their help during the human-machine comparison experiments using the external test dataset.
Funding Statement
This work was supported by the Key Project of the National Natural Science Foundation of China (No. U21A20390), National Natural Science Foundation of China (No. 82472620), Department of Finance of Jilin Province (No. 2023SCZ69), Jilin Province Development and Reform Commission (No. 2023C039-3) and Jilin Provincial Scientific and Technological Development Program (No. 20230203089SF).
Disclosure statement
All authors hereby attest that they do not have any conflicts of interest related to this article.
Ethics approval
The study protocol was approved by the Ethical Institutional Review Board of the Second Hospital of Jilin University (No. 2025-127) and was ethically endorsed by the participating centers. The Ethical Institutional Review Board of the three hospitals waived the requirement for patient informed consent due to the retrospective design and the use of anonymized data. All procedures were performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki and its later amendments.
Generative artificial intelligence
We only used ChatGPT (GPT-4o) for polishing the grammar of the manuscript.
Patient consent for publication
Not applicable.
Role of the funder/sponsor
The funding organizations had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Data availability statement
Due to patient privacy concerns, the radiographic datasets generated and analyzed during this study are not publicly available. However, de-identified data may be available from the corresponding author upon reasonable request and with ethical review board approval.
References
- 1.Katz JN, Arant KR, Loeser RF.. Diagnosis and treatment of hip and knee osteoarthritis: a review. JAMA. 2021;325(6):568–578. doi: 10.1001/jama.2020.22171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.GBD 2021 Osteoarthritis Collaborators . Global, regional, and national burden of osteoarthritis, 1990–2020 and projections to 2050: a systematic analysis for the global burden of disease study 2021. Lancet Rheumatol. 2023;5(9):e508–22–e522. doi: 10.1016/S2665-9913(23)00163-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chen H, Zhang L, Shi X, et al. Evaluation of osteoarthritis disease burden in China during 1990–2019 and forecasting its trend over the future 25 years. Arthritis Care Res (Hoboken). 2024;76(7):1006–1017. doi: 10.1002/acr.25322. [DOI] [PubMed] [Google Scholar]
- 4.Dieleman JL, Baral R, Birger M, et al. US spending on personal health care and public health, 1996–2013. JAMA. 2016;316(24):2627–2646. doi: 10.1001/jama.2016.16885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Woo J, Lau E, Lau CS, et al. Socioeconomic impact of osteoarthritis in Hong Kong: utilization of health and social services, and direct and indirect costs. Arthritis Rheum. 2003;49(4):526–534. doi: 10.1002/art.11198. [DOI] [PubMed] [Google Scholar]
- 6.Hannon CP, Delanois RE, Nandi S, Management of Osteoarthritis of the Hip Work Group, Staff of the American Academy of Orthopaedic Surgeons ., et al. American academy of orthopaedic surgeons clinical practice guideline summary management of osteoarthritis of the hip. J Am Acad Orthop Surg. 2024;32(20):e1027–34–e1034. doi: 10.5435/JAAOS-D-24-00420. [DOI] [PubMed] [Google Scholar]
- 7.Chen N, Feng Z, Li F, et al. A fully automatic target detection and quantification strategy based on object detection convolutional neural network YOLOv3 for one-step X-ray image grading. Anal Methods. 2023;15(2):164–170. doi: 10.1039/d2ay01526a. [DOI] [PubMed] [Google Scholar]
- 8.Mourad C, Vande Berg B.. Osteoarthritis of the hip: is radiography still needed? Skeletal Radiol. 2023;52(11):2259–2270. doi: 10.1007/s00256-022-04270-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Walsh PJ, Walz DM.. Imaging of osteoarthritis of the hip. Radiol Clin North Am. 2022;60(4):617–628. doi: 10.1016/j.rcl.2022.03.005. [DOI] [PubMed] [Google Scholar]
- 10.Hunter CW, Deer TR, Jones MR, et al. Consensus guidelines on interventional therapies for knee pain (STEP guidelines) from the American society of pain and neuroscience. J Pain Res. 2022;15:2683–2745. doi: 10.2147/JPR.S370469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Brejnebøl MW, Lenskjold A, Ziegeler K, et al. Interobserver agreement and performance of concurrent AI assistance for radiographic evaluation of knee osteoarthritis. Radiology. 2024;312(1):e233341. doi: 10.1148/radiol.233341. [DOI] [PubMed] [Google Scholar]
- 12.Olsson S, Akbarian E, Lind A, et al. Automating classification of osteoarthritis according to Kellgren–Lawrence in the knee using deep learning in an unfiltered adult population. BMC Musculoskelet Disord. 2021;22(1):844. doi: 10.1186/s12891-021-04722-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pi S-W, Lee B-D, Lee MS, et al. Ensemble deep-learning networks for automated osteoarthritis grading in knee X-ray images. Sci Rep. 2023;13(1):22887. doi: 10.1038/s41598-023-50210-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Yoon JS, Yon C-J, Lee D, et al. Assessment of a novel deep learning-based software developed for automatic feature extraction and grading of radiographic knee osteoarthritis. BMC Musculoskelet Disord. 2023;24(1):869. doi: 10.1186/s12891-023-06951-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Norman B, Pedoia V, Noworolski A, et al. Applying densely connected convolutional neural networks for staging osteoarthritis severity from plain radiographs. J Digit Imaging. 2019;32(3):471–477. doi: 10.1007/s10278-018-0098-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tiulpin A, Thevenot J, Rahtu E, et al. Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach. Sci Rep. 2018;8(1):1727. doi: 10.1038/s41598-018-20132-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Shen X, Luo J, Tang X, et al. Deep learning approach for diagnosing early osteonecrosis of the femoral head based on magnetic resonance imaging. J Arthroplasty. 2023;38(10):2044–2050. doi: 10.1016/j.arth.2022.10.003. [DOI] [PubMed] [Google Scholar]
- 18.Shen X, He Z, Shi Y, et al. Development and validation of an automated classification system for osteonecrosis of the femoral head using deep learning approach: a multicenter study. J Arthroplasty. 2024;39(2):379–386.e2. doi: 10.1016/j.arth.2023.08.018. [DOI] [PubMed] [Google Scholar]
- 19.Badgeley MA, Zech JR, Oakden-Rayner L, et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit Med. 2019;2(1):31. doi: 10.1038/s41746-019-0105-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Cheng C-T, Ho T-Y, Lee T-Y, et al. Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs. Eur Radiol. 2019;29(10):5469–5477. doi: 10.1007/s00330-019-06167-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.von Schacky CE, Sohn JH, Liu F, et al. Development and validation of a multitask deep learning model for severity grading of hip osteoarthritis features on radiographs. Radiology. 2020;295(1):136–145. doi: 10.1148/radiol.2020190925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chee CG, Kim Y, Kang Y, et al. Performance of a deep learning algorithm in detecting osteonecrosis of the femoral head on digital radiography: a comparison with assessments by radiologists. AJR Am J Roentgenol. 2019;213(1):155–162. doi: 10.2214/AJR.18.20817. [DOI] [PubMed] [Google Scholar]
- 23.Li Y, Li Y, Tian H.. Deep learning-based end-to-end diagnosis system for avascular necrosis of femoral head. IEEE J Biomed Health Inform. 2021;25(6):2093–2102. doi: 10.1109/JBHI.2020.3037079. [DOI] [PubMed] [Google Scholar]
- 24.Zheng Z, Ryu BY, Kim SE, et al. Deep learning for automated hip fracture detection and classification: Achieving superior accuracy. Bone Joint J. 2025;107-B(2):213–220. doi: 10.1302/0301-620X.107B2.BJJ-2024-0791.R1. [DOI] [PubMed] [Google Scholar]
- 25.Li R, Wang X, Li T, et al. Deep learning-based automated measurement of hip key angles and auxiliary diagnosis of developmental dysplasia of the hip. BMC Musculoskelet Disord. 2024;25(1):906. doi: 10.1186/s12891-024-08035-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Altman RD, Gold GE.. Atlas of individual radiographic features in osteoarthritis, revised. Osteoarthritis Cartilage. 2007;15:A1–A56. doi: 10.1016/j.joca.2006.11.009. [DOI] [PubMed] [Google Scholar]
- 27.Selvaraju RR, Cogswell M, Das A, et al. Grad-cam: visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision; 2017. p. 618–626. [Google Scholar]
- 28.Kellgren JH, Lawrence JS.. Radiological assessment of osteo-arthrosis. Ann Rheum Dis. 1957;16(4):494–502. doi: 10.1136/ard.16.4.494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Xue Y, Zhang R, Deng Y, et al. A preliminary examination of the diagnostic value of deep learning in hip osteoarthritis. PLoS One. 2017;12(6):e0178992. doi: 10.1371/journal.pone.0178992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Masuda M, Soufi M, Otake Y, et al. Automatic hip osteoarthritis grading with uncertainty estimation from computed tomography using digitally-reconstructed radiographs. Int J Comput Assist Radiol Surg. 2024;19(5):903–915. doi: 10.1007/s11548-024-03087-1. [DOI] [PubMed] [Google Scholar]
- 31.Chen C, Liu P, Feng Y, et al. Diagnostic performance for severity grading of hip osteoarthritis and osteonecrosis of femoral head on radiographs: deep learning model vs. board-certified orthopaedic surgeons. Osteoarthr Imaging. 2023;3(2):100092. doi: 10.1016/j.ostima.2023.100092. [DOI] [Google Scholar]
- 32.Huang W, Randhawa R, Jain P, et al. Development and validation of an artificial intelligence-powered platform for prostate cancer grading and quantification. JAMA Netw Open. 2021;4(11):e2132554. doi: 10.1001/jamanetworkopen.2021.32554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ström P, Kartasalo K, Olsson H, et al. Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: A population-based, diagnostic study. Lancet Oncol. 2020;21(2):222–232. doi: 10.1016/S1470-2045(19)30738-7. [DOI] [PubMed] [Google Scholar]
- 34.Postler AE, Lützner C, Goronzy J, et al. When are patients with osteoarthritis referred for surgery? Best Pract Res Clin Rheumatol. 2023;37(2):101835. doi: 10.1016/j.berh.2023.101835. [DOI] [PubMed] [Google Scholar]
- 35.Hannon CP, Goodman SM, Austin MS, et al. 2023 American college of rheumatology and American association of hip and knee surgeons clinical practice guideline for the optimal timing of elective hip or knee arthroplasty for patients with symptomatic moderate-to-severe osteoarthritis or advanced symptomatic osteonecrosis with secondary arthritis for whom nonoperative therapy is ineffective. Arthritis Care Res (Hoboken). 2023;75(11):2227–2238. doi: 10.1002/acr.25175. [DOI] [PubMed] [Google Scholar]
- 36.Kohn MD, Sassoon AA, Fernando ND.. Classifications in brief: Kellgren–Lawrence classification of osteoarthritis. Clin Orthop Relat Res. 2016;474(8):1886–1893. doi: 10.1007/s11999-016-4732-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Due to patient privacy concerns, the radiographic datasets generated and analyzed during this study are not publicly available. However, de-identified data may be available from the corresponding author upon reasonable request and with ethical review board approval.






