Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Nov 25;15:41940. doi: 10.1038/s41598-025-25755-1

Deep learning framework for automated frame selection in kidney ultrasound

Amirali Seraj 1, Seyed Pedram Monazami 1, Raheleh Davoodi 1,, Javad Seraj 2, Hadi Ghattan Kashani 3, Abdoulreza Sajjadian Moosavi 4, Masoud Shariat Panahi 5
PMCID: PMC12647705  PMID: 41290740

Abstract

Manual selection of optimal frames from kidney ultrasound videos is a time-consuming and subjective process that can introduce variability into clinical assessments. This study presents a fully automated deep learning–based framework designed to identify the most diagnostically informative frames, thereby enhancing the efficiency and consistency of kidney ultrasound interpretation. A curated dataset of 1,203 frames from 211 patients was constructed and annotated by clinical experts into three quality-based categories: Good, Bad, and Null. Multiple convolutional neural network models–including InceptionV3, ResNet34/50, EfficientNet, VGG16, YOLOv8x-cls and YOLO11x-cls–were trained and systematically compared for the task of frame classification. The YOLO11x-cls model, optimized using multi-class cross-entropy loss and evaluated through 5-fold patient-level cross-validation, consistently outperformed the baseline architectures. It achieved perfect classification metrics (F1-score of 100%) on the Good class. Additionally, YOLO11x-cls attained the highest average cross-validation accuracy (90%) with minimal performance variance across folds. These results highlight the potential of the proposed YOLO–based pipeline as a robust and efficient solution for automated best-frame selection in kidney ultrasound imaging. The method holds promise for integration into clinical workflows, where it can reduce manual effort and improve diagnostic reliability and reproducibility.

Keywords: Kidney ultrasound, Frame selection, Deep learning, YOLOv8x-cls, Medical image classification, Diagnostic automation

Subject terms: Ultrasonography, Biomedical engineering

Introduction

Ultrasound imaging plays a vital role in the assessment of kidney function and structure, offering a radiation-free, low-cost, and non-invasive diagnostic modality1. Accurate evaluation of kidney size, shape, and echogenicity through ultrasound is central to diagnosing chronic kidney disease (CKD), detecting congenital anomalies, and identifying lesions such as cysts and stones24. However, the quality and interpretability of kidney ultrasound exams critically depend on selecting optimal frames from lengthy video sequences.

Currently, frame selection in kidney ultrasound is a manual, time-intensive task performed by radiologists. Each exam may contain hundreds of frames, of which only a few are clinically informative1. This manual process is laborious, introduces inter-observer variability, and hampers the scalability of computer-aided diagnostic systems. In many existing deep learning-based pipelines, models rely on pre-selected, high-quality images provided by experts, making automation incomplete and limiting real-time applications3.

Automating best-frame selection is thus essential to streamline clinical workflows and fully enable AI-driven kidney ultrasound interpretation. Prior research in other domains such as breast and lung ultrasound has demonstrated the value of automated frame selection using video summarization algorithms, including unsupervised clustering5,6, motion-based heuristics7, and deep reinforcement learning (RL)8,9. In particular, Huang et al.8 proposed an RL-based keyframe selector for breast ultrasound, while Torti et al.9 applied clustering and convolutional neural network (CNN)-based techniques to lung ultrasound.

Yet, despite the success of such approaches in other applications, few studies have addressed the challenge of automated frame selection in kidney ultrasound specifically. Even recent advances in automated kidney diagnosis–whether for CKD2,1012, congenital anomalies13, or function prediction14–continue to rely on manually chosen images. Although prior research in kidney ultrasound has made significant strides in segmentation and classification, most existing models rely heavily on pre-selected, high-quality frames and do not address the automation of frame selection. To the best of our knowledge, our work is the first to systematically benchmark multiple deep learning architectures, including the YOLO11x-cls15 classifier, for the task of fully automated key-frame selection in kidney ultrasound. Our study introduces a streamlined and clinically-aligned pipeline that eliminates manual pre-selection, enabling real-time and scalable deployment in nephrology imaging. This contribution marks a fundamental shift from semi-automated image analysis toward true end-to-end automation in ultrasound workflows.

In this study, we propose a fully automated deep learning framework to classify and select diagnostically relevant frames from kidney ultrasound videos. We curate a labeled dataset from 211 patients and categorize frames into three quality-based classes: Good, Bad, and Null. We benchmark multiple CNN-based classifiers and demonstrate that a variant of the You Only Look Once (YOLO), i.e. YOLO11x-cls architecture, outperforms classical models in accuracy and generalization.

The key contributions of this work are:

  • We construct and release a dataset, Kidney-ultrasound-cls, with expert-annotated labels reflecting diagnostic frame quality.

  • We evaluate and compare a range of state-of-the-art convolutional neural network architectures for ultrasound image classification.

  • We show that YOLO11x-cls achieves superior classification performance, offering a reliable tool for automatic best-frame extraction.

Our approach enables end-to-end automation of kidney ultrasound analysis, eliminating the need for manual frame curation and laying the groundwork for scalable AI-driven nephrological tools.

Related work

Automated frame selection in medical ultrasound

Frame selection plays a foundational role in automated video-based diagnosis, particularly for ultrasound imaging. Early studies applied unsupervised and heuristic-based approaches–such as pixel-level motion analysis and feature clustering–to identify keyframes5,6. Although efficient, these methods were limited in their ability to capture clinically relevant patterns.

More recently, deep learning–based techniques have emerged for intelligent frame selection. Huang et al.8 utilized reinforcement learning to optimize keyframe extraction in breast ultrasound. Similarly, Morshed et al.16 applied CNN-based clustering in lung ultrasound, improving classification performance. RL-based summarization has also shown promise in self-supervised contexts17. Despite these advancements, most efforts have focused on breast and lung ultrasound. The kidney domain remains underexplored. Yin et al.13 proposed a multi-instance learning framework for pediatric kidney images but did not address frame selection. To our knowledge, only one prior study13 has explored frame-level learning in kidney ultrasound, and it did not explicitly address automated keyframe extraction.

Kuo et al.3 applied multi-frame aggregation, yet relied on manual pre-selection. Thus, a gap remains in fully automated frame selection for kidney ultrasound–a critical step this study seeks to address.

GPU-accelerated keyframe selection for pulmonary ultrasound9 and kidney anomaly classification using multi-frame aggregation point to growing interest in this space.

Deep learning for kidney ultrasound analysis

Deep learning has demonstrated strong capabilities in analyzing kidney ultrasound images. Yu and Wu18 and Sudharson et al.4 used CNN and hybrid CNN-SVM models to classify kidney abnormalities with accuracies up to 97%. Kuo et al.3 combined image features and structured clinical data for renal function prediction. Su et al.19 further improved CKD classification by fusing image-based CNN features with patient-level metadata. These studies confirm the diagnostic potential of deep models in nephrology. However, they all depend on high-quality input frames–often manually curated–highlighting the importance of robust, automated frame selection.

Kidney ultrasound segmentation and multi-task learning

Kidney segmentation is another critical task in ultrasound imaging, aiding in structure localization and quantitative analysis. Several recent models have pushed segmentation accuracy toward clinical usability. For instance, multi-branch aware network (MBANet)20 and Fast-Unet++21 achieved Dice scores exceeding 0.95. Hybrid architectures, such as DDTransUNet22, a hybrid network combining Transformer and CNN, with a dual-branch encoder and dual attention mechanism for ultrasound image segmentation, offer promising improvements in spatial precision. Additionally, semi-supervised and generative adversarial network (GAN)-augmented frameworks23 have been proposed to reduce annotation burden. Karimi et al.24 applied a similar hybrid framework to spleen ultrasound.

Integrating these techniques into end-to-end pipelines remains an active area of research. Accurate frame selection can serve as a crucial preprocessing step, filtering out irrelevant or noisy frames prior to segmentation or disease prediction. This study contributes by establishing a reliable front-end system for this purpose.

In summary, while deep learning has significantly advanced kidney ultrasound interpretation, existing workflows typically rely on manually selected frames. Automated keyframe extraction in kidney imaging remains insufficiently addressed in literature. Most prior work has either focused on segmentation or classification using pre-selected frames. Our study fills this gap by proposing and validating a fully automated, YOLO11x-cls–based system that classifies ultrasound frames into diagnostically meaningful categories, thereby enabling scalable and standardized clinical pipelines. This study is, to the best of our knowledge, the first to benchmark multiple deep learning models for best-frame classification in kidney ultrasound, with expert-annotated data and quantitative validation.

Materials and methods

In this study, we propose a deep learning–based pipeline for the automatic selection of optimal frames from kidney ultrasound video sequences. The primary aim is to identify frames that are diagnostically informative and clinically valid, allowing for their use in downstream tasks such as kidney segmentation, area measurement, and disease classification. By automating this step, we aim to reduce inter-observer variability and streamline the diagnostic workflow.

Crucially, expert radiological input was integrated into the data annotation process to ensure that the model learns to distinguish clinically meaningful frames from suboptimal or irrelevant ones. This expert-informed approach enhances the clinical applicability and reliability of the system. Figure 1 illustrates the overall pipeline of our proposed framework, from frame extraction to model-based classification.

Fig. 1.

Fig. 1

Overview of the proposed deep learning framework for automatic classification of kidney ultrasound frames. The pipeline begins with frame extraction from DICOM video files, followed by manual annotation by expert reviewers into three clinically meaningful categories: Good, Bad, and Null. The annotated dataset is then used to train a YOLO11x-cls model, which learns to automatically assign new frames into the same classes based on visual features. This modular design ensures a clinically informed training process while maintaining full automation during inference. The integration of domain expertise in the labeling phase enhances the model’s diagnostic alignment, and the architecture is structured to be easily expandable for future downstream tasks such as segmentation or disease detection.

Model architecture and training setup

For the classification task, we employed YOLO as implemented by Ultralytics25, a high-capacity convolutional neural network tailored for image classification. This model was selected for its superior performance in initial experiments, particularly in capturing subtle spatial and structural cues in medical images. All input frames were resized to 240x240 pixels to match the model’s requirements while preserving relevant anatomical features. Training was performed using the official YOLO11 implementation without any architectural modifications. The YOLO11x-cls model was trained using standard supervised learning on the labeled kidney ultrasound dataset. To ensure stable optimization and reproducibility, we retained all training parameters from the official implementation. Table 3 provides a comprehensive overview of the hyperparameters, including input resolution, optimizer configuration, learning rate schedule, and momentum settings.

Table 3.

Training hyperparameters used for the YOLO11x-cls model in the kidney ultrasound frame classification task. The model was trained using AdamW optimizer with a warm-up strategy and momentum to ensure stable convergence and generalization across varying patient data. This table outlines the critical hyperparameters that guided the training of YOLO11x-cls. These values were chosen based on empirical best practices and were kept consistent with the official implementation to ensure reproducibility. The combination of a conservative learning rate, momentum stabilization, and weight decay was essential for optimizing model performance in a medical imaging context where overfitting is a common risk.

Parameter Value
model YOLO11x-cls
Input image size 224x224
lr0 0.001
lrf 0.001
Optimizer AdamW
Momentum value 0.937
Weight decay 0.0005
Warmup epochs 3.0
Warmup momentum 0.8

Specifically, we adopted a conservative learning rate (0.000714) in conjunction with the AdamW optimizer, which decouples weight decay from gradient updates and helps prevent overfitting. A 3-epoch linear warm-up phase was applied to gradually ramp up the learning rate and avoid early training instability. The use of momentum (0.937) and warm-up momentum (0.8) further contributed to smooth convergence across all training epochs. These choices proved effective for our classification task, as reflected in the model’s strong generalization and low validation loss. The architectural design of YOLOv8x-cls, illustrated in Fig. 2, consists of a backbone network that extracts hierarchical features at multiple spatial resolutions, a Feature Pyramid Network (FPN) that combines these features across scales, and a head that produces final classification outputs. This structure enables the model to capture both global contextual patterns and local texture details–properties that are essential for distinguishing between similar-looking ultrasound frames.

Fig. 2.

Fig. 2

Schematic overview of the YOLOv8 architecture used in this study. The model is composed of three main components: a backbone for hierarchical feature extraction, a FPN for multi-scale feature fusion, and a head for task-specific predictions. The architecture enables efficient classification by integrating low-level and high-level semantic features across spatial scales. This modular structure allows the model to maintain high sensitivity to fine-grained anatomical details in ultrasound images. The output heads are optimized using a combination of cross-entropy loss (for class prediction), L1 loss (for localization, when applicable), and objectness loss. In our classification setting, only the class-specific outputs were used.31.

Although YOLO is originally designed for object detection tasks, its classification variant (YOLO11x-cls) retains the same foundational structure, replacing bounding box outputs with class probabilities. In our implementation, only the classification head and corresponding cross-entropy loss were used, while the objectness and localization heads remained inactive. This adaptation allows the architecture’s efficiency and multi-scale sensitivity to be leveraged for image-level categorization in medical ultrasound.

The model was trained to minimize the multi-class cross-entropy loss, defined as:

graphic file with name d33e464.gif 1

Here, Inline graphic is the batch size, Inline graphic is the number of classes (Good, Bad, Null), Inline graphic is the one-hot encoded ground truth for sample Inline graphic and class Inline graphic, and Inline graphic is the predicted probability for class Inline graphic from the softmax output.

For optimization, the AdamW optimizer was used, which decouples weight decay from the gradient update, leading to improved regularization. The parameter update rule is given by:

graphic file with name d33e500.gif 2

Where Inline graphic refers to the learning rate, Inline graphic and Inline graphic are bias-corrected first and second moment estimates, Inline graphic is a small constant for numerical stability, and Inline graphic indicates the weight decay coefficient.

To ensure stable convergence during early training, we employed a learning rate warm-up strategy over the first Inline graphic epochs:

graphic file with name d33e532.gif 3

After the warm-up phase, the learning rate was fixed at Inline graphic.

Momentum was applied to accelerate convergence:

graphic file with name d33e543.gif 4

Here, Inline graphic is the momentum factor, determining the contribution of past gradients.

Cross-validation and baseline comparisons

To assess the model’s generalization, we implemented 5-fold cross-validation with patient-wise partitioning. All frames from a single patient were assigned to either the training or validation set—never both—to avoid data leakage and overestimation of performance. Each fold used an 80/20 split between training and validation data. This strategy ensures that the model is evaluated on truly unseen patient data, providing a realistic estimate of clinical deployment performance.

In addition to YOLO, we trained several well-established classification architectures (InceptionV3 26, VGG16 27, ResNet34, ResNet50 28, and EfficientNet 29) using the same dataset and settings for fair comparison. Their training configurations and performance results are detailed in Results Section.

Metric definitions

To evaluate model performance across classes, the following metrics were computed:

  • Precision: Inline graphic – the proportion of correctly predicted positive samples among all predicted positives.

  • Recall (Sensitivity): Inline graphic – the proportion of correctly predicted positive samples among all actual positives.

  • F1-Score: Harmonic mean of precision and recall: Inline graphic

  • Accuracy: Inline graphic – the overall correctness across all predictions.

These metrics provide complementary insights. For instance, high recall ensures diagnostically relevant frames are not missed, while high precision avoids including low-quality frames.

Dataset

The dataset utilized in this study, comprises ultrasound videos of the kidney from 211 patients. No patient metadata has been retained, adhering strictly to ethical guidelines. Each image has been anonymized through the assignment of coded names and IDs. All images have been shared with explicit consent from the respective patients, ensuring full compliance with ethical standards. Patient anonymity has been rigorously maintained throughout the dataset.

Importantly, the dataset includes a heterogeneous group of patients encompassing both healthy subjects and individuals with a variety of kidney conditions, such as hydronephrosis, cystic lesions, and other structural abnormalities. This diversity ensures that the dataset reflects real-world variability in anatomical presentation and image quality. Due to strict ethical guidelines and patient confidentiality, detailed clinical metadata, including specific diagnoses and patient history, were not retained. Nevertheless, the dataset’s manual annotation by expert radiologists ensures that the frames are reliably categorized according to diagnostic usefulness. This carefully curated dataset thus balances patient privacy with the need for diversity and expert labeling, providing a solid foundation for developing robust frame quality classification models. All videos were stored in DICOM (digital imaging and communications in medicine) format, with frame counts ranging from 30 to 800 per patient, depending on the duration and scanning protocol. Frames were extracted programmatically using the pydicom and cv2 libraries.

Due to significant variation in image quality across frames and patients, a rigorous manual annotation protocol was adopted by an expert. Each extracted frame was labeled as one of three categories:

  • Good: Frames containing clear, well-centered, and complete kidney structures suitable for diagnostic measurement.

  • Bad: Frames with partial kidney visibility, motion blur, or poor contrast that reduce diagnostic utility.

  • Null: Frames in which the kidney is not visible or irrelevant anatomical content dominates.

Annotation was performed using the 3D Slicer platform by a trained assistant and reviewed weekly under the supervision of a senior radiologist. Ambiguous cases were discussed in collaborative review sessions to ensure consistency and accuracy of labels. The final dataset comprised 1,203 labeled frames, with the following distribution: 448 Good frames, 354 Bad frames and 401 Null frames.

Given the high redundancy within patient videos (i.e., multiple frames capturing similar views at slightly different angles), we did not apply any data augmentation, in order to preserve the authenticity and variability of real-world clinical data.

To illustrate the qualitative characteristics of each image category, Fig. 3 presents representative samples from the annotated dataset. Good frames (Fig. 3b) clearly display a full, well-contrasted kidney with diagnostic value. In contrast, Bad frames (Fig. 3a) exhibit partial kidney visibility, blur, or degraded contrast, making them suboptimal for clinical use. Null frames (Fig. 3c) contain no recognizable kidney anatomy or irrelevant content, and are thus excluded from downstream analysis.

Fig. 3.

Fig. 3

Representative examples from the labeled dataset used for training, illustrating the three image categories: (a) Bad, (b) Good, and (c) Null. (a) Bad frames depict partial visibility of the kidney, suboptimal contrast, or poor anatomical delineation. (b) Good frames show clear, centered kidneys with sufficient quality for diagnosis. (c) Null frames lack visible kidney structures and are not diagnostically usable. This figure visualizes the visual variability within the dataset and underscores the challenge of accurately classifying ambiguous frames. The Bad category, in particular, contains subtle degradations in visibility that are difficult to detect without domain knowledge–highlighting the value of expert annotation during dataset construction.

These examples reflect the inherent ambiguity in ultrasound imaging and the need for expert-informed annotation to distinguish diagnostically meaningful frames. They also highlight the importance of a robust classification model capable of learning these subtle distinctions.

Results

Classification performance

Table 1 presents a detailed breakdown of the classification performance across seven deep learning models, including both traditional CNN architectures (such as InceptionV3, ResNet variants, and VGG16) and the more recent YOLO11x-cls classifier. Each model was evaluated on a held-out test set, and results are reported separately for the Good, Bad, and Null frame categories using precision, recall, and F1-score.

Table 1.

class-wise classification performance of six deep learning models–InceptionV3, EfficientNet, VGG16, ResNet34, ResNet50, YOLOv8x-cls, and YOLO11x-cls–on the kidney ultrasound test set (Test Set). Precision, recall, and F1-score are reported for each of the three frame classes: Good (clinically usable), Bad (suboptimal quality), and Null (no visible kidney). YOLO11x-cls achieves perfect classification on the Good Class and superior balance across all categories compared to other models. This table provides a comparative overview of how each model handles the three-class classification task. Performance is evaluated not only based on overall accuracy but also in terms of how well each class is individually detected and distinguished. The YOLO11x-cls model outperforms all others, particularly in the Bad and Null classes, which are more prone to misclassification due to overlapping visual features with the Good class.

Model Class Precision (%) Recall (%) F1-Score (%)
InceptionV3 Good 98 93 95
Bad 70 61 66
Null 77 88 82
EfficientNet Good 73 91 81
Bad 64 23 33
Null 76 93 84
VGG16 Good 100 89 94
Bad 73 61 67
Null 75 93 83
ResNet34 Good 97 84 90
Bad 74 81 77
Null 89 95 92
ResNet50 Good 100 93 97
Bad 86 39 53
Null 68 100 81
YOLOv8x-cls Good 97 100 98
Bad 84 87 86
Null 93 93 93
YOLO11x-cls Good 100 100 100
Bad 93 90 92
Null 93 95 94

The YOLO11x-cls model demonstrated superior and balanced performance, achieving perfect scores (100%) across all metrics for the Good Class–indicating flawless identification of diagnostically valuable frames. More notably, YOLO11x-cls maintained high F1-scores for the challenging Bad (92%) and Null (94%) classes, highlighting its robustness in distinguishing subtle differences in frame quality. In contrast, models like EfficientNet struggled with poor recall in the Bad category (only 23%), leading to a low F1-score (33%), which suggests that this architecture frequently misclassifies lower-quality frames. Even stronger models like ResNet50, despite achieving high precision for Bad frames (86%), suffered from significantly low recall (39%), reflecting inconsistency in detecting these ambiguous cases. Overall, these findings demonstrate that YOLO11x-cls not only excels at recognizing optimal frames but also reliably filters out non-diagnostic or misleading content, making it a highly effective solution for clinical frame triage.

Among the tested architectures, YOLO11x-cls achieved the highest overall performance, notably reaching a perfect score of 100% across all metrics in the Good Class. This is particularly significant because the Good Category is clinically critical–it represents the diagnostically usable frames, which are most relevant for downstream measurement and analysis. Additionally, the model maintained strong generalization in the more challenging Bad and Null classes, with F1-scores of 86% and 89%, respectively. These results highlight YOLO11x-cls’s ability to learn subtle differences between frame classes, making it a robust and clinically reliable model for ultrasound frame triage.

Cross-validation results

To rigorously assess the generalization capability of the models, we employed a 5-fold cross-validation strategy with patient-level separation. The results, summarized in Table 2, include both the mean accuracy across folds and the best validation accuracy achieved by each model. Among the evaluated architectures, YOLO11x-cls outperformed all others, achieving the highest average accuracy (90% ± 5.9%) and peak fold accuracy (95.90%). These values reflect both strong overall performance and training stability. In contrast, EfficientNet showed the weakest generalization, with a considerably lower average accuracy of 70% ± 2.1% and limited performance gain in the best fold. Notably, while ResNet50 reached a respectable best fold accuracy of 90.08%, its higher variance and lower mean performance indicate less consistent learning across data partitions. This contrast further underscores YOLO’s ability to adapt effectively across varying patient subsets, making it a reliable choice for real-world deployment in diverse clinical settings. These results suggest strong robustness of the model across different patient subsets. In contrast, EfficientNet and ResNet34 showed greater variability and lower overall accuracy, potentially due to their lower capacity or inadequate feature extraction in ultrasound settings.

Table 2.

Cross-validation results for six deep learning models on the kidney ultrasound classification task. The table reports both the average accuracy (± standard deviation) across five folds and the best validation accuracy achieved in any single fold. YOLO11x-cls demonstrates the highest average accuracy and peak validation performance, indicating strong generalization across different patient subsets. This table highlights the consistency and robustness of each model under a 5-fold cross-validation scheme. The performance variance across folds is also reflected through the standard deviation, offering insights into each model’s stability when trained on different subsets of the data.

Model Average accuracy across folds (%) Best validation accuracy (%)
InceptionV3 Inline graphic 94.21
EfficientNet Inline graphic 72.08
VGG16 Inline graphic 88.75
ResNet34 Inline graphic 89.20
ResNet50 Inline graphic 90.08
YOLOv8x-cls Inline graphic 95.44
YOLO11x-cls 90 ± 5.9 95.9

Qualitative and visual analysis

The model’s performance is further illustrated in Fig. 4, which presents the confusion matrix of YOLO11x-cls on the test set. As shown, the classifier achieved perfect recognition of the “Good” frames (45 out of 45 correctly predicted), which are the most clinically important for kidney measurement and assessment.

Fig. 4.

Fig. 4

Confusion matrix for YOLO11x-cls predictions on Test Set. Rows represent true labels and columns show predicted labels. The model achieves perfect classification on the “Good” class (45/45), with minimal confusion in “Bad” and “Null” categories, confirming its robustness and high sensitivity in distinguishing diagnostically relevant frames. This visualization demonstrates the model’s ability to separate diagnostically valuable frames (Good) from those with lower or no clinical utility (Bad and Null). The matrix reveals that YOLO11x-cls made only five misclassifications in total out of 118 samples, all of which occurred between the more visually similar Bad and Null categories.

Misclassifications were minimal and confined to the Bad and Null categories–primarily involving a few Bad frames being misclassified as Null (3 cases), and vice versa (2 cases). Importantly, no Good frames were misclassified, and there were no false positives in the Good category, confirming the model’s high precision and clinical reliability in selecting diagnostically acceptable frames. The misclassified images are shown in Fig. 5. These results emphasize the model’s sensitivity to subtle visual distinctions and its ability to avoid overestimating diagnostic quality, which is critical for medical image triage systems.

Fig. 5.

Fig. 5

Representative examples of the 5 misclassified validation samples by the YOLO11x-cls model. All errors occurred between the Bad and Null classes, which are often visually ambiguous or borderline. No Good frames were misclassified, underscoring the model’s reliability in identifying diagnostically valuable frames. These results highlight the clinical robustness of the proposed automated selection approach.

The learning dynamics of YOLO11x-cls during training are visualized in Fig. 6, which plots the validation accuracy over 100 epochs. The model rapidly improves within the first 10 epochs, surpassing 95% early in training. While small fluctuations are observed, the trend remains upward and stable. Toward the end of training, the model maintains high accuracy, peaking close to 98%. This consistent trajectory confirms successful convergence and strong generalization to unseen data. The curve’s smooth profile reflects the effectiveness of the YOLO11x-cls architecture and its robust training behavior.

Fig. 6.

Fig. 6

Validation accuracy curve of the YOLO11x-cls model across 100 training epochs. The curve shows a consistent upward trend with minor fluctuations, stabilizing above 94% in later epochs. This indicates successful convergence and strong generalization performance. The accuracy curve demonstrates that the model quickly escaped the initial low-performance phase, reaching over 90% accuracy within the first 20 epochs. Despite minor oscillations caused by validation data variance, the trend stabilizes around a high accuracy plateau (>94%), reflecting reliable learning behavior and minimal risk of overfitting.

Complementing the accuracy trend, Fig. 7 shows the validation loss curve of YOLO11x-cls during training. The loss drops sharply within the initial epochs and gradually stabilizes, reaching approximately 0.09 by the end of training. This rapid descent indicates that the model quickly learned meaningful patterns from the data. Following this initial phase, the curve continues to decline gradually and stabilizes around 0.09, reflecting the model’s consistent learning and strong fit to the validation data. The absence of late-stage spikes or divergence in loss further confirms that the model avoided overfitting, maintaining high generalization ability throughout the training process. Together with the accuracy curve in Fig. 6, this trend provides strong empirical evidence of model reliability.

Fig. 7.

Fig. 7

Validation loss curve of YOLO11x-cls over 100 training epochs. The graph shows a sharp initial decrease followed by a steady downward trend and stabilization around a low plateau ( 0.09), indicating effective convergence and minimal overfitting. This figure illustrates the learning stability of the model during training. After a rapid reduction in the first 10 epochs, the validation loss continues to decrease gradually with only minor oscillations. The convergence around a low loss value in later epochs supports the model’s ability to generalize well to unseen data, reinforcing the effectiveness of the selected architecture and training configuration.

These results indicate that YOLO not only achieves high numerical performance but also exhibits consistency, training stability, and interpretability in practice. Overall, YOLO11x-cls demonstrates superior performance in both quantitative metrics and qualitative behavior compared to conventional CNN architectures. Its robustness in detecting diagnostically valuable frames makes it a strong candidate for integration into clinical ultrasound workflows, where accurate and efficient frame selection is critical.

Discussion

A key innovation of this study lies in its shift from conventional static-image-based kidney analysis to a dynamic, frame-level classification approach tailored for ultrasound video sequences. Unlike prior works that assume access to optimal images, our pipeline addresses a critical and underexplored gap: automated selection of diagnostically valuable frames. By implementing the YOLO11x-cls classifier–originally designed for object detection–we successfully adapted a high-capacity, multi-scale architecture to a novel clinical task, thereby broadening its applicability in medical imaging.

The results of our study demonstrate the effectiveness of deep learning methods for automated frame selection in kidney ultrasound imaging, particularly highlighting the superior performance of the YOLO11x-cls model. While all evaluated models showed varying levels of success, YOLO11x-cls consistently outperformed traditional CNN architectures in terms of precision, recall, F1-score, and cross-validation accuracy. From a clinical perspective, the ability to automatically identify diagnostically optimal frames is of great significance. Ultrasound examinations of the kidney often contain dozens to hundreds of frames, many of which are redundant, blurry, or lack complete anatomical visibility. In current clinical workflows, the selection of a single high-quality frame for kidney measurement is a manual and time-intensive task performed by radiologists or technicians. Our proposed method addresses this bottleneck by introducing an automated system that not only reduces human workload but also increases reproducibility and reduces observer variability.

The YOLO11x-cls model’s perfect classification performance on the Good class (100% precision, recall, and F1-score) is especially noteworthy, as these frames directly impact downstream tasks such as renal length measurement, segmentation, and disease classification. Its strong generalization ability on Bad and Null frames indicates its robustness in distinguishing between subtle variations in frame quality–variations that even human observers may occasionally misjudge. In contrast, traditional models such as EfficientNet and ResNet variants, while relatively strong on the Good class, suffered noticeable drops in performance on the Bad class, suggesting they may confuse poor-quality frames with diagnostically acceptable ones. This distinction is crucial because incorrect inclusion of suboptimal frames in the diagnostic pipeline can negatively affect measurement accuracy and model interpretability.

Furthermore, the cross-validation results reinforce the generalizability of our approach. YOLO11x-cls achieved the highest average accuracy across folds, with relatively low standard deviation, indicating that it maintains its performance across diverse patient data. The consistent convergence behavior observed during training (as seen in Figs. 6 and 7) further supports the model’s training stability and reliability. To visually assess the model’s classification behavior, Fig. 8 presents a qualitative selection of ultrasound frames from the validation set, each labeled with the class predicted by YOLO11x-cls. The figure includes representative examples from all three categories–Good, Bad, and Null–with color-coded labels for clarity.

Fig. 8.

Fig. 8

Qualitative results from a validation batch using YOLO11x-cls. Each ultrasound frame is labeled according to the model’s predicted class: Good, Bad, or Null. The images demonstrate high-confidence and accurate predictions across all three categories, highlighting the model’s visual discrimination ability and clinical reliability. This figure showcases how the model distinguishes between subtle anatomical features across frames. Good frames typically exhibit a complete, centered kidney with clear cortical-medullary definition; Bad frames show partial visibility, low contrast, or motion blur; and Null frames lack discernible kidney structures. The model consistently assigns correct labels, confirming its capability to generalize beyond training data.

The Good frames predicted by the model consistently display complete and well-positioned kidneys with clear anatomical detail, suitable for diagnostic measurement. In contrast, frames labeled as Bad exhibit poor contrast or partial organ visibility, while Null frames show little to no kidney content. The model’s high-confidence and accurate predictions across this diverse batch of images further validate its ability to capture subtle diagnostic features and make clinically aligned judgments, even in the presence of image variability. One of the key advantages of our pipeline is the integration of expert knowledge during the annotation process. By involving radiologists in the labeling and review of frames, we ensured that the training data closely reflects real-world clinical standards. This expert-informed dataset serves as a strong foundation for the model to learn clinically relevant distinctions, rather than relying solely on raw pixel patterns.

While the proposed YOLO11x-cls–based framework demonstrates strong performance, several limitations should be acknowledged. First, the dataset–although annotated by clinical experts–was relatively small (1203 frames from 211 patients) and derived from a single institution, which may limit the model’s generalizability across different clinical settings, ultrasound machines, or operators. Second, only B-mode ultrasound images were considered, whereas other modalities such as Doppler or elastography could provide complementary diagnostic features. Third, inter-rater variability in the annotation process was not quantitatively assessed, although weekly reviews were conducted. Lastly, the model’s real-time inference capability in live clinical settings has not yet been validated. However, our YOLO11x-cls model processes each frame in approximately 18.5 ms (corresponding to Inline graphic54 frames per second), supporting real-time integration in future applications (see Supplementary Table S1). Addressing these limitations in future work will be critical to ensure broader adoption and deployment of the proposed system. Additionally, while we applied a unified training configuration across all baseline models to ensure consistency, this may not have captured the full potential of each architecture. However, our attempts to improve baseline model performance through alternative settings did not lead to better results. We chose this controlled approach to isolate architectural performance and maintain focus on the study’s primary goal—selecting diagnostically optimal frames. Under these conditions, YOLO11x-cls achieved the most reliable outcomes, with perfect classification of the Good class, which directly impacts downstream diagnostic tasks. This reinforces its practical value and robustness in clinical settings.

Looking forward, our proposed method has strong potential for clinical integration. It could be deployed as part of a real-time ultrasound software package that automatically identifies and stores the best frames during scanning. Alternatively, it could function as a post-processing tool to assist radiologists in reviewing stored DICOM videos. Moreover, combining this frame selection system with downstream applications—such as kidney segmentation, area measurement, or disease classification—could pave the way toward fully automated ultrasound-based kidney analysis pipelines. In conclusion, this study presents a practical and highly effective approach for automating a critical yet often-overlooked task in renal ultrasound imaging. By demonstrating that deep learning, and particularly YOLO11x-cls, can reliably identify high-quality diagnostic frames, we lay the groundwork for more standardized, efficient, and intelligent ultrasound workflows in nephrology.

Conclusion

This study presents a novel and practical framework for automatic best-frame selection in kidney ultrasound videos using deep learning. By leveraging expert-annotated data and evaluating multiple state-of-the-art architectures, we demonstrate that YOLO11x-cls delivers outstanding classification accuracy and generalization. Notably, the model excels in identifying diagnostically meaningful frames–an essential step in clinical decision-making workflows.

Compared to existing methods that rely on manual review or traditional classification models, our approach achieves superior precision, consistency, and efficiency. The integration of expert knowledge during the dataset construction phase further enhances the system’s clinical relevance, enabling it to distinguish subtle anatomical differences that often challenge both human observers and general-purpose models.

The proposed method not only reduces the burden on radiologists but also paves the way for fully automated ultrasound pipelines involving downstream segmentation, measurement, and disease classification. Future research may focus on expanding the dataset and validating performance across multiple institutions and imaging systems to improve generalizability. Additionally, while our current selection criteria focus on general diagnostic quality, future extensions will aim to support disease-specific needs–such as prioritizing frames that highlight focal lesions or abnormal structures. With further development, this system holds promise for real-time deployment in point-of-care ultrasound applications.

Supplementary Information

Acknowledgements

We would like to thank all the patients who participated in this study, as well as the clinical staff involved in the data acquisition process.

Author contributions

A.S. and R.D. wrote the manuscript, designed and conducted the experiments, and also developed and implemented the proposed method. A.S., S.P.M., and A.S.M.contributed to data collection and labeling. A.S. and J.S. were involved in assessing the project’s feasibility. R.D., H.G.K., A.S.M., and M.S.P. managed and supervised the project. All authors reviewed the manuscript.

Data availability

The dataset and source code used in this study are available upon reasonable request for academic and non-commercial use. Interested researchers may request access by contacting the corresponding author via email or by submitting a formal application through the request form available at https://github.com/am-a-s/Kidney-ultrasound-cls.

Declarations

Competing interests

The authors declare no competing interests.

Ethical approval

In accordance with ethical guidelines, studies involving human participants were reviewed and approved by Ethics Committee of the Physical Education and Sport Sciences at University of Tehran (ID: IR.UT.SPORT.REC.1402.127)30. Prior to participation, all participants provided informed written consent. Additionally, explicit consent was obtained from each individual for the publication of any images in this manuscript.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-25755-1.

References

  • 1.Singla, R. K., Kadatz, M., Rohling, R. & Nguan, C. Kidney ultrasound for nephrologists: A review. Kidney Med.4, 100464. 10.1016/j.xkme.2022.100464 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Patil, S. & Choudhary, S. Deep convolutional neural network for chronic kidney disease prediction using ultrasound imaging. Bio-Algorithms Med-Syst.17, 137–163. 10.1515/bams-2020-0068 (2021). [Google Scholar]
  • 3.Kuo, C.-C. et al. Automation of the kidney function prediction and classification through ultrasound-based kidney imaging using deep learning. npj Digital Med.2, 29. 10.1038/s41746-019-0104-2 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sudharson, S. & Kokil, P. An ensemble of deep neural networks for kidney ultrasound image classification. Comput. Methods Programs Biomed.197, 105709. 10.1016/j.cmpb.2020.105709 (2020). [DOI] [PubMed] [Google Scholar]
  • 5.Meena, P., Kumar, H. & Kumar Yadav, S. A review on video summarization techniques. Eng. Appl. Artif. Intell.118, 105667. 10.1016/j.engappai.2022.105667 (2023). [Google Scholar]
  • 6.Jadon, S. & Jasim, M. Unsupervised video summarization framework using keyframe extraction and video skimming. In 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA) (eds Jadon, S. & Jasim, M.) 140–145 (IEEE, 2020). 10.1109/iccca49541.2020.9250764. [Google Scholar]
  • 7.Zhou, P., Ding, Q., Luo, H. & Hou, X. Violence detection in surveillance video using low-level features. PLoS ONE13, 1–15. 10.1371/journal.pone.0203668 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Huang, R. et al. Extracting keyframes of breast ultrasound video using deep reinforcement learning. Med. Image Anal.80, 102490. 10.1016/j.media.2022.102490 (2022). [DOI] [PubMed] [Google Scholar]
  • 9.Torti, E., Gazzoni, M., Marenzi, E. & Leporati, F. Gpu-based key-frame selection of pulmonary ultrasound images to detect covid-19. J. Real-Time Image Proc.21, 113. 10.1007/s11554-024-01493-x (2024). [Google Scholar]
  • 10.Bandara, M. S. et al. Ultrasound based radiomics features of chronic kidney disease. Acad. Radiol.29, 229–235. 10.1016/j.acra.2021.01.006 (2022). [DOI] [PubMed] [Google Scholar]
  • 11.Qin, X., Liu, X., Xia, L., Luo, Q. & Zhang, C. Multimodal ultrasound deep learning to detect fibrosis in early chronic kidney disease. Ren. Fail.46, 2417740. 10.1080/0886022X.2024.2417740 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ma, F., Sun, T., Liu, L. & Jing, H. Detection and diagnosis of chronic kidney disease using deep learning-based heterogeneous modified artificial neural network. Futur. Gener. Comput. Syst.111, 17–26. 10.1016/j.future.2020.04.036 (2020). [Google Scholar]
  • 13.Yin, S. et al. Multi-instance deep learning of ultrasound imaging data for pattern classification of congenital abnormalities of the kidney and urinary tract in children. Urology142, 183–189. 10.1016/j.urology.2020.05.019 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Hosseinzadeh, M. et al. A diagnostic prediction model for chronic kidney disease in internet of things platform. Multim. Tools Appl.80, 16933–16950. 10.1007/s11042-020-09049-4 (2021). [Google Scholar]
  • 15.Jocher, G. & Qiu, J. Ultralytics yolo11 (2024).
  • 16.Morshed, A. et al. Ultrasound-based ai for covid-19 detection: A comprehensive review of public and private lung ultrasound datasets and studies. Multim. Tools Appl.10.1007/s11042-025-20802-5 (2025). [Google Scholar]
  • 17.Mathews, R. P. et al. Rl based unsupervised video summarization framework for ultrasound imaging. In Simplifying Medical Ultrasound (eds Aylward, S. et al.) 23–33 (Springer International Publishing, 2022). [Google Scholar]
  • 18.Wu, Y. & Yi, Z. Automated detection of kidney abnormalities using multi-feature fusion convolutional neural networks. Knowl.-Based Syst.200, 105873. 10.1016/j.knosys.2020.105873 (2020). [Google Scholar]
  • 19.Su, X., Lin, S. & Huang, Y. Value of radiomics-based two-dimensional ultrasound for diagnosing early diabetic nephropathy. Sci. Rep.13, 20427. 10.1038/s41598-023-47449-2 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen, G., Dai, Y., Zhang, J., Yin, X. & Cui, L. Mbanet: Multi-branch aware network for kidney ultrasound images segmentation. Comput. Biol. Med.141, 105140. 10.1016/j.compbiomed.2021.105140 (2022). [DOI] [PubMed] [Google Scholar]
  • 21.Ghelich Oghli, M. et al. Fully automated kidney image biomarker prediction in ultrasound scans using fast-unet++. Sci. Rep.10.1038/s41598-024-55106-5 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhang, C., Wang, L., Wei, G., Kong, Z. & Qiu, M. A dual-branch and dual attention transformer and cnn hybrid network for ultrasound image segmentation. Front. Physiol.10.3389/fphys.2024.1432987 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ting, P., Wong, J., Ng, W. & Chan, C. S. Semi-supervised gan-based radiomics model for data augmentation in breast ultrasound mass classification. Comput. Methods Programs Biomed.203, 106018. 10.1016/j.cmpb.2021.106018 (2021). [DOI] [PubMed] [Google Scholar]
  • 24.Karimi, A. et al. Improving spleen segmentation in ultrasound images using a hybrid deep learning framework. Sci. Rep.15, 1670. 10.1038/s41598-025-85632-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ultralytics. Yolov8 documentation - image classification model. https://docs.ultralytics.com/models/classify/ (2023). Accessed: 2025-04.
  • 26.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the Inception Architecture for Computer Vision . In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2826, 10.1109/CVPR.2016.308 (IEEE Computer Society, Los Alamitos, CA, USA, 2016).
  • 27.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition (2015). arXiv:1409.1556.
  • 28.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778. 10.1109/CVPR.2016.90 (2016).
  • 29.Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks, 10.48550/arXiv.1905.11946 (2019).
  • 30.Ghattan Kashani, H. & Shariat Panahi, M. Intelligent detection, measurement and segmentation of ultrasound images of internal organs using neural networks. Ministry of Health and Medical Education (2024). Accessed from https://ethicsresearch.ut.ac.ir/article_96212_e01372fa3cd2d1bbbb74ec651851a8dc.pdf.
  • 31.Ultralytics. What is yolov8? https://yolov8.org/what-is-yolov8/ (2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The dataset and source code used in this study are available upon reasonable request for academic and non-commercial use. Interested researchers may request access by contacting the corresponding author via email or by submitting a formal application through the request form available at https://github.com/am-a-s/Kidney-ultrasound-cls.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES