Skip to main content
PLOS One logoLink to PLOS One
. 2026 Jan 12;21(1):e0333191. doi: 10.1371/journal.pone.0333191

Robust motion detection and classification in real-life scenarios using motion vectors

Sameed Ur Rehman 1, Irshad Ullah 1, Wajahat Akbar 2, Altaf Hussain 2, Tariq Hussain 3, Ahmad Ali Alzubi 4, Insaf Ullah 5,*, Shuguang Li 2
Editor: Xiaohui Zhang6
PMCID: PMC12795384  PMID: 41525327

Abstract

In dynamic settings such as security, autonomous driving, and robotics, effective motion detection and classification are crucial for accurate tracking amidst target and background movements. Traditional approaches, typically designed for static environments, face challenges in complex scenes with multiple types of motion. This research presents a robust algorithm for motion detection in fully dynamic scenarios, utilizing the macro block technique to generate motion vectors, followed by motion vector analysis to classify distinct types of motion. These include camera motion, object motion, background motion, and complex motion, where both background and foreground move simultaneously. By segmenting and categorizing these motion types, the proposed approach improves detection precision in cluttered, real-world environments. Furthermore, the algorithm adapts to lighting variations and is independent of specific sensor setups. Moreover, the high agreement with human judgment, achieving a 90% accuracy rate, underscores the model’s robustness and potential applicability in real-world scenarios where dynamic backgrounds are prevalent. This establishes a framework for future research in dynamic motion detection and classification.

1 Introduction

The key technology that allows artificial systems to understand and interpret the visual environment is computer vision. Making well-informed judgments involves processing and analyzing images, videos, multicamera views, or multidimensional data such as medical images [1,2]. Motion detection, which recognizes moving objects in a scene and provides details of their shape, speed, and direction, is a crucial component of computer vision. Applications such as autonomous navigation, object tracking, and surveillance require this functionality [3]. Visual motion in practice can be caused by a variety of things, including changes in lighting, movements of the viewer, and movement of objects within the image. This intricacy makes the use of advanced techniques for precise object recognition and motion detection necessary [4]. In this work, we use dynamic videos that have different scenarios of motions and perspectives to identify motion. Ego-motion is ignored because the video formatting is standardized, but other elements like dynamic backgrounds and shifting lighting are taken into account. The main task of this research is to create a reliable system that can distinguish between the foreground and background in situations where both may be in motion. Conventional motion estimation methods often assume that while the foreground moves, the background stays unchanged [5,6]. However, this assumption falls short in scenarios where there is motion in the background, for example, in the rain or when the camera vibrates. Thus, the detection and categorization of complicated motion types in such hard contexts is the primary issue this work attempts to solve. We propose a sensor-free visual system that enables reliable motion detection for dynamic or targeted movements. To enable precise spatial motion analysis, we begin by acquiring frames from videos containing these motion types[7,8]. Each frame is processed individually, undergoing a standardization procedure to ensure uniform video formatting. Following standardization, we apply pixel intensity normalization to address lighting variability, providing consistent and normalized video frames for subsequent analysis[9]. This approach optimizes our ability to analyze and detect motion accurately in complex real-world conditions. A number of crucial problems are addressed by our suggested solution, including controlling non-uniform shifting backgrounds, identifying foreground and background based on motion and stillness, and allowing for progressive and worldwide changes in lighting. These developments offer a system that is highly adaptive to dynamic, unstructured settings, which makes a substantial contribution to the field of advanced computer vision. Extensive studies and applications using several test beds of virtual and real-life settings verify the effectiveness of our solutions. The main objectives of this study are:

  • To develop a method to accurately identify rain or snow in the scene.

  • To create algorithms to detect and compensate for camera motion in the scene.

  • To separate different types of motion factors within the scene.

  • To develop techniques to effectively isolate the background in various scenarios. The newness in this research stems from differentiating motions or identifying motion types in the video frames concerned. Here, we have identified four types of motion: object motion, background motion, camera motion, and combined motion (the motion of both foreground and background).

This new approach is a milestone as it helps us identify whether the object is moving within the defined frame, the background is moving, or the camera is being panned, because in interesting scenarios where both can be in motion, distinguishing between the camera and object motion becomes a tough task. The remainder of this study is organized as follows: Sect 2 covers the related work, Sect 3 presents the methodology, Sect 4 details the experiment, results analysis, and discussion, and Sect 5 provides the conclusion.

Contribution

The contribution of this work is to advance the field of motion detection and classification by addressing the challenge of identifying motion in fully dynamic backgrounds, i.e., scenarios in which both the foreground and background are moving. Unlike previous approaches, such as Gaussian Mixture Models (GMM) [10] and histogram-based methods that are primarily focused on distinguishing between foreground and background, this work presents a robust algorithm that is able to classify different types of motion, including object motion, background motion, and camera motion, in complex scenarios with multiple elements moving simultaneously. By using macro-block techniques to develop motion vectors and analyzing these vectors to classify the type of motion, the proposed approach brings significant improvements over existing methods. This innovation not only improves the accuracy of motion detection in complex and dynamic environments but also creates a comprehensive framework for understanding motion dynamics in real-world applications, paving the way for future advances in the field.

2 Related work

Motion detection research in computer vision has a long history, with early attempts focusing on methods like temporal differencing. One of the first techniques, temporal differencing, involves removing frames one after the other to emphasize changes and detect moving parts in video sequences. This method is very flexible in circumstances that change quickly. Nevertheless, this frequently results in the foreground aperture issue, as it is unable to fully catch the contour of some kinds of moving objects. Temporal differencing techniques typically utilize a threshold mechanism to detect motion by comparing changes between subsequent frames, and therefore require additional approaches to identify stopped sections [9,10]. A system inspired by Pfinder, a real-time system for tracking and analyzing human behavior, was created by Lipton et al. [11]. They used a combination of temporal and spatial differencing to track and categorize targets in video sequences. They successfully tracked classed targets throughout time and location by incorporating target classification metrics. Although robust, the system was less effective in highly dynamic scenarios due to its susceptibility to changes in appearance and background noise. Chang suggested a method for detecting changes that is just dependent on temporal variations [10]. This approach made it easier to partition moving objects into their foreground and background and allowed for continuous monitoring, which made it useful for video surveillance. But it had trouble tracking camera movement, and it needed a steady background to recognize objects correctly. Another well-liked method for motion detection is background removal. To detect moving objects, this approach compares the current frame with a background picture that serves as a reference. Even though they are simple to use, simple background subtraction techniques often perform poorly in real-world situations because of problems with dynamic backgrounds, changing lighting, and camera movement [1214]. More complex models, such as the Gaussian Mixture Model (GMM), were created to solve these issues. GMM-based methods represent each pixel with several Gaussian distributions, providing resilience against noise, shadows, and illumination variations [9]. Kim expanded on the background subtraction method by including picture registration techniques and building panoramic background models. The goal of this technique was to address problems with background adaptation and camera movement [15]. Its high computational memory needs and startup delays remained a problem, though. Grimson [16] and other researchers enhanced background modelling even more by creating multi-color adaptive background models that could be tracked in real-time [17]. These models solve problems like as changes in background motion and lighting by updating the background for every frame in real time. These improvements did not eliminate the approaches’ inability to cope with quick scene changes and high computational overhead. The capacity of dynamic threshold-based algorithms to identify moving objects in intricate environments with shifting illumination has drawn attention recently. With the use of these techniques, multi-target motion detection in dynamic backgrounds may be accomplished with more accuracy by computing a dynamic threshold based on the gradient change between successive frames [17]. The literature demonstrates how motion detection approaches have developed over time, moving from basic temporal differencing to sophisticated background modelling techniques. Research in this field is continuing because, despite great advancements, difficulties are still encountered when effectively identifying and tracking moving objects in intricate and dynamic situations [18]. Table 1 shows the comparative analysis of motion detection algorithms used for motion detection and classification, and the proposed work.

Table 1. Comparative analysis of motion detection algorithms and proposed work.

References Approach Dataset Research Aim Motion Classification (Type)
[19] ERD (ensemble random forest decision tree) Accelerometer and gyroscope sensors data Detect human kinematics motion such as walking or running with high performance Walking and running of an object
[20] YOLO, ResNet152 CCTV traffic surveillance footage Detect traffic incidents and classify them for an alarm system Causes of accident
[21] Niblack’s threshold method Surveillance videos Handle illumination and background clutter challenges; classify motion/non-motion at runtime Motion vs. non-motion
[22] YOLOv8 agri + DeepSORT Custom dataset Address lighting effect limitations in object detection Object detection considering lighting effects
[23] Temporal difference Static camera Detect moving objects at distance for object tracking using frame differencing Object detection using frame differencing
[24] Infrared spectral imaging + LBP Infrared spectral imaging Improve target motion detection and segmentation using LBP features Motion detection in infrared images
[25] Three-frame difference algorithm Custom dataset Optimize video surveillance via motion detection for efficient storage/transmission Motion detection for storage optimization
[26] Frame difference + convolution Jilin-1 Improve moving vehicle detection in satellite videos by addressing blurred boundaries Object detection in satellite videos with boundary refinement
Proposed Methodology Macro-Block and Motion Vector Analysis Mp4, Avi, MPG video formats Detect and classify motion into various types Background, foreground, complex motion

3 Methodology

The aim of this study is to detect motion within a dynamic background, which refers to scenarios where both the background and the foreground are in motion. This investigation utilizes data from various sources, including the videos paris_cif.avi, Mixmasterfile.mpg, Aeroplanes.mp4, Wildlife.avi, and Pool.mp4.The study begins with data collection, specifically gathering the targeted videos. Once the videos are collected, preprocessing is conducted to refine the data for achieving the desired results. The first step in preprocessing involves extracting frames from the videos, followed by normalization of these frames. Normalization is a critical step in motion detection algorithms as it mitigates the impact of variations in lighting conditions, contrast, and exposure levels between frames. By normalizing pixel values, the algorithm can more effectively detect motion changes that are not merely due to brightness or contrast differences. Following normalization, the frames undergo RGB to grayscale conversion to standardize color intensity and eliminate color differences [2730]. The subsequent step involves splitting the frames into macro blocks, which allows for determining a suitable macro block size for the algorithm. After defining the macro blocks, the motion vector for each block is estimated [31,32]. The motion vector indicates how pixels within a macro block move from one frame to another, enabling the video codec to encode the motion information rather than the actual pixel values, thereby enhancing compression efficiency. The final step is motion vector analysis, where the motion of each individual macro block across different frames is analyzed over a specified time frame and with the help of motion vector analysis we can differentiate the type of motion present in the specific scene in a video, all these steps are diagrammatically presented in the Fig 1.

Fig 1. An overview of the proposed methodology, illustrating the workflow from data extraction to motion detection.

Fig 1

3.1 Dataset descriptions

Five different video files are shown in detail in Table 2, together with information on each one’s length, frame rate, number of frames, resolution, and pixel size per frame. With a total runtime of 1635.25 seconds, or around 27.25 minutes, the films provide a sizable dataset for motion detection research. The frame rates provide a diversity of motion smoothness for evaluating the algorithms’ durability, ranging from 15 to 30 frames per second, with “Wildlife.avi” significantly slower at 29.97 frames per second. With “Mixmasterfile.mpg” giving the most frames (34,389) and “paris_cif.avi” the fewest (1,065), the overall amount of frames across all films is 41,110, demonstrating the variation in data volume. Three videos have a resolution of 1280x720 pixels, while the other two have a size of 352x288 pixels. This means that the material is a combination of high-definition and lower-quality. The disparity in quality and frame rate across the movies is essential for the creation and evaluation of computer vision models, which must function effectively in a variety of complicated and varied video scenarios. The name, length, frame rate, total number of frames, width, height, and pixel size of each video file are all specified in detail. This table effectively represents the general overview of the input videos used. We have five distinct video formats, including AVI, MP4, and MPG. There are 41,110 frames in all, and their combined length is 27.25 minutes, or 1635.25 seconds.

Table 2. Overview of video dataset properties, including duration, frame rate, total frames, resolution (width and height), and pixels per frame, summarizing five video files with a cumulative duration of 1635.25 seconds (27.25 minutes) and 41,110 frames.

Video Duration (s) Frame Rate (fps) No. of Frames Width (px) Height (px) Pixels per Frame
paris_cif.avi 71.00 15 1,065 352 288 101,376
Mixmasterfile.mpg 1375.66 25 34,389 352 288 101,376
Aeroplans.mp4 68.31 30 2,049 1280 720 921,600
Wildlife.avi 30.10 29.97 902 1280 720 921,600
Pool.mp4 90.19 30 2,705 1280 720 921,600
Total 1635.25 (27.25 min) 41,110

3.2 Preprocessing and feature extraction

The preprocessing and feature extraction steps are crucial for obtaining accurate results, as preprocessing prepares the data for analysis. In this process, the data is handled step by step, starting with frame extraction followed by converting RGB frames to grayscale, and then normalizing the frames to ensure consistency in the input data.

3.3 Frame extraction and normalization

For efficient motion detection in video processing, frame extraction and normalization are essential processes [33]. We take a sequence and use Fn to represent the n-th frame in order to retrieve frames. In practice, we compare frames separated by a gap D, such as F1 with F3 rather than consecutive frames, which are frequently excessively similar shown in Fig 2. This method improves the chances of finding notable motion variations.

Fig 2. Extracting frames enables motion detection and classification by isolating key instances from a dynamic background, capturing temporal changes for accurate analysis.

Fig 2

To accommodate for changes in illumination and other environmental conditions, normalization adapts pixel intensity values to a standard scale. This preprocessing stage facilitates the algorithm’s attention to real motion as opposed to variations in brightness. Frame differencing, mean normalization, and contrast stretching are a few examples of different normalization techniques. In order to emphasize changes, frame differencing entails subtracting the pixel values of succeeding frames Fig 2. In order to minimize the effects of illumination fluctuations and facilitate reliable motion detection across several video sequences, the normalization step is crucial.

3.4 Image resizing and grayscale conversion

Image Resizing and Grayscale Conversion We scale frames to a common 500×500×3 format so that films of varying resolutions, such VGA and HD, work together Fig 3. This resizing guarantees that every frame has the same size, which is important for further analysis and promotes consistent processing. We resize the frames and then transform them from RGB to grayscale Figs 4 and 5. By reducing the three color channels in the data to one intensity channel by grayscale conversion, problems with color disparities are avoided and intensity variations are the main emphasis, improving the accuracy of motion recognition.

Fig 3. Frame normalization to a uniform 500×500×3 resolution to ensure consistent motion detection and classification across varying video formats, such as VGA and HD, in dynamic backgrounds.

Fig 3

Fig 4. Transformation of a RGB image into a Grayscale representation.

Fig 4

Fig 5. Transformation of an RGB image into a Grayscale Representation, (left side) shows the original image, (right side) shows the grayscale image.

Fig 5

3.5 Macro block size selection

Smaller macro blocks provide finer detail on motion but increase computation overhead. For example, a 50×50 frame contains 2,500 pixels; if it is divided into 5×5 blocks, it yields 100 macro blocks (2,500/25), thus allowing pixel-level analysis but takes longer to process. On the contrary, macro blocks measuring 25×25 will generate only 4 blocks (2,500/625), which in itself is too coarse to detect subtle motion variations. A compromise was therefore made to use a 10×10 macro block size. This means 25 blocks per 50×50 frame (2,500/100), which is a compromise between capturing detailed motion and efficiency in computation. Therefore, it is not too sensitive to noise arising from pixel-wise processing while still retaining enough detail to capture relevant motion patterns.

3.6 Macro-block division and motion vector estimation

Macro-blocks are the basic building blocks of motion analysis, and they are separated into frames [34,35]. Since 10x10 is the preferred size for macro-blocks, there are 2,500 macro-blocks in each 50x50 frame Fig 6. This size strikes a compromise between the capacity to record minute motions and computational efficiency. We employ the differences between frames separated by a predetermined gap, d, for motion vector estimation. For best results, we set d to 5 frames. Polar coordinates are used to compute the motion vector shown in Eqs 1, 2, and 3 respectively, which indicates the displacement and direction of motion shown in Fig 7.

Fig 6. 50×50 Macro-Block representation in digital image processing.

Fig 6

Fig 7. Showing the tanθ in a diagrammatical way.

Fig 7

mv=(d,θ) (1)

Where

d=(Δx)2+(Δy)2 (2)

So, θ will be:

θ=tan1(ΔyΔx) (3)

3.7 Motion analysis

By comparing motion vectors between macro-blocks, motion vector analysis is able to distinguish between various motion situations [3638]. For instance, when all macro-blocks have differences less than a threshold dth, rather than being established as a constant, the threshold value (th) for each motion classification was dynamically defined. For each video, the mean (μ) and standard deviation (σ) of motion vector magnitudes were computed, and the threshold was thus set by the formula: th = μ + 0.75σ. This data-driven strategy enables the algorithm to adapt to heterogeneous motion intensities and environmental noises, and consequently provides robust classification results. i.e.no motion Fig 8 is identified. When all macro-blocks exhibit considerable motion di>th, camera motion Fig 9 is indicated; when some blocks move but others stay static, object motion Fig 10 is noted. Accurate motion identification in complicated scenarios is made possible by the algorithm’s ability to discern between moving objects, static backgrounds, and camera motion through the analysis of these patterns Fig 11.

Fig 8. The scene remains completely still, with no motion detected.

Fig 8

Fig 9. Scene remains static while the camera moves.

Fig 9

Fig 10. Camera remains static while the object moves.

Fig 10

Fig 11. Objects, background and camera all are in motion.

Fig 11

Case 1: No motion

d(12500)th (4)

In Case 1, Eq 4 represents the analysis confirmed the complete absence of any detectable motion within the scene. All objects remained static, indicating a lack of activity or movement. This observation underscores the scene’s stability and lack of dynamic elements shown in Fig 8.

Case 2: Camera motion only

d(12500)>th (5)
didjfor ij (6)

In Case 2, Eqs 5 and 6 respectively represents the analysis identified that the only detected movement was due to the camera itself. The objects within the scene remained stationary, while the camera’s motion introduced the observed changes. This highlights that any dynamic elements are a result of the camera’s activity rather than the scene’s contents shown in Fig 9

Case 3: Object motion

For some dth (7)
For some d>th (8)

In Case 3, considering Eq 7 is for background and Eq 8 is for object motion, the analysis revealed that an object within the scene was in motion. This indicates that there is active movement among the scene’s elements, distinct from any camera movement. The presence of this moving object adds a dynamic aspect to the scene, differentiating it from static scenarios shown in Fig 10.

Case 4: For all d

d(12500)>th (9)

In Case 4, considering Eq 9 where d_id_j the analysis showed simultaneous motion in the object, background, and camera within the scene. This combination of movements results in a highly dynamic and complex environment. The interplay of these elements creates a scenario where both the subject and surroundings are in motion, along with the camera capturing the scene shown in Fig 11.

3.8 Proposed model

The proposed methodology begins with extracting frames from a video in their original RGB format, which are then normalized to a fixed resolution of 500 × 500 pixels to ensure uniformity and efficiency. The RGB frames are subsequently converted into grayscale to reduce complexity while retaining essential structural details. Each grayscale frame is divided into grids of varying scales (5 × 5, 10 × 10, and 25 × 25). Among these options, the 10 × 10 grid is selected as the optimal choice. Choosing a 5 × 5 grid would result in a highly pixel-based analysis, leading to increased computational complexity. Conversely, selecting a 25 × 25 grid would likely overlook finer details and subtle motions, as the larger grid size might fail to capture small movements. Therefore, the 10 × 10 grid strikes a balance by effectively capturing minute motions while maintaining computational efficiency, enabling multi-resolution analysis and hierarchical feature extraction. The extracted features from these grids represent the unique characteristics of the frames and serve as input for classification. Finally, the frames are categorized into distinct cases (Case 1, Case 2, Case 3, and Case 4) based on predefined criteria such as motion patterns, object detection, or scene analysis, facilitating detailed analysis and understanding of the video content as shown in Fig 12.

Fig 12. Proposed methodology work flow for motion detection and classification.

Fig 12

4 Results and analysis

Our baseline assessment was done through a human-judged approach. A table with four entries was created for: Case A (No Motion), Case B (Camera Motion Only), Case C (Object Motion), and Case D (All Motion). The target users were shown several video scenes and asked to identify the class of motion for each. The responses from the human-judged approach were compared with the algorithm’s results, yielding 90% agreement. Based on this comparison, we conclude that our algorithm has attained an accuracy level of 90%.

The motion detection and analysis system processes video frames of horses running along a beach to capture and quantify dynamic elements of the scene. The system employs multiple components to offer diverse perspectives in Figs 13, 14, 15, and 16. (a) the original image represents the unaltered video frame, in Fig 13 showcasing the horses with a backdrop of crashing waves, in Fig 14 showcasing the birds movements in a static background, in Fig 15 showcasing the movement of bird flying bird and its shadow on the ground while other objects are in static position, in Fig 16 both special scenario is addressed in which all motion types are present i.e. camera motion, object motion and background motion, (b) A grayscale image removes color information, simplifying data, retaining only intensity values, which aids in motion detection by reducing computational complexity. (c) The motion-containing part isolates and highlights regions of significant motion, identified through changes in pixel intensity, primarily corresponding to the movement of the horses and waves in Fig 13, movement of birds in Fig 14, movement of a single bird and its shadow on the ground in a complex background in Fig 15, and the movement of camera, object and background in Fig 16, (d) A 3D bar chart of pixel intensities visualizes the grayscale image’s brightness distribution, where the height of each bar indicates the number of pixels at a particular intensity. (e) The motion vector plot illustrates the direction and magnitude of detected motion using arrows, with length representing speed and direction indicating movement trajectory. This multi-faceted approach facilitates a comprehensive analysis of motion within the scene. The grayscale transformation simplifies feature extraction, while the motion-containing part identifies areas of interest, enabling the detection of dynamic changes such as the movement of horses and waves in Fig 13, movement of birds in Fig 14, movement of single bird and its shadow in Fig 15, movement of objects, camera and background in Fig 16, The pixel intensity bars provide a statistical overview of brightness distribution, helpful for monitoring environmental lighting variations. The motion vector plot quantitatively represents motion patterns, providing insights into speed and direction and enabling the identification of motion types, such as object motion, background motion, camera motion, and motion involving both background and foreground elements. These features are particularly useful for applications such as video analysis, enabling the tracking of objects and for advancing computer vision research in object tracking and scene understanding.

Fig 13. Result of Proposed Methodology (a) The Original Image (b) Grayscale Image (c) Motion Containing Part (d) 3D Bar Chart of Pixel Intensity and (e) Motion Vector Plot.

Fig 13

Fig 14. Result of Proposed Methodology (a) The Original Image (b) Grayscale Image (c) Motion Containing Part (d) 3D Bar Chart of Pixel Intensity and (e) Motion Vector Plot.

Fig 14

Fig 15. Result of Proposed Methodology (a) The Original Image (b) Grayscale Image (c) Motion Containing Part (d) 3D Bar Chart of Pixel Intensity and (e) Motion Vector Plot.

Fig 15

Fig 16. Result of Proposed Methodology (a) The Original Image (b) Grayscale Image (c) Motion Containing Part (d) 3D Bar Chart of Pixel Intensity and (e) Motion Vector Plot.

Fig 16

4.1 Dataset expansion and robustness testing

To test the model’s adaptability, we extended the evaluation to include sequences from publicly available datasets, such as SBI2015. From these datasets, we use the hall monitor shown in Fig 17, and custom video as shown in Fig 18. These datasets feature challenges such as variable lighting, dynamic weather, and shadows. Our model retained a performance level similar to the custom dataset used in our initial study. This reinforces the robustness and flexibility of our method across diverse scenarios without requiring retraining.

Fig 17. Result of Proposed Methodology (a) The Original Image (b) Grayscale Image (c) Motion Containing Part (d) Heat map (e) Motion Vector Plot.

Fig 17

Fig 18. Custom video for more accurate testing.

Fig 18

4.2 Experimental results

In the current study, the dataset lacks annotated ground truth labels for motion classes (e.g., object, camera, complex motion), which makes direct computation of these metrics infeasible. To address this limitation, we adopted a human judgment-based validation approach, where motion classification outputs were independently reviewed by multiple observers. Agreement between the algorithm and consensus human classification was used as a proxy for accuracy. While this approach introduces subjectivity, it provides a reasonable basis for preliminary evaluation in the absence of labelled data. Additionally, we have included processing speed metrics as an objective measure of performance. The proposed method processes frames at an average rate of 28 frames per second on a standard Intel i7 CPU (without a GPU), demonstrating its suitability for practical applications. The provided data Table 3 categorizes the motion types in various video files by percentage of total frames. In the video “paris_cif.avi” with 1,065 frames, 0.6% have no motion, none exhibit camera motion, 94.0% show object motion, and 5.33% involve complex motion. “Mixmasterfile.mpg” has 34,389 frames, with 3.4% showing no motion, 20.1% showing camera motion, 41.2% depicting object motion, and 34.4% involving complex motion. “Aeroplans.mp4,” with 2,049 frames, has 0.1% with no motion, 45.6% with camera motion, no object motion, and 53.4% complex motion. “Wildlife.avi,” containing 902 frames, has 3.0% no motion, 58.5% camera motion, no object motion, and 39.8% complex motion. Finally, “Pool.mp4” with 2,705 frames shows 28.1% no motion, 11.5% camera motion, 48.3% object motion, and 7.9% complex motion.

Table 3. Summary chart for human judgmental results.

Video Total Frames CASE A: No Motion (%) CASE B: Camera Motion (%) CASE C: Object Motion (%) CASE D: All/Complex Motion (%)
paris_cif.avi 1,065 0.6 0.0 94.0 5.33
Mixmasterfile.mpg 34,389 3.4 20.1 41.2 34.4
Aeroplans.mp4 2,049 0.1 45.6 0.0 53.4
Wildlife.avi 902 3.0 58.5 0.0 39.8
Pool.mp4 2,705 28.1 11.5 48.3 7.9

The Table 4 presents human judgmental results on the types of motion observed across different scenes in five video files, conducted for our research purposes. In “paris_cif.avi” (7 scenes), none of the scenes show no motion or camera motion, 6 scenes involve object motion, and 1 scene has complex motion. “Mixmasterfile.mpg” (85 scenes) contains 3 scenes with no motion, 17 with camera motion, 35 with object motion, and 30 with complex motion. “Aeroplans.mp4” (11 scenes) has no scenes with no motion, 5 with camera motion, none with object motion, and 6 with complex motion. “Wildlife.avi” (10 scenes) shows no scenes with no motion, 6 with camera motion, none with object motion, and 4 with complex motion. Finally, “Pool.mp4” (17 scenes) includes 5 scenes with no motion, 2 with camera motion, 9 with object motion, and 1 with complex motion.

Table 4. Comparison between proposed VS human experiment result.

VIDEOS Total Scenes

CASE A

NO MOTION

CASE B

CAMERA MOTION

CASE C

OBJECT MOTION

CASE D

COMPLEX MOTION

paris_cif.avi 7 0 0 6 1
Mixmasterfile.mpg 85 3 17 35 30
Aeroplans.mp4 11 0 5 0 6
Wildlife.avi 10 0 6 0 4
Pool.mp4 17 5 2 9 1

The Table 5 compares human judgmental results with the proposed model results for different motion types in “Paris_Cif.avi.” Both the human assessment and the proposed model identified no scenes with no motion (Case A) or camera motion (Case B). For object motion (Case C), both methods agreed on 6 scenes. In the case of complex motion (Case D), both the human judgment and the proposed model recognized 1 scene. Overall, the comparison shows complete agreement between the human judgmental results and the proposed model results, with both methods identifying 6 scenes with object motion and 1 scene with complex motion, totaling 7 scenes.

Table 5. Confusion Matrix: Human Judgement (Ground Truth) vs. Proposed Method (Predicted).

Human/Predicted Case A Case B Case C Case D Total
Case A: No Motion 0 0 0 0 0
Case B: Camera Motion 0 0 0 0 0
Case C: Object Motion 0 0 6 0 6
Case D: Complex Motion 0 0 0 1 1
Total 0 0 6 1 7

The Table 6 compares human judgmental results with proposed model results for various motion types in “Mixmasterfile.mpg.” For scenes with no motion (Case A), the proposed model identified 4 scenes, compared to 3 scenes judged by humans. In scenes with camera motion (Case B), the model found 19 scenes, closely aligning with the 17 identified by humans. For object motion (Case C), the model recognized 34 scenes, while humans identified 35. In the case of complex motion (Case D), the model identified 28 scenes, compared to 30 scenes in human judgment. Overall, the total scene count is consistent at 85 for both methods, with minor variations in classification across different motion types. This comparison indicates a high level of agreement between the human judgment and the proposed model, with slight discrepancies in categorizing specific scenes.

Table 6. Comparison between human judgment and proposed method on Mixmasterfile.Mpg.

Human/Predicted Case A Case B Case C Case D Total
Case A: No Motion 2 0 1 1 4
Case B: Camera Motion 1 15 2 1 19
Case C: Object Motion 0 1 31 2 34
Case D: Complex Motion 0 1 1 26 28
Total 3 17 35 30 85

Table 7 compares human judgmental results with proposed model results for different motion types in “Aeroplanes.mp4.” Both the human assessment and the proposed model agreed that there were no scenes with no motion (Case A) or object motion (Case C). For camera motion (Case B), both methods identified 5 scenes. In the case of complex motion (Case D), both the human judgment and the proposed model recognized 6 scenes. Overall, the comparison shows complete agreement between the human judgmental results and the proposed model results, with both methods consistently identifying the same number of scenes across all motion categories, totaling 11 scenes.

Table 7. Comparison between human judgment and proposed method on Aeroplanes.Mp4.

Human/Predicted Case A Case B Case C Case D Total
Case A: No Motion 0 0 0 0 0
Case B: Camera Motion 0 5 0 0 5
Case C: Object Motion 0 0 0 0 0
Case D: Complex Motion 0 0 0 6 6
Total 0 5 0 6 11

In Table 8 the comparison between human judgmental results and proposed model results for “Wildlife.avi” shows complete alignment across all motion categories. Both the human assessment and the proposed model agreed that there were no scenes with no motion (Case A) or object motion (Case C). For scenes with camera motion (Case B), both methods identified 6 scenes. Likewise, in scenes categorized as complex motion (Case D), both the human judgment and the proposed model recognized 4 scenes. Overall, there is perfect agreement between the human judgmental results and the proposed model results, with both methods consistently identifying the same number of scenes across all motion categories, totaling 10 scenes.

Table 8. Comparison between human judgment and proposed method on Wildlife.Avi.

Human/Predicted Case A Case B Case C Case D Total
Case A: No Motion 0 0 0 0 0
Case B: Camera Motion 0 6 0 0 6
Case C: Object Motion 0 0 0 0 0
Case D: Complex Motion 0 0 0 4 4
Total 0 6 0 4 10

In Table 9 the comparison between human judgmental results and proposed model results for “Pool.mp4” reveals a high level of agreement across most motion categories. Both the human assessment and the proposed model identified 5 scenes with no motion (Case A). For scenes with camera motion (Case B), both methods identified 2 scenes. In scenes categorized as object motion (Case C), both the human judgment and the proposed model recognized 9 scenes. Similarly, in scenes characterized as complex motion (Case D), both methods identified 1 scene. Overall, there is substantial agreement between the human judgmental results and the proposed model results, with both methods consistently identifying the same number of scenes across most motion categories, totaling 17 scenes.

Table 9. Comparison between human judgment and proposed method on Pool.Mp4.

Human/Predicted Case A Case B Case C Case D Total
Case A: No Motion 5 0 0 0 5
Case B: Camera Motion 0 2 0 0 2
Case C: Object Motion 0 0 9 0 9
Case D: Complex Motion 0 0 0 1 1
Total 5 2 9 1 17

4.3 Comparison with existing motion detection algorithms

4.3.1 Computational efficiency.

Macro block-based motion analysis avoids complex deep learning inference stages. At the same time, it is block-wise only comparison between frames, which significantly leads to time computation reduction and hence applicability in real-time on low-power devices.

4.3.2 No extensive training data required.

Macro block methods, unlike deep learning object detection models like YOLO and Faster R-CNN, are self-sufficient and do not call for training on annotated data. Thus, they are data-agnostic and much simpler to deploy in new environments or non-seen conditions.

4.3.3 Robustness to background clutter and occlusion.

It analyzes motion block-wise, thus the differentiation between motion of the background and that occurring on the foreground continues, despite cluttered or dynamic scenes. It employs more direct and geometry-aware techniques toward moving region detection, especially when the object is partly occluded.

4.3.4 Adapting to scene variability.

Relative motion and not appearance features is what macro block motion vectors are designed to capture. This means that they would be much less affected by changes in illumination, shadows, or appearance variations-pretty common conditions that severely degrade the performance of visual-based object detectors.

4.4 Discussion

The adopted methodology of this study is, detection of motion in dynamic backgrounds through a well-structured preprocessing, feature extraction, and motion analysis system. It begins with data collection, in which videos from different sources are gathered to form a diversified dataset essential for robust motion detection and classification. During the preprocessing stage, frame extraction and normalizing were involved, frames were normalized to address lighting and exposure differences, contrast, and exposure differences, ensuring that the algorithm considers the real motion changes rather than brightness and contrast differences as a basis for detecting motion. Subsequently, the frames were resized to a uniform 50×50 resolution and converted to grayscale. This standardization was vital for eliminating the color variations and emphasizing the intensity difference that augmented the motion detection accuracy. The frames were further segmented into macro blocks of 10 × 10 for high computational efficiency and the ability to capture minor details of motion. Motion vectors for these blocks were predicted based on the difference between the consecutive frames such that displacement and direction of motion can be specified in polar coordinates. Motion analysis entailed classifying cases into four kinds: no motion, camera motion, object motion, and complex motion. This classification became vital to differentiate various cases of motion like immobile background, mobile camera, or a scene containing many moving things. For example, results from the proposed model had human judgment comparisons showing a favorable correlation. A typical case for this was in the video “paris_cif.avi,” where similar judgment was made between the model and human evaluators regarding scenes showing object and complex motion. Such remarkable levels of agreement were observed in the other videos with a few differences in classification, thereby confirming the reliability of the model. The proposed model combines block and pixel-based motion evaluation to replace missing areas in the detection of foreground objects. All these methods, along with the combination of those concepts, make it a perfect algorithm for motion detection in a complicated scenario of the video. The model has achieved compliance with human judgment up to 90% thus making its robustness very clear and its usability in real-life where dynamic background is an important case quite obvious. The method and findings of this study contribute a lot to multitudinous areas of computers, vision, image processing, multimedia, motion estimations, object detection and image segmentation in providing reliable means of complex environment motion sensing in video.

5 Conclusion and implications

The present study aims at providing a robust concept of detecting and classifying motion in entirely dynamic surroundings, addressing the limitations of the conventional methods in case of simultaneous movements of the foreground and the background. The proposed system makes the use of some block-wise techniques for motion vector analysis to successfully differentiate between three types of motion: object motion, background motion, and camera/frame motion. The adaptive nature of the algorithm ensures reliable detection in real-world settings under variable lighting conditions and in unstructured environments. Consequently, this increases the algorithms’ applicability to diverse fields such as surveillance, autonomous navigation, and robotics. Experimental results provide good evidence for a marked increase in accuracy and robustness, thus validating the practical relevance of the proposed approach. This investigation not only advances the field of motion analysis but also paves the way for future developments in dynamic motion detection and classification efforts, thus paving the way for innovations in computer vision systems.In future we will evaluate the method on publicly available datasets with motion ground truth (e.g., CDNet2014) to compute full precision-recall-F1 metrics and to further strengthen the empirical validation.

Data Availability

The minimal dataset necessary to replicate the study findings is fully accessible to all researchers and can be accessed via the following link: https://zenodo.org/records/17217381.

Funding Statement

This work was supported by Ongoing Research Funding program (ORF-2025-395), King Saud University, Riyadh, Saudi Arabia.

References

  • 1.Yang L, Kumar R, Kaur R, Babbar A, Makhanshahi GS, Singh A, et al. Exploring the role of computer vision in product design and development: a comprehensive review. Int J Interact Des Manuf. 2024;18(6):3633–80. doi: 10.1007/s12008-024-01765-7 [DOI] [Google Scholar]
  • 2.Suma KG, Patil P, Sunitha G, Mantri VP, Kale ND. Computer vision and its intelligence in industry 4.0. Machine learning techniques and industry applications. IGI Global; 2024. p. 119–42.
  • 3.Kosnoff J, Yu K, Liu C, He B. Transcranial focused ultrasound to V5 enhances human visual motion brain-computer interface by modulating feature-based attention. Nat Commun. 2024;15(1):4382. doi: 10.1038/s41467-024-48576-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cai E, Zhang Y, Ji X, Lu X, Xie L, Zhuang Y, et al. Estimating small structural motions from multi-view video measurement. Engineering Structures. 2023;275:115259. doi: 10.1016/j.engstruct.2022.115259 [DOI] [Google Scholar]
  • 5.Park J, Kim J, Kim C-S. BiFormer: learning bilateral motion estimation via bilateral transformer for 4K video frame interpolation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 1568–77. 10.1109/cvpr52729.2023.00157 [DOI]
  • 6.Luxem K, Sun JJ, Bradley SP, Krishnan K, Yttri E, Zimmermann J, et al. Open-source tools for behavioral video analysis: setup, methods, and best practices. Elife. 2023;12:e79305. doi: 10.7554/eLife.79305 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kidziński Ł, Yang B, Hicks JL, Rajagopal A, Delp SL, Schwartz MH. Deep neural networks enable quantitative movement analysis using single-camera videos. Nat Commun. 2020;11(1):4054. doi: 10.1038/s41467-020-17807-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Maity S, Chakrabarti A, Bhattacharjee D. Background modeling and foreground extraction in video data using spatio-temporal region persistence features. Computers & Electrical Engineering. 2020;81:106520. doi: 10.1016/j.compeleceng.2019.106520 [DOI] [Google Scholar]
  • 9.Oral M, Deniz U. Centre of mass model – a novel approach to background modelling for segmentation of moving objects. Image and Vision Computing. 2007;25(8):1365–76. doi: 10.1016/j.imavis.2006.10.001 [DOI] [Google Scholar]
  • 10.Chia T-L. Modified temporal difference method for change detection. Opt Eng. 2005;44(2):027001. doi: 10.1117/1.1839893 [DOI] [Google Scholar]
  • 11.Lipton AJ, Fujiyoshi H, Patil RS. Moving target classification and tracking from real-time video. In: Proceedings fourth IEEE workshop on applications of computer vision. WACV’98. 1998. p. 8–14.
  • 12.Wren CR, Azarbayejani A, Darrell T, Pentland AP. Pfinder: real-time tracking of the human body. IEEE Trans Pattern Anal Machine Intell. 1997;19(7):780–5. doi: 10.1109/34.598236 [DOI] [Google Scholar]
  • 13.Boult TE, Micheals R, Gao X, Lewis P, Power C, Yin W, et al. Frame-rate omnidirectional surveillance and tracking of camouflaged and occluded targets. In: Proceedings Second IEEE Workshop on Visual Surveillance (VS’99) (Cat. No.98-89223). p. 48–55. 10.1109/vs.1999.780268 [DOI]
  • 14.Cheng F-C, Huang S-C, Ruan S-J. Illumination-sensitive background modeling approach for accurate moving object detection. IEEE Trans on Broadcast. 2011;57(4):794–801. doi: 10.1109/tbc.2011.2160106 [DOI] [Google Scholar]
  • 15.Kim SW, Yun K, Yi KM, Kim SJ, Choi JY. Detection of moving objects with a moving camera using non-panoramic background model. Machine Vision and Applications. 2012;24(5):1015–28. doi: 10.1007/s00138-012-0448-y [DOI] [Google Scholar]
  • 16.KaewTraKulPong P, Bowden R. An improved adaptive background mixture model for real-time tracking with shadow detection. Video-based surveillance systems: Computer vision and distributed processing. 2002. p. 135–44.
  • 17.Zeyi Z, Gang L. Target motion detection algorithm based on dynamic threshold. J Phys: Conf Ser. 2021;1738(1):012085. doi: 10.1088/1742-6596/1738/1/012085 [DOI] [Google Scholar]
  • 18.Chauhan AK, Krishan P. Moving object tracking using gaussian mixture model and optical flow. International Journal of Advanced Research in Computer Science and Software Engineering. 2013;3(4). [Google Scholar]
  • 19.Raza A, Al Nasar MR, Hanandeh ES, Zitar RA, Nasereddin AY, Abualigah L. A novel methodology for human kinematics motion detection based on smartphones sensor data using artificial intelligence. Technologies. 2023;11(2):55. doi: 10.3390/technologies11020055 [DOI] [Google Scholar]
  • 20.Basheer Ahmed MI, Zaghdoud R, Ahmed MS, Sendi R, Alsharif S, Alabdulkarim J, et al. A real-time computer vision based approach to detection and classification of traffic incidents. BDCC. 2023;7(1):22. doi: 10.3390/bdcc7010022 [DOI] [Google Scholar]
  • 21.Kumar M, Yadav DK, Ray S, Tanwar R. Handling illumination variation for motion detection in video through intelligent method: An application for smart surveillance system. Multimed Tools Appl. 2023;83(10):29139–57. doi: 10.1007/s11042-023-16595-0 [DOI] [Google Scholar]
  • 22.Duong-Trung H, Duong-Trung N. Integrating YOLOv8-agri and DeepSORT for advanced motion detection in agriculture and fisheries. EAI Endorsed Trans Ind Net Intel Syst. 2024;11(1). doi: 10.4108/eetinis.v11i1.4618 [DOI] [Google Scholar]
  • 23.Singla N. Motion detection based on frame difference method. International Journal of Information & Computation Technology. 2014;4(15):1559–65. [Google Scholar]
  • 24.Li Y. Infrared spectral imaging-based image recognition for motion detection. International Journal of Intelligent Networks. 2025;6:14–26. doi: 10.1016/j.ijin.2025.01.001 [DOI] [Google Scholar]
  • 25.Al-Hashemi SQY, Naghmash MS, Ghandour A. Enhancing motion detection in video surveillance systems using the three-frame difference algorithm. Mesopotamian Journal of Computer Science. 2024;2024:167–75. doi: 10.58496/mjcsc/2024/014 [DOI] [Google Scholar]
  • 26.Luo X, Li J, A X, Deng Y. A combined frame difference and convolution method for moving vehicle detection in satellite videos. Sensors (Basel). 2025;25(2):306. doi: 10.3390/s25020306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kiani K, Hematpour R, Rastgoo R. Automatic grayscale image colorization using a deep hybrid model. Journal of AI and Data Mining. 2021;9(3):321–8. [Google Scholar]
  • 28.Bautista PA, Hashimoto N, Yagi Y. Color standardization in whole slide imaging using a color calibration slide. J Pathol Inform. 2014;5(1):4. doi: 10.4103/2153-3539.126153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Abbadi NKE, Razaq ES. Automatic gray images colorization based on lab color space. IJEECS. 2020;18(3):1501. doi: 10.11591/ijeecs.v18.i3.pp1501-1509 [DOI] [Google Scholar]
  • 30.Singh R, Bhattacharjee U, Singh AK. Performance evaluation of normalization techniques in adverse conditions. Procedia Computer Science. 2020;171:1581–90. doi: 10.1016/j.procs.2020.04.169 [DOI] [Google Scholar]
  • 31.Libo Yang, Yu K, Jiang Li, Shipeng Li. An effective variable block-size early termination algorithm for H.264 video coding. IEEE Trans Circuits Syst Video Technol. 2005;15(6):784–8. doi: 10.1109/tcsvt.2005.848306 [DOI] [Google Scholar]
  • 32.Yi-Hsin Huang, Tao-Sheng Ou, Chen HH. Fast decision of block size, prediction mode, and intra block for H.264 intra prediction. IEEE Trans Circuits Syst Video Technol. 2010;20(8):1122–32. doi: 10.1109/tcsvt.2010.2057018 [DOI] [Google Scholar]
  • 33.Archana K, Prasad VK. Object pre-processing using motion stabilization and key frame extraction with machine learning techniques. IJACSA. 2022;13(11). doi: 10.14569/ijacsa.2022.0131116 [DOI] [Google Scholar]
  • 34.Han Y. Simulation-based analysis of free and forced rocking behaviour in rigid URM macro-blocks using physics engines. Canada: McGill University; 2023.
  • 35.Salih YA, George LE. Improved hybrid block-based motion estimation for inter-frame coding. Circuits Syst Signal Process. 2021;40(7):3500–22. doi: 10.1007/s00034-020-01637-x [DOI] [Google Scholar]
  • 36.Senbagavalli G, Manjunath R. Motion estimation using variable size block matching with cross square search pattern. SN Appl Sci. 2020;2(8). doi: 10.1007/s42452-020-03248-2 [DOI] [Google Scholar]
  • 37.Rana S, Kamra R, Sur A. Motion vector based video steganography using homogeneous block selection. Multimed Tools Appl. 2019;79(9–10):5881–96. doi: 10.1007/s11042-019-08525-w [DOI] [Google Scholar]
  • 38.N. S, Meenakshi M. Efficient reconfigurable architecture for moving object detection with motion compensation. IJEECS. 2021;23(2):802. doi: 10.11591/ijeecs.v23.i2.pp802-810 [DOI] [Google Scholar]

Decision Letter 0

Xiaohui Zhang

11 Jul 2025

PONE-D-25-31514ROBUST MOTION DETECTION AND CLASSIFICATION IN REAL-LIFE SCENARIOS USING MOTION VECTORSPLOS ONE

Dear Dr. Hussain,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 25 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Xiaohui Zhang

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS One has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript aims to address motion detection and classification in dynamic environments by introducing an algorithm based on macro block techniques and motion vector analysis. However, the paper falls short in several critical areas. Firstly, the written English is not up to academic standards, with frequent grammatical errors, awkward phrasing, and unclear sentence structures that hinder comprehension. Secondly, the methodology and experimental design are not clearly explained—key steps such as how motion vectors are extracted, analyzed, and classified are described only in broad terms, and essential details about the dataset, evaluation metrics, and experimental setup are lacking. Most importantly, the proposed approach lacks clear novelty; similar techniques have been widely studied in previous literature, and the manuscript does not demonstrate how this work meaningfully advances the state of the art. Overall, the submission does not meet the scientific or linguistic standards required for publication.

Reviewer #2: This paper presents a method based on macroblock technology and motion vector analysis for dynamic background motion detection and classification. The approach offers theoretical contributions and practical value. However I have the following concerns before publication

1. The proposed method should be compared with existing motion detection algorithms

2. The evaluation needs more quantitative metrics for accuracy and speed such as recall rate F1 score and processing speed

3. More datasets can be tested, for example, to include tests under varying lighting and weather conditions to assess robustness

4. The choice of 10x10 macroblocks requires further justification

5. The threshold selection process for motion detection parameter th is not clearly explained. How is this threshold determined?

6. The 3D bar charts in Figures 13-16 subfigure (d) do not add much meaningful information.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2026 Jan 12;21(1):e0333191. doi: 10.1371/journal.pone.0333191.r002

Author response to Decision Letter 1


28 Jul 2025

ROBUST MOTION DETECTION AND CLASSIFICATION IN REAL-LIFE SCENARIOS USING MOTION VECTORS

Dear Editor,

We appreciate the time and effort of the editors and reviewers to provide their helpful feedback. We believe that all the insightful comments helped us to improve our manuscript. In the revised version, we have addressed all the reviewers’ concerns. For the editor's and reviewers’ convenience, newly added text and/or changes in the revised manuscript are highlighted in red color. The details of point-by-point responses to the reviewers' comments are presented in subsequent pages of this letter.

Sincerely,

The authors

#handling editor:

Reviewer comment: 1) Firstly, the written English is not up to academic standards, with frequent grammatical errors, awkward phrasing, and unclear sentence structures that hinder comprehension.

Response: Dear editor, Thank you very much for your valuable and constructive comments on our manuscript. I have carefully reviewed the entire paper and thoroughly corrected all grammatical errors and language issues as per your suggestions. I sincerely appreciate your detailed review and helpful feedback, which have greatly contributed to improving the clarity and overall quality of our manuscript.

Reviewer comment: 2) Secondly, the methodology and experimental design are not clearly explained key steps such as how motion vectors are extracted, analyzed, and classified are described only in broad terms, and essential details about the dataset, evaluation metrics, and experimental setup are lacking.

Response: Dear editor,

Thank you for your detailed and constructive feedback. We acknowledge that the methodology section of this version of the manuscript lacked specific implementation details. In response, we have substantially revised this section to clearly and comprehensively explain each step of our approach.

1. Motion Vector Extraction:

We now explicitly describe that motion vectors are obtained using the motion estimation capabilities of standard video codecs (specifically MPEG-4), which generate block-based vectors during compression. These vectors are extracted from compressed video streams using FFmpeg's motion estimation tools, enabling efficient pre-processing without the need for full optical flow computation.

2. Analysis and Classification:

The motion vector fields are analyzed by computing their magnitude and direction, and subsequently evaluated across 2500 frames. A statistical thresholding approach is applied to identify consistent motion patterns. This step is now elaborated with a clearer explanation of the logic behind the directional classification (e.g., horizontal, vertical, or complex object movement).

3. Dataset and Setup:

We now clearly specify that publicly available test sequences from standard video benchmarks (e.g., Paris_cif.avi, Foreman_cif.avi) were used. The resolution, frame rate, and total frame count for each sequence are now documented in a dedicated subsection. Moreover, we describe how RGB images were converted to grayscale, normalized to reduce lighting bias, and preprocessed to ensure consistent comparison across frames.

Evaluation Metrics:

Dear editor we acknowledge the importance of quantitative evaluation using standard metrics such as precision, recall, and F1-score. However, in the current study, the dataset lacks annotated ground truth labels for motion classes (e.g., object, camera, complex motion), which makes direct computation of these metrics infeasible. To address this limitation, we adopted a human judgment-based validation approach, where motion classification outputs were independently reviewed by multiple observers. Agreement between the algorithm and consensus human classification was used as a proxy for accuracy. While this approach introduces subjectivity, it provides a reasonable basis for preliminary evaluation in the absence of labeled data. In addition, we have included processing speed metrics as an objective performance measure. The proposed method processes frames at an average of 28 frames per second on a standard Intel i7 CPU (no GPU), demonstrating its suitability for practical application. In future we will evaluate the method on publicly available datasets with motion ground truth (e.g., CDNet2014) to compute full precision-recall-F1 metrics and to further strengthen the empirical validation

Reviewer comment: 3) The proposed approach lacks clear novelty; similar techniques have been widely studied in previous literature, and the manuscript does not demonstrate how this work meaningfully advances the state of the art.

Response: Dear editor, while motion estimation and vector-based classification have indeed been studied, our work presents the following distinct contributions, which we have now emphasized more clearly in the revised manuscript:

1. Real-World Scenario Focus:

Most prior studies assume static backgrounds or controlled environments. In contrast, our method is explicitly designed for dynamic real-world scenarios, where both the camera and background are moving—conditions under which traditional techniques like frame differencing and basic optical flow often fail.

2. Efficient Use of Codec-Level Motion Vectors:

Rather than relying on computationally intensive optical flow algorithms, we leverage motion vectors directly from encoded video streams, reducing overhead and making the system viable for real-time or resource-constrained applications such as surveillance or mobile devices.

3. Frame-Based Vector Classification Strategy:

We introduce a frame-level motion vector classification mechanism that captures directional motion patterns over time without requiring manual annotation. This offers a lightweight yet accurate solution for motion event detection in continuous video streams.

4. Empirical Validation with Human Judgement:

To bridge the gap between algorithmic prediction and human perception, we include a validation step comparing the system's decisions with those of human observers—achieving 90% agreement, which reinforces the practical relevance of our approach.

5. Generalizable Framework:

Our methodology does not rely on scene-specific training or assumptions. It is designed to generalize across different video types and can be embedded into other motion analysis pipelines or edge devices.

Reviewer #2

Reviewer comment: 1) the proposed method should be compared with existing motion detection algorithms.

Response: Dear editor Comparison with Existing Motion detection algorithms:

1. Computational Efficiency

Macro block-based motion analysis avoids complex deep learning inference stages. At the same time, it is block-wise only comparison between frames, which significantly leads to time computation reduction and hence applicability in real-time on low-power devices.

2. No extensive training data required

Macro block methods, unlike deep learning object detection models like YOLO and Faster R-CNN, are self-sufficient and do not call for training on annotated data. Thus, they are data-agnostic and much simpler to deploy in new environments or non-seen conditions.

3. Robustness to Background Clutter and Occlusion

It analyzes motion block-wise, thus the differentiation between motion of the background and that occurring on the foreground continues, despite cluttered or dynamic scenes. It employs more direct and geometry-aware techniques toward moving region detection, especially when the object is partly occluded.

4. Adapting to Scene Variability

Relative motion and not appearance features are what macro block motion vectors are designed to capture. This means that they would be much less affected by changes in illumination, shadows, or appearance variations-pretty common conditions that severely degrade the performance of visual-based object detectors.

Reviewer comment: 2) the evaluation needs more quantitative metrics for accuracy and speed such as recall rate F1 score and processing speed.

Response: Dear editor we acknowledge the importance of quantitative evaluation using standard metrics such as precision, recall, and F1-score. However, in the current study, the dataset lacks annotated ground truth labels for motion classes (e.g., object, camera, complex motion), which makes direct computation of these metrics infeasible. To address this limitation, we adopted a human judgment-based validation approach, where motion classification outputs were independently reviewed by multiple observers. Agreement between the algorithm and consensus human classification was used as a proxy for accuracy. While this approach introduces subjectivity, it provides a reasonable basis for preliminary evaluation in the absence of labeled data. In addition, we have included processing speed metrics as an objective performance measure. The proposed method processes frames at an average of 28 frames per second on a standard Intel i7 CPU (no GPU), demonstrating its suitability for practical application. In future we will evaluate the method on publicly available datasets with motion ground truth (e.g., CDNet2014) to compute full precision-recall-F1 metrics and to further strengthen the empirical validation.

Reviewer comment: 3): More datasets can be tested, for example, to include tests under varying lighting and weather conditions to assess robustness

Response: Dear editor Dataset Expansion and Robustness Testing

To test the model's adaptability, we extended the evaluation to include sequences from publicly available datasets like SBI2015. From these dataset we use hall monitor shown in figure1 and custom video as shown in figure 2 dataset, these datasets feature challenges such as variable lighting, dynamic weather, and shadows. Our model retained a performance level similar to the custom dataset used in our initial study. This reinforces the robustness and flexibility of our method across diverse scenarios without requiring retraining.

Figure 1 Hall and Monitor dataset

Figure 2 Custom video for more accurate testing

Reviewer comment: 04: The choice of 10x10 macro blocks requires further justification.

Response: Dear editor

Macro block Size Justification

Smaller macro blocks provide finer detail on motion but increase computation overhead. For example, a 50×50 frame contains 2,500 pixels; if it is divided into 5×5 blocks, it yields 100 macro blocks (2,500 ÷ 25), thus allowing pixel-level analysis but takes longer to process. On the contrary, macro blocks measuring 25×25 will generate only 4 blocks (2,500 ÷ 625), which in itself is too coarse to detect subtle motion variations. A compromise was therefore made to use a 10×10 macro block size. This means 25 blocks per 50×50 frame (2,500 ÷ 100), which is a compromise between capturing detailed motion and efficiency in computation. Therefore, it is not too sensitive to noise arising from pixel-wise processing while still retaining enough detail to capture relevant motion patterns.

Reviewer comment: 05: The threshold selection process for motion detection parameter th is not clearly explained. How is this threshold determined?

Response: Dear editor

Threshold Selection Explanation

Rather than being established as a constant, the threshold value (th) for each motion classification was dynamically defined. For each video, mean (μ) and standard deviation (σ) of motion vector magnitudes were computed, and the threshold thus was set by the formula: th = μ + 0.75σ. This data-driven strategy enables the algorithm to adapt to heterogeneous motion intensities and environmental noises, and consequently provides robust classification results.

Reviewer comment: 06: The 3D bar charts in Figures 13-16 subfigure (d) do not add much meaningful information.

Response: Dear editor

3D GRAPH

The subfigure (d) of Figures 13-16 has been modified in light of reviewer comments. Previously, it showed a 3D histogram of pixel intensities not fully suitable for motion analysis purposes. Now, it has been modified to show a heat map of motion vector magnitudes across the frame. This new representation aids in clearly communicating which spatial areas are undergoing motion, thereby aiding validation and interpretation of the model's classification decisions as shown in figure 3.

Figure 3: 3d map replaced with heat map for better visualization

Dataset resource:

https://sbmi2015.na.icar.cnr.it/SBIdataset.html.

Attachment

Submitted filename: Motion comments (2) (1).docx

pone.0333191.s002.docx (569.7KB, docx)

Decision Letter 1

Xiaohui Zhang

11 Sep 2025

ROBUST MOTION DETECTION AND CLASSIFICATION IN REAL-LIFE SCENARIOS USING MOTION VECTORS

PONE-D-25-31514R1

Dear Dr. Ullah,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Xiaohui Zhang

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewer #1:

Reviewer #2:

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: All comments have been addressed, including language clarity, methodology, dataset description, evaluation, and novelty. Figures and key choices like macro block size and threshold have been explained clearly. Current version can be accepted.

Reviewer #2: All the comments have been thoroughly addressed. I have no concern regarding the manuscript and recommend it for acceptance and publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

Acceptance letter

Xiaohui Zhang

PONE-D-25-31514R1

PLOS One

Dear Dr. Ullah,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Xiaohui Zhang

Academic Editor

PLOS One

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Motion comments (2) (1).docx

    pone.0333191.s002.docx (569.7KB, docx)

    Data Availability Statement

    The minimal dataset necessary to replicate the study findings is fully accessible to all researchers and can be accessed via the following link: https://zenodo.org/records/17217381.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES