Skip to main content
MethodsX logoLink to MethodsX
. 2024 Jun 14;13:102780. doi: 10.1016/j.mex.2024.102780

Behavioral profiling for adaptive video summarization: From generalization to personalization

Payal Kadam a,b, Deepali Vora a,, Shruti Patil a, Sashikala Mishra a, Vaishali Khairnar c
PMCID: PMC11239710  PMID: 39007030

Abstract

In today's world of managing multimedia content, dealing with the amount of CCTV footage poses challenges related to storage, accessibility and efficient navigation. To tackle these issues, we suggest an encompassing technique, for summarizing videos that merges machine-learning techniques with user engagement. Our methodology consists of two phases, each bringing improvements to video summarization. In Phase I we introduce a method for summarizing videos based on keyframe detection and behavioral analysis. By utilizing technologies like YOLOv5 for object recognition, Deep SORT for object tracking, and Single Shot Detector (SSD) for creating video summaries. In Phase II we present a User Interest Based Video summarization system driven by machine learning. By incorporating user preferences into the summarization process we enhance techniques with personalized content curation. Leveraging tools such as NLTK, OpenCV, TensorFlow, and the EfficientDET model enables our system to generate customized video summaries tailored to preferences. This innovative approach not only enhances user interactions but also efficiently handles the overwhelming amount of video data on digital platforms. By combining these two methodologies we make progress in applying machine learning techniques while offering a solution to the complex challenges presented by managing multimedia data.

Keywords: Video Summarization; Query based Summarization; YOLO V5; Deep Sort Algorithm; Deep Learning, Computer Vision; Information Retrieval

Method name: Keyframe Extraction Based Single View Query Dependent Video Summarization

Graphical abstract

Image, graphical abstract


Specifications table

This table provides general information on your method.

Subject area: Computer Science
More specific subject area: Video Summarization
Name of your method: Keyframe Extraction Based Single View Query Dependent Video Summarization
Name and reference of original method: Not Applicable
Reagents/tools: Not Applicable
Experimental design: Our approach addresses issues like data storage, real-time access, and quick browsing by employing keyframe detection and behavioral analysis for summary generation. By incorporating user queries, we enhance summary relevance for individual users. Our system leverages advanced techniques including YOLOv5 for object recognition, Deep SORT for object tracking, and SSD for generating video summaries in Phase I. Additionally, Phase II introduces a machine learning-powered User Interest-Based Video summarizing system, utilizing NLTK, OpenCV, TensorFlow, and the EfficientDET model to craft personalized video summaries tailored to individual preferences. This innovative solution not only elevates user experiences but also effectively manages video data volumes in digital platforms, offering a significant advancement in multimedia data management through machine learning applications.
Trial registration: Not Applicable
Ethics: The proposed work does not include human subjects, animal subjects, or data from a social media platform. The proposed study contains human faces in results which are present in the datasets that are open for study and has been used for evaluating the proposed approach.
Value of the Protocol:
  • Relevance: The summarization method can enhance the overall relevance of the summary by giving priority to the video segments that are most pertinent to the user's information demands by taking into account the query context.

  • Facilitation of Decision Making: By saving time and mental stress, users can make well-informed decisions more quickly when they have access to brief summaries that specifically answer their questions.

  • Contextual expertise: Through query analysis, the summarization system is able to extract and present important information in a more meaningful way by gaining a deeper knowledge of the context and intent of the user.

Background

The motivation behind developing a framework that combines machine learning with user interactivity, stems from the challenges posed by the proliferation of video content in today's multimedia landscape, particularly in the context of CCTV footage. With vast amounts of video data being generated constantly, traditional methods of storage, access, and browsing become increasingly cumbersome. Thus, there is a pressing need for innovative solutions that can efficiently summarize video content while also providing real-time access and user-specific interactions.

Description of protocol

A framework that is a combination of machine learning and user interactivity which is represented with the help of Fig. 1 is used for creating a concise representation of the video input. In today's multimedia era, maintaining massive volumes of CCTV footage poses considerable issues, such as data storage, real-time access, and fast browsing. To solve this issue, we offer a keyframe detection and behavioral analysis-based summarization approach for summary generation along with the added advantage of query input to generate a user-specific summary. Our system utilizes advanced techniques such as YOLOv5 [1] for object recognition, Deep SORT [2] for object tracking, and Single Shot Detector (SSD) [3] for generating video summaries which is shown and described in Phase I. Furthermore, we provide a machine learning-powered User Interest-Based Video summarizing system that improves standard summarizing methods by adding user preferences. Using techniques such as NLTK, OpenCV, TensorFlow, and the EfficientDET model [4], our system generates personalized, short video summaries customized to individual preferences which is shown and described in Phase II. This novel solution not only enhances user experiences, but it also effectively controls the volume of video data in digital platforms. The combination of these two methodologies offers a substantial improvement in machine learning applications, providing a realistic answer to the issues presented by multimedia data management.

Fig. 1.

Fig 1

Proposed comprehensive approach for video summarization.

Phase I

In today's digital landscape, an abundance of videos saturates various online platforms, demanding substantial bandwidth for seamless viewing experiences. With the continuous influx of video content onto the internet and cloud-based platforms, the necessity for efficient utilization of network resources becomes increasingly evident. Video summarization emerges as a practical solution to this challenge, providing concise yet comprehensive representations of original video material. By distilling key information into succinct summaries, this approach not only saves time but also optimizes space and conserves valuable network and multimedia infrastructure resources, offering a pragmatic strategy to navigate the vast expanse of digital video content efficiently [5].

In proposed video summarization shown in Fig. 2, which utilizes OpenCV to split videos into individual frames. Using the cv2.VideoCapture() function, we loaded the video and employed a while loop to extract each frame. Frames were saved with cv2.imwrite() using a naming convention containing the frame number and timestamp. Frame resolution was resized to meet YOLO v5′s input size requirements. These frames served as input to the YOLO v5 object detection model, and the resulting frames were formally represented as stated in Eq. (1)

VS=f1,f2,f3,.fn (1)

Fig. 2.

Fig 2

Phase I architecture: generation of single view video summarization.

The total number of frames are represented as VS

And number of consecutive frames are represented as f1,f2, f3, ….fn.

To detect objects in the video frames, we used the YOLO v5, which is a state-of-the-art deep-learning model for the recognition and categorization of objects. YOLO v5 is an improvement over the previous versions of YOLO, with higher accuracy and faster performance. It is based on a single neural network that instantly predicts probability classes and box boundaries from complete photos. We used the pre-trained YOLO v5 model available in the open-source repository to detect objects in the video frames. The model was trained on a large dataset of annotated images, which allows it to learn a wide variety of objects and scenes. For each video frame, we passed the image through the YOLO v5 model to obtain the bounding boxes and class probabilities of the detected objects. The model detects objects belonging to 80 different classes, including people, vehicles, animals, and household objects. We filtered the detected objects based on a confidence threshold, which means that we only considered objects with a predicted confidence score of 50.

To track and analyze detected objects in video frames, we employed the state-of-the-art Deep SORT algorithm, a deep learning-based object-tracking system. Utilizing the pre-trained Deep SORT model from an open-source repository, it assigned unique IDs to detected objects, employing appearance features and motion models for tracking. Appearance features were extracted through a deep CNN, while motion models predicted object movements. The algorithm gracefully handled complexities like occlusions and object disappearance/reappearance using a Kalman filter. The resulting object tracks and behavior were crucial inputs for our video summarization algorithm, determining keyframes for the summary.

After detecting objects and their behavior with YOLO v5 and DeepSORT, we employed a video summarization algorithm to select key frames for a concise video synopsis. The algorithm prioritized frames using a frame ranking system based on visual saliency and relevance to the video's narrative. Visual saliency was calculated using Deep Saliency, a pre-trained neural network, considering factors like object motion, size, and color contrast. Object behavior information from DeepSORT influenced frame ranking, emphasizing significant changes or new objects. We determined the number of frames for the summary using a threshold-based approach based on the input data length. OpenCV was then utilized to stitch the selected frames into a coherent video summary, capturing essential events and information. To prepare for object detection and behavior analysis with YOLO v5 and DeepSORT, we utilized OpenCV to convert the video into individual frames. Using the "VideoCapture" function, we loaded the video, iterated through frames, and applied pre-processing steps such as resizing or converting to grayscale. Processed frames were saved as individual image files in a specified directory with unique filenames based on frame numbers. These image frames served as input for subsequent object detection and behavior analysis algorithms.

To facilitate the tracking of objects across multiple frames and the identification of common frames between different videos, we assigned a unique identifier (ID) to each frame. We first counted the total frames in the video and assigned a unique ID to each frame based on its position in the video sequence. We then used YOLO v5 to detect objects in each frame and DeepSORT to track the detected objects across multiple frames. For each detected object, we assigned a unique object ID based on the output of DeepSORT. This allowed us to track the movement and behavior of each object across multiple frames. Finally, we identified the common frames between different videos by comparing the frame IDs of the detected objects. Frames that contained the same objects with the same IDs were considered common frames. We saved the frame IDs and object IDs as metadata for each frame, which could be used for further analysis or visualization of the video content. The procedure for allocating an ID is stated in Eq. (2),

ds=I(f1)1,I(f2)2,I(f3),,I(fn)n (2)

The total number of frames ID are represented as ds And Id assigned for each frame and represented as I(f1), I(f2), I(f3) …I(fn) for frames f1,f2,f3,….fn.

Finally, MobileNet SSD model [6] for person detection, implemented through the Caffe DNN framework. This model identifies individuals in each video frame, generating a summary based on pixel differences. For detailed people-focused summaries, we use the SSD algorithm to detect and track individuals, applying DeepSORT for continuous tracking. Frames with detected people are ranked using visual saliency and behavior, and OpenCV stitches selected frames into a focused person summary video. This concise summary emphasizes the movement and behavior of individuals in the original video.

Phase II

The user-interest-based video summarising method is discussed in this section, which is broken into several sections as shown in Fig. 3 for a thorough comprehension. The method consists of several processes, beginning with frame extraction and progressing to object detection before converting images back into a video summary. Our methodology begins with the extraction of image frames from the movie, using libraries like OpenCV for image processing and TensorFlow for numerical computation. Subsequently, we use the EfficientDet machine learning model to detect items in video pictures, producing a comprehensive list of objects using machine learning techniques.

Fig. 3.

Fig 3

Phase II architecture: generation of query dependent video summarization.

In the process of converting videos into image frames, we rely on the robust capabilities of OpenCV, a powerful library for image processing. OpenCV provides an extensive set of functions and tools that enable us to efficiently read and manipulate video files, extracting individual frames with precision. Leveraging its intuitive interface, we seamlessly navigate through the video stream, extracting each frame methodically for further analysis. Furthermore, TensorFlow serves as a cornerstone for numerical computations in our workflow. With its versatile architecture optimized for machine learning tasks, TensorFlow empowers us to perform complex numerical computations with ease and efficiency. By integrating TensorFlow into our pipeline, to unlock the potential for advanced data processing and analysis, facilitating tasks such as feature extraction and object detection. Together, the seamless integration of OpenCV and TensorFlow streamlines the process of converting videos into image frames while providing a solid foundation for subsequent analytical tasks.

After collecting frames from the video with OpenCV, we perform image processing activities such as scaling down the images to optimize subsequent analysis. Scaling down photos is critical for lowering computational overhead and memory consumption, particularly when working with large numbers of frames. This method entails shrinking each frame to a lesser resolution while retaining important visual information, ensuring that the downscaled images remain true to the original material. To accomplish this, OpenCV provides a variety of functions and methods designed for picture modification, including resizing operations. Using these skills, we can easily scale down each frame to the necessary proportions while maintaining key visual elements. This down sampling not only allows for smoother processing, but it also improves the overall efficiency of subsequent activities like object detection and feature extraction, resulting in a more streamlined and resource-efficient workflow.

The insertion and processing of user queries are critical in customizing the video summarizing process to individual preferences. We use the Natural Language Toolkit (NLTK), a comprehensive package for natural language processing tasks, to automate this important step. NLTK provides a variety of features, including tokenization, part-of-speech tagging, and semantic analysis, allowing us to successfully interpret and understand user requests. Through NLTK, we transform raw user input into structured data, facilitating deeper analysis and interpretation. After receiving the user query, we use NLTK's tokenization capabilities to break down the text into individual words or tokens, removing unnecessary parts like articles and pronouns. This tokenization procedure improves the query's specificity and relevance, allowing us to focus on important keywords and concepts. Furthermore, NLTK has tools for semantic analysis, which allow us to extract semantic meaning from queries and discover significant themes or topics of interest. Integrating NLTK into our workflow improves the accuracy and efficiency of query processing, allowing us to provide more personalized and relevant video summaries based on the user's preferences.

Our solution is based on the comparison of image features and the processed user query to extract relevant frames. Using spaCy, we turn the lists of objects spotted in the photos and token words from the query into documents, allowing us to calculate the cosine similarity scores. The frames considered significant are then used to create a summarized video with OpenCV. To make future usage of the code easier, we use pickle to save the values of discovered objects, avoiding duplicate processing.

Protocol validation

Datasets

Phase I

We evaluated our suggested framework on two benchmark data sets for video summarizing that are available to the public, SumMe and TVSum [9,10]. The SumMe data collection specifically includes 25 user films that document a range of occasions, including holidays, sports, and historical events. The duration of the movies varies from 1.5 to 6.5 minutes. The 50 videos that make up the TVSum data set are divided into ten categories, such as animal videos, parade, riding bike and so forth. It is important to note that both the TVSum and SumMe data sets have two kinds of ground-truth(G) annotations: the indicator and significance score vectors at the frame level. Many techniques employ frame-level significance values as the foundation of truth, so we use this arrangement. Because YouTube only offers a limited number of keyframes as ground realities, we explicitly set them as our assessment objective. Table 1 shows the original video length and summarized video length of the videos taken from the SumMe, TVSum, and Real time dataset. The comparison of original video lengths to their respective summarized lengths across various datasets showcases the efficiency and effectiveness of summarization techniques in condensing video content while preserving its essence.

Table 1.

Phase I: Comparison of original video length and summarized video length.

Dataset Video Number Original Video length Summarized video length
TVSum Video 6 1 min 50 sec 52 sec
TVSum Video 15 1 min 38 sec 49 sec
TVSum Video 20 2 min 25 sec 1 min 40 sec
TVSum Video 25 1 min 44 sec 1 min 1 sec
TVSum Video 35 4 min 29 sec 2 min 55 sec
SumMe Video 2 2 min 38 sec 20 sec
SumMe Video 4 1 min 27 sec 50 sec
SumMe Video 11 6 min 38 sec 2 min 50 sec
SumMe Video 13 30 sec 18 sec
SumMe Video 24 2 min 35 sec 1 min 20 sec
Real time video Video 1 47 sec 43 sec
Real time video Video 2 55 sec 28 sec
Real time video Video 3 45 sec 10 sec
Real time video Video 4 1 min 22 sec 36 sec
Real time video Video 5 1 min 5 sec 31 sec

Across the TVSum dataset, original video lengths vary from as short as 1 minute 38 seconds to as long as 4 minutes 29 seconds, with summarized lengths ranging from 49 seconds to 2 minutes 55 seconds. Similarly, in the SumMe dataset, original video lengths span from 30 seconds to 6 minutes 38 seconds, with summarized lengths condensed to as little as 18 seconds to 2 minutes 50 seconds. Notably, in real-time video datasets, the summarization process significantly reduces video lengths while retaining crucial information, with original videos ranging from 45 seconds to 1 minute 22 seconds and their summarized counterparts condensed to 10 seconds to 43 seconds. These findings underscore the effectiveness of video summarization techniques in facilitating efficient content browsing and retrieval, particularly in scenarios where time and attention are limited, such as real-time video consumption and large-scale video collections. The results of the performance evaluation across several datasets are shown in Table 2 which depicts the differences in F-score, recall, and precision measures, indicating the efficacy of video summarising methods in a variety of data domains. A moderate level of summarization quality is indicated by the SumMe dataset's precision of 37.1%, recall of 45.1%, and F-score of 77.7%. In comparison, the TVSum dataset displays slightly lower F-score of 72.7% with somewhat higher accuracy and recall values of 56.0% and 56.4%, respectively. This suggests a generally balanced performance in terms of precision and recall. Notably, recall and F-score are 49.2% and 56.2%, respectively, while precision in the real-time video dataset is 66.2%, higher than in the other datasets.

Table 2.

Performance evaluation on different datasets.

Dataset Precision Recall F-Score
SumMe 37.1 45.1 77.7
TVSum 56.0 56.4 72.7
Real time video 66.2 49.2 56.2

The SumMe and TVSum dataset's comparative analysis of F-score metrics amongst various videos demonstrates the differing levels of summarization efficacy is shown in Table 3. F-scores in the SumMe dataset vary from 26.4% to 77.7%. Video 25 has the highest score, suggesting an especially effective summary. On the other hand, videos with lower F-scores, like Video 21 and Video 16, indicate that the quality of the summarization may be improved. Like this, F-scores in the TVSum dataset range greatly, from 42.5% to 72.7%. Among the videos, Video 45 has the greatest F-score, indicating that the summary process was quite successful. In contrast, videos 8 and 34 have relatively lower F-scores. These results illustrate areas where refining and optimization efforts might result in improved summarising quality, underscoring the significance of considering specific video characteristics and content complexities when evaluating the effectiveness of video summarization algorithms.

Table 3.

F-Score in percentage of videos from SumMe and TVSum dataset.

Dataset Video Number Video Length in Sec F-Score
SumMe Video 4 100 28.8
SumMe Video 9 193 35.5
SumMe Video 16 300 28.6
SumMe Video 21 220 26.4
SumMe Video 25 172 77.7
TVSum Video 5 191 64.1
TVSum Video 6 110 56.2
TVSum Video 8 133 42.5
TVSum Video 14 510 46.8
TVSum Video 22 243 58.0
TVSum Video 31 260 66.9
TVSum Video 34 274 48.2
TVSum Video 38 149 43.6
TVSum Video 43 265 62.5
TVSum Video 45 446 72.7
SumMe Video 4 100 28.8
SumMe Video 9 193 35.5
SumMe Video 16 300 28.6
SumMe Video 21 220 26.4
SumMe Video 25 172 77.7
TVSum Video 5 191 64.1

The average ratings for informativeness and visual pleasantness are stated with the help of Table 4 which shows are consistently high across datasets, showing that the summarized videos are of good quality and regarded to be successful. Both informativeness and visual pleasantness have high average ratings (9.1 and 9.3, respectively) in the SumMe dataset. Comparably, the TVSum dataset shows excellent evaluations; informativeness is rated 9.4, a little better than visual pleasantness, which is rated 9.1. Similar positive scores are given to both features in the real-time video dataset, where informativeness and visual pleasantness score 9.2 and 9.3, respectively. These results imply that video summarising methods applied to different datasets are effective in providing visually appealing and educational summaries, improving user satisfaction with the summarised content. As per the data stated in Table 5, Our suggested approach performs promisingly across a variety of evaluation parameters when compared to cutting-edge video summarising algorithms. Compared to other methods, such TTH-RNN [7], SBOPM 8], and GCN [11], which have F-scores ranging from 46.6% to 50.5%, our method achieves an F-score of 56.2%. Interestingly, our method also performs exceptionally well in terms of precision, scoring an astounding 66.2%, while other methods' precision scores range from 40.5% to 50.8%.

Table 4.

Statistical data.

Dataset Measures Average Rating
SumMe Informativeness 9.1
SumMe Visual Pleasantness 9.3
TVSum Informativeness 9.4
TVSum Visual Pleasantness 9.1
Real time video Informativeness 9.2
Real time video Visual Pleasantness 9.3
Table 5.

Comparison of proposed method with state-of-art video summarization methods.

Methodology F-Score Precision Recall
TTH-RNN [7] 46.6 40.5 53.2
SBOPM [8] 48.54 41.41 64.49
DSNet [10] 51.2 50.8 51.9
GCN [11] 50.5 50.0 51.3
Ours (Phase I) 56.2 66.2 49.2

Phase II

Table 6 shows the notable differences in the lengths of the generated videos when looking at the summarized outputs for different queries on the TVSum and YouTube datasets. Depending on the query used, the summarized video lengths differ significantly across the two datasets. For example, the summary durations for the queries "Cup of Coffee" and "Food" on the TVSum dataset varied from 1 second to 1 minute and 15 seconds. Comparably, the summarization times for the queries "Train arriving" and "Dog" on the YouTube dataset range from only 2 to 9 seconds.

Table 6.

Summarized output with different queries on same video of TVSum and YouTube dataset.

Sr. No TVSum
YouTube
Title of Video Query Original Video Length Summarized Video Length Title of Video Query Original Video Length Summarized Video Length
1. 37rzWOQsNIw Birds 3 m 11 s 38s Crazy Passenger Refuses to Miss Flight Typing on laptop 2 m 36s 4s
2. 37rzWOQsNIw Car 3 m 11 s 12s Crazy Passenger Refuses to Miss Flight Train arriving 2 m 36s 2s
3. 37rzWOQsNIw Food 3 m 11 s 1 m 15s Crazy Passenger Refuses to Miss Flight Cell phone 2 m 36s 5s
4. 37rzWOQsNIw Cup of Coffee 3 m 11 s 1s Crazy Passenger Refuses to Miss Flight Cat 2 m 36s 7s
5. 37rzWOQsNIw Burger 3 m 11 s 23s Crazy Passenger Refuses to Miss Flight Dog 2 m 36s 9s

As per the Table 7, the lengths of the summarized outputs for different searches on different videos from the TVSum and YouTube datasets show significant differences, indicating the efficacy of query-dependent video summarising methods in a variety of content areas. The summary durations for the queries "Chopping vegetables" and "Birds and dogs" on the TVSum dataset vary from 1 second to 25 seconds. Comparably, the summarization durations on the YouTube dataset range from only two seconds for the query "Car scenes" to one minute and forty-eight seconds for the query "Airplanes". It's interesting to note that, even though the original video lengths varied, the summaries typically stayed quite short, demonstrating the capacity of query-dependent summarization techniques to effectively extract pertinent information from videos of any length. These results demonstrate how query-dependent video summarising algorithms can improve browsing experiences and content accessibility in a variety of video collections by offering concise and customized summaries that correspond with user queries.

Table 7.

Summarized output with different queries on different videos of TVSum and YouTube dataset.

Sr. No TVSum
YouTube
Title of Video Query Original Video Length Summarized Video Length Title of Video Query Original Video Length Summarized Video Length
1. iVt07TCkFM0 Bicycles in the video 1 m 44s 20s Inside Virat Kohli's Spacious Nature Inspired Holiday Home | AD India Furniture 3 m 51s 35s
2. 37rzWOQsNIw Birds and dogs 3 m 11s 25s The True Scale of the World's Largest Airports Airplanes 6 m 44s 1 m 48s
3. LRw_obCPUt0 Burgers and salads 4 m 24s 20s Despicable Me 4 | Official Trailer Car scenes 2 m 26s 2s
4. sTEELN-vY30 Train accident 2 m 28s 21s Harley Davidson Street 750 Accident with Cow Cow accident 1 m 30s 5s
5. WG0MBPpPC6I Chopping vegetables 6 m 37s 1s HGV almost crushes car in shocking footage! Furniture 3 m 51s 35s

Table 8 presents the results of video summarization for two different queries, "cars" and "potted plants," using TVSum and YouTube datasets. For the "cars" query, the original video lengths ranged from 1 minute 44 seconds to 2 minutes 29 seconds, with summarized lengths varying from 10 seconds to 1 minute 16 seconds across the datasets. Similarly, for the "potted plants" query, original video lengths varied from 51 seconds to 3 minutes 14 seconds, with summarized lengths ranging from 15 seconds to 49 seconds. The summarization process significantly reduced the length of the videos across both queries and datasets, showcasing the effectiveness of the summarization techniques employed. However, there appears to be variability in the extent of summarization achieved, likely influenced by factors such as content complexity and dataset characteristics. Further analysis could explore the quality and relevance of the summarized content relative to the original videos, as well as the potential impact of these variations on user experience and information retention. Table 9 shows the comparison of the proposed approach with existing approaches used for generating query-focused video summarization. Table 9 shows comparison of proposed approach with the existing methods used for query dependent video summary.

Table 8.

Summarized output with same query on different videos of TVSum and YouTube dataset.

Sr. No TVSum
YouTube
Title of Video Query Original Video Length Summarized Video Length Title of Video Query Original Video Length Summarized Video Length
1. sTEELN-vY30 Cars 2 m 29s 36s Quick Bedroom Decorating Ideas _ Guest Room Decorations Potted Plants 51s 15s
2. iVt07TCkFM0 Cars 1 m 44s 15s Paris vlog _ coffee and pastries _ Discover all iconic coffee shops and cafés in Paris Potted Plants 3 m 50s 22 s
3. akI8YFjEmUw Cars 2 m 13s 18s Cafe Tour _ Istanbul-Kadikoy _ 4k video _ #coffee #vlog #lifestyle Potted Plants 2 m 20s 19s
4. XzYM3PfTM4w Cars 1 m 51s 1 m 16s Virtual Office Flythrough, London - FPV Drone Tour by Bad Wolf Horizon Potted Plants 3 m 14s 49 s
5. GsAD1KT1xo8 Cars 2 m 25s 10s The White Space _ Office Interior Cinematic Shoot _ 4K Potted Plants 1 m 46s 1 m
Table 9.

Comparison of proposed method with state-of-art query dependent video summarization methods.

Methodology F-Score
[12] 55.3
[13] 57.6
[14] 55.5
[15] 46.94
Ours (Phase II) 58.6

Results and discussion

Phase I

Concerning the generated video summary, it exhibited a sound performance in all datasets of SumMe, TVSum, YouTube CCTV videos, and the real-time sample screenshot datasets. This method was successful in keyframe extraction in the SumMe dataset. Their method of behavioral analysis was able to add context-awareness to the summaries. Similarly, the application of the proposed method on the TVSum dataset rendered consistent and competitive results, demonstrating the adaptability of the method to different types of video content. For instance, in the case of YouTube CCTV videos, the algorithm would prove useful in detecting the keyframes and in generating something more meaningful out of them, in favor of surveillance applications.

Besides, for the videos that are represented by the sample screenshots, the algorithm was promising towards in real-time capabilities. It has shown potentials of the flexible to dynamic scenes and evidenced clearly, as it captured critical moments promptly thus suggesting practical uses of live streaming or summarization of events. The results on different datasets obtained highlight the generalization capabilities of the algorithm and encourage one to apply the findings within various video summarization scenarios. The approach is applicable to a number of current state-of-the-art video applications. Further work may be required to tune and optimize the approach based on potential limitations found when doing its evaluation and seek ways to grow its performance across different video content and scenarios.

The proposed model separates incoming video into image frames. Using YOLOV5, an item is found from each frame. The Deep Sort algorithm is used to analyze a person's behavior. Each frame is labelled with and ID, but only unique frames are used to provide the final output summary. We utilized a machine learning model called Single Shot Detector to integrate all the distinct frames, and it provides us with a summary of the input video. Additionally, we tested our framework using the most popular datasets (SumMe and TVSum). Experiments show that the suggested strategy for video summarizing effectively cuts the video content while retaining the important scene. The resulted video summary samples from SumMe and TVSum is depicted with the help of Fig. 4, Fig. 5 respectively. On real-time video and surveillance footage downloaded from YouTube, we experimented and created a video summary which is shown in Fig. 6.

Fig. 4.

Fig 4

Video summary samples from the SumMe dataset.

Fig. 5.

Fig 5

Video summary samples from the TVSum dataset.

Fig. 6.

Fig 6

Video summary samples from the real time videos.

Phase II

The observed differences in summarized video lengths across queries and datasets underscore the nuanced nature of query-dependent video summarization. Notably, the disparity in summarized durations between the TVSum and YouTube datasets highlights the influence of dataset characteristics and query semantics on the summarization process. Fig. 7, Fig. 8 shows the sample output of input videos and generated query dependent video summary when given different query for the same input video taken from TVSum and YouTube dataset.

Fig. 7.

Fig 7

Video summary samples from the YouTube dataset: Same Input video and different input queries.

Fig. 8.

Fig 8

Video summary samples from the TVSum dataset: Same Input video and different input queries.

The findings from Fig. 9, Fig. 10 indicates the outcomes of video summarization conducted on two distinct queries, “cars” and “potted plants,” utilizing both the TVSum and YouTube datasets. These results demonstrate a notable reduction in video lengths following the summarization process, indicating the efficacy of the employed techniques. However, the variability observed in the extent of summarization achieved suggests potential influences from factors such as content complexity and dataset characteristics. Further scrutiny is warranted to assess the quality and relevance of the summarized content relative to the original videos, alongside investigating potential implications on user experience and information retention. Moreover, Table 9 provides a comparative analysis of the proposed approach against existing methodologies utilized for generating query-focused video summarization, offering insights into the effectiveness and competitiveness of the proposed method within the research landscape.

Fig. 9.

Fig 9

Video summary samples from the TVSum dataset when given the same query for different videos(Input Query for Video 1 and 2 is given “Cars”).

Fig. 10.

Fig 10

Video summary samples from the YouTube dataset when given same query for different videos(Input Query for Video 1 and 2 is given “Potted Plants”).

The findings depicted in the Fig. 11, Fig. 12 highlights considerable differences in the lengths of summarized outputs across various searches on videos from both the TVSum and YouTube datasets, underscoring the effectiveness of query-dependent video summarization methods across diverse content domains. It is intriguing to observe that despite variations in original video lengths, the summaries consistently remained succinct, showcasing the capability of query-dependent summarization techniques to efficiently extract relevant information from videos of varying durations. These results underscore the potential of query-dependent video summarization algorithms to enhance browsing experiences and improve content accessibility within a multitude of video collections by providing concise and tailored summaries aligned with user queries. This suggests promising avenues for leveraging such techniques to streamline information retrieval processes and facilitate more efficient consumption of multimedia content.

Fig. 11.

Fig 11

Video summary samples from the TVSum dataset: different input video and different input queries.

Fig. 12.

Fig 12

Video summary samples from the YouTube dataset: different input video and different input queries.

Future research directions

Future research on video summarization should advance the field and tackle its challenges. In this respect, one may look for Explainable AI (XAI) techniques within video summarization algorithms. Transparency in decision-making in summary models and interpretability of such decisions could help in ensuring that the users trust their summaries and even better understand how algorithms reach certain summaries. This would be essential in the wake that video summarization is needed for critical applications such as surveillance and journalism, in which interpretability is important to meet accountability and ethical considerations. The other future research area should dwell on developing unsupervised or self-supervised learning techniques. By using large, annotated datasets, these methods would require less prior knowledge and further improve the scalability and adaptability of video summarization algorithms to new, diverse content. Summarization without the need of labor-intensive labeled training data could take advantage of intrinsic video properties, temporal structures, and contextual cues by unsupervised techniques. That formulation fits well the course of the field, which tends to be driven by the ever-increasing demand for efficient and data-efficient algorithms that enable the model to generalize at video summarization across myriad domains and content types.

Limitations

None.

CRediT author statement

Payal Kadam: Conceptualization, Methodology, Data curation, Writing – original draft, Writing – review & editing. Deepali Vora: Writing – review & editing, Project Administration, Funding Acquisition, Provided overall project guidance, critically reviewed and edited the manuscript for intellectual content, administered the project's execution. Shruti Patil: Project Administration, provided overall project guidance, critically reviewed edited the manuscript for intellectual content. Sashikala Mishra: Project Administration, provided overall project guidance, critically reviewed edited the manuscript for intellectual content. Vaishali Khairnar: Project Administration, provided overall project guidance, critically reviewed edited the manuscript for intellectual content.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability

  • Data will be made available on request.

References

  • 1.Saxena N., Asghar M.N. Lecture Notes in Networks and Systems. Springer Nature Switzerland; 2023. YOLOv5 for road events based video summarization; pp. 996–1010. [DOI] [Google Scholar]
  • 2.Harakannanavar S.S., Sameer S.R., Kumar V., Behera S.K., V Amberkar A., Puranikmath V.I. Robust video summarization algorithm using supervised machine learning. Glob. Transit. Proceed. 2022;3(1):131–135. doi: 10.1016/j.gltp.2022.04.009. IssueElsevier BV. [DOI] [Google Scholar]
  • 3.H.B. Ul Haq, M. Asif, M.B. Ahmad, R. Ashraf, & T. Mahmood, (2022). An effective video summarization framework based on the object of interest using deep learning. In N. Qazi (Ed.), Mathematical Problems in Engineering (vol. 2022, pp. 1–25). Hindawi Limited. 10.1155/2022/7453744 [DOI]
  • 4.Aldahoul N., Karim H.A., Sabri A.Q.Md., Tan M.J.T., Momo Mhd.A., Fermin J.L. Vol. 10. Institute of Electrical and Electronics Engineers (IEEE); 2022. A comparison between various human detectors and CNN-based feature extractors for human activity recognition via aerial captured video sequences; pp. 63532–63553. (IEEE Access). [DOI] [Google Scholar]
  • 5.Kadam P., et al. Recent challenges and opportunities in video summarization with machine learning algorithms. IEEE Access. 2022;10:122762–122785. doi: 10.1109/ACCESS.2022.3223379. [DOI] [Google Scholar]
  • 6.Manasa Y, Yalanati Ayyappa, Nagarapu Naveen, Pitchika Prakash, Veeranki. M. N. V Sai, Sunkara Sandeep. 2023 7th International Conference on Trends in Electronics and Informatics (ICOEI). 2023 7th International Conference on Trends in Electronics and Informatics (ICOEI) IEEE; 2023. Extracting the essence of video content: movement change based video summarization using deep neural networks. [DOI] [Google Scholar]
  • 7.Zhao B., Li X., Lu X. "TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization,''. IEEE Trans. Ind. Electron. Apr. 2021;68(4) doi: 10.1109/TIE.2020.2979573. [DOI] [Google Scholar]
  • 8.Ma M., Mei S., Wan S., Hou J., Wang Z., Feng D.D. "Video summarization via block sparse dictionary selection". Neurocomputing. Feb. 2020;378 doi: 10.1016/j.neucom.2019.07.108. [DOI] [Google Scholar]
  • 9.Liu N., Sun X., Yu H., Zhang W., Xu G. ”D-MmT: A concise decoder-only multi-modal transformer for abstractive summarization in videos”. Neurocomputing. 2021;456:179–189. doi: 10.1016/j.neucom.2021.04.072. VolumePagesISSN09252312. [DOI] [Google Scholar]
  • 10.Ji Z., Zhao Y., Zhu W., Lu J., Li J., Zhou J. "DSNet: A flexible detect-to-summarize network for video summarization,". IEEE Trans. Image Process. 2021;30 doi: 10.1109/TIP.2020.3039886. [DOI] [PubMed] [Google Scholar]
  • 11.Kipf T.N., Welling M. Proc. ICLR. 2017. Semi-supervised classification with graph convolutional networks; pp. 1–13. [Google Scholar]
  • 12.J.-H. Huang, L. Murn, M. Mrak, & M. Worring, (2023). Query-based Video Summarization with Pseudo Label Supervision (Version 1). arXiv. 10.48550/ARXIV.2307.01945. [DOI]
  • 13.S. Messaoud, I. Lourentzou, A. Boughoula, M. Zehni, Z. Zhao, C. Zhai, & A.G. Schwing, (2021). DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization (Version 1). arXiv. 10.48550/ARXIV.2105.06441. [DOI]
  • 14.Ji Z., Zhang Y., Pang Y., Li X., Pan J. Multi-video summarization with query-dependent weighted archetypal analysis. In Neurocomputing. 2019;332:406–416. doi: 10.1016/j.neucom.2018.12.038. [DOI] [Google Scholar]
  • 15.Attention network for query-focused video summarization. arXiv. 10.48550/ARXIV.2002.03740. [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

  • Data will be made available on request.


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES