Skip to main content
Data in Brief logoLink to Data in Brief
. 2020 Dec 26;34:106701. doi: 10.1016/j.dib.2020.106701

A dataset of labelled objects on raw video sequences

Hyomin Choi a, Elahe Hosseini a, Saeed Ranjbar Alvar a, Robert A Cohen a, Ivan V Bajić a,
PMCID: PMC7797526  PMID: 33457477

Abstract

We present an object labelled dataset called SFU-HW-Objects-v1, which contains object labels for a set of raw video sequences. The dataset can be useful for the cases where both object detection accuracy and video coding efficiency need to be evaluated on the same dataset. Object ground-truths for 18 of the High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) sequences have been labelled. The object categories used for the labeling are based on the Common Objects in Context (COCO) labels. A total of 21 object classes are found in test sequences, out of the 80 original COCO label classes. Brief descriptions of the labeling process and the structure of the dataset are presented.

Keywords: Object detection, Video coding, Video compression, Video coding for machines

Specifications Table

Subject Computer Vision and Pattern Recognition
Specific subject area Object detection, object classification, video compression
Type of data Annotations
How data were acquired Data was created by processing and analyzing HEVC v1 CTC test video sequences. The sequences were first passed through YOLOv3 object detector to find initial object locations and classes. Then this information was refined manually, frame by frame, using the Yolo_mark software tool, to create the final object labels.
Data format Analyzed
Parameters for data collection Raw primary data was converted from YUV420 to RGB24 format prior to data collection.
Description of data collection The converted RGB24 data were passed through YOLOv3 object detector to find initial object locations and classes. Then this information was refined manually, frame by frame, using the Yolo_mark software tool, to create the final object labels.
Data source location Institution: Simon Fraser University
City/Town/Region: Burnaby, British Columbia
Country: Canada
Latitude and longitude (and GPS coordinates, if possible) for collected samples/data: Latitude: 49.276765, Longitude: −122.917957
Primary data sources:
Raw HEVC v1 CTC video sequences maintained by ITU-T JCT-VC:
https://www.itu.int/en/ITU-T/studygroups/2017-2020/16/Pages/video/jctvc.aspx)
Data accessibility Repository name: Mendeley
Direct URL to data: http://dx.doi.org/10.17632/hwm673bv4m.1
Instructions for accessing these data:
Secondary data, which this paper describes, is publicly available at the above URL.
https://www.itu.int/en/ITU-T/studygroups/2017-2020/16/Pages/video/jctvc.aspx

Value of the Data

  • We provide a dataset of object labels for raw (uncompressed) HEVC v1 CTC video sequences.

  • Our dataset can benefit the research at the intersection of video coding and computer vision. It is one of the datasets used in the MPEG-VCM (Video Coding for Machines) standardization group.

  • The dataset can be used to study the impact of video compression on object detection, or for developing and analyzing systems that perform video compression and object detection simultaneously, as in MPEG-VCM.

1. Data Description

We present a dataset called SFU-HW-Objects-v1, which contains bounding boxes and object class labels for High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) video sequences [1], [2]. The presented dataset contains only object labels; video sequences themselves can be obtained from the Joint Collaborative Team on Video Coding (JCT-VC).1 Table 1 lists 18 video sequences in this group, along with their characteristics and the number of objects found in each sequence. Table 2 shows the list of object classes found in these sequences. Object class IDs follow the Common Objects in Context (COCO) [3] label indices.

Table 1.

HEVC v1 CTC sequences and the number of object classes found in each sequence.

Class Sequence name Width × Height Frame count Frame rate (Hz) Bit depth Number of object classes
A Traffic 2560 × 1600 150 30 8 2
A PeopleOnStreet 2560 × 1600 150 30 8 4
B BQTerrace 1920 × 1080 600 60 8 9
B BasketballDrive 1920 × 1080 500 50 8 4
B Cactus 1920 × 1080 500 50 8 1
B Kimono 1920 × 1080 240 24 8 2
B ParkScene 1920 × 1080 240 24 8 4
C BQMall 832 × 480 600 60 8 3
C BasketballDrill 832 × 480 500 50 8 4
C PartyScene 832 × 480 500 50 8 6
C RaceHorses 832 × 480 300 30 8 2
D BQSquare 416 × 240 600 60 8 7
D BasketballPass 416 × 240 500 50 8 4
D BlowingBubbles 416 × 240 500 50 8 3
D RaceHorses 416 × 240 300 30 8 2
E KristenAndSara 1280 × 720 600 60 8 3
E Johnny 1280 × 720 600 60 8 3
E FourPeople 1280 × 720 600 30 8 4

Table 2.

Object classes found in the HEVC v1 CTC sequences.

Class ID Object Class ID Object Class ID Object
0 Person 17 Horse 56 Chair
1 Bicycle 24 Backpack 58 Potted plant
2 Car 25 Umbrella 60 Dining table
5 Bus 26 Handbag 63 Laptop
7 Truck 27 Tie 67 Cell phone
8 Boat 32 Sports ball 74 Clock
13 Bench 41 Cup 77 Teddy bear

Object labels are organized into separate folders, one for each sequence. Data file structure is shown in Fig. 1. There is one text file per each frame of each sequences containing object labels. The filename indicates the name of the sequence, resolution, frame rate, and the frame index, as shown in Fig. 1. Within the file, each object is annotated by a row in the file. The first element of each row is the object class ID based on the COCO object categories. For example, the first row in the right part of Fig. 1 shows class ID 26, which corresponds to “Handbag” in Table 2.

Fig. 1.

Fig 1

Data files and their contents.

The remaining four numbers in the row specify the bounding box of the corresponding object. Of these four numbers, the first two specify the center coordinates of the corresponding bounding box, relative to the top-left corner of the video frame, normalized by the resolution, and the last two elements are the width and height of the bounding box, again normalized by the resolution. An example showing how these are computed in presented in Fig. 2. The example shows a frame with two giraffes, so there will be two rows in the annotation file, each starting with index 78 (the COCO class ID for “giraffe”). The big giraffe has a bounding box of width w1, height h1, centered at (x1,y1) relative to the top-left corner of the frame. Hence, the next for numbers in the annotation record for this giraffe will be

x1/Ny1/Mw1/Nh1/M

where M and N are the height and width of the frame, respectively. The small giraffe has a bounding box of width w2, height h2, centered at (x2,y2) relative to the top-left corner of the frame, so the annotation record for this giraffe will be

x2/Ny2/Mw2/Nh2/M

Fig. 2.

Fig 2

Illustration of the object annotation format.

Fig. 3 shows examples of object annotations overlaid on the corresponding frames. As seen in these examples, many of the sequences in the dataset contain people, but they also contain a variety of other objects. The scale and density of the objects varies significantly among the sequences.

Fig. 3.

Fig 3

Examples of object annotations overlaid on the corresponding frame.

2. Experimental Design, Materials and Methods

Raw HEVC v1 CTC video sequences in the YUV420 format are maintained by the Joint Collaborative Team on Video Coding (JCT-VC).2 They can be obtained via FTP3 following the procedure described in Section 2 of [4]. To create object annotations, we employed a semiautomatic labeling process illustrated in Fig. 4. First, raw YUV420 video sequences were converted to RGB444 (RGB24) and each frame was saved in a separate PNG file. The frames were then fed to the YOLOv3 object detector [5], using the Darknet software from [6], with pre-trained weights4 trained on the COCO 2014 dataset, to get the initial object annotations. Then, using the annotation editing tool Yolo_mark,5 incorrect positions of the boxes were manually corrected and falsely detected objects were removed. If an object was detected in an earlier frame, and is still visible in the current frame but was not automatically detected by the YOLOv3 object detector, then we manually label it in the current frame and continue labeling it in subsequent frames until the object moves out of view. We also tried to fill in the gaps in detection – if an object was detected in frame n and frame n+k but not in the intermediate k1 frames, then a box and the corresponding label for this object was added in the intermediate frames.

Fig. 4.

Fig 4

Labeling process.

The final labels and bounding box coordinates were saved to a file corresponding to the current frame. As mentioned before, the annotation file is a text file whose name contains the sequence name, resolution, frame rate, and frame index.

Ethics Statement

The work did not involve any human or animal subjects, nor data from social media platforms.

CRediT Author Statement

Hyomin Choi: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Elahe Hosseini: Methodology, Data curation, Writing - original draft. Saeed Ranjbar Alvar: Conceptualization, Investigation, Validation. Robert A. Cohen: Supervision, Writing - review & editing. Ivan V. Bajić: Supervision, Writing - review & editing, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.

Acknowledgments

This work was supported in part by Huawei and the Natural Sciences and Engineering Research Council (NSERC) Discovery Grant RGPIN-2016-04590.

Footnotes

References

  • 1.F. Bossen, “Common test conditions and software reference configurations,” Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVC-L1100, Geneva, CH, 2013.
  • 2.Corrêa G., Assunção P., Agostini L., da Silva Cruz L.A. Springer; 2016. Complexity-Aware High Efficiency Video Coding. [Google Scholar]
  • 3.Lin T.-.Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L. Proceedings of the European Conference on Computer Vision (ECCV) Zurich; 2014. Microsoft COCO: common objects in context. [Google Scholar]
  • 4.F. Bossen, “Common test conditions and software reference configurations,” ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 JCTVC-B300, July 2010, [Online] Available: https://www.itu.int/wftp3/av-arch/jctvc-site/2010_07_B_Geneva/JCTVC-B300.doc
  • 5.J. Redmon and A. Farhadi, “YOLOv3: an incremental improvement,” arXiv:1804.02767, 2018.
  • 6.A. Bochkovskiy et al., Windows and Linux version of Darknet, commit 4e47429, 2019, [Online] Available: https://github.com/AlexeyAB/darknet/commit/4e47429

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES