Detection of Double-Compressed Videos Using Descriptors of Video Encoders

Yun Gu Lee; Gihyun Na; Junseok Byun

doi:10.3390/s22239291

. 2022 Nov 29;22(23):9291. doi: 10.3390/s22239291

Detection of Double-Compressed Videos Using Descriptors of Video Encoders

Yun Gu Lee ^1,^*,^†, Gihyun Na ², Junseok Byun ²

Editor: Antonio Guerrieri

PMCID: PMC9740893 PMID: 36501993

Abstract

In digital forensics, video becomes important evidence in an accident or a crime. However, video editing programs are easily available in the market, and even non-experts can delete or modify a section of an evidence video that contains adverse evidence. The tampered video is compressed again and stored. Therefore, detecting a double-compressed video is one of the important methods in the field of digital video tampering detection. In this paper, we present a new approach to detecting a double-compressed video using the proposed descriptors of video encoders. The implementation of real-time video encoders is so complex that manufacturers should develop hardware video encoders considering a trade-off between complexity and performance. According to our observation, hardware video encoders practically do not use all possible encoding modes defined in the video coding standard but only a subset of the encoding modes. The proposed method defines this subset of encoding modes as the descriptor of the video encoder. If a video is double-compressed, the descriptor of the double-compressed video is changed to the descriptor of the video encoder used for double-compression. Therefore, the proposed method detects the double-compressed video by checking whether the descriptor of the test video is changed or not. In our experiments, we show descriptors of various H.264 and High-Efficiency Video Coding (HEVC) video encoders and demonstrate that our proposed method successfully detects double-compressed videos in most cases.

Keywords: video forgery detection, video tampering detection, double compression detection, digital forensics

1. Introduction

The price of digital imaging systems today has become considerably low, and the use of devices equipped with cameras has become a fact of daily life. Numerous surveillance cameras on roads and streets are constantly recording our surroundings, and when an accident or a crime occurs, video becomes important evidence for digital forensics [1,2]. For example, a video recorded by a car’s digital video recorder (DVR) camera (or black box camera) can be used as important evidence in the case of a car accident. Digital forensics is a forensic science that encompasses all types of digital media devices and digital technologies [3,4]. Delp defines digital forensics as scientific techniques for the preservation, collection, validation, identification, analysis, interpretation, documentation, and presentation of digital media evidence acquired from digital devices [5]. On the other hand, video editing programs are becoming popular, and even non-professional users can easily use such programs to delete or modify a section of an evidence video that contains adverse evidence. Therefore, detecting a forgery video plays a key role in digital forensics.

A video basically consists of a sequence of images (or frames). The forgery video can be detected by applying image tampering detection methods [6,7,8] to each frame of the video. However, the performance of this approach was not satisfied due to many reasons [9]. Therefore, many algorithms have been proposed to detect forgery videos. Sitara [9] classified video tampering detection methods into three types: double compression detection, video inter-frame forgery detection, and region tampering detection. In video inter-frame forgery, a target video is tampering in the temporal domain. Typical methods are frame removal, frame duplication, and frame insertion. Since this type tampers the target video in the temporal domain, detection methods usually analyze motion, prediction error, and residuals in P frames [10,11,12,13]. In region tampering, an attacker copies a small region of a frame and pastes it at another frame. Chen [14] proposed a method to detect region tampering using features from motion residuals. Lin’s method utilizes temporal copy-and-paste and exemplar-based texture synthesis to detect tampered videos [15]. Su [16] utilized the difference between the current and a non-tampered reference frame to detect dynamic foreground removal from static background video. The last type of video tampering detection is double compression detection. Figure 1 represents a procedure of video forgery using double compression. The original video (compressed) is decompressed first. Then, an attacker modifies the decoded video (uncompressed) using editing software according to his or her wish. Finally, the forged video is compressed using a software video encoder again. To detect a double-compressed video, Wang [17] proposed a method using specific static and temporal statistical perturbations of a double-compressed Moving Picture Experts Group (MPEG) video. The double quantization effect is used in [18] to detect double MPEG compression. Markov-based features are proposed to detect double compression artifacts in [19]. A method to detect double Advanced Video Coding (AVC)/HEVC [20,21] encoding was proposed under the assumption that the former compression has a lower quality than the latter compression [22]. In [23], the probability distribution of quantized non-zero AC coefficients is utilized as features to detect double-compressed video. He et al. [24] proposed a method to detect double compression based on local motion vector field analysis in static-background videos. Bestagini [25] proposed a method to identify the codec and the size of a group of pictures that are used in the first coding step and analyze double-compressed videos. Based on quality degradation mechanism analysis, Jiang et al. [26] proposed a method to detect double compression with the same coding parameters. Recently, Li [27] proposed a semi-supervised learning method to detect double-compressed video using Gaussian density-based one-class classifiers. Mahfoudi proposed the statistical H.264 double-compression detection method based on discrete cosine transform (DCT) coefficients [28]. A motion-adaptive algorithm is proposed to detect HEVC double compression with the same coding parameters [29]. Since video forgery is normally performed on uncompressed domains, the process of video tampering consists of decoding the compressed video, editing the uncompressed video, and compressing the edited video again [27]. Therefore, double compression detection is generally an effective method of video forensics [14,27,30]. Hence, our study focuses on double compression detection.

Procedure of video forgery by double compression.

This paper presents a novel method to detect a double-compressed video under conditions that are slightly limited but frequently occur in practice. There are two assumptions in our study. The first assumption is that the model name of a camera that took a test video is known. Here, the test video denotes the video to be checked for tampering. Since the test video is submitted as evidence of a crime, the model name of the camera that took the test video is generally known. The second assumption is that an attacker is not enough of a video coding expert to develop his (or her) own software video encoder. Therefore, when compressing a forged uncompressed video, the attacker utilizes a software video encoder available either publicly or on the market. In other words, the software video encoder shown in Figure 1 should be available publicly or on the market. The main idea behind the proposed method is that video encoders usually utilize a subset of encoding modes among all possible modes in a trade-off between complexity and performance. The proposed method defines this subset of encoding modes as a descriptor for each video encoder. If the test video is double-compressed for forgery, the descriptor of the double-compressed video should belong to the descriptor of the software video encoder used for the last compression, not the descriptor of the camera model from which the test video was taken. Therefore, the proposed method detects the double-compressed video by comparing the descriptor of the camera model with the descriptor of the test video.

The contributions and novel parts of our study are as follows. First, this paper introduces a new approach in the field of detecting a double-compressed video. While most existing methods analyze only the test video itself, the proposed algorithm considers the characteristics of the hardware video encoder that took the test video. This approach is one of the novel parts of our study. Second, the proposed method can complement the existing method to further improve detection accuracy. For example, the proposed method first checks a test video to see whether it is tampered with or not. If the proposed method fails to detect anything, the existing methods can be applied to further examine the test video. Finally, if the proposed method decides a test video is a forgery video, the probability of a wrong decision is extremely low, which can be used for strong evidence in a crime. There is no risk that an unforged video is determined to be a forgery video.

The rest of the paper is organized as follows. Section 2 illustrates how video encoders utilize a subset of encoding modes among all possible modes for a trade-off. Section 3 introduces the H.264 and HEVC descriptor and proposes a method for detecting a double-compressed video. In Section 4, we show experimental results. Finally, we conclude our work in Section 5.

2. Characteristics of Hardware Video Encoder

2.1. Encoder Complexity

A video encoder for international video coding standards, such as using H.264 [20] and HEVC [21], is usually much more complex than the corresponding video decoder [31,32]. The video encoder should choose the best prediction modes from possible candidates, which requires high computation. As the number of modes (or options) to choose increases, the complexity of the video encoder increases. Let us consider the encoder complexity in brief in terms of block sizes and prediction modes. In H.264, a macroblock (MB) of a size of $16 \times 16$ is encoded in an intra mode or an inter mode. The intra mode supports two types of block sizes: $16 \times 16$ blocks or four $4 \times 4$ blocks. There are nine intra prediction modes in the $4 \times 4$ block and four intra prediction modes in the $16 \times 16$ block. The detailed prediction modes are given in [20]. The video encoder should decide the block size of the intra block, whether it is predicted as a $16 \times 16$ block or four or a $4 \times 4$ block. It also predicts the intra prediction mode for each sub-block. In the inter mode, the best matching block for an MB is found within previously reconstructed frames, which is called motion estimation and compensation. H.264 supports a block partition technique that divides the MB into sub-blocks to improve motion compensation performance. The MB is first partitioned into one of $16 \times 16$ , $16 \times 8$ , $8 \times 16$ , and $8 \times 8$ . For the $8 \times 8$ partition case, each $8 \times 8$ block can be further partitioned into $8 \times 8$ , $8 \times 4$ , $4 \times 8$ , or $4 \times 4$ blocks. The video encoder should estimate the block partition size for each MB. HEVC is an international video coding standard that supports more flexible prediction modes than H.264. While the basic coding unit of H.264 is an MB of size $16 \times 16$ , the basic processing unit of the HEVC is a coding tree unit (CTU) [21] whose size is variable: $64 \times 64$ , $32 \times 32$ , $16 \times 16$ , and $8 \times 8$ . Further, the number of intra prediction modes in HEVC is 35, which is much more than those of H.264. Therefore, the complexity of HEVC is higher than that of H.264.

There are two approaches to the implementation of a video encoder: software and hardware. Software-based video encoders are inherently very flexible in modifying and upgrading algorithms. Further, there is no limitation on the size of input videos. As the size increases, the processing time increases only. This flexibility makes the software-based encoder adaptable to various types of applications. As the software-based encoder does not guarantee real-time processing, it is suitable for offline applications. Meanwhile, since the processing blocks in a hardware-based encoder run in parallel, its processing speed (or throughput) is generally faster than that of a software-based encoder. However, the hardware video encoder cannot process a video larger than the designed hardware specification due to limitations of memory size, operation frequency, etc. Once the video encoder is implemented in an application-specific integrated circuit (ASIC), it is very difficult to modify or upgrade the encoding algorithms. Hence, hardware-based video encoders are suitable for consumer electronics where real-time encoding is an essential feature.

2.2. Implementation of Video Encoder in Hardware

This paper proposes a method to detect a video forged from a video shot with a camera, such as a car DVR (or a black box). Here, the camera encodes input frames using a hardware video encoder. Hence, this subsection examines the hardware implementation of video encoders.

To achieve high coding performance, video encoders generally need to examine all possible prediction modes and choose the best mode among the candidates. For example, HEVC video encoders need to examine 35 intra-prediction candidates and choose the best one among the 35 candidates. Recent video coding standards are so complex that even hardware-based video encoders are not practical to carefully check and evaluate all possible prediction modes in real-time. Therefore, many researchers have proposed new hardware architectures with novel fast algorithms to efficiently reduce the processing speed and implementation cost while maintaining the coding efficiency. Video encoders need to determine many parameters, such as block size, block partition, intra prediction mode, and so on. Top-tier hardware vendors can develop all new advanced algorithms that run in real-time for high-performance video encoders. However, since the development of the new advanced algorithms increases the development cost, other hardware vendors may find it difficult to develop all new advanced algorithms to determine encoding modes. These hardware vendors may need to consider a trade-off between the development cost and the encoder performance. Therefore, some hardware vendors often consider simple and straightforward methods to efficiently reduce the development cost while the performance degradation is not significant.

One of the simple and straightforward methods is to use only a subset of encoding modes among all possible ones. Since three intra predictions of modes 0 (vertical), 1 (horizontal), and 2 (DC) for $4 \times 4$ block are statistically dominant in most video sequences; hardware vendors may develop a video encoder that uses only the three intra prediction modes among the nine intra prediction modes in $4 \times 4$ blocks. This encoder does not include any hardware block for processing or predicting the six intra prediction modes. Although this simple approach may degrade the coding performance for some videos, it significantly reduces the development cost, hardware complexity, and processing speed. According to our observations, many hardware manufacturers frequently adopt this simple way of implementation. This proposed method utilizes this feature to detect a tampered video. In the experimental results, we will show real examples of this simple implementation.

For a video decoder to comply with the video coding standard, the decoder should have the capability to decode all encoding modes defined in the standard. On the other hand, if the bitstream generated from a video encoder is decodable by the standard decoder, the video encoder is considered to generate the bitstream that conforms to the video coding standard. Hence, the video encoder can comply with the standard even if only some of encoding modes defined by the standard are implemented. For example, an H.264 video encoder that uses only three intra prediction modes for the $4 \times 4$ intra block complies with the video coding standard.

3. Detection of a Double-Compressed Video

3.1. Descriptor of a Hardware Video Encoder

3.1.1. Descriptor Structure

As mentioned in the previous section, some hardware video encoders support only a subset of encoding modes among all possible ones. Which encoding modes are included in the subset can vary from video encoder to video encoder. Hence, the subset can be used as the descriptor of a video encoder. Let us first examine the descriptor of an H.264 video encoder. There are many encoding modes to be determined by the H.264 encoder. The proposed descriptor includes the intra prediction modes and inter block partition. The profile, level, and group of pictures (GOP) are also included in the descriptor. Table 1 illustrates an example of the H.264 video encoder’s descriptor. In this example, the video encoder supports the baseline profile with level 3.1. A plane intra prediction mode (or mode 3) is not used in intra $16 \times 16$ . The encoder considers only three prediction modes of 0, 1, and 2 for intra $4 \times 4$ . The $8 \times 8$ block in the inter block is not decomposed further. This video encoder supports limited encoding modes compared to the full features of the baseline profile.

Table 1.

Example of the H.264 decoder’s descriptor. BL stands for baseline.

Encoding Mode		Number
Profile		66(BL)
Level		3.1
GOP size		6
	mode0	385,332
Intra	mode1	30,440
$16 \times 16$	mode2	95,570
	mode3	0
	mode0	690,994
	mode1	950,632
	mode2	1,476,582
Intra	mode3	0
$4 \times 4$	mode4	0
	mode5	0
	mode6	0
	mode7	0
	mode8	0
	$16 \times 16$	1,935,837
	$16 \times 8$	0
Inter block	$8 \times 16$	0
partition	$8 \times 8$	1,171,112
	$4 \times 8$	0
	$8 \times 4$	0
	$4 \times 4$	0

CU Type	CU Size
CU Type	$64 \times 64$	$32 \times 32$	$16 \times 16$	$8 \times 8$
Skip CU	11,609	48,399	35,748	28,864
Intra CU	0	35,040	35,838	26,458
Inter CU	16,097	32,277	14,679	6978

IPMN	Number	IPMN	Number	IPMN	Number	IPMN	Number
0	12,864	9	0	18	198	27	0
1	49,047	10	2908	19	0	28	0
2	1550	11	0	20	0	29	0
3	0	12	0	21	0	30	0
4	0	13	0	22	0	31	0
5	0	14	0	23	0	32	0
6	0	15	0	24	0	33	0
7	0	16	0	25	0	34	519
8	0	17	0	26	1854

Size	Number	Size
$32 \times 32$	1,535,249	$16 \times 16$
$32 \times 16$	0	$16 \times 32$
$32 \times 8$	0	$32 \times 24$
$8 \times 32$	0	$24 \times 32$

Mode	Video #1	Video #2	Video #3
$M_{0}$	142,342	51,244	15,249
$M_{1}$	522,342	123,142	292,342
$M_{2}$	152,342	24,922	0
$M_{3}$	0	0	125,242
$M_{4}$	0	0	8754

Encoding Mode	Video DB				Model DB
( $4 \times 4$ Intra Pred.)	Video #1	Used	Video #2	Used	Model DB
mode0	722,157	O	1,095,886	O	O
mode1	747,453	O	1,490,516	O	O
mode2	1,324,614	O	1,790,590	O	O
mode3	0	×	0	×	×
mode4	0	×	0	×	×
mode5	0	×	0	×	×
mode6	0	×	0	×	×
mode7	0	×	0	×	×
mode8	0	×	0	×	×

Encoding Mode	Video DB				Model DB
( $2 N \times 2 N$ Intra Pred.)	Video #1	Used	Video #2	Used	Model DB
mode0	3210	O	25,622	O	O
mode1	2702	O	31,559	O	O
mode2	54	O	1205	O	O
mode3	2	O	399	O	O
mode4	0	×	14	O	O
mode5	1	O	21	O	O
mode6	0	×	183	O	O
…	…	…	…	…	…
mode34	0	×	48	O	O

Encoding Mode	Model DB	Test Video
Encoding Mode	D _b	$D_{T}^{1}$	$D_{T}^{2}$	$D_{T}^{3}$
A	O	O	O	O
B	O	O	O	O
C	O	O	×	×
D	×	×	×	O
D	×	×	×	×
F	O	O	×	O
G	O	O	O	×

Model	$16 \times 16$ Intra Prediction Modes
	Mode 0		Mode 1		Mode 2		Mode 3
	Num	%	Num	%	Num	%	Num	%
a	122,842	49.184	23,182	9.286	103,715	41.526	10	0.004
b	262,525	66.623	61987	15.731	69535	17.646	1	0.000

CM	CU Size	Intra $2 N \times 2 N$ Prediction Modes
CM	CU Size	0	1	2	3	4	5	6	7	8	9	10	11	…	33	34
1	$32 \times 32$	O	O	O	×	×	×	×	×	×	×	O	×	…	×	O
	$16 \times 16$	O	O	O	×	O	×	O	×	O	×	O	×	…	×	O
	$8 \times 8$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
2	$32 \times 32$	O	O	O	×	×	×	×	×	×	×	O	×	…	×	×
	$16 \times 16$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
	$8 \times 8$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
3	$32 \times 32$	O	O	×	×	×	×	×	×	×	×	O	×	…	×	O
	$16 \times 16$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
	$8 \times 8$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
4	$32 \times 32$	O	O	O	O	O	O	O	O	O	O	O	O	…	×	O
	$16 \times 16$	O	O	O	O	O	O	O	O	O	O	O	O	…	×	O
	$8 \times 8$	O	O	O	O	O	O	O	O	O	O	O	O	…	×	O
5	$32 \times 32$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
	$16 \times 16$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
	$8 \times 8$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
6	$32 \times 32$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O
6	$16 \times 16$	O	O	O	O	O	O	O	O	O	O	O	O	…	O	O

Camera	Intra $N \times N$ Prediction Modes
Model	0	1	2	3	4	5	6	7	8	9	10	…	33	34
a	×	×	×	×	×	×	×	×	×	×	×	…	×	×
b	O	O	×	O	×	O	×	O	×	O	×	…	O	×
c	O	O	O	O	O	O	O	O	O	O	O	…	×	O
d	O	O	O	O	O	O	O	O	O	O	O	…	O	×
e	O	O	O	O	O	O	O	O	O	O	O	…	O	O

Model	Profile	Level	Width	Height
A	Baseline	3	1280	720
B	High	4.1	1920	1080
C	High	4	1920	1080
D	High	4.2	1920	1080
E	High	4.2	1920	1080
F	High	4.2	1920	1080
G	High	4.1	1920	1080
H	High	4	1920	1080
I	Main	4	1920	1080
J	Baseline	3	1280	720
K	Baseline	3.1	1280	720

Input	# of Camera Models	Detectable	Not Detectable	Accuracy (%)
Unforged Video (HEVC)		13	0	100
Forged Video (HEVC) using open-source	13	13	0	100
Forged Video (HEVC) using HM source		13	0	100
Unforged Video (H.264)		10	1	100
Forged Video (H.264) using open-source	11	10	1	100
Forged Video (H.264) using JM source		10	1	100

PERMALINK

Detection of Double-Compressed Videos Using Descriptors of Video Encoders

Yun Gu Lee

Gihyun Na

Junseok Byun

Roles

Abstract

1. Introduction

Figure 1.

2. Characteristics of Hardware Video Encoder

2.1. Encoder Complexity

2.2. Implementation of Video Encoder in Hardware

3. Detection of a Double-Compressed Video

3.1. Descriptor of a Hardware Video Encoder

3.1.1. Descriptor Structure

Table 1.

Table 2.

Table 3.

Table 4.

3.1.2. Prediction of Encoder’s Descriptor

3.2. Tampered Video Detection

Figure 2.

3.2.1. Structure of the Proposed Detector

Figure 3.

3.2.2. Comparison of Descriptors

Table 5.

Table 6.

Table 7.

Table 8.

Figure 4.

3.3. Refinement of H.264 Descriptors

Table 9.

4. Experimental Results

4.1. Descriptors of HEVC Hardware Video Encoder

Table 10.

Table 11.

Table 12.

4.2. Descriptors of H.264 Hardware Video Encoder

Table 13.

Table 14.

Table 15.

4.3. Descriptors of Double-Compressed Video Using Software Video Encoder

Figure 5.

4.4. Video Forgery Detection

Table 16.

5. Conclusions

Author Contributions

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases