Abstract
Advances in both lossy image compression and semantic content understanding have been greatly fueled by deep learning techniques, yet these two tasks have been developed separately for the past decades. In this work, we address the problem of directly executing semantic inference from quantized latent features in the deep compressed domain without pixel reconstruction. Although different methods have been proposed for this problem setting, they either are restrictive to a specific architecture, or are sub-optimal in terms of compressed domain task accuracy. In contrast, we propose a lightweight, plug-and-play solution which is generally compliant with popular learned image coders and deep vision models, making it attractive to vast applications. Our method adapts prevalent pixel domain neural models that are deployed for various vision tasks to directly accept quantized latent features (other than pixels). We further suggest training the compressed domain model by transferring knowledge from its corresponding pixel domain counterpart. Experiments show that our method is compliant with popular learned image coders and vision task models. Under fair comparison, our approach outperforms a baseline method by a) more than 3% top-1 accuracy for compressed domain classification, and b) more than 7% mIoU for compressed domain semantic segmentation, at various data rates.
Keywords: Learned image compression, compressed domain semantic inference, compressed representation, deep learning
I. Introduction
“A picture (image) is worth a thousand words!” – originated by Henrik Ibsen, provides us with two different perspectives about an image. First, information conveyed in a single image is immerse for which lossy compression is typically required for better exchange across bandwidth constrained networks. Second, the information contained in an image provides semantic meaning for image content understanding and decision-making. Whereas, for the past several decades, image compression and image content understanding tasks have been developed separately.
Traditional lossy image coding techniques, e.g., JPEG [1], JPEG 2000 [2], and BPG (Intra Profile of High-Efficiency Video Coding), rely on handcrafted tools to exploit pixel or block correlations for differential compression. Although the energy compaction can be well facilitated, the spatial coherency of content semantics is typically destroyed in differentiated residuals (i.e., transform coefficients). Thus, compressed images are usually decoded to pixel-domain representations for subsequent content understanding tasks.
Along with the explosive growth of deep neural network (DNN)-based semantic engines used for discriminative feature aggregation and task execution [3], [4], [5], recent years have witnessed the emergence of numerous DNN-based learned image compression (LIC) approaches [6], [7], [8], [9], [10] that present competitive efficiency to the most recent intra profile of the Versatile Video Coding Standard (VVC) [11]. Similar to the DNNs used in vision tasks, these LICs extract and aggregate features layer by layer for the compact representation of the input image, which significantly differs from the pixel or block based differential coding mechanisms in traditional image coding methods.
This work considers the problem of performing vision tasks directly using (entropy decoded) latent representation produced by LIC approaches, which we formulate as the compressed domain semantic inference problem, as shown in Fig. 1. Compressed domain semantic inference is of significant importance from several aspects:
As image compression and image content understanding tasks are separately treated for decades, it is a fundamental breakthrough to have the same set of latent features for different tasks. In theory, it is challenging since these latent features need to be both compact (for compression) and discriminative (for semantics).
Having semantic understanding without pixel reconstruction can eliminate the decoding latency, which could reduce the processing delay in many critical public infrastructure applications. For example, in many metropolitan areas, one may need to concurrently process thousands of millions of images collected from surveillance cameras in real-time for traffic monitoring and forecasting, safety control, etc, putting very stringent latency requirement [12].
Furthermore, semantic execution using compressed latent features may also bring security advantages. For example, it prevents the disclosure of pixel-domain representation to potential attackers in a cloud image analysis system [13] by completely skipping the pixel decoding and reconstruction process.
Fig. 1. Semantic inference using compressed latent features without pixel reconstruction.

For a learnt image coding system, an image is transformed by a DNN encoder E (including a quantizer) to a (quantized) compressed representation , which is then entropy encoded into binary bits for storage or transmission. At the receiver side, we directly use the entropy-decoded feature to execute vision tasks. The skipping of the DNN decoder to avoid pixel reconstruction is highlighted by the cross mark.
Early attempts have been made to address the compressed domain semantic inference problem. Several methods are developed for hand crafted codecs such as JPEG or JPEG 2000 compressed images [14], [15], [16], as well as latent features of LICs such as [17], [18], [19]. These methods target at either a particular codec or a specific vision model. For example, Torfason et al. [18] designed an compressed domain variant of ResNet [20] by re-arranging the ResNet blocks. A similar approach is also adopted by Wang et al. [19] and achieves even better performance than pixel domain inference at certain bit rates. However, it is not clear how previous methods can be generalized to more recent vision task models, such as EfficientNets [21].
This paper therefore targets at a more unified setting: adapting any pixel-domain vision model to compressed domain inference. Towards this goal, we propose a simple yet effective method that can turn existing pixel domain model architectures into compressed domain ones. Our method is motivated by the observation that most architectures for vision tasks start with a downsampling operation, which would be redundant if we take compressed representation (which already has a low spatial dimension) as inputs. Our method consists of two steps: a) removing downsampling layers and b) inserting an learnable upsampling layer, which in combination aligns the spatial dimension of compressed representation and pixel domain model inputs. As an additional contribution, we also present a pixel-to-compressed domain knowledge transfer technique, which relies on pre-trained pixel domain model to guide the training of its compressed domain counterpart.
Our methods are decoupled with both learned image coders and vision models and can be used in a plug-and-play fashion. In our experiments, we demonstrate it by applying our methods to different learned image coders, different pixel domain architectures, and different vision tasks. The proposed method avoids the complex and computationally-expansive process such as model architecture search, while achieves better rate-accuracy trade-off compared to previous baselines. The developments of LICs and vision models themselves are, however, out of the scope of this paper. Access to such pre-trained models [22], [23] is publicly available. Note that our method currently focuses its application on learning-based image coder and vision models. Traditional image coders and vision models that rely on handcrafted features are therefore out of scope for this work.
To summarize, the contributions of this paper are as follows.
We propose a simple yet effective method to adapt pixel domain models to process quantized latent features of deep compressed images, making compressed domain semantic inference feasible.
We introduce a pixel-to-compressed domain knowledge transfer loss for compressed domain semantic inference and show that it largely improves the compressed domain vision task accuracy.
We conduct extensive experiments on compressed domain classification and semantic segmentation. Results demonstrate that our method offers a better rate-accuracy-complexity trade-off than previous baseline methods.
Our method is a lightweight, plug-and-play solution, which is generally compliant with most existing learned image coders and vision models, making it attractive for practical applications.
II. Related Work
A. Learned Lossy Image Compression
Learning-based lossy image coders have shown better rate-distortion efficiency than conventional handcrafted image compression standards such as the JPEG [1], JPEG 2000 [2], BPG, etc. in recent years. Existing LICs mainly adopt a compressive autoencoder framework as described in [24], where stacked convolutions and nonlinear activation are applied in the encoder to extract and aggregate information as latent features for quantization and entropy coding at the bottleneck layer, and an inverse process is devised in the decoder to convert bits to decoded pixels.
To improve the compression performance, most LICs focus on the development of better nonlinear transforms and more accurate entropy models. As for nonlinear transform, stacked convolutional layers and generalized divisive normalization (GDN) are used in [6], [25], [7] while the ResNet [20] and attention mechanisms are applied in [8], [9]. As for entropy context modeling, fully factorized hyperpriors are utilized in Ballé [25] to better predict the probability of latent feature, which is further improved by including the autoregressive neighbors in [7]. Such joint hyperprior and autoregressive neighbor utilization is then widely applied in [7], [8], [9].
As introduced in Sec. I, our motivation is to perform vision tasks directly upon compressed representations from learned coders to reduce pixel reconstruction delay and pursue a unified compression-vision framework. In our case, only the entropy parsing is involved while the DNN decoder is completely skipped. It thus can be said that the compressed domain semantic inference performs a “partial decoding”. Though such partial decoding, especially when using autoregressive entropy models, could still be slow, recent researches have proposed significantly faster entropy models [26], [27] (more than faster than the autoregressive model [7]), in which case the entropy decoding consumes only less than 1/4 of the DNN decoder time. Thus by skipping the DNN decoder, we could save a significant portion of the computation time in the decoding process.
B. Image and Video Coding for Machines
A recent concept in image/video coding that is relevant to our problem is so-called Video Coding for Machines (VCM). In this paradigm, researchers shift attention from image/video coding for human vision to coding for machine vision, motivated by the fact that images and videos are only used for content analysis but never reconstructed to pixels in many applications [28]. The notion of VCM was first introduced in [28], where the authors proposed to code video frames and human poses into separate bit streams to bring facility to action recognition, based on the assumption that action was solely determined by human poses. In [29] and [30], researchers proposed face image coding algorithms optimized for both image reconstruction quality and face recognition accuracy. Sun et al. [31] proposed to encode objects in the image separately, so the bit stream itself contained objects information (location, dimension, and category), allowing object localization at the decoder side without pixel reconstruction.
The essential difference between the VCM paradigm and our setting lies in the way images are encoded. VCM methods aim at designing image coders to make the bit stream optimized for certain assumptions or vision tasks. In the aforementioned works, a static background with only human movement is assumed in [28], only face images are considered in [29], [30], and a specific object detector at the encoder side is used in [31]. In contrast, we keep the existing image coders unchanged, and we aim at developing methods that work universally for all vision tasks and models. Specifically, we disentangle the encoder and the vision tasks so that the overall system is modular and flexible.
C. Compressed Domain Machine Vision
Some early works have investigated the use of dimensionality reduced data for machine vision, such as self-supervised representation learning using autoencoder [32], compressive sensing for image classification [33], [34], and sparse representations for image super resolution and face recognition [35]. These methods mostly exploit low-dimensionality and sparsity of image data, but not rate-distortion optimized compressed representation as in our case.
A few recent works have explored the problem of task inference from compressed representation of conventional image coders. For example, Gueguen et al. [14] proposed to perform classification in DCT domain for JPEG compressed images, and showed that the classification accuracy using DCT coefficients was similar to that of using pixel values. Ehrlich and Davis [36] derived a method of reparameterizing convolutional layers to make the DCT domain inference equivalent to the pixel domain inference, since block-based DCT is a linear transform and therefore can be implemented using a single convolutional layer. Similarly, an image classifier was trained on discrete wavelet transform (DWT) coefficients of JPEG2000 images in [16] and the result rivaled with method using RGB pixels. There were also a few works [37], [38], [39], [40] that explored visual recognition from compressed videos. Though the reconstruction of P-frames can be skipped, a full reconstruction of I-frames was still required in most of these methods. In contrast, we do not require any pixel data for task inference in our methods.
To the best of our knowledge, only a few works [18], [17] have explored vision task inference in the deep compressed domain. In [18], Torfason et al. designed a compressed domain ResNet variant [20] and trained it jointly with a deep image coder for image classification and semantic segmentation. Specifically, they replace a subset of the ResNet layers with an additional network with different input/output channels and resolution. However, they did not achieve satisfactory results comparing to the original ResNet used for pixel domain vision tasks. In addition, the joint training is often practically impossible in real-world applications when proprietary image compression encoder is used. In contrast, our method directly uses quantized latent features from any given LIC for compressed domain task execution, with which joint training of compression and vision task is not required, making our method more generalized and coder-agnostic.
D. Knowledge Transfer
In the context of deep learning, knowledge transfer refers to adapting information from a neural network model to another model, which may or may not have the same model architecture. For example, transferring learned information from labeled domain to unlabeled domain (e.g., synthetic-to-real [41]) is also known as domain adaptation, in which the model architecture is typically the same. Another example is to transfer learned information from an ensemble of large models to a single small model, typically for the same dataset and modality. Methods in this setting are known as knowledge distillation [42], [43], [44], [45] or feature adaptation [46], [47].
The most relevant research to our work is the so-called cross-modality knowledge transfer, such as video to sound [48], RGB to depth [49], [50], [51], RGB to flow [50], and luminance to chrominance [51]. Among them, both FitNets [44] and Supervision Transfer [50] use a feature space discrepancy loss to transfer knowledge between different models. In this work, we present a new application of this technique in transferring knowledge from a pixel domain pre-trained model to our compressed domain model, which we term as pixel-to-compressed domain knowledge transfer. We show that such method clearly improves the compressed domain vision task accuracy in all our experiments.
III. Problem Formulation
We first summarize the pipeline of learning-based image coders, including the autoencoder framework and the rate-distortion optimization, to introduce the notations and provide background knowledge to our problem. We then define the problem of compressed domain semantic inference.
A. Learned Image Coders
Most learned image coders employ an autoencoder-like structure [25], [7], [8], which consists of an encoder network and a decoder network with and denoting the network parameters. Given a normalized RGB image , the forward propagation of an autoencoder with quantization can be expressed as:
| (1) |
where is the latent space representation with dimensions depending on the encoder architecture, denotes a quantization function, and is reconstructed image. The quantization function behaves differently during training and testing. During training, a widely adopted choice is to perform element-wise uniform noise U(−0.5, 0.5) adding [6], while in testing, it is the element-wise nearest integer rounding. Thus, the quantized latent space representation is continuous during training, but discrete during testing.
From the perspective of rate-distortion theory [52], the image is a continuous random variable and thus has an infinite (information) entropy. On the other hand, its low-dimensional representation, , resides in the discrete space with a finite entropy (i.e., the number of bits needed to losslessly code it). Since both the dimension and the entropy of are much smaller than that of the original data in pixels, we refer to as the compressed representation, and we refer to the space of as the compressed domain, as opposed to the pixel domain which the image resides.
To efficiently perform image compression, the goal of learned image coders is to minimize a weighted sum of the entropy of compressed representation (rate) and the distance between the reconstructed and original image (distortion):
| (2) |
or equivalently (note that we omit the network parameter notations for simplicity),
| (3) |
where we implicitly model as a random variable (and thus is the probability mass function of estimated by an entropy model, is a constant scaling factor, and is a distance function. Various entropy models have been summarized in Sec. II, and popular choices for include mean squared error (MSE) and multiscale structural similarity (MS-SSIM) [53]. In practice, the expectation in the above equation is approximated by averaging over the training dataset, and the optimization problem is solved using the Stochastic Gradient Descent (SGD) algorithm.
Once the image coder is faithfully trained, it compresses an image by first transforming the image into the (quantized) compressed representation, followed by coding the compressed representation into bit streams using properly designed entropy engine like Huffman coding [54] and arithmetic coding, conditioned on the estimated probabilities . However, since neither the training of image coder nor the entropy coding is the focus of this paper, we do not go into further details here. Interested audiences may refer to [55], [8] for detailed description. In this paper, we directly use pre-trained popular LICs in testing without additional training. As a result, is discrete whenever we refer to it.
B. Compressed Domain Semantic Inference
Suppose at the receiver side, we obtain a bit stream that carries image data, and we would like to analyze its content semantics using a neural vision model. In the conventional decode-then-analyze scheme, we first apply the entropy decoding to interpret the bits into the compressed representation , and then reconstruct back into pixels using the decoder network for task execution in subsequent vision model:
| (4) |
where is the vision model parameterized by , and is the model output. In contrast, for the proposed compressed domain semantic inference paradigm, we wish to let the vision model directly operate on the compressed representation for inference, i.e.,
| (5) |
where the decoder network is skipped (the entropy decoding is still required since we need ), thus computational complexity can be reduced compared to the pixel domain method in (4).
For this work, we assume the training process of compressed domain vision tasks are still performed following a supervised means, i.e., we rely on compressed representation-label pairs to update the network parameters for error minimization. The vision model architecture, training data, and loss function all depend on the specific vision task. For example, in image classification, the label is typically the one-hot encoding of the object category, and the training loss is the cross entropy between the prediction and label . Note that during training, we only update the parameters of the vision model without intervening the LIC.
IV. Method
This section describes how to transfer knowledge from pixel domain vision models to improve the compressed domain counterparts. We first propose a pixel-to-compressed domain knowledge transfer strategy that can learn prior knowledge from pixel domain models to enhance the performance of compressed domain vision tasks. Next, we introduce a simple yet efficient method to quickly adapt pixel domain vision model architectures to execute compressed domain tasks. As shown in Sec. V later, we surpass the previous baseline method by a large margin and achieve similar accuracy as the use of pixel domain models while requiring much less computational resource (due to skipping the DNN decoder).
A. Pixel-to-Compressed Domain Knowledge Transfer
Inspired by the FitNets [44] and Supervision Transfer [50] that both apply a feature-domain distance loss to guide the knowledge transfer between models, we suggest using pixel domain pre-trained model to optimize its compressed domain counterpart during training.
Figure 2a illustrates our training strategy that transfers knowledge from a pre-trained pixel domain model to a compressed domain model. At each training step, we sample a batch of images, denoted by , from the dataset and encode them into compressed representations, , using the encoder of a selected LIC. We then use the compressed domain vision model to predict based on the input , where denotes the task specific output vector. To train the compressed domain model for vision tasks, we minimize not only the vision task loss function but also the discrepancy between the intermediate feature maps of the compressed domain model and the pixel domain model, where the pixel domain model takes the uncompressed image as input. Our design is based on the assumption that since pixel domain models are well-trained feature extractors, we want the compressed domain model to extract features that are sufficiently close to the features produced by corresponding pixel domain models. During inference, the compressed domain model operates without the need of uncompressed images and the pixel domain model as shown in Fig. 2b.
Fig. 2. Overview of proposed method.

(a) An illustration of our strategy for training compressed domain models using pixel-to-compressed domain knowledge transfer. Once the model is trained, it operates on compressed representations solely by itself without the need of pixel-domain knowledge, as is illustrated in (b).
We define the pixel-to-compressed domain knowledge transfer strategy as follows. Let and denote the feature maps from the compressed domain model and the pixel domain pre-trained model, respectively, where N is the total number of intermediate feature maps of each model, depending on the model architecture. Following the concept of feature pyramids in object detection methods [56], we pick such that they are from different network stages with different spatial dimensions. Specifically, for , we choose the feature map with stride 8,16,32,…, where strides are w.r.t. the original image. The same rule holds for , and therefore and have the same dimension for distance computation.
The pair-wise discrepancy between and is measured by a function , which could be any function that measures the distance between multi-dimensional arrays, e.g., the Euclidean distance. Putting the vision task loss and the feature distance loss together, the compressed domain model is trained end-to-end by minimizing the following objective function:
| (6) |
where and are the vision task-specific loss function and label, respectively, and are scaling factors that we can tune. For example, in image classification, is the cross entropy loss, and is the one-hot vector for the ground truth class label of the input image. In our experiments, we find that the squared Euclidean distance (i.e., squared error) performs the best among commonly used distance functions, which will be discussed in Section V-E, so we adopt it as our default distance function in the remaining of this paper. Similarly, We find that setting achieves reasonably good performance for most vision models, for which we use as the default setting unless otherwise specified.
B. Pixel-to-compressed Domain Architecture Adaptation
Since we require intermediate feature maps of the pixel domain model and the compressed domain model to share the same dimension, it is natural to expect these models also share similar architectures. Also, it is widely known that developing neural network architectures can be computationally expensive [57]. Therefore, we opt to adapt existing pixel domain model architectures to compressed domain rather than designing a new architecture from scratch. We propose to adapt pixel domain CNN architecture to accept compressed representation as input for compressed domain tasks.
Our architecture adaptation strategy is shown in Fig. 3. We aim to take an pixel domain CNN architecture as input and produce its compressed domain counterpart with as few modifications as possible. Most pixel domain CNN models, despite their difference in detailed layer architecture and hyperparameters, employ the similar structure as illustrated in Fig. 3a. The “Conv” blocks in the figure represent general CNN-based network blocks such as residual blocks in ResNets [20] and MBConv blocks in EfficientNets [21], with the subscript following “Conv” indicating the block index. Similarly, the “Downsample” layer in Fig. 3a stands for the downsampling operation, typically being a max-pooling layer or a convolutional layer with stride greater than 1. We assume each downsampling operation spatially downsamples the input by a factor of two at each spatial dimension, which is the case for most vision models.
Fig. 3. Illustration of a pixel domain CNN architecture (a) and its compressed domain counterpart (b) produced by our architecture adaptation method.

We remove the downsampling layers of the pixel domain model and insert an upsampling operation to the input of the model. Spatial dimensions of intermediate feature maps are annotated between layers.
Given such a pixel domain CNN model, our architecture adaptation strategy can be described by the following steps:
Step 1: Remove the first 2× downsampling operation (usually a strided convolutional layer);
Step 2: For the second downsampling operation: reduce the stride to 1 if it is a convolutional layer, otherwise simply remove it;
Step 3: Insert a 4× upsampling operation before the first layer. Specifically, we use a sub-pixel convolutional layer [58] to perform the upsampling operation.
The reason for removing downsampling operations and inserting an upsampling operation is to solve the dimension mismatch problem between compressed representations and pixel domain input. Namely, the compressed representation produced by most learned image coders has a spatial dimension of while pixel domain vision models typically accept input image at a size of , where is the resolution of the uncompressed image. Previous works address this problem either by increase the resolution of input images [15], which is less practical as the computational cost of CNN grows quadratically with the image height or width, or redesign the vision model architecture to match the dimension of compressed representations [18], which is difficult to generalize to various vision tasks.
To minimize model architecture modification as well as solving the dimension mismatch problem, we remove the first two 2× downsampling operations (or change the stride if the downsampling is achieved by strided convolution) and insert an 4× sub-pixel convolution operation before the first layer. By doing so, the dimension of the input feature map to the Conv2 block remains the same as in the pixel domain models as shown in Fig. 3, thus preserving the use of the same receptive field of convolutional layers after Conv2.
Due to the simplicity of our model architecture adaptation method, it is not limited to any specific model but can be generalized to most pixel domain CNN architecture. To demonstrate this, we show in Sec. V and Sec. VI that we can apply the architecture adaptation strategy to VGG [59], ResNets [20], and EfficientNets [21] for compressed domain classification, as well as the PSPNet [60] for segmentation.
V. Experiments: Compressed Domain Classification
In this section, we evaluate the proposed methods for compressed domain image classification. In Section VI, we show that our methods generalize to semantic segmentation as well.
As we discussed in Section IV, our method posits no restrictions on the choices of the learning-based image coder and vision model. As for comparative studies, we pick several popular LICs and vision models and list them as follows:
where we use the MSE loss optimized Cheng et al. (2020) and the MS-SSIM loss optimized NLAIC. To verify that our method is agnostic to the choices of image coders and vision models, we conduct experiments on all pairs of the coders and classifiers listed above. For each coder-classifier pair, we keep the coder fixed and apply our method to adapt the classifier’s model architecture, and we compare the rate-accuracy performance of the compressed domain classification with that of pixel domain baselines.
Dataset:
We use the 1,000-class ImageNet [3] dataset to train and test all methods. The dataset consists of 1.28 million training and 50,000 validation images from 1,000 object categories. The ImageNet images are already compressed by JPEG but at a high bit rate (4.88 bits per pixel) with most visual details preserved, so we refer to them as the original undistorted images. For all methods, we first train them on the ImageNet train split for 16 epochs at the highest bit rate (details provided in Section V-A). Then, we start from the highest bit rate model and fine-tune it for another 8 epochs at lower bit rates. After each training procedure, we evaluate the final model on the ImageNet val split and report the top-1 classification accuracy.
Devices:
All GPU latency is estimated by averaging over 5,000 runs on random inputs on an Nvidia 3090 GPU. Similarly, CPU latency is estimated by 500 runs on an Intel 10700k CPU. For compressed domain models, the input dimension is (i.e., 192 channels, 14 in height, and 14 in width), and for RGB-based models, the input dimension is (i.e., 3 channels, 224 in height, and 224 in width). All computational complexity are estimated in the testing mode (i.e., without back-propagation).
A. ResNet-50 Experiments
We start with the ResNet-50 [20], a widely used image classification model as the baseline vision model for our method. We describe in this section the experiment and implementation settings in detail, and these settings remain the same for VGG and EfficientNets in the next sections.
Given a pre-trained ResNet-50, we construct its compressed domain version by applying the proposed architecture adaptation method, and we refer to the resulting compressed domain architecture as ResNet-50-aa. We train ResNet-50-aa on the images compressed by NLAIC and Cheng et al. (2020), respectively, each at three different bit rates. Additionally, we train ResNet-50-aa with the proposed pixel-to-compressed domain knowledge transfer technique, and denote it by ResNet50-aa + transfer.
Implementation details:
We adopt the publicly available implementations of NLAIC1 [8] and Cheng et al. (2020)2 [9]. For each coder, we start by training ResNet-50-aa on the highest bit rate with model parameters initialized using the pre-trained (pixel domain) ResNet-50. We then use the trained model parameters to initialize models for lower bit rates as well as for the knowledge transfer training. We train ResNet-50-aa for 16 epochs for the highest bit rate and 8 epochs for all subsequent experiments. At each training iteration, we perform standard data augmentation for image classification on the pixel domain images (3 × 224 × 224), including random resizing, cropping, and horizontal flipping. The augmented images are then encoded to the compressed representation (c × 14 × 14) using the learned image coder. Each training batch consists of 256 images, and we use Stochastic Gradient Descent (SGD) algorithm with initial learning rate 0.01, which decays by 0.1 at the 3/4 of the total epochs and the last epoch. During testing, all ImageNet val images are first resized to 3 × 256 × 256 and center-cropped to 3 × 224 × 224 before encoded to compressed representations for classification.
Methods in comparison:
We construct several baselines to benchmark our method. We list below the short name and the description for those baseline methods.
ResNet-50: the pixel domain ResNet-50 model, which achieves 76.15% accuracy on the original, “undistorted” images. This serves as the upper bound on all methods.
Decode + ResNet-50: ResNet-50 trained and tested on the images compressed and then reconstructed by the learned image coders. This represents the conventional decode-then-inference paradigm.
cResNet-51: the ResNet variant that operates on the deep compressed domain proposed by Torfason et al. in [18], which is the previous method with the best performance. Since its implementation is not publicly available, we implement it based on our best understanding and train/test it using the same setting as for ours.
Table I shows the image classification accuracy as well as the computational complexity of our methods compared to the aforementioned baselines on the NLAIC-compressed ImageNet dataset. Similarly, Table II show the set of results when using Cheng et al. (2020) as the image coder. In both Table I and Table II, we only consider the complexity of DNN models, without taking the complexity of entropy decoding into account. Also note that the complexity of ResNet-50, cResNet-51, and ResNet-50-aa are independent of the image coder, so their complexity stays the same across Table I and Table II.
TABLE I.
ResNet-based compressed domain classification for NLAIC
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | Top-1 accuracy (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.133 bpp | 0.321 bpp | 0.595 bpp | |||||
| ResNet-50 | 25.6M | 4.1B | 23.2ms | 4.31ms | - | - | - |
|
| |||||||
| Decode + ResNet-50 | 39.4M | 101B | 257ms | 10.1ms | 68.12 | 72.55 | 74.11 |
|
| |||||||
| cResNet-51 | 29.8M | 3.3B | 21.0ms | 4.31ms | 60.67 | 66.49 | 68.99 |
|
| |||||||
| ResNet-50-aa | 27.3M | 4.3B | 22.1ms | 4.31ms | 67.43 | 72.08 | 73.24 |
| ResNet-50-aa + transfer | 68.37 | 72.82 | 73.79 | ||||
TABLE II.
ResNet-based compressed domain classification for Cheng et al. (2020)
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | Top-1 accuracy (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.164 bpp | 0.340 bpp | 0.753 bpp | |||||
| ResNet-50 | 25.6M | 4.1B | 23.2ms | 4.31ms | - | - | - |
|
| |||||||
| Decode + ResNet-50 | 37.2M | 32B | 95.4ms | 6.37ms | 68.28 | 72.41 | 74.15 |
|
| |||||||
| cResNet-51 | 29.8M | 3.3B | 21.0ms | 4.31ms | 64.15 | 66.76 | 69.12 |
|
| |||||||
| ResNet-50-aa | 27.3M | 4.3B | 22.1ms | 4.31ms | 67.72 | 71.16 | 73.10 |
| ResNet-50-aa + transfer | 69.26 | 72.24 | 73.75 | ||||
We first observe that our model architecture itself, ResNet-50-aa, outperforms the previous best method cResNet-51 [18] by more than 3% for classification accuracy across all bit rates in both Table I and II while having similar computational complexity with cResNet-51. Also recall that our proposed architecture adaptation method requires less modification and is agnostic to the base model, revealing that our method is simpler, more general, and more robust than cResNet-51 at the same time. When compared to the decoding-based method, ResNet-50-aa performs only slightly worse than the Decode + ResNet-50 in terms of classification accuracy. However, ResNet-50-aa operates directly in compressed domain and thus reduces the computational resources needed for decoding processing. As can be seen from Table I, ResNet-50-aa is significantly faster than Decode + ResNet-50 when using NLAIC as the encoder/decoder (around 8 times faster on CPU and 2 times faster on GPU). This is, of course, partly due to the high complexity of NLAIC, but in the case of Cheng et al. (2020), ResNet-50-aa still runs around 3 times faster on CPU and 1.5 times faster on GPU than the decode-then-inference baseline (Decode + ResNet-50).
As can be observed in Table I and II, If we train ResNet-50-aa with the pixel-to-compressed domain knowledge transfer strategy (denoted by ResNet-50-aa + transfer), the classification accuracy is further improved and becomes comparable with or even surpassing the Decode + ResNet-50. It is worthwhile to mention that the knowledge transfer strategy is performed only during training, so the computational complexity during testing time is unchanged and is thus much faster than the decode-then-inference approach.
B. VGG Experiments
We conduct the same set of experiments as in the previous section on the VGG Nets [59], a classic family of convolutional neural networks. Specifically, we choose the VGG-11 network as our base model. Similar to the previous section, we construct a compressed domain version of VGG-11 and test it on images compressed by Cheng et al. (2020) and NLAIC, respectively. Since there is no previous work in this setting, we only compare our methods with the decode-then-inference baseline. We list it as well as our methods in the following:
VGG-11: the original pixel domain VGG-11 model, which achieves 69.03% top-1 accuracy on ImageNet.
Decode + VGG-11: VGG-11 trained and tested on compressed and then reconstructed images by the LIC.
VGG-11-aa: The compressed domain VGG-11 model constructed by our architecture adaptation method, where knowledge transfer is not applied during training.
VGG-11-aa + transfer: The same model as VGG-11-aa, adding the knowledge transfer training.
For training these methods, we use the same set of hyperparameters as the experiments for ResNets except that the learning rate is chosen to be 0.004 for VGG-11-aa, as the default learning rate 0.01 leads to the gradient exploding of VGG. The compressed image classification results of VGG-11-aa on ImageNet are shown in Table III for the compressed domain of NLAIC, and Table IV for the compressed domain of Cheng et al. (2020). From both of them, we observe that the standalone VGG-11-aa without knowledge transfer training suffers from an accuracy drop compared to the decode + VGG-11. This gap is mitigated by using our knowledge transfer method during training, which improves VGG-11-aa with an accuracy increment ranging from 0.3% to more than 2%. As can be seen from Table III, VGG-11-aa + transfer achieves higher accuracy than the decoding-based counterpart at 0.164 bit rate. When taking the computational complexity into account, directly performing inference in the compressed domain offers a better speed-accuracy trade-off, as VGG-11-aa runs approximately 5 to 8 times faster than the decode + VGG-11 for NLAIC and around 3 times faster for Cheng et al. (2020).
TABLE III.
VGG-11-based compressed domain classification for NLAIC
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | Top-1 accuracy (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.133 bpp | 0.312 bpp | 0.595 bpp | |||||
| VGG-11 | 132.9M | 7.6B | 35.6ms | 1.72ms | - | - | - |
|
| |||||||
| Decode + VGG-11 | 146.7M | 105B | 259ms | 9.51ms | 58.24 | 63.19 | 66.00 |
|
| |||||||
| VGG-11-aa | 134.6M | 7.2B | 30.4ms | 1.58ms | 56.64 | 60.42 | 63.98 |
| VGG-11-aa + transfer | 57.65 | 62.71 | 65.31 | ||||
TABLE IV.
VGG-11-based compressed domain classification for Cheng et al. (2020)
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | Top-1 accuracy (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.164 bpp | 0.340 bpp | 0.753 bpp | |||||
| VGG-11 | 132.9M | 7.6B | 35.6ms | 1.72ms | - | - | - |
|
| |||||||
| Decode + VGG-11 | 144.5M | 35.4B | 107ms | 4.88ms | 58.81 | 63.94 | 66.43 |
|
| |||||||
| VGG-11-aa | 134.6M | 7.2B | 30.4ms | 1.58ms | 59.09 | 60.93 | 62.81 |
| VGG-11-aa + transfer | 59.34 | 63.00 | 63.94 | ||||
C. EfficientNet Experiments
Similar to the ResNets and the VGG Nets experiments, we also examine our method using EfficientNets [21] as the base classification model. EfficientNet is a family of CNN models whose architecture is automatically searched to maximize the ImageNet accuracy with constraints on the model complexity (FLOPs). Compared with VGG and ResNets, EfficientNets is more complicated in architecture, consisting of modules such as depthwise convolutions and Squeeze-and-Excitation blocks [61].
Again, we keep the same experiments setting, including training hyperparameters and fine-tuning strategy, as in Section V-B. The baseline methods and our approach are constructed exactly the same way as before. For completeness, we again list them below:
EfficientNet-B0: the original pixel domain model, which achieves 77.69% top-1 accuracy on ImageNet.
Decode + EfficientNet-B0: the EfficientNet-B0 but trained and tested on the compressed and then reconstructed images by the LICs.
EfficientNet-B0-aa: The compressed domain model constructed by our architecture adaptation method.
EfficientNet-B0-aa + transfer: The same model as above but additionally trained with the knowledge transfer loss.
We show the classification performance in the compressed domain of NLAIC and Cheng et al. (2020) in Table VI and Table IV, respectively. Our observation is mostly consistent as previous experiments. In both cases, EfficientNet-B0-aa suffers an accuracy drop (ranging from 0.39% to 2%) comparing to the decoding baseline (Decode + EfficientNet-B0). Again, such accuracy drop is alleviated by incorporating the knowledge transfer training strategy. When considering both computational complexity and model accuracy, our EfficientNet-B0-aa + transfer achieves comparable accuracy with the decoding baseline while saving more than 20% running time on GPU and 80% running time on CPU.
TABLE VI.
EfficientNet-B0-based compressed domain image classification for Cheng et al. (2020)
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | Top-1 accuracy (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.164 bpp | 0.340 bpp | 0.753 bpp | |||||
| EfficientNet-B0 | 5.29M | 0.398B | 11.9ms | 5.40ms | - | - | - |
|
| |||||||
| Decode + EfficientNet-B0 | 17.0M | 28.1B | 84.7ms | 7.33ms | 64.94 | 67.99 | 70.71 |
|
| |||||||
| EfficientNet-B0-aa | 6.17M | 0.538B | 10.3ms | 5.18ms | 64.49 | 66.96 | 68.60 |
| EfficientNet-B0-aa + transfer | 65.09 | 67.72 | 70.23 | ||||
D. Additional Experiment: Reshaping Baseline
Our proposed method as described in Sec. IV assumes that the compressed representation is fundamentally different than the ones extracted from the pixel domain using a classifier. Thus directly applying pixel domain classifiers to compressed representations would lead to suboptimal classification accuracy. In this section, we design and conduct a baseline experiment to empirically verify this hypothesis.
We construct a baseline method where a classifier directly operates on the compressed representations as follows. We adopt the same configuration of LIC and classifier as in Sec. V-A, i.e., Cheng et al. (2020) for compression, and ResNet-50 for classification. We again use the 1,000-class ImageNet dataset for training and evaluation. All training hyperparameters are kept the same as in Sec. V-A. However, due to the dimension mismatch between the compressed representation (e.g., , for the channel, width, and height) and classifier inputs (requires 3 channels), the classifier cannot accept compressed representations directly. We thus zero-pad the compressed representations in the channel dimension and reshape (by the pixel shuffling operation) them such that their channel dimension is 3. For example, for a compressed representation with shape , it is zero-padded to and reshaped to . A ResNet-50 is then trained and tested on the reshaped representations. We refer to this method as the Reshape + ResNet-50 baseline.
We report experimental results for Reshape + ResNet-50 compared to ResNet-50-aa in Table VII, where it is clear that such a baseline incurs a large accuracy drop (more than 9% accuracy for all bit rates) w.r.t. our model. We thus conclude that it is difficult for a CNN classifier to parse compressed representations up to an element re-arranging, and thus further processing is required for more accurate compressed domain semantic inference. As shown in the table, our approach is a simple yet effective solution to address this problem.
TABLE VII.
Baseline comparison for Cheng et al. (2020) with ResNet-50.
| Method | Params. | Top-1 accuracy (%) at | ||
|---|---|---|---|---|
| 0.164 bpp | 0.340 bpp | 0.753 bpp | ||
| Reshape + ResNet-50 | 25.6M | 53.63 | 54.59 | 63.15 |
| ResNet-50-aa | 27.3M | 67.72 | 71.16 | 73.10 |
E. Additional Experiment: Training Strategy
We also conduct an ablation analysis to evaluate how each component of our methods contributes to the classification accuracy.
Dataset and implementation details:
We use a subset of the ImageNet training set for ablative analysis. This subset contains 200 categories with 600 training and 50 testing images in each category. We refer to this subset as ImageNet-200. All numbers reported in this section are averaged over three runs. We use NLAIC at 0.595 bpp to map the pixel domain image into the compressed representation. For each run, we train on the ImageNet-200 training set for 80 epochs with batch size 128 and initial learning rate 0.1, and the learning rate is multiplied by 0.1 at the 40th epoch. Other settings remain the same as in the image classification experiments in previous sections.
We first train our ResNet-50-aa from randomly initialized parameters on compressed ImageNet-200 training set, and we report the classification accuracy of the testing set in the first row of Table VIII. Then, we train ResNet-50-aa again but initialize from parameters pre-trained on the pixel-domain ImageNet-200, and we show the resulting accuracy in the second row of Table VIII. This is a classic use case of transfer learning by parameters (i.e., weights) initialization. As shown in the table, the parameter initialization already leads to an improvement of 3%. This confirms the design philosophy of our architecture adaptation method: we minimize the modification of the original pixel domain architecture as much as possible so that more parameters can be transferred to the compressed domain model. Since we show that transferring parameters helps, transferring more parameters is, presumably, positively correlated to better performance.
TABLE VIII.
Ablation study on compressed ImageNet-200.
| Init. | Transfer func. | Top-1 acc. (%) | |
|---|---|---|---|
| - | - | 67.59 | |
| ✓ | - | 70.97 | |
| ResNet-50-aa | ✓ | 1 - cos. | 67.95 |
| ✓ | MAE | 72.74 | |
| ✓ | MSE | 73.90 |
Next, we train ResNet-50-aa using the pixel-to-compressed domain knowledge transfer strategy. We compare three commonly used distance functions: “cosine distance” (i.e., 1 - cosine similarity), mean absolute error (MAE), and mean squared error (MSE), and we substitute them into Eq. (6) for training with knowledge transfer. Results are shown in the last three rows of Table VIII. We observe that the squared loss function performs the best in our setting, so we use it as the default distance function in all our experiments. That is, the last row of Table VIII corresponds to the ResNet-50-aa + transfer method in previous sections.
F. Discussion
Based on this set of experiments, we conclude that the performance of our methods is consistent across different CNN architectures. In particular, the pixel-to-compressed domain knowledge transfer strategy improves the classification accuracy in all settings, empowering the compressed domain classification to achieve comparable accuracy as the decode-then-inference paradigm with improved rate-accuracy-complexity trade-off. Also recall that in our approach the image coder is not altered, meaning that we can always reconstruct the original image on demand at the same reconstruction quality with the pre-trained coder. Thus, the compressed representation becomes a more “unified” image representation as we can not only reconstruct pixels but also infer semantics from it.
From another perspective, there is still a gap between the accuracy of our compressed domain models and the decode-and-inference approach using native pixel domain models, and this gap tends to be smaller at low bit rates than at higher bit rates. This consistently occurs across all experiments, which brings several interesting questions. First of all, what causes this accuracy gap? We heuristically attribute the reason for the gap to the high (spatial) frequency components in pixels. As has been reported in previous work [62], CNNs learn to capture high frequency components in images and largely rely on them to make predictions. However, spatial information is rearranged in the compressed domain (a 16 × 16 image patch would be embedded into a 1 × 1× c tensor having c as the channel numbers), making the same set of CNNs that are originally designed for pixels less effective to extract sufficient high frequency components for compressed representations. Second, why the accuracy gap is smaller at lower bit rates than at higher bit rates? We conjecture that at lower bit rates, most high frequency components are discarded by the image coder anyway, which closes the discrepancy between compressed domain and pixel domain. Thus, with additional knowledge transfer training, the compressed domain classification accuracy can even surpass the reconstructed pixel domain accuracy at low bit rates.
VI. Experiments: Compressed Domain Semantic Segmentation
Semantic segmentation is another fundamental vision task, and we test the performance of our methods for this task in the compressed domain as well. We choose a popular semantic segmentation CNN model, PSPNet-50 [60], as the pixel domain base model. Similar to experiments conducted for image classification, we construct a compressed domain semantic segmentation CNN model using our architecture adaptation method based on PSPNet-50, and we refer to the resulting architecture as PSPNet-50-aa.
Dataset and metric:
We use the Cityscapes [5] dataset for training and evaluation. The training set of Cityscapes [5] dataset consists of two subsets: fine and coarse, and we only use the fine set, which contains 2,975 images. We report the mean Intersection over Union (mIoU, higher is better) on the Cityscapes val set, which contains 500 images. The original training and testing images in the dataset are losslessly compressed by PNG [63] standard at 9.01 Bpp, and we treat them as undistorted images.
Implementation details:
we adopt the authors’ publicly available implementation3 but use some custom hyperparameters. We first train PSPNet-50 on the original (pixel domain) images for 120 epochs with batch size 16 and initial learning rate 0.006. The learning rate follows the cosine decay rule [64] with final learning rate , and other hyperparameters remain at the defualt values. When transferring to compressed domain, we only train for 80 epochs in total, and the initial learning rate is set to be 10−3 and decay to 10−4 after 40 epochs. We perform data augmentation on pixel domain images (3 × 640 × 640), and then use LICs to encode the images into the compressed representations (192 × 40 × 40, for channel, height, and width). During testing, PSPNet-50aa is directly applied to the compressed representations of the testing images to predict the pixel-wise class map. For all methods, we train for a single run with a fixed random seed.
Methods in comparison:
We list the baselines as well as our methods in the following.
PSPNet-50: The original, pixel-domain PSPNet-50 [60], which achieves 74.98% mIoU on original images.
Decode + PSPNet-50: PSPNet-50 trained and tested on images compressed and then reconstructed by LICs.
PSPNet w/ cResnet-51: the compressed domain semantic segmentation method proposed in [18], except that we use PSPNet instead of Deeplab for a fair comparison. Equivalently speaking, this corresponds to PSPNet with cResnet-51 as the “backbone” architecture.
PSPNet-50-aa: the compressed domain PSPNet-50, constructed by our architecture adaption approach.
PSPNet-50-aa + transfer: the same model as above but additionally trained with the knowledge transfer strategy.
Result:
Table IX and Table X show the compressed domain semantic segmentation performance of our methods comparing to the baseline approaches, for the compressed domain of NLAIC [8] and Cheng et al. (2020) [9], respectively. Note that the bit rates shown in the these two tables are computed on the ImageNet dataset in order to be consistent with previous experiments. CPU and GPU latency are average over 50 and 1,000 runs, respectively.
TABLE IX.
Compressed domain semantic segmentation for NLAIC
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | mIoU (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.133 bpp | 0.321 bpp | 0.595 bpp | |||||
| PSPNet-50 | 49.1M | 288B | 903ms | 30.4ms | - | - | - |
|
| |||||||
| Decode + PSPNet-50 | 62.9M | 1.08T | 2.83s | 84.4ms | 67.47 | 71.59 | 73.28 |
|
| |||||||
| PSPNet w/ cResnet-51 [18] | 54.5M | 301B | 856ms | 31.6ms | 51.39 | 60.11 | 61.11 |
|
| |||||||
| PSPNet-50-aa | 50.8M | 282B | 827ms | 29.6ms | 63.93 | 67.95 | 68.49 |
| PSPNet-50-aa + transfer | 65.70 | 69.38 | 71.08 | ||||
TABLE X.
Compressed domain semantic segmentation for Cheng et al. (2020)
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | mIoU (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.164 bpp | 0.340 bpp | 0.753 bpp | |||||
| PSPNet-50 | 49.1M | 288B | 903ms | 30.4ms | - | - | - |
|
| |||||||
| Decode + PSPNet-50 | 60.8M | 515B | 1.55s | 52.0ms | 59.90 | 66.94 | 70.43 |
|
| |||||||
| PSPNet w/ cResnet-51 [18] | 54.5M | 301B | 856ms | 31.6ms | 41.21 | 46.09 | 48.91 |
|
| |||||||
| PSPNet-50-aa | 50.8M | 282B | 827ms | 29.6ms | 59.20 | 63.66 | 64.15 |
| PSPNet-50-aa + transfer | 61.96 | 65.26 | 66.01 | ||||
We first observe that our PSPNet-50-aa significantly outperforms PSPNet w/ cResnet-51 by more than 7% of mIoU while requiring less computational complexity. When trained with knowledge transfer, PSPNet-50-aa is further improved by around 2% of mIoU in all bit rates and is approaching the decoding-based approach (Decode + PSPNet-50). PSPNet-50-aa + transfer is still inferior to Decode + PSPNet-50 when only looking at the mIoU metric, but when using NLAIC as the image coder, our approach is around 3× computationally efficient in terms of FLOPs and CPU latency, and runs more than 2× faster on GPU. Again, our methods shows the computational advantage and potentially achieving better speed-accuracy trade-off than the decoding-based approach.
Discussion:
We noticed that the performance gap between pixel and compressed domain in semantic segmentation (ranging from 1.5% to 4.4%) is larger than that in image classification (mostly within 1%). We hypothesize that it is due to the “distortion” of spatial information in the compressed representations. The task of semantic segmentation asks for fine-grained prediction such as the precise boundary of each object category, while as we discussed in Sec. V-F, such spatial information is “distorted” in the compressed domain and may not be easily extracted by CNNs as from the pixel domain. In fact, image coders are explicitly designed to reduce spatial redundancy, which makes the compressed representation less spatially dependent. This nature of the compressed representation is less consistent with pixel-oriented tasks such as semantic segmentation, which we regard as a possible reason for the lower mIoU of PSPNet-50-aa + transfer comparing to the decoding baseline.
We also visually show prediction examples in Fig. 4, where the coder is NLAIC at 0.595 bit rate. In the first row, both the compressed domain prediction (b) and pixel domain prediction (c) miss the poles in the white bounding box, indicating that compression itself introduces difficulty to segment thin objects, no matter from the pixel or compressed domain. In the second row, the compressed domain prediction (f) miss the thin fence in the original image while the pixel domain prediction (g) captures it. Apart from those thin objects, the compressed domain prediction is mostly consistent with the pixel domain prediction, which provides clues that the compressed representation may distort the fine-grained pixel information for the detection of tiny/thin objects. A more rigorous investigation and possible solution to this problem requires additional research, and we leave it to future work.
Fig. 4. Qualitative examples of compressed domain semantic segmentation.

(a) and (e) are the original images, (b) and (f) are the compressed domain predictions using PSPNet-50-aa + transfer, (c) and (g) are the pixel domain prediction using Decode + PSPNet-50, and (d) and (h) are the ground truth labels. Note, that the blacked region in (d) and (h) represents the “ignored” area that does not count in the metric computation. See text for discussion.
VII. Conclusion
This paper demonstrates the compressed domain vision tasks by directly inferring the content semantics using latent features of the deep compressed image. Different from existing methods that mostly rely on dedicated model development and joint training of compression and vision task engines, we leverage prevalent vision models by simple modifications of existing network architectures to accept the compressed latent features (other than pixels), and by transferring knowledge from corresponding pixel domain models to improve the performance further. It is worthwhile to point out that our method is generally applicable to existing deep image coders and well-known neural vision engines without exhaustive model redesign, which is more attractive for practical application.
We show by experiments that the proposed method outperforms existing baseline methods on compression domain classification and semantic segmentation. Although there still exists a performance gap between task inference directly from the compressed domain versus from the reconstructed pixel domain, our method offers a generalizable solution to this challenging problem as well as provides an improved rate-accuracy trade-off.
TABLE V.
EfficientNet-B0-based compressed domain image classification for NLAIC
| Method | Params. | FLOPs | Latency (CPU) | Latency (GPU) | Top-1 accuracy (%) at | ||
|---|---|---|---|---|---|---|---|
| 0.133 bpp | 0.312 bpp | 0.595 bpp | |||||
| EfficientNet-B0 | 5.29M | 0.398B | 11.9ms | 5.40ms | - | - | - |
|
| |||||||
| Decode + EfficientNet-B0 | 19.1M | 97.7B | 250ms | 9.77ms | 63.10 | 67.79 | 69.57 |
|
| |||||||
| EfficientNet-B0-aa | 6.17M | 0.538B | 10.3ms | 5.18ms | 61.72 | 67.40 | 67.77 |
| EfficientNet-B0-aa + transfer | 62.22 | 67.42 | 69.05 | ||||
Acknowledgments
This work was partially supported by the National Institutes of Health under Grants 1R01CA277839-01, and the National Natural Science Foundation of China under Grant 62022038 and Grant U20A2018.
Biographies

Zhihao Duan (S’20) received the B.E. degree in Electrical Engineering from Shanghai Jiao Tong University, Shanghai, China, in 2018, and the M.S. degree in Electrical and Computer Engineering from the Boston University, Boston, MA, USA, in 2020. He is currently a Ph.D. student in the Video and Image Processing Laboratory at Purdue University West Lafayette, IN, USA. His research interest lies in the intersection of image processing, computer vision, and machine learning. Specifically, he is working towards a unified system for image compression and object recognition.

Zhan Ma (SM’19) is a Full Professor in the School of Electronic Science and Engineering, Nanjing University, Jiangsu, 210093, China. He received his Ph.D. degree from the New York University, New York, in 2011, and the B.S. and M.S. from the Huazhong University of Science and Technology, Wuhan, China, in 2004 and 2006 respectively. From 2011 to 2014, he has been with Samsung Research America, Dallas, TX, and Futurewei Technologies, Inc., Santa Clara, CA, respectively. His research focuses include the learnt image/video coding, and computational imaging. He was awarded the 2018 PCM Best Paper Finalist, 2019 IEEE Broadcast Technology Society Best Paper Award, and 2020 IEEE MMSP Grand Challenge Best Image Coding Solution.

Fengqing Zhu (SM’19) is an Associate Professor of Electrical and Computer Engineering at Purdue University. She received the B.S.E.E. (with highest distinction), M.S., and Ph.D. degrees in Electrical and Computer Engineering from Purdue University, West Lafayette, IN, USA, in 2004, 2006, and 2011, respectively. Her research interests include image processing, computer vision, video compression and digital health. Prior to joining Purdue in 2015, she was a Staff Researcher at Futurewei Technologies, Santa Clara, CA, where she received a Certification of Recognition for Core Technology Contribution in 2012. She is the recipient of an NSF CISE Research Initiation Initiative (CRII) award in 2017, a Google Faculty Research Award in 2019, and an ESI and trainee poster award for the NIH Precision Nutrition workshop in 2021.
Footnotes
Contributor Information
Zhihao Duan, Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana 47907, U.S.A..
Zhan Ma, School of Electronic Science and Engineering, Nanjing University, Nanjing, Jiangsu 210093, China.
Fengqing Zhu, Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana 47907, U.S.A..
References
- [1].Wallace G, “The jpeg still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, Feb 1992. [Google Scholar]
- [2].Skodras A, Christopoulos C, and Ebrahimi T, “The jpeg 2000 still image compression standard,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36–58, Sep. 2001. [Google Scholar]
- [3].Deng J, Dong W, Socher R, Li L, Li K, and Li F, “Imagenet: A large-scale hierarchical image database,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, Sep. 2009. [Google Scholar]
- [4].Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, and Zitnick CL, “Microsoft coco: Common objects in context,” Proceedings of the European Conference on Computer Vision, pp. 740–755, Sep. 2014. [Google Scholar]
- [5].Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, and Schiele B, “The cityscapes dataset for semantic urban scene understanding,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223, Jun. 2016. [Google Scholar]
- [6].Ballé J, Laparra V, and Simoncelli E, “End-to-end optimization of nonlinear transform codes for perceptual quality,” Picture Coding Symposium, pp. 1–5, Dec. 2016. [Google Scholar]
- [7].Minnen D, Ballé J, and Toderici G, “Joint autoregressive and hierarchical priors for learned image compression,” Advances in Neural Information Processing Systems, vol. 31, pp. 10 794–10 803, Dec. 2018. [Google Scholar]
- [8].Chen T, Liu H, Ma Z, Shen Q, Cao X, and Wang Y, “End-to-end learnt image compression via non-local attention optimization and improved context modeling,” IEEE Transactions on Image Processing, vol. 30, pp. 3179–3191, Feb. 2021. [DOI] [PubMed] [Google Scholar]
- [9].Cheng Z, Sun H, Takeuchi M, and Katto J, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7936–7945, Jun. 2020. [Google Scholar]
- [10].Gao G, You P, Pan R, Han S, Zhang Y, Dai Y, and Lee H, “Neural image compression via attentional multi-scale back projection and frequency decomposition,” Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14 677–14 686, Oct. 2021. [Google Scholar]
- [11].Pfaff J, Filippov A, Liu S, Zhao X, Chen J, De-Luxán-Hernández S Wiegand T, Rufitskiy V, Ramasubramonian AK, and Van der Auwera G, “Intra prediction and mode coding in vvc,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3834–3847, Apr. 2021. [Google Scholar]
- [12].Chen Z, Fan K, Wang S, Duan L, Lin W, and Kot AC, “Toward intelligent sensing: Intermediate deep feature compression,” IEEE Transactions on Image Processing, vol. 29, pp. 2230–2243, Sep. 2019. [DOI] [PubMed] [Google Scholar]
- [13].Kakkad V, Patel M, and Shah M, “Biometric authentication and image encryption for image security in cloud framework,” Multiscale and Multidisciplinary Modeling, Experiments and Design, vol. 2, no. 4, pp. 233–248, May 2019. [Google Scholar]
- [14].Gueguen L, Sergeev A, Kadlec B, Liu R, and Yosinski J, “Faster neural networks straight from jpeg,” Proceedings of the International Conference on Neural Information Processing Systems, p. 3937–3948 Dec. 2018. [Google Scholar]
- [15].Xu K, Qin M, Sun F, Wang Y, Chen Y, and Ren F, “Learning in the frequency domain,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1737–1746, Jun. 2020. [Google Scholar]
- [16].Chamain LD and Ding Z, “Improving deep learning classification of jpeg2000 images over bandlimited networks,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4062–4066, May 2020. [Google Scholar]
- [17].Shen Q, Cai J, Liu L, Liu H, Chen T, Ye L, and Ma Z, “Codedvision: Towards joint image understanding and compression via end-to-end learning,” Proceedings of the Advances in Multimedia Information Processing, vol. 11164, pp. 3–14, Sep. 2018. [Google Scholar]
- [18].Torfason R, Mentzer F, Ágústsson E, Tschannen M, Timofte R, and Gool LV, “Towards image understanding from deep compression without decoding,” International Conference on Learning Representations Feb. 2018. [Google Scholar]
- [19].Wang Z, Qin M, and Chen Y-K, “Learning from the cnn-based compressed domain,” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4000–4008, 2022. [Google Scholar]
- [20].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, Jun. 2016. [Google Scholar]
- [21].Tan M and Le Q, “Efficientnet: Rethinking model scaling for convolutional neural networks,” Proceedings of the International Conference on Machine Learning, vol. 97, pp. 6105–6114, Jun. 2019. [Google Scholar]
- [22].Wightman R, “Pytorch image models,” github.com/rwightman/pytorchimage-models, 2019.
- [23].Bégaint J, Racapé F, Feltman S, and Pushparaja A, “Compressai: a pytorch library and evaluation platform for end-to-end compression research,” 2020.
- [24].Theis L, Shi W, Cunningham A, and Huszár F, “Lossy image compression with compressive autoencoders,” International Conference on Learning Representations, Apr. 2017. [Google Scholar]
- [25].Ballé J, Minnen D, Singh S, Hwang S, and Johnston N, “Variational image compression with a scale hyperprior,” International Conference on Learning Representations, Apr. 2018. [Google Scholar]
- [26].He D, Zheng Y, Sun B, Wang Y, and Qin H, “Checkerboard context model for efficient learned image compression,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14 766–14 775, Jun. 2021. [Google Scholar]
- [27].Lu M and Ma Z, “High-efficiency lossy image coding through adaptive neighborhood information aggregation,” 2022.
- [28].Duan L, Liu J, Yang W, Huang T, and Gao W, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,” IEEE Transactions on Image Processing, vol. 29, pp. 8680–8695, Aug. 2020. [DOI] [PubMed] [Google Scholar]
- [29].Hu Y, Yang S, Yang W, Duan L-Y, and Liu J, “Towards coding for human and machine vision: A scalable image coding approach,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1–6, Jul. 2020. [Google Scholar]
- [30].Wang S, Wang S, Yang W, Zhang X, Wang S, Ma S, and Gao W, “Towards analysis-friendly face representation with scalable feature and texture compression,” IEEE Transactions on Multimedia, pp. 1–1, Jul. 2021. [Google Scholar]
- [31].Sun S, He T, and Chen Z, “Semantic structured image coding framework for multiple intelligent applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3631–3642, Sep. 2021. [Google Scholar]
- [32].Hinton G and Salakhutdinov R, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. [DOI] [PubMed] [Google Scholar]
- [33].Qiu Q and Sapiro G, “Learning compressed image classification features,” Proceedings of the IEEE International Conference on Image Processing, pp. 5761–5765, Oct. 2014. [Google Scholar]
- [34].Xu Y, Liu W, and Kelly KF, “Compressed domain image classification using a dynamic-rate neural network,” IEEE Access, vol. 8, pp. 217 711–217 722, Dec. 2020. [Google Scholar]
- [35].Wright J, Ma Y, Mairal J, Sapiro G, Huang TS, and Yan S, “Sparse representation for computer vision and pattern recognition,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1031–1044, Apr. 2010. [Google Scholar]
- [36].Ehrlich M and Davis L, “Deep residual learning in the jpeg transform domain,” Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3483–3492, Nov. 2019. [Google Scholar]
- [37].Wu C, Zaheer M, Hu H, Manmatha R, Smola AJ, and Krähenbühl P, “Compressed video action recognition,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6026–6035, Jun. 2018. [Google Scholar]
- [38].Babu R, Ramakrishnan K, and Srinivasan S, “Video object segmentation: a compressed domain approach,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 4, pp. 462–474, Apr. 2004. [Google Scholar]
- [39].Tan Z, Liu B, Chu Q, Zhong H, Wu Y, Li W, and Yu N, “Real time video object segmentation in compressed domain,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 1, pp. 175–188, Jan. 2021. [Google Scholar]
- [40].Fan Z, Liu J, and Wang Y, “Motion adaptive pose estimation from compressed videos,” Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11 719–11 728, Oct. 2021. [Google Scholar]
- [41].Bousmalis K, Silberman N, Dohan D, Erhan D, and Krishnan D, “Unsupervised pixel-level domain adaptation with generative adversarial networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 95–104, Jul. 2017. [Google Scholar]
- [42].Hinton G, Vinyals O, and Dean J, “Distilling the knowledge in a neural network,” https://arxiv.org/abs/1503.02531, Mar. 2015.
- [43].Long M, Cao Y, Wang J, and Jordan MI, “Learning transferable features with deep adaptation networks,” Proceedings of the International Conference on International Conference on Machine Learning, p. 97–105, Jul. 2015. [Google Scholar]
- [44].Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, and Bengio Y, “Fitnets: Hints for thin deep nets,” International Conference on Learning Representations, May 2015. [Google Scholar]
- [45].Zamir AR, Sax A, Shen W, Guibas L, Malik J, and Savarese S, “Taskonomy: Disentangling task transfer learning,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3712–3722, Jun. 2018. [Google Scholar]
- [46].Song J, Chen Y, Wang X, Shen C, and Song M, “Deep model transferability from attribution maps,” Advances in Neural Information Processing Systems, vol. 32, Dec. 2019. [Google Scholar]
- [47].Kang G, Jiang L, Yang Y, and Hauptmann AG, “Contrastive adaptation network for unsupervised domain adaptation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4888–4897, Jun. 2019. [Google Scholar]
- [48].Aytar Y, Vondrick C, and Torralba A, “Soundnet: Learning sound representations from unlabeled video,” Proceedings of the International Conference on Neural Information Processing Systems, vol. 29, p. 892–900, Dec. 2016. [Google Scholar]
- [49].Hoffman J, Gupta S, Leong J, Guadarrama S, and Darrell T, “Cross-modal adaptation for rgb-d detection,” Proceedings of the IEEE International Conference on Robotics and Automation, pp. 5032–5039, May 2016. [Google Scholar]
- [50].Gupta S, Hoffman J, and Malik J, “Cross modal distillation for supervision transfer,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2827–2836, Jun. 2016. [Google Scholar]
- [51].Tian Y, Krishnan D, and Isola P, “Contrastive representation distillation,” International Conference on Learning Representations, Apr. 2020. [Google Scholar]
- [52].Cover TM and Thomas JA, Elements of Information Theory. USA: Wiley-Interscience, Jul. 2006. [Google Scholar]
- [53].Wang Z, Bovik AC, Sheikh HR, and Simoncelli EP, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004. [DOI] [PubMed] [Google Scholar]
- [54].Huffman D, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952. [Google Scholar]
- [55].Ma S, Zhang X, Jia C, Zhao Z, Wang S, and Wang S, “Image and video compression with neural networks: A review,’ IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1683–1698, Jun. 2020. [Google Scholar]
- [56].Lin T, Dollár P, Girshick R, He K, Hariharan B, and Belongie S, “Feature pyramid networks for object detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944, Jul. 2017. [Google Scholar]
- [57].Elsken T, Metzen J, and Hutter F, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, no. 55, pp. 1–21, Dec. 2019. [Google Scholar]
- [58].Shi W, Caballero J, Huszár F, Totz J, Aitken A, Bishop R, Rueckert D, and Wang Z, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 1874–1883, Jun. 2016. [Google Scholar]
- [59].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” International Conference on Learning Representations, May 2015. [Google Scholar]
- [60].Zhao H, Shi J, Qi X, Wang X, and Jia J, “Pyramid scene parsing network,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6230–6239, Jul. 2017. [Google Scholar]
- [61].Hu J, Shen L, Albanie S, Sun G, and Wu E, “Squeeze-and-excitation networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 8, pp. 2011–2023, Aug. 2020. [DOI] [PubMed] [Google Scholar]
- [62].Wang H, Wu X, Huang Z, and Xing EP, “High-frequency component helps explain the generalization of convolutional neural networks,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8681–8691, June 2020. [Google Scholar]
- [63].Boutell T and Lane T, “Rfc2083: Png (portable network graphics) specification version 1.0,” Network Working Group, pp. 1–102, Mar. 1997. [Google Scholar]
- [64].Loshchilov I and Hutter F, “Sgdr: Stochastic gradient descent with warm restarts,” International Conference on Learning Representations, Apr. 2017. [Google Scholar]
